The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements
https://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy
This guide is an overview of Linux commands for making directories, moving files, listing files (ls), changing directories, and so forth. It is publicly available through the ubuntu website and is considered a general training material.
Dataset Card for linux-man-pages-tldr-summarized
Dataset Summary
This dataset contains linux man pages downloaded from man7, with a prefix: 'summarize: ', and the corresponding summarization downloaded from TLDR-pages.
Supported Tasks
This dataset should be used to fine-tune language models for summarization tasks.
https://www.linuxhintbd.xyzhttps://www.linuxhintbd.xyz
Linux Hint BD Storm Data is provided by the National Weather Service (NWS) and contain statistics on...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains commits with detailed information about changed files from about 12 years of the linux kernel master branch. It contains about 600.000 (filtered) commits and this breaks down to about 1.4 million file change records.
Each row represents a changed file in a specific commit, with annotated deletions and additions to that file, as well as the filename and the subject of the commit. I also included anonymized information about the author of each changed file aswell as the time of commit and the timezone of the author.
The columns in detail:
I'm sure with this dataset nice visualizations can be created, let's see what we can come up with!
For everybody interested how the dataset was created, I've setup a github repo that contains all the required steps to reproduce it here.
If you have any questions, feel free to contact me via PM or discussions here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd
import json
import numpy
with open("Linux_options.json","r") as f:
linux_options = json.load(f)
# Load csv by setting options as int8 to save a lot of memory
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore Linux through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With large scale and complex configurable systems, it is hard for users to choose the right combination of options (i.e., configurations) in order to obtain the wanted trade-off between functionality and performance goals such as speed or size. Machine learning can help in relating these goals to the configurable system options, and thus, predict the effect of options on the outcome, typically after a costly training step. However, many configurable systems evolve at such a rapid pace that it is impractical to retrain a new model from scratch for each new version. Taking the extreme case of the Linux kernel with its ≈ 14, 500 configuration options, we investigate how binary size predictions of kernel size degrade over successive versions (and how transfer learning can be adapted and applied to mitigate this degradation).
We used and are sharing a unique and large dataset constituted of the binary sizes (compressed and non-compressed) of thousands of configurations for different versions of the kernel, spanning three years (4.13, 4.15, 4.20, 5.0, 5.4, 5.7, and 5.8). Overall, around 200K configurations over 10K+ options/features and 6 versions.
This dataset has been used in the Transactions of Software Engineering (TSE) article "Transfer Learning Across Variants and Versions: The Case of Linux Kernel Size" (preprint: https://hal.inria.fr/hal-03358817)
This data contains 2k log lines from the Linux Dataset, derived from LogPai Github Repository. The first file contains just the log lines. Second, contains the log lines with their categorized fields - namely Month, Date, Time, Level, Component, PID, Content, EventID, and EventTemplate.
1. Understanding the frequency of different Event Types (EventID) that occur in the log set.
2. Identifying anomaly in the logs, if it exists.
3. Named Entity Recognition - To identify different fields of the log set from the set-aside data.
4. Multiclass classification - To identify what Event Type (EventID) the log line belongs to.
5. Adding variable parts (<*>) a name, and adding it to the entity recognition task. [Boss level!]
Point 5 explanation: In the 3rd file named Linux_2k.log_templates.csv, for each of the event types (given by EventIDs) there is a template. The template consists of a variable portion (given by <*>) and a constant portion (the other words in the template). The value of this variable part can be found by comparing the template against the log line containing this template. A name could be assigned to the variable part and be accounted for named entity recognition. Keep in mind the frequency of a variable part might be limited.
Note: An important idea to have in mind is that one will have to focus on the syntax more than the semantics of a log line.
Have fun understanding how to apply NLP concepts to Log Datasets! 😀
Check out my other Datasets here
MIT License
Copyright (c) 2018 LogPAI
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Dataset Card for "stackoverflow_linux"
Dataset information:
Source: Stack Overflow Category: Linux Number of samples: 300 Train/Test split: 270/30 Quality: Data come from the top 1k most upvoted questions
Additional Information
License
All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required. More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature code is often scattered across a software system. Scattering is not necessarily bad if used with care, as witnessed by systems with highly scattered features that evolved successfully. Feature scattering, often realized with a pre-processor, circumvents limitations of programming languages and software architectures. Unfortunately, little is known about the principles governing scattering in large and long-living software systems. We present a longitudinal study of feature scattering in the Linux kernel, complemented by a survey with 74, and interviews with nine Linux kernel developers. We analyzed almost eight years of the kernel's history, focusing on its largest subsystem: device drivers. We learned that the ratio of scattered features remained nearly constant and that most features were introduced without scattering. Yet, scattering easily crosses subsystem boundaries, and highly scattered outliers exist. Scattering often addresses a performance-maintenance tradeoff (alleviating complicated APIs), hardware design limitations, and avoids code duplication. While developers do not consciously enforce scattering limits, they actually improve the system design and refactor code, thereby mitigating pre-processor idiosyncrasies or reducing its use.
https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html
File versions from Linux Kernel, with vulnerability labels derived from the CVE database. Based on the work by Jimenez et al. [1].[1] Jimenez, Matthieu, Mike Papadakis, and Yves Le Traon. "Vulnerability prediction models: A case study on the linux kernel." Source Code Analysis and Manipulation (SCAM), 2016 IEEE 16th International Working Conference on. IEEE, 2016.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore A practical guide to Linux through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a list of games that have been released on the Linux operating system. Games that are natively playable on Linux are included in this list.
With so many amazing games available to play on Linux, it can be hard to decide which ones to try first. This comprehensive list includes something for everyone, from fast-paced action games to relaxing puzzle games and everything in between. With such a wide variety of genres represented, there's sure to be something here that appeals to you.
So what are you waiting for? Give one (or more) of these Linux games a try today!
This dataset can be used to find a list of games that have been released on the Linux operating system. Games that are natively playable on Linux are included in this list. The dataset includes the name of the game, the developer, the publisher, the genres, the operating systems, the date released, and the Metacritic score
- Checking for release dates of Linux games.
- Finding the Metacritic score for a particular game.
- Searching for a specific game by name or genre
This dataset was compiled by scraping wikipedia
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_16.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_26.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_20.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_18.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_25.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_27.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_11.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_1.csv | Column name | Description | |:--------------|:---------------------------| | 0 | Name of the game. (String) |
File: df_31.csv
File: df_4.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_21.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_17.csv | Column name | Description | |:----------------------|:----------------------------------------------------------| | Name | The name of the game. (String) | | Developer | The game's developer. (String) | | Publisher | The game's publisher. (String) | | Genres | The genres the game belongs to. (String) | | Operating Systems | The operating systems the game can be played on. (String) | | Date Released | The date the game was released. (Date) | | Metacritic | The game's Metacritic score. (Integer) |
File: df_24.csv | Column name | Description
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore Linux and the Unix philosophy through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This dataset contains changes performed by developers over 15 releases of the Linux kernel. This dataset cover the feature-oriented change history of the kernel between releases 3.10 and 4.4. The changes are broken down by affected artefacts, and all changes pertaining to the same feature are regrouped together. If you want to know things like: How many time was a feature touched in the kernel? How many feature changes came with makefile adjustments ? Then this dataset may interest you. To access the data, you can install a Neo4j server via http://neo4j.com/
https://www.ine.es/aviso_legalhttps://www.ine.es/aviso_legal
Survey on Equipment and Use of Information and Communication Technologies in Households: Computer use, by Autonomous community and knowledge of LINUX operating system. Autonomous Communities.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
I wanted to create a time-series forecasting model for tracking the number of commits on the Linux kernel repository per month. I extracted the data from https://www.phoronix.com/misc/linux-eoy2019/activity.html which already tracks the Linux repository using Gitstat. Having realized that the community might be able to create more robust models. I decided to upload the extracted data here as well. Link to repository: https://github.com/torvalds/linux
The dataset contains the Linux Repository commit/lines added/lines deleted count that's been collected on a monthly basis. Phoronix's last extraction was performed by the end of 2019.
I would love to see how the community would create time-series models on a dataset that's as limited as this.
Image Taken From: Donald Clark from Pixabay
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A novel dataset "Linux-APT-Dataset-2024” that includes the Tactics, Techniques and Procedure (TTPs) of APT attacks in Linux environment. There are two files one is 'combine.csv' and other is 'Processed Version.xlsx' both includes 17 files ranging from 01st October 2023 to 07 January 2024 and each of the file contains all the essential data fields. These 17 files can be found in the below link.Karim, S. (2024). Linux-APT-Dataset-2024 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.106856421. combine.csv is the raw file that is a merger of all the 17 files extracted from SIEM 'WAZUH' after the simulation of latest attacks in the environment. Due to SIEM's 'WAZUH' limitation to produce files with more than 10,000 records, all of the files are combined that could be used as input for other analyses.2. Processed Version.xlsx is the compiled version of combine.csv, the file extension is changed to xlsx because of the support available in most of the system, also Tactics and Techniques are separated for convenience of different researchers. It is also tagged with General and Malicious, if the value is 1 means it is suspicious/malicious, otherwise 0 for General/Normal log.Regarding dataset, it contains both type of activities/logs general as well as malicious/suspicious to make the dataset near real-time for better analysis and evaluation. It will be more productive if the cybersecurity framework considered for mapping the TTP is MITRE. The simulated attacks includes all the privilege escalation payloads for Linux, recently discovered CVEs, emulations of key-loggers and APTs like APT41, APT28, APT29, Turla. An effective way to make the log/records whether it is general or suspicious is to filter the log if it is TTP tagged, that means it's suspicious/malicious otherwise it is considered as general. While developing the dataset we have The dataset is also useful for analysing all the critical log resources in the Linux environment that could be considered while performing forensics activity.
The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements