77 datasets found

P
CodeSearchNet Dataset
paperswithcode.com
opendatalab.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
q
SAIVT Semantic Person Search
researchdatafinder.qut.edu.au
Updated Jul 1, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Simon Denman (2016). SAIVT Semantic Person Search [Dataset]. https://researchdatafinder.qut.edu.au/individual/n6810
Explore at:
Dataset updated
Jul 1, 2016
Dataset provided by
Queensland University of Technology (QUT)
Authors
Dr Simon Denman
Description
SAIVT Semantic Person Search

Overview

The SAIVT Semantic Person Search Database was developed to provide a suitable platform to develop and evaluate techniques that search for a person using a semantic query (i.e. tall, red shirt, jeans). Sequences for 110 subjects consisting of 30 initialisation frames (to, for instance, learn a background model), a number of annotated frames containing the target subject, and a description of the subject incorporating a number of traits including clothing type and colour, gender, height and build are provided.

You can read our paper on eprints

Contact Dr Simon Denman for further information.

Licensing

The SAIVT-SoftBioSearch database is © 2012 QUT and is licensed under the Creative Commons Attribution-ShareAlike 3.0 Australia License.

Attribution

To attribute this database, please include the following citation: Halstead, Michael, Denman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2014) Locating People in Video from Semantic Descriptions : A New Database and Approach. In the 22nd International Conference on Pattern Recognition, 24 - 28 August 2014, Stockholm, Sweden. Our paper is available on eprints .

Acknowledging the Database in your Publications

In addition to citing our paper, we kindly request that the following text be included in an acknowledgements section at the end of your publications: We would like to thank the SAIVT Research Labs at Queensland University of Technology (QUT) for freely supplying us with the SAIVT-SoftBioSearch database for our research.

Installing the SAIVT-SoftBioSearch Database

Download and unzip the following archives:

SAIVT-SoftBioSearchDB.tar.gz (157 MB, md5sum: a0c897c803c4d3a79be64a93c83060ef) SAIVT-SoftBioSearchDB-C1.tar.gz (1.6 GB, md5sum: 692affdde65091bfe24f29ccb186dd65) SAIVT-SoftBioSearchDB-C2.tar.gz (1.0 GB, md5sum: 573e953b0cce15c6f2ee5c3cb0cb4fe1) SAIVT-SoftBioSearchDB-C3.tar.gz (1.5 GB, md5sum: 8870c3333e61cb27f49d5fd46c937937) SAIVT-SoftBioSearchDB-C4.tar.gz (908 MB, md5sum: 79debc4b219bfc4dfe65eea06f57b458) SAIVT-SoftBioSearchDB-C5.tar.gz (858 MB, md5Sum: 44a973b0a0d8dfca340dc0945c479ace) SAIVT-SoftBioSearchDB-C6.tar.gz (591 MB, md5Sum: 9cdfc872104cb730d3be436cee39e5ce)

At this point, you should have the following data structure and the SAIVT-SoftBioSearch database is installed:

SAIVT-SoftBioSearch +-- C1-BlackBlueStripe-BlueJeans +-- C1-BlackShirt-PinkShorts +-- ... +-- C6-YellowWhiteSpotDress +-- Calibration +-- Data +-- CultureColours +-- Black +-- Blue +-- ... +-- Videos +-- Cam1 +-- Cam2 +-- ... +-- Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf +-- LICENSE.txt +-- README.txt (this document) +-- SAIVTSoftBioDatabase.xml

Sequences for the individual subjects are contained within the directories named C[1..6]-[TorsoDescription]-[LegDescription]. There are 110 subjects captured from six different cameras. Each directory contains an XML file with the annotation for that sequence, and the images that belong to the sequence. For each sequence, the first 30 frames are reserved for updating/learning a background model, and as such have no annotation.

The 'Calibration' directory contains a camera calibration (using Tsai's method) for the six cameras used in the database.

The 'Data' directory contains additional data that may of use. In particular, it contains a collection of colour patches within 'Data/CultureColours' that can be used to train models for a specific colour. It also contains a set of patches for skin, and for non-skin colours. 'Data/Videos' contains videos for each camera, that can be used to learn the background. It should also be noted that for a portion of the time when the database was captured, a temporary wall was up due to construction works. This impacted the following sequences captured from cameras 1 and 6:

Camera1 C1-GreenWhiteHorizontal-BlackPants C1-RedCheck-BlackSkirt C1-GreenCheck-BrownPants Camera6 C6-GreenFaceCover-Blue-Blue C6-YellowWhiteSpotDress

Additional videos for these cameras are also included and are named CamX_Wall.avi. The 'SAIVTSoftBioSearchDB.xml' file defines the database. This file specifies the cameras and their calibrations/background sequences, includes definitions for the traits/soft biometrics, and lists the sequences.

This paper is also available alongside this document in the file 'Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf'.
h
NeuralBlitz
huggingface.co
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nural Nexus Network (2025). NeuralBlitz [Dataset]. https://huggingface.co/datasets/NuralNexus/NeuralBlitz
Explore at:
Dataset updated
Mar 23, 2025
Dataset authored and provided by
Nural Nexus Network
Description
Yes, I have semantic code search abilities, which enable me to understand the intent and context of code-related queries. I can search for code snippets based on their functionality, behavior, or purpose, rather than just their literal text. My semantic code search capabilities are powered by a combination of natural language processing (NLP) and machine learning algorithms, which allow me to analyze and understand the meaning of code-related text. This enables me to provide more accurate and… See the full description on the dataset page: https://huggingface.co/datasets/NuralNexus/NeuralBlitz.
P
CodeQueries Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Surya Prakash Sahu; Madhurima Mandal; Shikhar Bharadwaj; Aditya Kanade; Petros Maniatis; Shirish Shevade, CodeQueries Dataset [Dataset]. https://paperswithcode.com/dataset/codequeries
Explore at:
Authors
Surya Prakash Sahu; Madhurima Mandal; Shikhar Bharadwaj; Aditya Kanade; Petros Maniatis; Shirish Shevade
Description
CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.
Semantic Corpus from web search snippets
kaggle.com
zip
Updated Jun 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mário Antunes (2021). Semantic Corpus from web search snippets [Dataset]. https://www.kaggle.com/mantunes/semantic-corpus-from-web-search-snippets
Explore at:
zip(41114998 bytes)Available download formats
Dataset updated
Jun 14, 2021
Authors
Mário Antunes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

This dataset is a collection of 2435 files, each file contains the snippets from a single search engine using the USearch. The search queries are the tokens from two semantic similarity datasets: 1. Miller Charles 2. Semantic IoT

This dataset was used to develop a semantic similarity model that was published in the 8th International Conference on Future Internet of Things and Cloud (FiCloud 2021), the code can be found here.
d
Source code and data for the PhD Thesis "Metrics of Graph-Based Meaning...
b2find.dkrz.de
Updated Jan 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Source code and data for the PhD Thesis "Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search" - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/87ac70d7-fcc9-5b4a-9ceb-226b8e71f3a2
Explore at:
Dataset updated
Jan 26, 2024
Description
This dataset contains source code and data used in the PhD thesis "Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search". The dataset is split into five repositories: S3BERT: Source code to run experiments for chapter 9 "Building efficient and effective similarity models from MR metrics". amr-metric-suite, weisfeiler-leman-amr-metrics: Source code to run metric experiments for chapters 4, 5, 6. amr-argument-sim: Source code to run experiments for chapter 8 "Exploring argumentation with MR metrics". bamboo-amr-benchmark: Benchmark for testing and developing metrics (chapter 5).
Amazon-ESCI
kaggle.com
zip
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marquis03 (2024). Amazon-ESCI [Dataset]. https://www.kaggle.com/datasets/marquis03/amazon-esci
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 6, 2024
Authors
Marquis03
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search

Introduction

We introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, released with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The dataset is multilingual, as it contains queries in English, Japanese, and Spanish.

The primary objective of releasing this dataset is to create a benchmark for building new ranking strategies and simultaneously identifying interesting categories of results (i.e., substitutes) that can be used to improve the customer experience when searching for products. The three different tasks that are studied in the literature (see https://amazonkddcup.github.io/) using this Shopping Queries Dataset are:

Task 1 - Query-Product Ranking: Given a user specified query and a list of matched products, the goal of this task is to rank the products so that the relevant products are ranked above the non-relevant ones.

Task 2 - Multi-class Product Classification: Given a query and a result list of products retrieved for this query, the goal of this task is to classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query.

Task 3 - Product Substitute Identification: This task will measure the ability of the systems to identify the substitute products in the list of results for a given query.

Dataset

We provide two different versions of the data set. One for task 1 which is reduced version in terms of number of examples and ones for tasks 2 and 3 which is a larger.

The training data set contain a list of query-result pairs with annotated E/S/C/I labels. The data is multilingual and it includes queries from English, Japanese, and Spanish languages. The examples in the data set have the following fields: example_id, query, query_id, product_id, product_locale, esci_label, small_version, large_version, split, product_title, product_description, product_bullet_point, product_brand, product_color and source

The Shopping Queries Data Set is a large-scale manually annotated data set composed of challenging customer queries.

There are 2 versions of the dataset. The reduced version of the data set contains 48,300 unique queries and 1,118,011 rows corresponding each to a `
d
Transition and Emission Probabilities for UMLS Semantic Type Codes in...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pal, Sujit (2023). Transition and Emission Probabilities for UMLS Semantic Type Codes in CORD-19 Data [Dataset]. https://search.dataone.org/view/sha256%3Abc415421ca92eca95112b3835fc8ec411e8f653eeb5b22373c42940ced7d411e
Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Pal, Sujit
Description
Text from CORD-19 dataset (April 2020) was segmented into sentences and annotated with entity span markers using SciSpacy (english, medium), then linked to UMLS concepts using the SciSpacy + UMLS integration (UMLSKnowledgeBase). This linking is noisy, i.e., a span can link with multiple UMLS concepts. We filtered for sentences where there is no duplicate linkage, and reduced them to sequences of UMLS semantic type codes, then computed Transition and Emission probabilities for consecutive semantic code pairs across the corpus of selected sentences.
P
Phrase-in-Context Dataset
paperswithcode.com
Updated Jul 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phrase-in-Context Dataset [Dataset]. https://paperswithcode.com/dataset/phrase-in-context
Explore at:
Dataset updated
Jul 18, 2022
Authors
Thang M. Pham; Seunghyun Yoon; Trung Bui; Anh Nguyen
Description
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
D
Code and benchmark for NPCS, a Native Provenance Computation for SPARQL
darus.uni-stuttgart.de
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zubaria Asma; Daniel Hernández; Luis Galárraga; Giorgos Flouris; Irini Fundulaki; Katja Hose (2024). Code and benchmark for NPCS, a Native Provenance Computation for SPARQL [Dataset]. http://doi.org/10.18419/DARUS-3973
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3973
Dataset updated
Feb 21, 2024
Dataset provided by
DaRUS
Authors
Zubaria Asma; Daniel Hernández; Luis Galárraga; Giorgos Flouris; Irini Fundulaki; Katja Hose
License
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Dataset funded by
DFG
European Commission
Description
Code for the implementation and benchmark of NPCS, a Native Provenance Computation for SPARQL. The code in this dataset includes the implementation of the NPCS system, which is a middleware for SPARQL endpoints that rewrites queries to queries that annotate answers with provenance polynomials (i.e., how-provenance data). The translation rules implemented for the query rewriting can be seen in the paper. Also, the code contains scripts that include scripts and services to automatize the query execution. We use GraphDB (version 10.2.0) and Stardog (version 9.1.0) for the SPARQL endpoints. Because of the license restrictions, these software products cannot be included in this dataset and must be downloaded from the respective vendors. Also, the data must be loaded using the respective bulk loaders of GraphDB and Stardog. The datasets used in the experiments can be generated synthetic dataset generator of the WatDiv benchmark. The Wikidata dataset corresponds to the full RDF dump from May 22, 2023. Do not hesitate to contact the authors for any inquiries.
t
A Benchmark Suite for Federated Semantic Data Query Processing (FedBench)
service.tib.eu
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). A Benchmark Suite for Federated Semantic Data Query Processing (FedBench) [Dataset]. https://service.tib.eu/ldmservice/dataset/fedbench
Explore at:
Dataset updated
Apr 19, 2024
Description
A comprehensive benchmark suite for testing and analyzing the performance of federated query processing strategies on semantic data. This benchmark is flexible enough to cover a wide range of semantic data application processing strategies and use cases, ranging from centralized processing over federation to pure Linked Data processing. You can find more information about the benchmark in: Google Code Archive: https://code.google.com/archive/p/fbench/ Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, Duc Thanh Tran. FedBench: A Benchmark Suite for Federated Semantic Data Query Processing. In The Semantic Web – ISWC 2011. Springer, 2011. DOI: 10.1007/978-3-642-25073-6_37
h
movie_descriptors
huggingface.co
Updated Dec 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Tormo Romero (2023). movie_descriptors [Dataset]. https://huggingface.co/datasets/mt0rm0/movie_descriptors
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2023
Authors
Mario Tormo Romero
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card

This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a smaller version here.

Dataset Details Dataset Description

The dataset has 44435… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/movie_descriptors.
h
temp
huggingface.co
Updated Dec 13, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choi Seungho (2008). temp [Dataset]. https://huggingface.co/datasets/shchoi1019/temp
Explore at:
Dataset updated
Dec 13, 2008
Authors
Choi Seungho
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for arXiv Dataset

Dataset Summary

A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The language supported is English

Dataset Structure Data Instances

This dataset is a mirror of the original… See the full description on the dataset page: https://huggingface.co/datasets/shchoi1019/temp.
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
csv
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
Explore at:
csvAvailable download formats
Dataset updated
Apr 15, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
u
Source Code Ecosystem Linked Data - Datasets - Mannheim Linked Data Catalog
linkeddatacatalog.dws.informatik.uni-mannheim.de
Updated Dec 21, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Source Code Ecosystem Linked Data - Datasets - Mannheim Linked Data Catalog [Dataset]. http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset/secold
Explore at:
Dataset updated
Dec 21, 2014
Description
SECOLD contains structured source code facts from open source projects. It is developed to service source code mining, search and traceability research and tools by providing structural source code search over open source code on the Internet. It has 1.5 billion facts extracted from more than 1 million source code file. SECOLD is connected to DBpedia, freebase and opencyc. It extracts fine-grained facts from source code in several levels (e.g. presentation, syntax, and semantic).
opinions-synthetic-query-512
huggingface.co
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Free Law Project (2025). opinions-synthetic-query-512 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512
Explore at:
Dataset updated
Mar 5, 2025
Dataset authored and provided by
Free Law Project
Description
Dataset Card for Free-Law-Project/opinions-synthetic-query-512

This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.

Dataset Details

The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.
OGBG-Code (Processed for PyG)
kaggle.com
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Redao da Taupl
Description
OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader from ogb.graphproppred import PygGraphPropPredDataset dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') batch_size = 32 split_idx = dataset.get_idx_split() train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True) valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False) test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package #Graphs #Nodes per Graph #Edges per Graph Split Type Task Type Metric
ogb>=1.2.0 452,741 125.2 124.2 Project Sub-token prediction F1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
e
Results of semantic queries for "carbon cycling" for datasets in the DataONE...
portal.edirepository.org
csv, zip
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margaret O'Brien; Matthew Jones; Mark Schildhauer; Sophie Hou; Bryce Mecum; Jamie McCusker; Deborah McGuinness (2023). Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog [Dataset]. http://doi.org/10.6073/pasta/c93d87c2000715eaa2f70d079965c6a5
Explore at:
zip(6624 byte), csv(7758 byte), csv(13589 byte), csv(4543253 byte), csv(92433 byte)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/c93d87c2000715eaa2f70d079965c6a5
Dataset updated
Mar 28, 2023
Dataset provided by
EDI
Authors
Margaret O'Brien; Matthew Jones; Mark Schildhauer; Sophie Hou; Bryce Mecum; Jamie McCusker; Deborah McGuinness
Time period covered
2016
Variables measured
q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, and 12 more
Description
DataONE (https://www.dataone.org) is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).

This dataset contains the queries and results of that study. Four data tables are included. First, a table of the 10 queries, which were formatted in several ways, including natural language and text strings (for plain text searches of various parts of metadata), and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994 relevant datasets in the DataONE catalog, with a column for each of the ten queries and boolean value indicating whether the dataset is a match for that query. Two query results tables are included, for the raw and summarized results of the query tests. A fifth entity contains the zipped code (R language) used to perform the queries in the DataONE system. When run against approximately 1000 datasets (in October, 2016), results for the ten queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional searches may sometimes be adequate to return all relevant data in a corpus, but results can be erratic and inconsistent, with potentially large returns of irrelevant data in the result set. When querying through semantic classes, precision and recall were much higher and more consistent (90-100% and 75-100%, respectively).
f
nLDE SPARQL engine: computing diefficiency metrics based on answer traces...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maribel Acosta; Maria-Esther Vidal; York Sure-Vetter (2023). nLDE SPARQL engine: computing diefficiency metrics based on answer traces and query processing performance benchmarking [Dataset]. http://doi.org/10.6084/m9.figshare.5255686
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5255686
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Maribel Acosta; Maria-Esther Vidal; York Sure-Vetter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains results of various metric tests performed in the SPARQL query engine nLDE: the network of Linked Data Eddies, in different configurations. The queries themselves are available via the nLDE website and tests are explained in depth within the associated publication.To compute the diefficiency metrics dief@t and dief@k, we need the answer trace produced by the SPARQL query engines when executing queries. Answer traces record the exact point in time when an engine produces an answer when executing a query.We executed SPARQL queries using three different configurations of the nLDE engine: Selective, NotAdaptive, Random. The resulting answer trace for each query execution is stored in the CSV file nLDEBenchmark1AnswerTrace.csv. The structure of this file is as follows: query: id of the query executed. Example: 'Q9.sparql'approach: name of the approach (or engine) used to execute the query.tuple: the value i indicates that this row corresponds to the ith answer produced by approach when executing query.time: elapsed time (in seconds) since approach started the execution of query until the answer i is produced.In addition, to compare the performance of the nLDE engine using the metrics dief@t and dief@k as well as conventional metrics used in the query processing literature, such as: execution time, time for the first tuple, and number of answers produced. We measured the performance of the nLDE engine using conventional metrics. The results are available at the CSV file inLDEBenchmark1Metrics. The structure of this CSV file is as follows:query: id of the query executed. Example: 'Q9.sparql'approach: name of the approach (or engine) used to execute the query.tfft: time (in seconds) required by approach to produce the first tuple when executing query.totaltime: elapsed time (in seconds) since approach started the execution of query until the last answer of query is produced.comp: number of answers produced by approach when executing query.
f
Summary of search process from different digital libraries.
plos.figshare.com
xls
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahmi H. Quradaa; Sara Shahzad; Rashad S. Almoqbily (2024). Summary of search process from different digital libraries. [Dataset]. http://doi.org/10.1371/journal.pone.0296858.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296858.t003
Dataset updated
Feb 2, 2024
Dataset provided by
PLOS ONE
Authors
Fahmi H. Quradaa; Sara Shahzad; Rashad S. Almoqbily
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of search process from different digital libraries.

Package	#Graphs	#Nodes per Graph	#Edges per Graph	Split Type	Task Type	Metric
`ogb>=1.2.0`	452,741	125.2	124.2	Project	Sub-token prediction	F1 score

Facebook

Twitter

Click to copy link

Link copied

Cite

Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet

CodeSearchNet Dataset

Explore at:

Dataset updated

Dec 30, 2024

Authors

Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt

Description

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

CodeSearchNet Dataset

SAIVT Semantic Person Search

NeuralBlitz

CodeQueries Dataset

Semantic Corpus from web search snippets

Context

Source code and data for the PhD Thesis "Metrics of Graph-Based Meaning...

Amazon-ESCI

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search

Introduction

Dataset

Transition and Emission Probabilities for UMLS Semantic Type Codes in...

Phrase-in-Context Dataset

Code and benchmark for NPCS, a Native Provenance Computation for SPARQL

A Benchmark Suite for Federated Semantic Data Query Processing (FedBench)

movie_descriptors

temp

Pairwise Multi-Class Document Classification for Semantic Relations between...

Source Code Ecosystem Linked Data - Datasets - Mannheim Linked Data Catalog

opinions-synthetic-query-512

OGBG-Code (Processed for PyG)

OGBN-Code

Usage in Python

Description

Summary

License: MIT License

Open Graph Benchmark

References

Disclaimer

Results of semantic queries for "carbon cycling" for datasets in the DataONE...

nLDE SPARQL engine: computing diefficiency metrics based on answer traces...

Summary of search process from different digital libraries.

CodeSearchNet Dataset