Search
Clear search
Close search
Main menu
Google apps
77 datasets found
  1. P

    CodeSearchNet Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
    Explore at:
    Dataset updated
    Dec 30, 2024
    Authors
    Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
    Description

    The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

  2. q

    SAIVT Semantic Person Search

    • researchdatafinder.qut.edu.au
    Updated Jul 1, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Simon Denman (2016). SAIVT Semantic Person Search [Dataset]. https://researchdatafinder.qut.edu.au/individual/n6810
    Explore at:
    Dataset updated
    Jul 1, 2016
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Dr Simon Denman
    Description

    SAIVT Semantic Person Search

    Overview

    The SAIVT Semantic Person Search Database was developed to provide a suitable platform to develop and evaluate techniques that search for a person using a semantic query (i.e. tall, red shirt, jeans). Sequences for 110 subjects consisting of 30 initialisation frames (to, for instance, learn a background model), a number of annotated frames containing the target subject, and a description of the subject incorporating a number of traits including clothing type and colour, gender, height and build are provided.

    You can read our paper on eprints

    Contact Dr Simon Denman for further information.

    Licensing

    The SAIVT-SoftBioSearch database is © 2012 QUT and is licensed under the Creative Commons Attribution-ShareAlike 3.0 Australia License.

    Attribution

    To attribute this database, please include the following citation: Halstead, Michael, Denman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2014) Locating People in Video from Semantic Descriptions : A New Database and Approach. In the 22nd International Conference on Pattern Recognition, 24 - 28 August 2014, Stockholm, Sweden. Our paper is available on eprints .

    Acknowledging the Database in your Publications

    In addition to citing our paper, we kindly request that the following text be included in an acknowledgements section at the end of your publications: We would like to thank the SAIVT Research Labs at Queensland University of Technology (QUT) for freely supplying us with the SAIVT-SoftBioSearch database for our research.

    Installing the SAIVT-SoftBioSearch Database

    Download and unzip the following archives:

    SAIVT-SoftBioSearchDB.tar.gz (157 MB, md5sum: a0c897c803c4d3a79be64a93c83060ef)
    SAIVT-SoftBioSearchDB-C1.tar.gz (1.6 GB, md5sum: 692affdde65091bfe24f29ccb186dd65)
    SAIVT-SoftBioSearchDB-C2.tar.gz (1.0 GB, md5sum: 573e953b0cce15c6f2ee5c3cb0cb4fe1)
    SAIVT-SoftBioSearchDB-C3.tar.gz (1.5 GB, md5sum: 8870c3333e61cb27f49d5fd46c937937)
    SAIVT-SoftBioSearchDB-C4.tar.gz (908 MB, md5sum: 79debc4b219bfc4dfe65eea06f57b458)
    SAIVT-SoftBioSearchDB-C5.tar.gz (858 MB, md5Sum: 44a973b0a0d8dfca340dc0945c479ace)
    SAIVT-SoftBioSearchDB-C6.tar.gz (591 MB, md5Sum: 9cdfc872104cb730d3be436cee39e5ce)
    

    At this point, you should have the following data structure and the SAIVT-SoftBioSearch database is installed:

    SAIVT-SoftBioSearch +-- C1-BlackBlueStripe-BlueJeans +-- C1-BlackShirt-PinkShorts +-- ... +-- C6-YellowWhiteSpotDress +-- Calibration +-- Data +-- CultureColours +-- Black +-- Blue +-- ... +-- Videos +-- Cam1 +-- Cam2 +-- ... +-- Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf +-- LICENSE.txt +-- README.txt (this document) +-- SAIVTSoftBioDatabase.xml

    Sequences for the individual subjects are contained within the directories named C[1..6]-[TorsoDescription]-[LegDescription]. There are 110 subjects captured from six different cameras. Each directory contains an XML file with the annotation for that sequence, and the images that belong to the sequence. For each sequence, the first 30 frames are reserved for updating/learning a background model, and as such have no annotation.

    The 'Calibration' directory contains a camera calibration (using Tsai's method) for the six cameras used in the database.

    The 'Data' directory contains additional data that may of use. In particular, it contains a collection of colour patches within 'Data/CultureColours' that can be used to train models for a specific colour. It also contains a set of patches for skin, and for non-skin colours. 'Data/Videos' contains videos for each camera, that can be used to learn the background. It should also be noted that for a portion of the time when the database was captured, a temporary wall was up due to construction works. This impacted the following sequences captured from cameras 1 and 6:

    Camera1 C1-GreenWhiteHorizontal-BlackPants C1-RedCheck-BlackSkirt C1-GreenCheck-BrownPants
    Camera6 C6-GreenFaceCover-Blue-Blue C6-YellowWhiteSpotDress
    

    Additional videos for these cameras are also included and are named CamX_Wall.avi. The 'SAIVTSoftBioSearchDB.xml' file defines the database. This file specifies the cameras and their calibrations/background sequences, includes definitions for the traits/soft biometrics, and lists the sequences.

    This paper is also available alongside this document in the file 'Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf'.

  3. h

    NeuralBlitz

    • huggingface.co
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nural Nexus Network (2025). NeuralBlitz [Dataset]. https://huggingface.co/datasets/NuralNexus/NeuralBlitz
    Explore at:
    Dataset updated
    Mar 23, 2025
    Dataset authored and provided by
    Nural Nexus Network
    Description

    Yes, I have semantic code search abilities, which enable me to understand the intent and context of code-related queries. I can search for code snippets based on their functionality, behavior, or purpose, rather than just their literal text. My semantic code search capabilities are powered by a combination of natural language processing (NLP) and machine learning algorithms, which allow me to analyze and understand the meaning of code-related text. This enables me to provide more accurate and… See the full description on the dataset page: https://huggingface.co/datasets/NuralNexus/NeuralBlitz.

  4. P

    CodeQueries Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Surya Prakash Sahu; Madhurima Mandal; Shikhar Bharadwaj; Aditya Kanade; Petros Maniatis; Shirish Shevade, CodeQueries Dataset [Dataset]. https://paperswithcode.com/dataset/codequeries
    Explore at:
    Authors
    Surya Prakash Sahu; Madhurima Mandal; Shikhar Bharadwaj; Aditya Kanade; Petros Maniatis; Shirish Shevade
    Description

    CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.

  5. Semantic Corpus from web search snippets

    • kaggle.com
    zip
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mário Antunes (2021). Semantic Corpus from web search snippets [Dataset]. https://www.kaggle.com/mantunes/semantic-corpus-from-web-search-snippets
    Explore at:
    zip(41114998 bytes)Available download formats
    Dataset updated
    Jun 14, 2021
    Authors
    Mário Antunes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This dataset is a collection of 2435 files, each file contains the snippets from a single search engine using the USearch. The search queries are the tokens from two semantic similarity datasets: 1. Miller Charles 2. Semantic IoT

    This dataset was used to develop a semantic similarity model that was published in the 8th International Conference on Future Internet of Things and Cloud (FiCloud 2021), the code can be found here.

  6. d

    Source code and data for the PhD Thesis "Metrics of Graph-Based Meaning...

    • b2find.dkrz.de
    Updated Jan 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Source code and data for the PhD Thesis "Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search" - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/87ac70d7-fcc9-5b4a-9ceb-226b8e71f3a2
    Explore at:
    Dataset updated
    Jan 26, 2024
    Description

    This dataset contains source code and data used in the PhD thesis "Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search". The dataset is split into five repositories: S3BERT: Source code to run experiments for chapter 9 "Building efficient and effective similarity models from MR metrics". amr-metric-suite, weisfeiler-leman-amr-metrics: Source code to run metric experiments for chapters 4, 5, 6. amr-argument-sim: Source code to run experiments for chapter 8 "Exploring argumentation with MR metrics". bamboo-amr-benchmark: Benchmark for testing and developing metrics (chapter 5).

  7. Amazon-ESCI

    • kaggle.com
    zip
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marquis03 (2024). Amazon-ESCI [Dataset]. https://www.kaggle.com/datasets/marquis03/amazon-esci
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 6, 2024
    Authors
    Marquis03
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search

    Introduction

    We introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, released with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The dataset is multilingual, as it contains queries in English, Japanese, and Spanish.

    The primary objective of releasing this dataset is to create a benchmark for building new ranking strategies and simultaneously identifying interesting categories of results (i.e., substitutes) that can be used to improve the customer experience when searching for products. The three different tasks that are studied in the literature (see https://amazonkddcup.github.io/) using this Shopping Queries Dataset are:

    Task 1 - Query-Product Ranking: Given a user specified query and a list of matched products, the goal of this task is to rank the products so that the relevant products are ranked above the non-relevant ones.

    Task 2 - Multi-class Product Classification: Given a query and a result list of products retrieved for this query, the goal of this task is to classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query.

    Task 3 - Product Substitute Identification: This task will measure the ability of the systems to identify the substitute products in the list of results for a given query.

    Dataset

    We provide two different versions of the data set. One for task 1 which is reduced version in terms of number of examples and ones for tasks 2 and 3 which is a larger.

    The training data set contain a list of query-result pairs with annotated E/S/C/I labels. The data is multilingual and it includes queries from English, Japanese, and Spanish languages. The examples in the data set have the following fields: example_id, query, query_id, product_id, product_locale, esci_label, small_version, large_version, split, product_title, product_description, product_bullet_point, product_brand, product_color and source

    The Shopping Queries Data Set is a large-scale manually annotated data set composed of challenging customer queries.

    There are 2 versions of the dataset. The reduced version of the data set contains 48,300 unique queries and 1,118,011 rows corresponding each to a `

  8. d

    Transition and Emission Probabilities for UMLS Semantic Type Codes in...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pal, Sujit (2023). Transition and Emission Probabilities for UMLS Semantic Type Codes in CORD-19 Data [Dataset]. https://search.dataone.org/view/sha256%3Abc415421ca92eca95112b3835fc8ec411e8f653eeb5b22373c42940ced7d411e
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Pal, Sujit
    Description

    Text from CORD-19 dataset (April 2020) was segmented into sentences and annotated with entity span markers using SciSpacy (english, medium), then linked to UMLS concepts using the SciSpacy + UMLS integration (UMLSKnowledgeBase). This linking is noisy, i.e., a span can link with multiple UMLS concepts. We filtered for sentences where there is no duplicate linkage, and reduced them to sequences of UMLS semantic type codes, then computed Transition and Emission probabilities for consecutive semantic code pairs across the corpus of selected sentences.

  9. P

    Phrase-in-Context Dataset

    • paperswithcode.com
    Updated Jul 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phrase-in-Context Dataset [Dataset]. https://paperswithcode.com/dataset/phrase-in-context
    Explore at:
    Dataset updated
    Jul 18, 2022
    Authors
    Thang M. Pham; Seunghyun Yoon; Trung Bui; Anh Nguyen
    Description

    Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

  10. D

    Code and benchmark for NPCS, a Native Provenance Computation for SPARQL

    • darus.uni-stuttgart.de
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zubaria Asma; Daniel Hernández; Luis Galárraga; Giorgos Flouris; Irini Fundulaki; Katja Hose (2024). Code and benchmark for NPCS, a Native Provenance Computation for SPARQL [Dataset]. http://doi.org/10.18419/DARUS-3973
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    DaRUS
    Authors
    Zubaria Asma; Daniel Hernández; Luis Galárraga; Giorgos Flouris; Irini Fundulaki; Katja Hose
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Dataset funded by
    DFG
    European Commission
    Description

    Code for the implementation and benchmark of NPCS, a Native Provenance Computation for SPARQL. The code in this dataset includes the implementation of the NPCS system, which is a middleware for SPARQL endpoints that rewrites queries to queries that annotate answers with provenance polynomials (i.e., how-provenance data). The translation rules implemented for the query rewriting can be seen in the paper. Also, the code contains scripts that include scripts and services to automatize the query execution. We use GraphDB (version 10.2.0) and Stardog (version 9.1.0) for the SPARQL endpoints. Because of the license restrictions, these software products cannot be included in this dataset and must be downloaded from the respective vendors. Also, the data must be loaded using the respective bulk loaders of GraphDB and Stardog. The datasets used in the experiments can be generated synthetic dataset generator of the WatDiv benchmark. The Wikidata dataset corresponds to the full RDF dump from May 22, 2023. Do not hesitate to contact the authors for any inquiries.

  11. t

    A Benchmark Suite for Federated Semantic Data Query Processing (FedBench)

    • service.tib.eu
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). A Benchmark Suite for Federated Semantic Data Query Processing (FedBench) [Dataset]. https://service.tib.eu/ldmservice/dataset/fedbench
    Explore at:
    Dataset updated
    Apr 19, 2024
    Description

    A comprehensive benchmark suite for testing and analyzing the performance of federated query processing strategies on semantic data. This benchmark is flexible enough to cover a wide range of semantic data application processing strategies and use cases, ranging from centralized processing over federation to pure Linked Data processing. You can find more information about the benchmark in: Google Code Archive: https://code.google.com/archive/p/fbench/ Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, Duc Thanh Tran. FedBench: A Benchmark Suite for Federated Semantic Data Query Processing. In The Semantic Web – ISWC 2011. Springer, 2011. DOI: 10.1007/978-3-642-25073-6_37

  12. h

    movie_descriptors

    • huggingface.co
    Updated Dec 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Tormo Romero (2023). movie_descriptors [Dataset]. https://huggingface.co/datasets/mt0rm0/movie_descriptors
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2023
    Authors
    Mario Tormo Romero
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card

    This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a smaller version here.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    The dataset has 44435… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/movie_descriptors.

  13. h

    temp

    • huggingface.co
    Updated Dec 13, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Choi Seungho (2008). temp [Dataset]. https://huggingface.co/datasets/shchoi1019/temp
    Explore at:
    Dataset updated
    Dec 13, 2008
    Authors
    Choi Seungho
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for arXiv Dataset

      Dataset Summary
    

    A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    The language supported is English

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    This dataset is a mirror of the original… See the full description on the dataset page: https://huggingface.co/datasets/shchoi1019/temp.

  14. E

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • live.european-language-grid.eu
    csv
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 15, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
    which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    • Datasets (articles, class labels, cross-validation splits)
    • Pretrained models (Transformers, GloVe, Doc2vec)
    • Model output (prediction) for the best performing models

    This package consists of the Dataset part.

    Dataset

    The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

    The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

    ├── 1
    │  ├── test.csv
    │  └── train.csv
    ├── 2
    │  ├── test.csv
    │  └── train.csv
    ├── 3
    │  ├── test.csv
    │  └── train.csv
    └── 4
     ├── test.csv
     └── train.csv

    4 directories, 8 files

  15. u

    Source Code Ecosystem Linked Data - Datasets - Mannheim Linked Data Catalog

    • linkeddatacatalog.dws.informatik.uni-mannheim.de
    Updated Dec 21, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Source Code Ecosystem Linked Data - Datasets - Mannheim Linked Data Catalog [Dataset]. http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset/secold
    Explore at:
    Dataset updated
    Dec 21, 2014
    Description

    SECOLD contains structured source code facts from open source projects. It is developed to service source code mining, search and traceability research and tools by providing structural source code search over open source code on the Internet. It has 1.5 billion facts extracted from more than 1 million source code file. SECOLD is connected to DBpedia, freebase and opencyc. It extracts fine-grained facts from source code in several levels (e.g. presentation, syntax, and semantic).

  16. opinions-synthetic-query-512

    • huggingface.co
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Free Law Project (2025). opinions-synthetic-query-512 [Dataset]. https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512
    Explore at:
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    Free Law Project
    Description

    Dataset Card for Free-Law-Project/opinions-synthetic-query-512

    This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.

      Dataset Details
    

    The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.

  17. OGBG-Code (Processed for PyG)

    • kaggle.com
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Redao da Taupl
    Description

    OGBN-Code

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

    Usage in Python

    from torch_geometric.data import DataLoader
    from ogb.graphproppred import PygGraphPropPredDataset
    
    dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

    Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

    Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

    Summary

    Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
    ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

    License: MIT License

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  18. e

    Results of semantic queries for "carbon cycling" for datasets in the DataONE...

    • portal.edirepository.org
    csv, zip
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Margaret O'Brien; Matthew Jones; Mark Schildhauer; Sophie Hou; Bryce Mecum; Jamie McCusker; Deborah McGuinness (2023). Results of semantic queries for "carbon cycling" for datasets in the DataONE catalog [Dataset]. http://doi.org/10.6073/pasta/c93d87c2000715eaa2f70d079965c6a5
    Explore at:
    zip(6624 byte), csv(7758 byte), csv(13589 byte), csv(4543253 byte), csv(92433 byte)Available download formats
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    EDI
    Authors
    Margaret O'Brien; Matthew Jones; Mark Schildhauer; Sophie Hou; Bryce Mecum; Jamie McCusker; Deborah McGuinness
    Time period covered
    2016
    Variables measured
    q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, and 12 more
    Description

    DataONE (https://www.dataone.org) is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).

     This dataset contains the queries and results of that study. Four data tables are
      included. First, a table of the 10 queries, which were formatted in several ways, including
      natural language and text strings (for plain text searches of various parts of metadata),
      and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994
      relevant datasets in the DataONE catalog, with a column for each of the ten queries and
      boolean value indicating whether the dataset is a match for that query. Two query results
      tables are included, for the raw and summarized results of the query tests. A fifth entity
      contains the zipped code (R language) used to perform the queries in the DataONE
      system.
    
     When run against approximately 1000 datasets (in October, 2016), results for the ten
      queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional
      searches may sometimes be adequate to return all relevant data in a corpus, but results can
      be erratic and inconsistent, with potentially large returns of irrelevant data in the result
      set. When querying through semantic classes, precision and recall were much higher and more
      consistent (90-100% and 75-100%, respectively).
    
  19. f

    nLDE SPARQL engine: computing diefficiency metrics based on answer traces...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maribel Acosta; Maria-Esther Vidal; York Sure-Vetter (2023). nLDE SPARQL engine: computing diefficiency metrics based on answer traces and query processing performance benchmarking [Dataset]. http://doi.org/10.6084/m9.figshare.5255686
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Maribel Acosta; Maria-Esther Vidal; York Sure-Vetter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains results of various metric tests performed in the SPARQL query engine nLDE: the network of Linked Data Eddies, in different configurations. The queries themselves are available via the nLDE website and tests are explained in depth within the associated publication.To compute the diefficiency metrics dief@t and dief@k, we need the answer trace produced by the SPARQL query engines when executing queries. Answer traces record the exact point in time when an engine produces an answer when executing a query.We executed SPARQL queries using three different configurations of the nLDE engine: Selective, NotAdaptive, Random. The resulting answer trace for each query execution is stored in the CSV file nLDEBenchmark1AnswerTrace.csv. The structure of this file is as follows: query: id of the query executed. Example: 'Q9.sparql'approach: name of the approach (or engine) used to execute the query.tuple: the value i indicates that this row corresponds to the ith answer produced by approach when executing query.time: elapsed time (in seconds) since approach started the execution of query until the answer i is produced.In addition, to compare the performance of the nLDE engine using the metrics dief@t and dief@k as well as conventional metrics used in the query processing literature, such as: execution time, time for the first tuple, and number of answers produced. We measured the performance of the nLDE engine using conventional metrics. The results are available at the CSV file inLDEBenchmark1Metrics. The structure of this CSV file is as follows:query: id of the query executed. Example: 'Q9.sparql'approach: name of the approach (or engine) used to execute the query.tfft: time (in seconds) required by approach to produce the first tuple when executing query.totaltime: elapsed time (in seconds) since approach started the execution of query until the last answer of query is produced.comp: number of answers produced by approach when executing query.

  20. f

    Summary of search process from different digital libraries.

    • plos.figshare.com
    xls
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahmi H. Quradaa; Sara Shahzad; Rashad S. Almoqbily (2024). Summary of search process from different digital libraries. [Dataset]. http://doi.org/10.1371/journal.pone.0296858.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Fahmi H. Quradaa; Sara Shahzad; Rashad S. Almoqbily
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of search process from different digital libraries.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet

CodeSearchNet Dataset

Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found