The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
SAIVT Semantic Person Search
Overview
The SAIVT Semantic Person Search Database was developed to provide a suitable platform to develop and evaluate techniques that search for a person using a semantic query (i.e. tall, red shirt, jeans). Sequences for 110 subjects consisting of 30 initialisation frames (to, for instance, learn a background model), a number of annotated frames containing the target subject, and a description of the subject incorporating a number of traits including clothing type and colour, gender, height and build are provided.
You can read our paper on eprints
Contact Dr Simon Denman for further information.
Licensing
The SAIVT-SoftBioSearch database is © 2012 QUT and is licensed under the Creative Commons Attribution-ShareAlike 3.0 Australia License.
Attribution
To attribute this database, please include the following citation: Halstead, Michael, Denman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2014) Locating People in Video from Semantic Descriptions : A New Database and Approach. In the 22nd International Conference on Pattern Recognition, 24 - 28 August 2014, Stockholm, Sweden. Our paper is available on eprints .
Acknowledging the Database in your Publications
In addition to citing our paper, we kindly request that the following text be included in an acknowledgements section at the end of your publications: We would like to thank the SAIVT Research Labs at Queensland University of Technology (QUT) for freely supplying us with the SAIVT-SoftBioSearch database for our research.
Installing the SAIVT-SoftBioSearch Database
Download and unzip the following archives:
SAIVT-SoftBioSearchDB.tar.gz (157 MB, md5sum: a0c897c803c4d3a79be64a93c83060ef)
SAIVT-SoftBioSearchDB-C1.tar.gz (1.6 GB, md5sum: 692affdde65091bfe24f29ccb186dd65)
SAIVT-SoftBioSearchDB-C2.tar.gz (1.0 GB, md5sum: 573e953b0cce15c6f2ee5c3cb0cb4fe1)
SAIVT-SoftBioSearchDB-C3.tar.gz (1.5 GB, md5sum: 8870c3333e61cb27f49d5fd46c937937)
SAIVT-SoftBioSearchDB-C4.tar.gz (908 MB, md5sum: 79debc4b219bfc4dfe65eea06f57b458)
SAIVT-SoftBioSearchDB-C5.tar.gz (858 MB, md5Sum: 44a973b0a0d8dfca340dc0945c479ace)
SAIVT-SoftBioSearchDB-C6.tar.gz (591 MB, md5Sum: 9cdfc872104cb730d3be436cee39e5ce)
At this point, you should have the following data structure and the SAIVT-SoftBioSearch database is installed:
SAIVT-SoftBioSearch +-- C1-BlackBlueStripe-BlueJeans +-- C1-BlackShirt-PinkShorts +-- ... +-- C6-YellowWhiteSpotDress +-- Calibration +-- Data +-- CultureColours +-- Black +-- Blue +-- ... +-- Videos +-- Cam1 +-- Cam2 +-- ... +-- Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf +-- LICENSE.txt +-- README.txt (this document) +-- SAIVTSoftBioDatabase.xml
Sequences for the individual subjects are contained within the directories named C[1..6]-[TorsoDescription]-[LegDescription]. There are 110 subjects captured from six different cameras. Each directory contains an XML file with the annotation for that sequence, and the images that belong to the sequence. For each sequence, the first 30 frames are reserved for updating/learning a background model, and as such have no annotation.
The 'Calibration' directory contains a camera calibration (using Tsai's method) for the six cameras used in the database.
The 'Data' directory contains additional data that may of use. In particular, it contains a collection of colour patches within 'Data/CultureColours' that can be used to train models for a specific colour. It also contains a set of patches for skin, and for non-skin colours. 'Data/Videos' contains videos for each camera, that can be used to learn the background. It should also be noted that for a portion of the time when the database was captured, a temporary wall was up due to construction works. This impacted the following sequences captured from cameras 1 and 6:
Camera1 C1-GreenWhiteHorizontal-BlackPants C1-RedCheck-BlackSkirt C1-GreenCheck-BrownPants
Camera6 C6-GreenFaceCover-Blue-Blue C6-YellowWhiteSpotDress
Additional videos for these cameras are also included and are named CamX_Wall.avi. The 'SAIVTSoftBioSearchDB.xml' file defines the database. This file specifies the cameras and their calibrations/background sequences, includes definitions for the traits/soft biometrics, and lists the sequences.
This paper is also available alongside this document in the file 'Halstead 2014 - Locating People in Video from Semantic Descriptions.pdf'.
Yes, I have semantic code search abilities, which enable me to understand the intent and context of code-related queries. I can search for code snippets based on their functionality, behavior, or purpose, rather than just their literal text. My semantic code search capabilities are powered by a combination of natural language processing (NLP) and machine learning algorithms, which allow me to analyze and understand the meaning of code-related text. This enables me to provide more accurate and… See the full description on the dataset page: https://huggingface.co/datasets/NuralNexus/NeuralBlitz.
CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a collection of 2435 files, each file contains the snippets from a single search engine using the USearch. The search queries are the tokens from two semantic similarity datasets: 1. Miller Charles 2. Semantic IoT
This dataset was used to develop a semantic similarity model that was published in the 8th International Conference on Future Internet of Things and Cloud (FiCloud 2021), the code can be found here.
This dataset contains source code and data used in the PhD thesis "Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search". The dataset is split into five repositories: S3BERT: Source code to run experiments for chapter 9 "Building efficient and effective similarity models from MR metrics". amr-metric-suite, weisfeiler-leman-amr-metrics: Source code to run metric experiments for chapters 4, 5, 6. amr-argument-sim: Source code to run experiments for chapter 8 "Exploring argumentation with MR metrics". bamboo-amr-benchmark: Benchmark for testing and developing metrics (chapter 5).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, released with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The dataset is multilingual, as it contains queries in English, Japanese, and Spanish.
The primary objective of releasing this dataset is to create a benchmark for building new ranking strategies and simultaneously identifying interesting categories of results (i.e., substitutes) that can be used to improve the customer experience when searching for products. The three different tasks that are studied in the literature (see https://amazonkddcup.github.io/) using this Shopping Queries Dataset are:
Task 1 - Query-Product Ranking: Given a user specified query and a list of matched products, the goal of this task is to rank the products so that the relevant products are ranked above the non-relevant ones.
Task 2 - Multi-class Product Classification: Given a query and a result list of products retrieved for this query, the goal of this task is to classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query.
Task 3 - Product Substitute Identification: This task will measure the ability of the systems to identify the substitute products in the list of results for a given query.
We provide two different versions of the data set. One for task 1 which is reduced version in terms of number of examples and ones for tasks 2 and 3 which is a larger.
The training data set contain a list of query-result pairs with annotated E/S/C/I labels. The data is multilingual and it includes queries from English, Japanese, and Spanish languages. The examples in the data set have the following fields: example_id
, query
, query_id
, product_id
, product_locale
, esci_label
, small_version
, large_version
, split
, product_title
, product_description
, product_bullet_point
, product_brand
, product_color
and source
The Shopping Queries Data Set is a large-scale manually annotated data set composed of challenging customer queries.
There are 2 versions of the dataset. The reduced version of the data set contains 48,300 unique queries
and 1,118,011 rows
corresponding each to a `
Text from CORD-19 dataset (April 2020) was segmented into sentences and annotated with entity span markers using SciSpacy (english, medium), then linked to UMLS concepts using the SciSpacy + UMLS integration (UMLSKnowledgeBase). This linking is noisy, i.e., a span can link with multiple UMLS concepts. We filtered for sentences where there is no duplicate linkage, and reduced them to sequences of UMLS semantic type codes, then computed Transition and Emission probabilities for consecutive semantic code pairs across the corpus of selected sentences.
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Code for the implementation and benchmark of NPCS, a Native Provenance Computation for SPARQL. The code in this dataset includes the implementation of the NPCS system, which is a middleware for SPARQL endpoints that rewrites queries to queries that annotate answers with provenance polynomials (i.e., how-provenance data). The translation rules implemented for the query rewriting can be seen in the paper. Also, the code contains scripts that include scripts and services to automatize the query execution. We use GraphDB (version 10.2.0) and Stardog (version 9.1.0) for the SPARQL endpoints. Because of the license restrictions, these software products cannot be included in this dataset and must be downloaded from the respective vendors. Also, the data must be loaded using the respective bulk loaders of GraphDB and Stardog. The datasets used in the experiments can be generated synthetic dataset generator of the WatDiv benchmark. The Wikidata dataset corresponds to the full RDF dump from May 22, 2023. Do not hesitate to contact the authors for any inquiries.
A comprehensive benchmark suite for testing and analyzing the performance of federated query processing strategies on semantic data. This benchmark is flexible enough to cover a wide range of semantic data application processing strategies and use cases, ranging from centralized processing over federation to pure Linked Data processing. You can find more information about the benchmark in: Google Code Archive: https://code.google.com/archive/p/fbench/ Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, Duc Thanh Tran. FedBench: A Benchmark Suite for Federated Semantic Data Query Processing. In The Semantic Web – ISWC 2011. Springer, 2011. DOI: 10.1007/978-3-642-25073-6_37
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card
This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a smaller version here.
Dataset Details
Dataset Description
The dataset has 44435… See the full description on the dataset page: https://huggingface.co/datasets/mt0rm0/movie_descriptors.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for arXiv Dataset
Dataset Summary
A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
The language supported is English
Dataset Structure
Data Instances
This dataset is a mirror of the original… See the full description on the dataset page: https://huggingface.co/datasets/shchoi1019/temp.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2
. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4
in train_testdata_4folds.tar.gz
.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
SECOLD contains structured source code facts from open source projects. It is developed to service source code mining, search and traceability research and tools by providing structural source code search over open source code on the Internet. It has 1.5 billion facts extracted from more than 1 million source code file. SECOLD is connected to DBpedia, freebase and opencyc. It extracts fine-grained facts from source code in several levels (e.g. presentation, syntax, and semantic).
Dataset Card for Free-Law-Project/opinions-synthetic-query-512
This dataset is created from the opinions-metadata, and used for training the Free Law Project Semantic Search models, including Free-Law-Project/modernbert-embed-base_finetune_512.
Dataset Details
The dataset is curated by Free Law Project by selecting the train split from the opinions-metadata dataset. The dataset is created for finetuning encoder models for semantic search, with 512 context window. The… See the full description on the dataset page: https://huggingface.co/datasets/freelawproject/opinions-synthetic-query-512.
Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code
dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code
, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code
allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
DataONE (https://www.dataone.org) is a federation of institutions involved with the earth and environmental sciences that share data through common cyberinfrastructure. In 2016, the DataONE project carried out a quantification of the utility of semantic query, by measuring the precision and recall of relevant datasets available through that catalog. Precision is defined as the proportion of relevant data in the retrieved results, and recall is the proportion of relevant data retrieved, compared to all relevant data present in the repository (see Methods).
This dataset contains the queries and results of that study. Four data tables are
included. First, a table of the 10 queries, which were formatted in several ways, including
natural language and text strings (for plain text searches of various parts of metadata),
and URIs for measurements in the EcoSystem Ontology (ECSO). A second table contains 994
relevant datasets in the DataONE catalog, with a column for each of the ten queries and
boolean value indicating whether the dataset is a match for that query. Two query results
tables are included, for the raw and summarized results of the query tests. A fifth entity
contains the zipped code (R language) used to perform the queries in the DataONE
system.
When run against approximately 1000 datasets (in October, 2016), results for the ten
queries ranged from 0-50% (precision) and 0-100% (recall), indicating that traditional
searches may sometimes be adequate to return all relevant data in a corpus, but results can
be erratic and inconsistent, with potentially large returns of irrelevant data in the result
set. When querying through semantic classes, precision and recall were much higher and more
consistent (90-100% and 75-100%, respectively).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains results of various metric tests performed in the SPARQL query engine nLDE: the network of Linked Data Eddies, in different configurations. The queries themselves are available via the nLDE website and tests are explained in depth within the associated publication.To compute the diefficiency metrics dief@t and dief@k, we need the answer trace produced by the SPARQL query engines when executing queries. Answer traces record the exact point in time when an engine produces an answer when executing a query.We executed SPARQL queries using three different configurations of the nLDE engine: Selective, NotAdaptive, Random. The resulting answer trace for each query execution is stored in the CSV file nLDEBenchmark1AnswerTrace.csv
. The structure of this file is as follows: query
: id of the query executed. Example: 'Q9.sparql'approach
: name of the approach (or engine) used to execute the query.tuple
: the value i
indicates that this row corresponds to the ith answer produced by approach
when executing query
.time
: elapsed time (in seconds) since approach
started the execution of query
until the answer i
is produced.In addition, to compare the performance of the nLDE engine using the metrics dief@t and dief@k as well as conventional metrics used in the query processing literature, such as: execution time, time for the first tuple, and number of answers produced. We measured the performance of the nLDE engine using conventional metrics. The results are available at the CSV file inLDEBenchmark1Metrics
. The structure of this CSV file is as follows:query
: id of the query executed. Example: 'Q9.sparql'approach
: name of the approach (or engine) used to execute the query.tfft
: time (in seconds) required by approach
to produce the first tuple when executing query
.totaltime
: elapsed time (in seconds) since approach
started the execution of query
until the last answer of query
is produced.comp
: number of answers produced by approach
when executing query
.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of search process from different digital libraries.
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found