Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.
The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus is outdated. Please use its successor PAN-PC-11.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095
The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.
The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus is outdated. Please use its successor PAN-PC-11.
This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095
The Webis plagiarism corpus 2008 (Webis-PC-08) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge, however, since the documents in the corpus are not free of copyrights we need assurance that you have legal access to the ACM digital library.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus is outdated. Please use its successor PAN-PC-11.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Details
ASPEC. The Asian Scientific Paper Excerpt Corpus comprises excepts of scientific papers in Japanese that have been manually translated to English and Chinese. We use both subsets of the ASPEC corpus.
JRC-Acquis. The corpus consists of legislative texts in 22 languages, which the European Union's Joint Research Centre (JRC) selected from the cumulative body of EU laws (the so called Acquis communautaire). We sampled our test cases from the 10,000 document pairs in the English-French subset of the corpus.
Europarl. The corpus contains transcripts of European Parliament proceedings in 21 European languages. We exclusively sampled test cases from the 9,443 document pairs in the English-French subset of the corpus.
PAN-PC-11. The corpus contains instances of simulated monolingual and cross-language plagiarism that were used for evaluating plagiarism detection methods as part of the workshop series Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN). Most of the 26,939 documents in the corpus were created by extracting text from openly available books. The documents are partially interspersed with instances of simulated plagiarism that were created and obfuscated automatically or by crowdsourced workers. We exclusively sampled test cases from the 2,921 Spanish-English aligned document pairs in the corpus, for which simulated plagiarism instances were either machine-generated or created manually by crowdsourced workers.
==========================================================================
File Structure
[corpus_documents] folder: Corpora of translation-aligned documents used in our experiments composed of:
Each sub-corpus consists of 4,000 translation-aligned files (2,000 per language); the entire corpus has thus 20,000 files.
Each set of translation-aligned documents was randomly selected from the original datasets (details in the paper).
The Japanese files in aspec and aspecx do not necessarily overlap even though they are from the same dataset.
[corpus_paragraphs] folder: 2,000 translation-aligned paragraphs randomly selected from:
[vectors_documents] folder: Average vector representation of the documents in the datasets from two pre-trained models:
Two granularities are provided:
The structure for each level of granularity follows the same pattern as their respective corpus.
Naming convention:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) contains 7,859 candidate paraphrases obtained from Mechanical Turk crowdsourcing. The corpus is made up of 4,067 accepted paraphrases, 3,792 rejected non-paraphrases, and the original texts. These samples have formed part of PAN 2010 international plagiarism detection competition, but were not previously available separate to rest of the competition data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.
The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.