8 datasets found
  1. o

    PAN Plagiarism Corpus 2011 (PAN-PC-11)

    • explore.openaire.eu
    • zenodo.org
    Updated Jun 1, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2011). PAN Plagiarism Corpus 2011 (PAN-PC-11) [Dataset]. http://doi.org/10.5281/zenodo.3250094
    Explore at:
    Dataset updated
    Jun 1, 2011
    Authors
    Martin Potthast; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso
    Description

    The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge. The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk. {"references": ["Benno Stein, Martin Potthast, Alberto Barr\u00f3n-Cede\u00f1o, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2010). SIGIR Forum, 45 (1) : 45-48, June 2011."]}

  2. W

    PAN-PC-10

    • webis.de
    3250123
    Updated 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Benno Stein; Andreas Eiselt (2010). PAN-PC-10 [Dataset]. http://doi.org/10.5281/zenodo.3250123
    Explore at:
    3250123Available download formats
    Dataset updated
    2010
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Martin Potthast; Benno Stein; Andreas Eiselt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus is outdated. Please use its successor PAN-PC-11.

  3. PAN Plagiarism Corpus 2009 (PAN-PC-09)

    • zenodo.org
    bin
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2020). PAN Plagiarism Corpus 2009 (PAN-PC-09) [Dataset]. http://doi.org/10.5281/zenodo.3250083
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

    The PAN plagiarism corpus 2009 (PAN-PC-09) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

    The PAN-PC-09 contains documents in which artificial plagiarism has been inserted automatically. The plagiarism cases have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of random variables. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, the length of a single plagiarized section, and the degree of obfuscation per plagiarized section.

  4. W

    PAN-PC-09

    • webis.de
    3250083
    Updated 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Benno Stein; Andreas Eiselt (2009). PAN-PC-09 [Dataset]. http://doi.org/10.5281/zenodo.3250083
    Explore at:
    3250083Available download formats
    Dataset updated
    2009
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Martin Potthast; Benno Stein; Andreas Eiselt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus is outdated. Please use its successor PAN-PC-11.

  5. PAN Plagiarism Corpus 2010 (PAN-PC-10)

    • zenodo.org
    • explore.openaire.eu
    bin
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2020). PAN Plagiarism Corpus 2010 (PAN-PC-10) [Dataset]. http://doi.org/10.5281/zenodo.3250123
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095

    The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

    The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

  6. o

    Webis Plagiarism Corpus 2008 (Webis-PC-08)

    • explore.openaire.eu
    • zenodo.org
    Updated Jan 1, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Meyer zu Eißen; Benno Stein (2008). Webis Plagiarism Corpus 2008 (Webis-PC-08) [Dataset]. http://doi.org/10.5281/zenodo.3254617
    Explore at:
    Dataset updated
    Jan 1, 2008
    Authors
    Sven Meyer zu Eißen; Benno Stein
    Description

    This corpus is outdated. Please use its successor PAN-PC-11: https://doi.org/10.5281/zenodo.3250095 The Webis plagiarism corpus 2008 (Webis-PC-08) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge, however, since the documents in the corpus are not free of copyrights we need assurance that you have legal access to the ACM digital library.

  7. W

    Webis-PC-08

    • anthology.aicmu.ac.cn
    • webis.de
    3254618
    Updated 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Stein; Sven Meyer zu Eißen (2008). Webis-PC-08 [Dataset]. http://doi.org/10.5281/zenodo.3254618
    Explore at:
    3254618Available download formats
    Dataset updated
    2008
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    Authors
    Benno Stein; Sven Meyer zu Eißen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus is outdated. Please use its successor PAN-PC-11.

  8. Data from: Detecting Cross-Language Plagiarism using Open Knowledge Graphs

    • zenodo.org
    zip
    Updated Oct 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Stegmüller; Johannes Stegmüller; Fabian Bauer-Marquart; Norman Meuschke; Norman Meuschke; Terry Ruas; Terry Ruas; Moritz Schubotz; Moritz Schubotz; Bela Gipp; Bela Gipp; Fabian Bauer-Marquart (2021). Detecting Cross-Language Plagiarism using Open Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.3616683
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Stegmüller; Johannes Stegmüller; Fabian Bauer-Marquart; Norman Meuschke; Norman Meuschke; Terry Ruas; Terry Ruas; Moritz Schubotz; Moritz Schubotz; Bela Gipp; Bela Gipp; Fabian Bauer-Marquart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corresponding authors: Norman Meuschke, Terry Ruas
    Venue: TBA (under review)

    ==========================================================================

    Source code: https://github.com/ag-gipp/cl-osa

    ==========================================================================

    Dataset Details

    ASPEC. The Asian Scientific Paper Excerpt Corpus comprises excepts of scientific papers in Japanese that have been manually translated to English and Chinese. We use both subsets of the ASPEC corpus.

    • ASPEC-JC contains abstracts and paragraphs from the main text of research papers that were translated manually from Japanese to Chinese.
    • ASPEC-JE contains abstracts of approx. two million research papers that were translated manually from Japanese to English.

    JRC-Acquis. The corpus consists of legislative texts in 22 languages, which the European Union's Joint Research Centre (JRC) selected from the cumulative body of EU laws (the so called Acquis communautaire). We sampled our test cases from the 10,000 document pairs in the English-French subset of the corpus.

    Europarl. The corpus contains transcripts of European Parliament proceedings in 21 European languages. We exclusively sampled test cases from the 9,443 document pairs in the English-French subset of the corpus.

    PAN-PC-11. The corpus contains instances of simulated monolingual and cross-language plagiarism that were used for evaluating plagiarism detection methods as part of the workshop series Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN). Most of the 26,939 documents in the corpus were created by extracting text from openly available books. The documents are partially interspersed with instances of simulated plagiarism that were created and obfuscated automatically or by crowdsourced workers. We exclusively sampled test cases from the 2,921 Spanish-English aligned document pairs in the corpus, for which simulated plagiarism instances were either machine-generated or created manually by crowdsourced workers.

    ==========================================================================

    File Structure

    [corpus_documents] folder: Corpora of translation-aligned documents used in our experiments composed of:

    • aspec: Japanese and English
    • aspecx: Japanese and Chinese
    • jrc: English and French
    • europarl: English and French
    • pan: English and Spanish

    Each sub-corpus consists of 4,000 translation-aligned files (2,000 per language); the entire corpus has thus 20,000 files.
    Each set of translation-aligned documents was randomly selected from the original datasets (details in the paper).
    The Japanese files in aspec and aspecx do not necessarily overlap even though they are from the same dataset.

    [corpus_paragraphs] folder: 2,000 translation-aligned paragraphs randomly selected from:

    • jrc: English and French
    • europarl: English and French
    • pan: English and Spanish


    [vectors_documents] folder: Average vector representation of the documents in the datasets from two pre-trained models:

    • Universal Sentence Encoder - Multilingual (USE-ML)
    • ConceptNet Numberbatch

    Two granularities are provided:

    • vector_paragraphs
    • vector_documents

    The structure for each level of granularity follows the same pattern as their respective corpus.

    Naming convention:

    • Example: cn_jrc_es:
      • model: ConceptNet Numberbatch
      • corpus: JRC-Acquis
      • language: Spanish
    • Labels:
  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Martin Potthast; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso (2011). PAN Plagiarism Corpus 2011 (PAN-PC-11) [Dataset]. http://doi.org/10.5281/zenodo.3250094

PAN Plagiarism Corpus 2011 (PAN-PC-11)

Explore at:
24 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 1, 2011
Authors
Martin Potthast; Benno Stein; Andreas Eiselt; Alberto Barrón-Cedeño; Paolo Rosso
Description

The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge. The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk. {"references": ["Benno Stein, Martin Potthast, Alberto Barr\u00f3n-Cede\u00f1o, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2010). SIGIR Forum, 45 (1) : 45-48, June 2011."]}

Search
Clear search
Close search
Google apps
Main menu