1 dataset found
  1. Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • zenodo.org
    • explore.openaire.eu
    bz2, xz
    Updated Aug 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3546193
    Explore at:
    bz2, xzAvailable download formats
    Dataset updated
    Aug 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

    The corpus has following structure:

    • wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
    • within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
    • within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
    • preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
    • without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)

    The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3546193
Organization logo

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Explore at:
bz2, xzAvailable download formats
Dataset updated
Aug 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

  • wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
  • within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
  • within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
  • preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
  • without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

Search
Clear search
Close search
Google apps
Main menu