Search
Clear search
Close search
Main menu
Google apps
1 dataset found
  1. E

    PAN Wikipedia Quality Flaw Corpus 2012 (PAN-WQF-12)

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). PAN Wikipedia Quality Flaw Corpus 2012 (PAN-WQF-12) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7531
    Explore at:
    Dataset updated
    Apr 24, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). PAN Wikipedia Quality Flaw Corpus 2012 (PAN-WQF-12) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7531

PAN Wikipedia Quality Flaw Corpus 2012 (PAN-WQF-12)

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 24, 2024
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.