Search
Clear search
Close search
Main menu
Google apps
2 datasets found
  1. W

    Webis-Sentences-17

    • webis.de
    • anthology.aicmu.ac.cn
    205950
    Updated 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Benno Stein; Stefan Lucks (2017). Webis-Sentences-17 [Dataset]. http://doi.org/10.5281/zenodo.205950
    Explore at:
    205950Available download formats
    Dataset updated
    2017
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    GESIS - Leibniz Institute for the Social Sciences
    Authors
    Johannes Kiesel; Benno Stein; Stefan Lucks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication.

  2. E

    Webis-Simple-Sentences-17 Corpus

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    txt
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webis-Simple-Sentences-17 Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7442
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 17, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word.The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.More information on the corpus can be found on the corpus web page at our university (listed under documented by).

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Benno Stein; Stefan Lucks (2017). Webis-Sentences-17 [Dataset]. http://doi.org/10.5281/zenodo.205950

Webis-Sentences-17

Explore at:
205950Available download formats
Dataset updated
2017
Dataset provided by
Bauhaus-Universität Weimar
The Web Technology & Information Systems Network
GESIS - Leibniz Institute for the Social Sciences
Authors
Johannes Kiesel; Benno Stein; Stefan Lucks
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication.