Updated Date
Download Format
Usage Rights
License from Data Provider
Please review the applicable license to make sure your contemplated use is permitted.
Cost to Access
Described as free to access or have a license that allows redistribution.
22 results found
  1. Webis-Sentences-17

    • webis.de
    • temir.org
    Published Feb 27, 2017
  2. Webis-Simple-Sentences-17 Corpus

    • zenodo.org
    • search.datacite.org
    Published Feb 27, 2017
  3. Webis-QSpell-17

    • webis.de
    • temir.org
    • +1more
    Published 2017
  4. Webis-Mnemonics-17

    • webis.de
    • temir.org
    • +1more
    Published 2017
  5. m

    Data for: Language Models, Surprisal and Fantasy in Slavic...

    • data.mendeley.com
    Updated Aug 29, 2018
  6. G

    Bengali [bn-bd] ASR

    • ai.google
    • research.google
  7. Bilingual English-Icelandic parallel corpus from Nordisk eTax website

    • data.europa.eu
    Updated Oct 10, 2019
  8. g

    Phrases in email subject lines

    • www.getresponse.com
  9. SNAP Memetracker

    • www.kaggle.com
    Updated Nov 21, 2016
  10. K

    Pre-trained Word Vectors for Spanish

    • www.kaggle.com
    Updated 9 ago. 2017
  11. A

    Ambulance Services, England - 2014-15

    • digital.nhs.uk
    Published Jun 17, 2015
  12. Bilingual hr-en parallel corpus from the National and University Library in...

    • data.europa.eu
    Updated Oct 10, 2019
  13. t

    Blueways Conservation Decision Support Tool

    • geospatial.tnc.org
    Published Oct 30, 2019
  14. d

    Hazmat Routes (National)

    • catalog.data.gov
    • www.geoplatform.gov
    • +1more
    Updated Aug 17, 2017
  15. d

    Archival Version

    • www.da-ra.de
    Published Feb 17, 1999
  16. A Canadian French Emotional Speech Dataset

    • zenodo.org
    Published Apr 17, 2018
  17. d

    Archival Version

    • www.da-ra.de
    • www.icpsr.umich.edu
    • +4more
    Published Jul 13, 1996
  18. d

    National Jail Census Series

    • www.da-ra.de
    Published Jul 13, 1996
  19. d

    Data from: Evaluation of Boot Camps for Juvenile Offenders in Cleveland,...

    • www.da-ra.de
    Published Nov 2, 1999
  20. E

    HF radar daily averaged surface currents from the MOOSE MEDTLN sites (Toulon...

    • erddap.osupytheas.fr
    Created Aug 23, 2018
  21. d

    Data from: Multisite Evaluation of Shock Incarceration: [Florida, Georgia,...

    • www.da-ra.de
    Published Jul 28, 1998
  22. d

    Federal Justice Statistics Program Data Series

    • www.da-ra.de
    Published Mar 2, 1990
  23. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Click to copy link
Link copied


  • Dataset published Feb 27, 2017
Dataset provided by
Bauhaus University, Weimarhttp://www.uni-weimar.de/
The Web Technology & Information Systems Network
Lucks, Stefan; Stein, Benno; Kiesel, Johannes

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically


The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication. The Webis-Simple-Sentences-17 corpus contains 471,085,690 English sentences from the Webis-Sentences-17 corpus. The sentences were sampled to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word. Both corpora are split in training and test set as they are used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.

Clear search
Close search
Google apps
Main menu