Saved datasets
Last updated
Download format
Croissant
Croissant is a format for Machine Learning datasets
Learn more about this at mlcommons.org/croissant.
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
9 datasets found
  1. Data from: Webis-Web-Archive-17

    • zenodo.org
    • webis.de
    • +2more
    png, txt, zip
    Updated Jan 21, 2021
  2. W

    Webis-Web-Archive-Quality-22

    • webis.de
    6881334
    Updated 2022
  3. o

    Webis-Web-Archive-17 Content Error Annotations

    • explore.openaire.eu
    • zenodo.org
    Updated Jan 25, 2019
  4. Webis-Web-Errors-19

    • zenodo.org
    • webis.de
    • +1more
    csv, png, txt
    Updated Sep 21, 2020
  5. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv, txt
    Updated Sep 21, 2020
  6. Webis-WebSeg-20

    • zenodo.org
    • webis.de
    • +1more
    txt, zip
    Updated Feb 16, 2023
    + more versions
  7. Webis-Web-Segments-20

    • zenodo.org
    • explore.openaire.eu
    txt, zip
    Updated Feb 16, 2023
  8. E

    Webis Clickbait Corpus 2017 (Webis-Clickbait-17)

    • live.european-language-grid.eu
    • zenodo.org
    html
    Updated May 29, 2022
    + more versions
  9. B

    geohist.ca website files/fichiers du site web geohist.ca

    • borealisdata.ca
    application/gzip
    Updated Jun 2, 2022
  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist (2021). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002204
Organization logo

Data from: Webis-Web-Archive-17

Related Article
Explore at:
zip, txt, pngAvailable download formats
Dataset updated
Jan 21, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

Search
Clear search
Close search
Google apps
Main menu