4 results found
  1. Webis-Web-Archive-17

    Published 2017
  2. Webis-Web-Archive-17 Content Error Annotations

    Published Mar 22, 2019
  3. Webis-Clickbait-17

    Published 2017
  4. Webis-Web-Archive-17 Content Error Annotations

    Published Jan 25, 2019
  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Click to copy link
Link copied
  • Dataset published 2017
Dataset provided by
Bauhaus University, Weimar
The Web Technology & Information Systems Network
Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Stein, Benno; Kneist, Florian

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 with an annotation per web page how well the web page can be reduced from the archive. This data aims to be the foundation for other web datasets. The archiving of the web pages makes them reproducible, which we see as a requirement for conducting research on web page analysis tools. A key question in this regard is how well the web pages can be reproduced from the archive with current technology. For this, we had human annotators grade the achieved reproduction on a 5-point scale. Annotations were collected using crowd sourcing and a tailored annotation interface. Each archive contains the files requested by the browser to display a single web page. The web pages were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The process is described in detail in an upcoming publication.

Clear search
Close search
Google apps
Main menu