Feedback
4 results found
  1. Webis-Web-Archive-17

    • webis.de
    • zenodo.org
    Published 2017
  2. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    Published Mar 22, 2019
  3. Webis-Clickbait-17

    • webis.de
    Published 2017
  4. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    • zenodo.figshare.com
    Published Jan 25, 2019
  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
Facebook
Twitter
Email
Click to copy link
Link copied
  • Dataset published 2017
Dataset provided by
Bauhaus University, Weimarhttp://www.uni-weimar.de/
The Web Technology & Information Systems Network
Authors
Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Stein, Benno; Kneist, Florian
Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 with an annotation per web page how well the web page can be reduced from the archive. This data aims to be the foundation for other web datasets. The archiving of the web pages makes them reproducible, which we see as a requirement for conducting research on web page analysis tools. A key question in this regard is how well the web pages can be reproduced from the archive with current technology. For this, we had human annotators grade the achieved reproduction on a 5-point scale. Annotations were collected using crowd sourcing and a tailored annotation interface. Each archive contains the files requested by the browser to display a single web page. The web pages were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The process is described in detail in an upcoming publication.

Search
Clear search
Close search
Google apps
Main menu