Search
Clear search
Close search
Main menu
Google apps
7 datasets found
  1. W

    Data from: Webis-Web-Archive-17

    • webis.de
    • anthology.aicmu.ac.cn
    • +2more
    1002203
    Updated 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002203
    Explore at:
    1002203Available download formats
    Dataset updated
    2017
    Dataset provided by
    GESIS - Leibniz Institute for the Social Sciences
    Friedrich Schiller University Jena
    University of Kassel, hessian.AI, and ScaDS.AI
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    Authors
    Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.

  2. W

    Webis-Web-Archive-Quality-22

    • anthology.aicmu.ac.cn
    • webis.de
    6881334
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Johannes Kiesel; Benno Stein (2022). Webis-Web-Archive-Quality-22 [Dataset]. http://doi.org/10.5281/zenodo.6881334
    Explore at:
    6881334Available download formats
    Dataset updated
    2022
    Dataset provided by
    Bauhaus-Universität Weimar
    Leipzig University
    The Web Technology & Information Systems Network
    Authors
    Martin Potthast; Johannes Kiesel; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Archive-Quality-22 comprises a total of 6,500 pairs of screenshots from web pages as they were archived and as they were reproduced from that archive, along with archive quality annotations and information of DOM elements on the screenshot.

  3. Z

    Webis-Web-Errors-19

    • data.niaid.nih.gov
    • webis.de
    • +2more
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Potthast, Martin (2024). Webis-Web-Errors-19 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2549837
    Explore at:
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Stein, Benno
    Potthast, Martin
    Kiesel, Johannes
    Hubricht, Fabienne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages. If you use this dataset in your research, please cite it using this paper.

  4. Webis-WebSeg-20

    • zenodo.org
    • webis.de
    • +1more
    txt, zip
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy (2023). Webis-WebSeg-20 [Dataset]. http://doi.org/10.5281/zenodo.3988124
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-WebSeg-20 dataset comprises 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17. Segmentations were fused from the segmentations of five crowd workers each. If you use this dataset in your research, please cite it using this paper.

  5. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2020). Webis-Web-Archive-17 Content Error Annotations [Dataset]. http://doi.org/10.5281/zenodo.2549838
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotations of content errors in the Webis-Web-Archive-17.

    Described in more detail in an upcoming publication.

  6. Webis-Web-Segments-20

    • zenodo.org
    • data.niaid.nih.gov
    txt, zip
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy (2023). Webis-Web-Segments-20 [Dataset]. http://doi.org/10.5281/zenodo.3884468
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of crowdsourced annotations for web page segmentations.

    Web pages are taken from the webis-web-archive-17.

  7. Z

    Webis Clickbait Corpus 2017 (Webis-Clickbait-17)

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    • +1more
    Updated Jun 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Potthast, Martin (2022). Webis Clickbait Corpus 2017 (Webis-Clickbait-17) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3346490
    Explore at:
    Dataset updated
    Jun 11, 2022
    Dataset provided by
    Schuster, Sebstian
    Hagen, Matthias
    Fernandez, Erika P. Garces
    Komlossy, Kristof
    Wiegmann, Matti
    Gollub, Tim
    Stein, Benno
    Potthast, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis Clickbait Corpus 2017 (Webis-Clickbait-17) comprises a total of 38,517 Twitter posts from 27 major US news publishers. In addition to the posts, information about the articles linked in the posts are included. The posts had been published between November 2016 and June 2017. To avoid publisher and topical biases, a maximum of ten posts per day and publisher were sampled. All posts were annotated on a 4-point scale [not click baiting (0.0), slightly click baiting (0.33), considerably click baiting (0.66), heavily click baiting (1.0)] by five annotators from Amazon Mechanical Turk. A total of 9,276 posts are considered clickbait by the majority of annotators. In terms of its size, this corpus outranges the Webis Clickbait Corpus 2016 by one order of magnitude. The corpus is divided into two logical parts, a training and a test dataset. The training dataset has been released in the course of the Clickbait Challenge and a download link is provided below. To allow for an objective evaulatuion of clickbait detection systems, the test dataset is available only through the Evaluation-as-a-Service platform TIRA at the moment. On TIRA, developers can deploy clickbait detection systems and execute them against the test dataset. The performance of the submitted systems can be viewed on the TIRA page of the Clickbait Challenge.

    To make working with the Webis Clickbait Corpus 2017 convenient, and to allow for its validation and replication, we are developing and sharing a number of software tags:

    Corpus Viewer. Our Django web service for exploring corpora. For importing the Webis Clickbait Corpus 2017 into the corpus viewer, we provide an appropriate configuration file.

    MTurk Manager. Our Django web service for conducting sophisticated crowd sourcing tasks on Amazon Mechanical Turk. The service allows to manage projects, upload batches of HITS, apply custom reviewing interfaces, and more. To make the clickbait crowd-sourcing task replicable, we share the worker template that we used to instruct the workers and to display the tweets. Also shared is a reviewing template that can be used to accept/reject assignments and to assess the quality of the received annotations quickly.

    Web Archiver. Software for archiving web pages as WARC files and reproducing them later on. This software can be used to open the WARC archives provided above.

    In addition to the corpus "clickbait17-train-170630.zip", we provide the original WARC archives of the articles that are linked in the posts. They are split in 5 archives that can be extracted separately.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002203

Data from: Webis-Web-Archive-17

Related Article
Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
1002203Available download formats
Dataset updated
2017
Dataset provided by
GESIS - Leibniz Institute for the Social Sciences
Friedrich Schiller University Jena
University of Kassel, hessian.AI, and ScaDS.AI
Bauhaus-Universität Weimar
The Web Technology & Information Systems Network
Authors
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.