4 datasets found

Webis-TLDR-17 Corpus
zenodo.org
paperswithcode.com
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein (2020). Webis-TLDR-17 Corpus [Dataset]. http://doi.org/10.5281/zenodo.1043504
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1043504
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

author: string (nullable = true)

body: string (nullable = true)

normalizedBody: string (nullable = true)

content: string (nullable = true)

content_len: long (nullable = true)

summary: string (nullable = true)

summary_len: long (nullable = true)

id: string (nullable = true)

subreddit: string (nullable = true)

subreddit_id: string (nullable = true)

title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.
W
Webis-TLDR-17
webis.de
1043504
Updated 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504
Explore at:
1043504Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.1043504
Dataset updated
2017
Dataset provided by
Leipzig University
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
Authors
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.
O
tldr-17
opendatalab.com
zip
Updated Dec 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bauhaus University, Weimar (2023). tldr-17 [Dataset]. https://opendatalab.com/OpenDataLab/tldr-17
Explore at:
zipAvailable download formats
Dataset updated
Dec 16, 2023
Dataset provided by
Bauhaus University, Weimar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset (Webis-TLDR-17). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit-id. Content is used as document and summary is used as summary.
E
Webis-TLDR-17 Corpus
live.european-language-grid.eu
json
Updated Dec 30, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Webis-TLDR-17 Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5176
Explore at:
jsonAvailable download formats
Dataset updated
Dec 30, 2017
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein (2020). Webis-TLDR-17 Corpus [Dataset]. http://doi.org/10.5281/zenodo.1043504

Webis-TLDR-17 Corpus

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1043504

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

author: string (nullable = true)
body: string (nullable = true)
normalizedBody: string (nullable = true)
content: string (nullable = true)
content_len: long (nullable = true)
summary: string (nullable = true)
summary_len: long (nullable = true)
id: string (nullable = true)
subreddit: string (nullable = true)
subreddit_id: string (nullable = true)
title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

Clear search

Close search

Google apps

Main menu

Webis-TLDR-17 Corpus

Webis-TLDR-17

tldr-17

Webis-TLDR-17 Corpus

Webis-TLDR-17 CorpusSee More Versions

Webis-TLDR-17 Corpus