4 datasets found

h
tldr-17
huggingface.co
Updated Jun 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webis Group (2020). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2020
Dataset authored and provided by
Webis Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
P
Webis-TLDR-17 Corpus Dataset
paperswithcode.com
zenodo.org
Updated Aug 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein (2017). Webis-TLDR-17 Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/webis-tldr-17-corpus
Explore at:
Dataset updated
Aug 31, 2017
Authors
Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein
Description
This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.
W
Webis-TLDR-17
webis.de
anthology.aicmu.ac.cn
1043504
Updated 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504
Explore at:
1043504Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.1043504
Dataset updated
2017
Dataset provided by
NEC Laboratories Europe
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network
Artefact Germany, Bauhaus-Universität Weimar
Authors
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.
Geospatiality_data
zenodo.org
bin, zip
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Mast; Johannes Mast (2025). Geospatiality_data [Dataset]. http://doi.org/10.5281/zenodo.14329235
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14329235
Dataset updated
Jan 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Mast; Johannes Mast
Description
This repository contains code and data for reproducing the study Geospatiality: The effect of topics on the presence of geolocation in English text data.

The study analyzed the frequency of geolocations in texts across several distinct datasets from different sources. These sources were:

Twitter (X)

Reddit

Stackexchange

GDELT

IA-Americana

Nairaland

For each source, a dataset was acquired and tested for the presence of geolocations in the texts, as well as annotated with topic-labels.
The scripts use as inputs the data from the zip files in the data directory. Files need to be unzipped before running the scripts. Note that usernames have been anonymized.

E_Modeling.R Applies the mixed modeling approach described in the article.

F1_Analyze_FracGeo.R produces figures and tables visualising FracGeo, the fraction of geolocated text items per supertopic and dataset (Table 3 and Figure 3).

F2_Explore_Variables.R analyses FracGeo, across timesteps, authors, and text length (Figure 4).

F3_Analyze_Models.R analyses the fixed effects of the GLMM models for each dataset, and compares their correlation across datasets (Table 4, Figure 5, and Appendices A1-A6).

F4_Validate.R compares the georeferences and supertopic assignments of the models to the human annotations (Appendix 9 and Table 5).

The file topic_taxonomy.xlsx contains the topic taxonomy which matches topics to site-specific categories (e.g. subreddits, subforums, stackexchange sites). For users without access to MS office, the file can be loaded using open scripting languages, for example R:

library(openxlsx2) path <- "../2_Data_Processing/Topic_taxonomy.xlsx" tax_reddit <- openxlsx2::wb_read(path, sheet = "Topic_Taxonomy_Reddit") tax_Stackexchange <- openxlsx2::wb_read(path, sheet = "Topic_Taxonomy_Stackexchange") tax_Nairaland <- openxlsx2::wb_read(path, sheet = "Topic_Taxonomy_Nairaland") tax_GDELT <- openxlsx2::wb_read(path, sheet = "Topic_Taxonomy_GDELT")
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Webis Group (2020). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17

tldr-17

webis/tldr-17

Reddit Webis-TLDR-17

Explore at:

51 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 12, 2020

Dataset authored and provided by

Webis Group

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.

tldr-17

Webis-TLDR-17 Corpus Dataset

Webis-TLDR-17

Geospatiality_data

tldr-17

webis/tldr-17

Reddit Webis-TLDR-17