2 datasets found

W
Webis-Gmane-19
webis.de
anthology.aicmu.ac.cn
3766984
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein (2019). Webis-Gmane-19 [Dataset]. http://doi.org/10.5281/zenodo.3766984
Explore at:
3766984Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3766984
Dataset updated
2019
Dataset provided by
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
Bauhaus-Universität Weimar and Leipzig University
University of Groningen
Authors
Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A large-scale corpus of over 153 million fully-segmented emails from 14.635 public mailing lists.
The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.
Webis Gmane Email Corpus 2019
zenodo.org
Updated Jun 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib (2020). Webis Gmane Email Corpus 2019 [Dataset]. http://doi.org/10.5281/zenodo.3766985
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3766985
Dataset updated
Jun 4, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Janek Bevendorff; Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Martin Potthast; Benno Stein; Benno Stein; Khalid Al-Khatib
Description
The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

{"index": {"_id": "

The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

Available email headers are:

message_id

date (yyyy-MM-dd HH:mm:ssZZ)

subject

from

to

cc

in_reply_to

references

list_id

Available segment classes are:

paragraph

closing

inline_headers

log_data

mua_signature

patch

personal_signature

quotation

quotation_marker

raw_code

salutation

section_heading

tabular

technical

visual_separator

Find more information about the dataset and the segmentation model at webis.de.

If you are using this resource in your work, please cite it as:

@InProceedings{stein:2020o, author = {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)}, month = jul, publisher = {Association for Computational Linguistics}, site = {Seattle, USA}, title = {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}}, year = 2020 }
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein (2019). Webis-Gmane-19 [Dataset]. http://doi.org/10.5281/zenodo.3766984

Webis-Gmane-19

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

3766984Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.3766984

Dataset updated

2019

Dataset provided by

The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
Bauhaus-Universität Weimar and Leipzig University
University of Groningen

Authors

Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A large-scale corpus of over 153 million fully-segmented emails from 14.635 public mailing lists.

The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.