1 dataset found

Webis-Ambient-15
zenodo.org
webis.de
+3more
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthias Hagen; Matthias Hagen; Tim Gollub; Tim Gollub; Matthias Busse; Matthias Busse (2020). Webis-Ambient-15 [Dataset]. http://doi.org/10.5281/zenodo.3250669
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3250669
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthias Hagen; Matthias Hagen; Tim Gollub; Tim Gollub; Matthias Busse; Matthias Busse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus is an extension of the Ambient data set created by Carpineto and Romano. For each subtopic, the websites of the given URLs were downloaded (if accessible). Those documents are named as the original documents, for example, 1/1.4/1.3.html. Each subtopic was then manually enriched to ten documents with websites retrieved by Google (for example, 1/1.1/g00.html - 'g' for Google, 00 for the first Google result). Some subtopics could not be sufficently enriched and were discarded. Moreover, some subtopics were duplicates or not interpretable and were also discarded.

The data sets consists of 44 topics (topics.txt) and 481 subtopics (subtopics.txt). Some subtopics are topically very similar and therefore rather difficult to be clustered. These subtopics (11.2, 12.13, 14.2, 19.33, 20.2, 20.5, 21.2, 24.3, 24.4, 27.26, 31.16, 36.7, 44.9) are discarded in the file subtopics-filtered.txt, which lists only the remaining 468 subtopics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Matthias Hagen; Matthias Hagen; Tim Gollub; Tim Gollub; Matthias Busse; Matthias Busse (2020). Webis-Ambient-15 [Dataset]. http://doi.org/10.5281/zenodo.3250669

Webis-Ambient-15

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3250669

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Matthias Hagen; Matthias Hagen; Tim Gollub; Tim Gollub; Matthias Busse; Matthias Busse

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This corpus is an extension of the Ambient data set created by Carpineto and Romano. For each subtopic, the websites of the given URLs were downloaded (if accessible). Those documents are named as the original documents, for example, 1/1.4/1.3.html. Each subtopic was then manually enriched to ten documents with websites retrieved by Google (for example, 1/1.1/g00.html - 'g' for Google, 00 for the first Google result). Some subtopics could not be sufficently enriched and were discarded. Moreover, some subtopics were duplicates or not interpretable and were also discarded.

The data sets consists of 44 topics (topics.txt) and 481 subtopics (subtopics.txt). Some subtopics are topically very similar and therefore rather difficult to be clustered. These subtopics (11.2, 12.13, 14.2, 19.33, 20.2, 20.5, 21.2, 24.3, 24.4, 27.26, 31.16, 36.7, 44.9) are discarded in the file subtopics-filtered.txt, which lists only the remaining 468 subtopics.