1 dataset found

PAN20 Authorship Analysis: Celebrity Profiling
zenodo.org
commons.datacite.org
zip
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast (2023). PAN20 Authorship Analysis: Celebrity Profiling [Dataset]. http://doi.org/10.5281/zenodo.4461887
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4461887
Dataset updated
Oct 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast
Description
Synopsis

Task: Given the Twitter feeds of the followers, determine the occupation, age, and gender of a celebrity.

Evaluation: [code]

Baselines: [code]

See the full Shared Task [here]

The datasets contain three files: a follower-feeds.ndjson as input, a labels.ndjson as output, and a celebrity-feeds.ndjson for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.

The follower-feeds.ndjson contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.

{"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]} {"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}

The celebrity-feeds.ndjson contains the Twitter timelines of the original celebrities, formatted as:

{"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]} {"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}

The labels.ndjson contains the classes that should be predicted. A valid submission has to produce a labels.ndjson given the follower-feeds.ndjson and contain an entry for each id given in the input.

{"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002} {"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}

The following values are possible for each of the traits:

occupation := {sports, performer, creator, politics} birthyear := {1940, ..., 1999} gender := {male, female}
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast (2023). PAN20 Authorship Analysis: Celebrity Profiling [Dataset]. http://doi.org/10.5281/zenodo.4461887

PAN20 Authorship Analysis: Celebrity Profiling

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4461887

Dataset updated

Oct 25, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast

Description

Synopsis

Task: Given the Twitter feeds of the followers, determine the occupation, age, and gender of a celebrity.
Evaluation: [code]
Baselines: [code]
See the full Shared Task [here]

The datasets contain three files: a follower-feeds.ndjson as input, a labels.ndjson as output, and a celebrity-feeds.ndjson for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.

The follower-feeds.ndjson contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.

{"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
{"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}

The celebrity-feeds.ndjson contains the Twitter timelines of the original celebrities, formatted as:

{"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]}
{"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}

The labels.ndjson contains the classes that should be predicted. A valid submission has to produce a labels.ndjson given the follower-feeds.ndjson and contain an entry for each id given in the input.

{"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002}
{"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}

The following values are possible for each of the traits:

occupation := {sports, performer, creator, politics}
birthyear  := {1940, ..., 1999}
gender   := {male, female}

Clear search

Close search

Google apps

Main menu