100+ datasets found

Z
Data from: A challenging data set for evaluating part-of-speech taggers
data.niaid.nih.gov
zenodo.org
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wahde, Mattias (2024). A challenging data set for evaluating part-of-speech taggers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10299107
Explore at:
Dataset updated
Feb 24, 2024
Dataset provided by
Wahde, Mattias
Suvanto, Minerva
Della Vedova, Marco L.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence
h
pos_tagging
huggingface.co
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shu Huang (2024). pos_tagging [Dataset]. https://huggingface.co/datasets/batterydata/pos_tagging
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Authors
Shu Huang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
POS Tagging Dataset

Original Data Source Conll2003

E. F. Tjong Kim Sang and F. De Meulder, Proceedings of the Seventh Conference on Natural Language Learning at HLT- NAACL 2003, 2003, pp. 142–147.

The Peen Treebank

M. P. Marcus, B. Santorini and M. A. Marcinkiewicz, Comput. Linguist., 1993, 19, 313–330.

Citation

BatteryDataExtractor: battery-aware text-mining software embedded with BERT models

Source Code Tagger Training Set Dataset

paperswithcode.com

Updated Aug 31, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill (2021). Source Code Tagger Training Set Dataset [Dataset]. https://paperswithcode.com/dataset/source-code-tagger-training-set

Explore at:

Dataset updated

Aug 31, 2021

Authors

Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill

Description

Ensemble Tagger Training and Testing Set This data includes two files: The training set used to create the SCANL Ensemble tagger [1] and the "unseen" testing set that includes words from systems that are not available in the training set. These are derived from a prior dataset of Grammar Patterns; described in a different paper [2]. Within each of these csv files, you'll find several columns. We explain these columns below:

Type (only in training set) - Type (or return type) of the identifier to which current word belongs.

Identifier - The full identifier from which the current word was split.

Grammar Pattern - The sequence of part-of-speech tags generated by splitting the identifier into words and annotating with part-of-spech tags.

Word - The current word; derived by splitting the corresponding identifier.

SWUM annotation - The annotation that the SWUM POS tagger applied to a given word.

POSSE annotation - The annotation that the POSSE POS tagger applied to a given word.

Stanford annotation - The annotation that the Stanford POS tagger applied to a given word.

Flair annotation - The annotation that the FLAIR POS tagger applied to a given word.

Position - The position of a given word within its original identifier. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 3 and Handler is in position 4.

Identifier size (max position) - The length, in words, of the identifier of which the word was originally part.

Normalized position - We normalized the position metric described above such that the first word in the identifier is in position 1, all middle words are in position 2, and the last word is in position 3. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 2 and Handler is in position 3. The reason for this feature is to mitigate the sometimes-negative effect of very long identifiers [2].

Context - The dataset contains five categories of identifier name: function, parameter, attribute, declaration, and class. We provide the category to which the given identifier belongs as one of the features to allow the ensemble to learn patterns that are more pervasive for certain identifier types versus others. For example, function identifiers contain verbs at a higher rate than other types of identifiers [2].

Correct - The correct part-of-speech tag for the current word.

System - System in which the current word was found.

Identifier Code - Each identifier has a unique number. Each word that has the same number is a part of the same identifier. For example, you can concatenate each word with a code of 0 to recreate the original identifier.

Context The numbers under the context feature represent the following categories (number -> category): 1. attribute 2. class 3. declaration 4. function 5. parameter

Best Features We found [1] that the best features, of the features described above, were 1. SWUM 2. POSSE 3. Stanford 4. Normalized position 5. Context

Tagset The tagset that we use is a subset of Penn treebank. Each of our annotations and an example can be found below. Further examples and definitions can be found in the paper [1]

Abbreviation	Expanded Form	Examples
N	noun	Disneyland, shoe, faucet, mother, bedroom
DT	determiner	the, this, that, these, those, which
CJ	conjunction	and, for, nor, but, or, yet, so
P	preposition	behind, in front of, at, under, beside, above, beneath, despite
NPL	noun plural	streets, cities, cars, people, lists, items, elements.
NM	noun modifier (adjective)	red, cold, hot, scary, beautiful, happy, faster, small
NM	noun modifier (noun-adjunct italicized)	employeeName, filePath, fontSize, userId
V	verb	run, jump, drive, spin
VM	verb modifier (adverb)	very, loudly, seriously, impatiently, badly
PR	pronoun	she, he, her, him, it, we, us, they, them, I, me, you
D	digit	1, 2, 10, 4.12, 0xAF
PRE	preamble (e.g., Hungarian)	Gimp, GLEW, GL, G, p_, m_, b_

Word of Caution Flair and Stanford recognize a larger number of verb conjugations (e.g., VBZ, VBD) than the ensemble, Posse, and SWUM. We left these conjugations in just in case someone wants to use them. If you are uninterested in using these conjugations, you should normalized them to just V-- inline with our tagset.

Identifier Naming Structure Catalogue We have put together a catalogue of identifier naming structures in source code. This catalogue explains a lot more about why this work is important, how we are using the ensemble tagger and why the tagset looks the way it does.

The actual tagger implementation You can find the tagger that was trained using this data here: https://github.com/SCANL/ensemble_tagger

Please cite the paper!

C. D. Newman, M. J. Decker, R. S. AlSuhaibani, A. Peruma, S. Mohapatra, T. Vishoi, M. Zampieri, M. W. Mkaouer, T. J. Sheldon, and E. Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.

Christian D. Newman, Reem S. Alsuhaibani, Michael J. Decker, Anthony Peruma, Dishant Kaushik, Mohamed Wiem Mkaouer, Emily Hill, On the generation, structure, and semantics of grammar patterns in source code identifiers, Journal of Systems and Software, 2020, 110740, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110740. (http://www.sciencedirect.com/science/article/pii/S0164121220301680)

Interested in our research? Check out https://scanl.org/

E
NLPnet - Part of Speech Tagging in Portuguese
live.european-language-grid.eu
Updated Apr 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). NLPnet - Part of Speech Tagging in Portuguese [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20210
Explore at:
Dataset updated
Apr 7, 2022
License
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Description
nlpnet is a Python library for Natural Language Processing tasks based on neural networks. Currently, it performs part-of-speech tagging and semantic role labeling for Portuguese and English. This implementation has an endpoint just for the part-of-speech tagging in Portuguese.
En Part-Of-Speech tags
kaggle.com
Updated Sep 25, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marouane Benmeida (2017). En Part-Of-Speech tags [Dataset]. https://www.kaggle.com/atmarouane/en-partofspeech-tags/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2017
Dataset provided by
Kaggle
Authors
Marouane Benmeida
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

This data is used with Text Normalization Challenge - English Language.

In many NLP problems using Part-Of-Speech tagging, and Wordnet can help.

At this stage we provide POS tagging for PLAIN class.

Content

sentence_id

token_id

pos (the meaning of POS tags can be found here)
Parts of Speech Tags
kaggle.com
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gagnadrengur (2023). Parts of Speech Tags [Dataset]. http://doi.org/10.34740/kaggle/dsv/6191528
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/6191528
Dataset updated
Jul 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gagnadrengur
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This data set can be used as a reference for tagging POS (parts of speech) in lexical analysis.

Parts of speech (POS) tagger is a fundamental component in natural language processing (NLP) that assigns a grammatical category (part of speech) to each word in a given text.

Notebook: https://www.kaggle.com/code/gagnadrengur/brute-force-pos-tagging-a-sample-sentence
Z
Data from: Cross-Register Projection for Headline Part of Speech Tagging
data.niaid.nih.gov
Updated Sep 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malioutov, Igor (2021). Cross-Register Projection for Headline Part of Speech Tagging [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5495667
Explore at:
Dataset updated
Sep 9, 2021
Dataset provided by
Benton, Adrian
Malioutov, Igor
Li, Hangyang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
POSH: The POS-tagged HeadlIne corpus was created for the paper “Cross-Register Projection for Headline Part of Speech Tagging” published in EMNLP 2021. This dataset contains headlines with gold annotated POS tags.

The GSCh evaluation set is here: GSCh/gsc-headline-gold.test.conllu

The smaller evaluation set of GSC headlines sampled uniformly at random (described in section 2.3) is here: gold_unconstrained_headlines/unifrand_gsc.test.conllu

The POS-tagged NYT headlines described in section 2.3 are not shared directly as this text was drawn from the New York Times Annotated Corpus (LDC2008T19), and subject to license constraints. However, if you have access to and have untarred LDC2008T19, you can recover this evaluation set with:

TAG_PATH="./unifrand_onlynyt.tags.json" # mapping from NYT headline span to gold POS tag python build_gold_nyt_headlines.py --nyt_dir /PATH/TO/ANNOTATED/NYT/CORPUS/ --tag_path ${TAG_PATH} --num_proc 4

Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time.

Under GSCproj we also share the GSCproj folds which we used to train and validate our models. These are not gold POS tags, and are shared purely for reproducibility sake.
Data from: A New Annotation Scheme for the Sejong Part-of-speech Tagged...
zenodo.org
bin
Updated Oct 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jungyeul Park; Francis Tyers; Jungyeul Park; Francis Tyers (2020). A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus [Dataset]. http://doi.org/10.5281/zenodo.3236528
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3236528
Dataset updated
Oct 9, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jungyeul Park; Francis Tyers; Jungyeul Park; Francis Tyers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing by using UDPipe (http://ufal.mff.cuni.cz/udpipe)

udpipe --tokenize --tag sjmorph.model input > output

see https://github.com/jungyeul/sjmorph
E
GATE: English Part-of-Speech and Morphology Analyzer
live.european-language-grid.eu
Updated Feb 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). GATE: English Part-of-Speech and Morphology Analyzer [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/517
Explore at:
Dataset updated
Feb 24, 2020
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Annotates tokens and sentences in English text, adding part-of-speech and morphological root and affix to each token.
m
Amazigh Linguistic Dataset: Part-of-Speech Tagging, Named Entity...
data.mendeley.com
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otman MAAROUF (2025). Amazigh Linguistic Dataset: Part-of-Speech Tagging, Named Entity Recognition, and Parallel Corpus (Tifinagh-English) [Dataset]. http://doi.org/10.17632/vdgfhfnr26.1
Explore at:
Unique identifier
https://doi.org/10.17632/vdgfhfnr26.1
Dataset updated
Mar 4, 2025
Authors
Otman MAAROUF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a comprehensive linguistic resource for the Amazigh (Berber) language, focusing on three key components: Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and a parallel corpus of Amazigh sentences in Tifinagh script with their English translations. The dataset is designed to support research in natural language processing (NLP), computational linguistics, and machine translation for the Amazigh language, which is historically underrepresented in digital and linguistic resources.
E
Data from: A part-of-speech (POS) lexicon of Classical Tibetan for NLP
live.european-language-grid.eu
zenodo.org
txt
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). A part-of-speech (POS) lexicon of Classical Tibetan for NLP [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1333
Explore at:
txtAvailable download formats
Dataset updated
Jun 15, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tibet
Description
This part-of-speech (POS) lexicon of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). The data for verbs comes from a digitized version of A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition (Munich: Bayerische Akademie der Wissenschaften, 2010) by Nathan W. Hill. Otherwise data comes from the manually part-of-speech tagged training data produced by the corpus and a few lexical items specifically added by hand to improve rule based tagging.
funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)
h
arabic_pos_dialect
huggingface.co
Updated Nov 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
arabic_pos_dialect [Dataset]. https://huggingface.co/datasets/QCRI/arabic_pos_dialect
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2021
Dataset authored and provided by
Arabic Language Technologies, Qatar Computing Research Institute
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Arabic POS Dialect

Dataset Summary

This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.

Supported Tasks and Leaderboards

The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/arabic_pos_dialect.
P
Twitter PoS VCB Dataset
paperswithcode.com
Updated Aug 31, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Twitter PoS VCB Dataset [Dataset]. https://paperswithcode.com/dataset/twitter-pos-vcb
Explore at:
Dataset updated
Aug 31, 2013
Authors
Leon Derczynski; Alan Ritter; Sam Clark; Kalina Bontcheva
Description
The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.
d
Character-level part-of-speech tagger of Slovene language - Dataset - B2FIND...
b2find.dkrz.de
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Character-level part-of-speech tagger of Slovene language - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/b6ff53f7-ccc8-5662-a68c-96ee7289eea1
Explore at:
Dataset updated
Oct 24, 2023
Description
Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been trained on the ssj500k 2.1 corpus, http://hdl.handle.net/11356/1181.
E
ELG API for Icelandic POS tagger
live.european-language-grid.eu
Updated Sep 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reykjavik University (2022). ELG API for Icelandic POS tagger [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/18084
Explore at:
Dataset updated
Sep 26, 2022
Dataset authored and provided by
Reykjavik University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ELG API for a Part-of-Speech tagger for Icelandic, POS, by the Language and Voice Lab at Reykjavik University, which is licensed under this Apache License 2.0.
The Tokenizer PyPi package is pip installed when the docker image is built. This tool was developed by Miðeind and is licensed under this MIT license. The ELG API implementation imports this PyPi package.
E
GATE: Universal Dependencies POS Tagger for ca / Catalan
live.european-language-grid.eu
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). GATE: Universal Dependencies POS Tagger for ca / Catalan [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/1431
Explore at:
Dataset updated
Nov 12, 2020
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Universal Dependencies POS tagger for ca / Catalan
based on a simple window-based maximum entropy model.
Z
Hachidaishu part of speech dataset
data.niaid.nih.gov
zenodo.org
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yamamoto, Hilofumi (2022). Hachidaishu part of speech dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4835805
Explore at:
Dataset updated
Feb 21, 2022
Dataset provided by
Yamamoto, Hilofumi
Hodošček, Bor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hachidaishu part-of-speech dataset

This dataset contains the part-of-speech information of the Imperial Anthology of Japanese Poetry and the Hachidaishu.

Data offset

Example: #1 Kokinshu

10001 年/名/としの/格助/の内/名/うちに/格助/に春/名/はるは/係助/はき/カ変-用:来:く/きに/完-用:ぬ:ぬ/にけり/過-終:けり:けり/けり一とせ/名/ひととせを/*助/をこそ/名/こぞと/格助/とや/係助/やいは/ハ四-未:言ふ:いふ/いはん/推-終体:む:む/むことし/名/ことしと/格助/とや/係助/やいは/ハ四-未:言ふ:いふ/いはん/推-終体:む:む/む

A line a poem: tokens are separated by spaces; and a token consists of pos elements separated by slashes.

1st column "10001" contains two elements: the first digit is an anthology ID and the rest is a poem ID; the anthology ID: 1..Kokinshu, 2..Gosenshu, 3..Shuishu, 4..Goshuishu, 5..Kin'yoshu, 6..Shikashu, 7..Senzaishu, and 8..Shinkokinshu.

The poem ID is the same as in the database "Nijuichidaishu."

2nd column and the followings are the information of each token.

In case of noun and particle, such as tokens not having conjugations: text/POS/reading.

In case of verb, adjectives, such as tokens having conjugations: text/POS:lemma-kanji:lemma-reading/reading.
c
Data from: MULTEXT-East free lexicons 4.0
clarin.si
live.european-language-grid.eu
Updated Jul 13, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomaž Erjavec; Ştefan Bruda; Ivan Derzhanski; Ludmila Dimitrova; Radovan Garabík; Peter Holozan; Nancy Ide; Heiki-Jaan Kaalep; Natalia Kotsyba; Csaba Oravecz; Vladimír Petkevič; Greg Priest-Dorman; Igor Shevchenko; Kiril Simov; Lydia Sinapova; Han Steenwijk; Laszlo Tihanyi; Dan Tufiş; Jean Véronis (2015). MULTEXT-East free lexicons 4.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1041?show=full
Explore at:
Dataset updated
Jul 13, 2015
Authors
Tomaž Erjavec; Ştefan Bruda; Ivan Derzhanski; Ludmila Dimitrova; Radovan Garabík; Peter Holozan; Nancy Ide; Heiki-Jaan Kaalep; Natalia Kotsyba; Csaba Oravecz; Vladimír Petkevič; Greg Priest-Dorman; Igor Shevchenko; Kiril Simov; Lydia Sinapova; Han Steenwijk; Laszlo Tihanyi; Dan Tufiş; Jean Véronis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.

This submission contains the freely available MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1042) gives those that are available only for non-commercial use.
h
mac_morpho
huggingface.co
paperswithcode.com
Updated Dec 21, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NILC NLP (2020). mac_morpho [Dataset]. https://huggingface.co/datasets/nilc-nlp/mac_morpho
Explore at:
Dataset updated
Dec 21, 2020
Dataset authored and provided by
NILC NLP
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3]. The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.

[1] Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V. 2003. An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR 2003

[2] Fonseca, E.R., Rosa, J.L.G. 2013. Mac-morpho revisited: Towards robust part-of-speech. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology – STIL

[3] Fonseca, E.R., Aluísio, Sandra Maria, Rosa, J.L.G. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society.
c
Data from: The CLASSLA-StanfordNLP model for morphosyntactic annotation of...
clarin.si
live.european-language-grid.eu
Updated Jul 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikola Ljubešić; Vanja Štefanec (2020). The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard Croatian 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1331
Explore at:
Dataset updated
Jul 17, 2020
Authors
Nikola Ljubešić; Vanja Štefanec
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1210), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.11.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wahde, Mattias (2024). A challenging data set for evaluating part-of-speech taggers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10299107

Data from: A challenging data set for evaluating part-of-speech taggers

Explore at:

Dataset updated

Feb 24, 2024

Dataset provided by

Wahde, Mattias
Suvanto, Minerva
Della Vedova, Marco L.

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence

Data from: A challenging data set for evaluating part-of-speech taggers

pos_tagging

Source Code Tagger Training Set Dataset

NLPnet - Part of Speech Tagging in Portuguese

En Part-Of-Speech tags

Context

Content

Parts of Speech Tags

Data from: Cross-Register Projection for Headline Part of Speech Tagging

Data from: A New Annotation Scheme for the Sejong Part-of-speech Tagged...

GATE: English Part-of-Speech and Morphology Analyzer

Amazigh Linguistic Dataset: Part-of-Speech Tagging, Named Entity...

Data from: A part-of-speech (POS) lexicon of Classical Tibetan for NLP

arabic_pos_dialect

Twitter PoS VCB Dataset

Character-level part-of-speech tagger of Slovene language - Dataset - B2FIND...

ELG API for Icelandic POS tagger

GATE: Universal Dependencies POS Tagger for ca / Catalan

Hachidaishu part of speech dataset

Data from: MULTEXT-East free lexicons 4.0

mac_morpho

Data from: The CLASSLA-StanfordNLP model for morphosyntactic annotation of...

Data from: A challenging data set for evaluating part-of-speech taggers