Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
POS Tagging Dataset
Original Data Source
Conll2003
E. F. Tjong Kim Sang and F. De Meulder, Proceedings of the Seventh Conference on Natural Language Learning at HLT- NAACL 2003, 2003, pp. 142–147.
The Peen Treebank
M. P. Marcus, B. Santorini and M. A. Marcinkiewicz, Comput. Linguist., 1993, 19, 313–330.
Citation
BatteryDataExtractor: battery-aware text-mining software embedded with BERT models
Ensemble Tagger Training and Testing Set This data includes two files: The training set used to create the SCANL Ensemble tagger [1] and the "unseen" testing set that includes words from systems that are not available in the training set. These are derived from a prior dataset of Grammar Patterns; described in a different paper [2]. Within each of these csv files, you'll find several columns. We explain these columns below:
Type (only in training set) - Type (or return type) of the identifier to which current word belongs.
Identifier - The full identifier from which the current word was split.
Grammar Pattern - The sequence of part-of-speech tags generated by splitting the identifier into words and annotating with part-of-spech tags.
Word - The current word; derived by splitting the corresponding identifier.
SWUM annotation - The annotation that the SWUM POS tagger applied to a given word.
POSSE annotation - The annotation that the POSSE POS tagger applied to a given word.
Stanford annotation - The annotation that the Stanford POS tagger applied to a given word.
Flair annotation - The annotation that the FLAIR POS tagger applied to a given word.
Position - The position of a given word within its original identifier. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 3 and Handler is in position 4.
Identifier size (max position) - The length, in words, of the identifier of which the word was originally part.
Normalized position - We normalized the position metric described above such that the first word in the identifier is in position 1, all middle words are in position 2, and the last word is in position 3. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 2 and Handler is in position 3. The reason for this feature is to mitigate the sometimes-negative effect of very long identifiers [2].
Context - The dataset contains five categories of identifier name: function, parameter, attribute, declaration, and class. We provide the category to which the given identifier belongs as one of the features to allow the ensemble to learn patterns that are more pervasive for certain identifier types versus others. For example, function identifiers contain verbs at a higher rate than other types of identifiers [2].
Correct - The correct part-of-speech tag for the current word.
System - System in which the current word was found.
Identifier Code - Each identifier has a unique number. Each word that has the same number is a part of the same identifier. For example, you can concatenate each word with a code of 0 to recreate the original identifier.
Context The numbers under the context feature represent the following categories (number -> category): 1. attribute 2. class 3. declaration 4. function 5. parameter
Best Features We found [1] that the best features, of the features described above, were 1. SWUM 2. POSSE 3. Stanford 4. Normalized position 5. Context
Tagset The tagset that we use is a subset of Penn treebank. Each of our annotations and an example can be found below. Further examples and definitions can be found in the paper [1]
Abbreviation | Expanded Form | Examples |
---|---|---|
N | noun | Disneyland, shoe, faucet, mother, bedroom |
DT | determiner | the, this, that, these, those, which |
CJ | conjunction | and, for, nor, but, or, yet, so |
P | preposition | behind, in front of, at, under, beside, above, beneath, despite |
NPL | noun plural | streets, cities, cars, people, lists, items, elements. |
NM | noun modifier (adjective) | red, cold, hot, scary, beautiful, happy, faster, small |
NM | noun modifier (noun-adjunct italicized) | employeeName, filePath, fontSize, userId |
V | verb | run, jump, drive, spin |
VM | verb modifier (adverb) | very, loudly, seriously, impatiently, badly |
PR | pronoun | she, he, her, him, it, we, us, they, them, I, me, you |
D | digit | 1, 2, 10, 4.12, 0xAF |
PRE | preamble (e.g., Hungarian) | Gimp, GLEW, GL, G, p_, m_, b_ |
Word of Caution Flair and Stanford recognize a larger number of verb conjugations (e.g., VBZ, VBD) than the ensemble, Posse, and SWUM. We left these conjugations in just in case someone wants to use them. If you are uninterested in using these conjugations, you should normalized them to just V-- inline with our tagset.
Identifier Naming Structure Catalogue We have put together a catalogue of identifier naming structures in source code. This catalogue explains a lot more about why this work is important, how we are using the ensemble tagger and why the tagset looks the way it does.
The actual tagger implementation You can find the tagger that was trained using this data here: https://github.com/SCANL/ensemble_tagger
Please cite the paper!
C. D. Newman, M. J. Decker, R. S. AlSuhaibani, A. Peruma, S. Mohapatra, T. Vishoi, M. Zampieri, M. W. Mkaouer, T. J. Sheldon, and E. Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
Christian D. Newman, Reem S. Alsuhaibani, Michael J. Decker, Anthony Peruma, Dishant Kaushik, Mohamed Wiem Mkaouer, Emily Hill, On the generation, structure, and semantics of grammar patterns in source code identifiers, Journal of Systems and Software, 2020, 110740, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110740. (http://www.sciencedirect.com/science/article/pii/S0164121220301680)
Interested in our research? Check out https://scanl.org/
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
nlpnet is a Python library for Natural Language Processing tasks based on neural networks. Currently, it performs part-of-speech tagging and semantic role labeling for Portuguese and English. This implementation has an endpoint just for the part-of-speech tagging in Portuguese.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This data is used with Text Normalization Challenge - English Language.
In many NLP problems using Part-Of-Speech tagging, and Wordnet can help.
At this stage we provide POS tagging for PLAIN class.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This data set can be used as a reference for tagging POS (parts of speech) in lexical analysis.
Parts of speech (POS) tagger is a fundamental component in natural language processing (NLP) that assigns a grammatical category (part of speech) to each word in a given text.
Notebook: https://www.kaggle.com/code/gagnadrengur/brute-force-pos-tagging-a-sample-sentence
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
POSH: The POS-tagged HeadlIne corpus was created for the paper “Cross-Register Projection for Headline Part of Speech Tagging” published in EMNLP 2021. This dataset contains headlines with gold annotated POS tags.
The GSCh evaluation set is here: GSCh/gsc-headline-gold.test.conllu
The smaller evaluation set of GSC headlines sampled uniformly at random (described in section 2.3) is here: gold_unconstrained_headlines/unifrand_gsc.test.conllu
The POS-tagged NYT headlines described in section 2.3 are not shared directly as this text was drawn from the New York Times Annotated Corpus (LDC2008T19), and subject to license constraints. However, if you have access to and have untarred LDC2008T19, you can recover this evaluation set with:
TAG_PATH="./unifrand_onlynyt.tags.json" # mapping from NYT headline span to gold POS tag
python build_gold_nyt_headlines.py --nyt_dir /PATH/TO/ANNOTATED/NYT/CORPUS/ --tag_path ${TAG_PATH} --num_proc 4
Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time.
Under GSCproj we also share the GSCproj folds which we used to train and validate our models. These are not gold POS tags, and are shared purely for reproducibility sake.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing by using UDPipe (http://ufal.mff.cuni.cz/udpipe)
udpipe --tokenize --tag sjmorph.model input > output
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Annotates tokens and sentences in English text, adding part-of-speech and morphological root and affix to each token.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a comprehensive linguistic resource for the Amazigh (Berber) language, focusing on three key components: Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and a parallel corpus of Amazigh sentences in Tifinagh script with their English translations. The dataset is designed to support research in natural language processing (NLP), computational linguistics, and machine translation for the Amazigh language, which is historically underrepresented in digital and linguistic resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This part-of-speech (POS) lexicon of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). The data for verbs comes from a digitized version of A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition (Munich: Bayerische Akademie der Wissenschaften, 2010) by Nathan W. Hill. Otherwise data comes from the manually part-of-speech tagged training data produced by the corpus and a few lexical items specifically added by hand to improve rule based tagging.
funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Arabic POS Dialect
Dataset Summary
This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.
Supported Tasks and Leaderboards
The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/arabic_pos_dialect.
The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.
Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been trained on the ssj500k 2.1 corpus, http://hdl.handle.net/11356/1181.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ELG API for a Part-of-Speech tagger for Icelandic, POS, by the Language and Voice Lab at Reykjavik University, which is licensed under this Apache License 2.0.
The Tokenizer PyPi package is pip installed when the docker image is built. This tool was developed by Miðeind and is licensed under this MIT license. The ELG API implementation imports this PyPi package.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Universal Dependencies POS tagger for ca / Catalan
based on a simple window-based maximum entropy model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hachidaishu part-of-speech dataset
This dataset contains the part-of-speech information of the Imperial Anthology of Japanese Poetry and the Hachidaishu.
Data offset
Example: #1 Kokinshu
10001 年/名/とし の/格助/の 内/名/うち に/格助/に 春/名/はる は/係助/は き/カ変-用:来:く/き に/完-用:ぬ:ぬ/に けり/過-終:けり:けり/けり 一とせ/名/ひととせ を/*助/を こそ/名/こぞ と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む ことし/名/ことし と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む
A line a poem: tokens are separated by spaces; and a token consists of pos elements separated by slashes.
1st column "10001" contains two elements: the first digit is an anthology ID and the rest is a poem ID; the anthology ID: 1..Kokinshu, 2..Gosenshu, 3..Shuishu, 4..Goshuishu, 5..Kin'yoshu, 6..Shikashu, 7..Senzaishu, and 8..Shinkokinshu.
The poem ID is the same as in the database "Nijuichidaishu."
2nd column and the followings are the information of each token.
In case of noun and particle, such as tokens not having conjugations: text/POS/reading.
In case of verb, adjectives, such as tokens having conjugations: text/POS:lemma-kanji:lemma-reading/reading.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.
This submission contains the freely available MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1042) gives those that are available only for non-commercial use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3]. The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.
[1] Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V. 2003. An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR 2003
[2] Fonseca, E.R., Rosa, J.L.G. 2013. Mac-morpho revisited: Towards robust part-of-speech. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology – STIL
[3] Fonseca, E.R., Aluísio, Sandra Maria, Rosa, J.L.G. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1210), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.11.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence