Search
Clear search
Close search
Main menu
Google apps
100+ datasets found
  1. Z

    Data from: A challenging data set for evaluating part-of-speech taggers

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wahde, Mattias (2024). A challenging data set for evaluating part-of-speech taggers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10299107
    Explore at:
    Dataset updated
    Feb 24, 2024
    Dataset provided by
    Wahde, Mattias
    Suvanto, Minerva
    Della Vedova, Marco L.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence

  2. h

    pos_tagging

    • huggingface.co
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shu Huang (2024). pos_tagging [Dataset]. https://huggingface.co/datasets/batterydata/pos_tagging
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Authors
    Shu Huang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    POS Tagging Dataset

      Original Data Source
    
    
    
    
    
    
    
      Conll2003
    

    E. F. Tjong Kim Sang and F. De Meulder, Proceedings of the Seventh Conference on Natural Language Learning at HLT- NAACL 2003, 2003, pp. 142–147.

      The Peen Treebank
    

    M. P. Marcus, B. Santorini and M. A. Marcinkiewicz, Comput. Linguist., 1993, 19, 313–330.

      Citation
    

    BatteryDataExtractor: battery-aware text-mining software embedded with BERT models

  3. P

    Source Code Tagger Training Set Dataset

    • paperswithcode.com
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill (2021). Source Code Tagger Training Set Dataset [Dataset]. https://paperswithcode.com/dataset/source-code-tagger-training-set
    Explore at:
    Dataset updated
    Aug 31, 2021
    Authors
    Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W. Mkaouer; Timothy J. Sheldon; Emily Hill
    Description

    Ensemble Tagger Training and Testing Set This data includes two files: The training set used to create the SCANL Ensemble tagger [1] and the "unseen" testing set that includes words from systems that are not available in the training set. These are derived from a prior dataset of Grammar Patterns; described in a different paper [2]. Within each of these csv files, you'll find several columns. We explain these columns below:

    Type (only in training set) - Type (or return type) of the identifier to which current word belongs.

    Identifier - The full identifier from which the current word was split.

    Grammar Pattern - The sequence of part-of-speech tags generated by splitting the identifier into words and annotating with part-of-spech tags.

    Word - The current word; derived by splitting the corresponding identifier.

    SWUM annotation - The annotation that the SWUM POS tagger applied to a given word.

    POSSE annotation - The annotation that the POSSE POS tagger applied to a given word.

    Stanford annotation - The annotation that the Stanford POS tagger applied to a given word.

    Flair annotation - The annotation that the FLAIR POS tagger applied to a given word.

    Position - The position of a given word within its original identifier. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 3 and Handler is in position 4.

    Identifier size (max position) - The length, in words, of the identifier of which the word was originally part.

    Normalized position - We normalized the position metric described above such that the first word in the identifier is in position 1, all middle words are in position 2, and the last word is in position 3. For example, given an identifier: GetXMLReaderHandler, Get is in position 1, XML is in position 2, Reader is in position 2 and Handler is in position 3. The reason for this feature is to mitigate the sometimes-negative effect of very long identifiers [2].

    Context - The dataset contains five categories of identifier name: function, parameter, attribute, declaration, and class. We provide the category to which the given identifier belongs as one of the features to allow the ensemble to learn patterns that are more pervasive for certain identifier types versus others. For example, function identifiers contain verbs at a higher rate than other types of identifiers [2].

    Correct - The correct part-of-speech tag for the current word.

    System - System in which the current word was found.

    Identifier Code - Each identifier has a unique number. Each word that has the same number is a part of the same identifier. For example, you can concatenate each word with a code of 0 to recreate the original identifier.

    Context The numbers under the context feature represent the following categories (number -> category): 1. attribute 2. class 3. declaration 4. function 5. parameter

    Best Features We found [1] that the best features, of the features described above, were 1. SWUM 2. POSSE 3. Stanford 4. Normalized position 5. Context

    Tagset The tagset that we use is a subset of Penn treebank. Each of our annotations and an example can be found below. Further examples and definitions can be found in the paper [1]

    AbbreviationExpanded FormExamples
    NnounDisneyland, shoe, faucet, mother, bedroom
    DTdeterminerthe, this, that, these, those, which
    CJconjunctionand, for, nor, but, or, yet, so
    Pprepositionbehind, in front of, at, under, beside, above, beneath, despite
    NPLnoun pluralstreets, cities, cars, people, lists, items, elements.
    NMnoun modifier (adjective)red, cold, hot, scary, beautiful, happy, faster, small
    NMnoun modifier (noun-adjunct italicized)employeeName, filePath, fontSize, userId
    Vverbrun, jump, drive, spin
    VMverb modifier (adverb)very, loudly, seriously, impatiently, badly
    PRpronounshe, he, her, him, it, we, us, they, them, I, me, you
    Ddigit1, 2, 10, 4.12, 0xAF
    PREpreamble (e.g., Hungarian)Gimp, GLEW, GL, G, p_, m_, b_

    Word of Caution Flair and Stanford recognize a larger number of verb conjugations (e.g., VBZ, VBD) than the ensemble, Posse, and SWUM. We left these conjugations in just in case someone wants to use them. If you are uninterested in using these conjugations, you should normalized them to just V-- inline with our tagset.

    Identifier Naming Structure Catalogue We have put together a catalogue of identifier naming structures in source code. This catalogue explains a lot more about why this work is important, how we are using the ensemble tagger and why the tagset looks the way it does.

    The actual tagger implementation You can find the tagger that was trained using this data here: https://github.com/SCANL/ensemble_tagger

    Please cite the paper!

    C. D. Newman, M. J. Decker, R. S. AlSuhaibani, A. Peruma, S. Mohapatra, T. Vishoi, M. Zampieri, M. W. Mkaouer, T. J. Sheldon, and E. Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.

    Christian D. Newman, Reem S. Alsuhaibani, Michael J. Decker, Anthony Peruma, Dishant Kaushik, Mohamed Wiem Mkaouer, Emily Hill, On the generation, structure, and semantics of grammar patterns in source code identifiers, Journal of Systems and Software, 2020, 110740, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110740. (http://www.sciencedirect.com/science/article/pii/S0164121220301680)

    Interested in our research? Check out https://scanl.org/

  4. E

    NLPnet - Part of Speech Tagging in Portuguese

    • live.european-language-grid.eu
    Updated Apr 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). NLPnet - Part of Speech Tagging in Portuguese [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20210
    Explore at:
    Dataset updated
    Apr 7, 2022
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description

    nlpnet is a Python library for Natural Language Processing tasks based on neural networks. Currently, it performs part-of-speech tagging and semantic role labeling for Portuguese and English. This implementation has an endpoint just for the part-of-speech tagging in Portuguese.

  5. En Part-Of-Speech tags

    • kaggle.com
    Updated Sep 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marouane Benmeida (2017). En Part-Of-Speech tags [Dataset]. https://www.kaggle.com/atmarouane/en-partofspeech-tags/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2017
    Dataset provided by
    Kaggle
    Authors
    Marouane Benmeida
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    This data is used with Text Normalization Challenge - English Language.

    In many NLP problems using Part-Of-Speech tagging, and Wordnet can help.

    At this stage we provide POS tagging for PLAIN class.

    Content

    • sentence_id
    • token_id
    • pos (the meaning of POS tags can be found here)
  6. Parts of Speech Tags

    • kaggle.com
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gagnadrengur (2023). Parts of Speech Tags [Dataset]. http://doi.org/10.34740/kaggle/dsv/6191528
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gagnadrengur
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This data set can be used as a reference for tagging POS (parts of speech) in lexical analysis.

    Parts of speech (POS) tagger is a fundamental component in natural language processing (NLP) that assigns a grammatical category (part of speech) to each word in a given text.

    Notebook: https://www.kaggle.com/code/gagnadrengur/brute-force-pos-tagging-a-sample-sentence

  7. Z

    Data from: Cross-Register Projection for Headline Part of Speech Tagging

    • data.niaid.nih.gov
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malioutov, Igor (2021). Cross-Register Projection for Headline Part of Speech Tagging [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5495667
    Explore at:
    Dataset updated
    Sep 9, 2021
    Dataset provided by
    Benton, Adrian
    Malioutov, Igor
    Li, Hangyang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    POSH: The POS-tagged HeadlIne corpus was created for the paper “Cross-Register Projection for Headline Part of Speech Tagging” published in EMNLP 2021. This dataset contains headlines with gold annotated POS tags.

    The GSCh evaluation set is here: GSCh/gsc-headline-gold.test.conllu

    The smaller evaluation set of GSC headlines sampled uniformly at random (described in section 2.3) is here: gold_unconstrained_headlines/unifrand_gsc.test.conllu

    The POS-tagged NYT headlines described in section 2.3 are not shared directly as this text was drawn from the New York Times Annotated Corpus (LDC2008T19), and subject to license constraints. However, if you have access to and have untarred LDC2008T19, you can recover this evaluation set with:

    TAG_PATH="./unifrand_onlynyt.tags.json" # mapping from NYT headline span to gold POS tag
    
    
    python build_gold_nyt_headlines.py --nyt_dir /PATH/TO/ANNOTATED/NYT/CORPUS/ --tag_path ${TAG_PATH} --num_proc 4
    

    Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time.

    Under GSCproj we also share the GSCproj folds which we used to train and validate our models. These are not gold POS tags, and are shared purely for reproducibility sake.

  8. Data from: A New Annotation Scheme for the Sejong Part-of-speech Tagged...

    • zenodo.org
    bin
    Updated Oct 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jungyeul Park; Francis Tyers; Jungyeul Park; Francis Tyers (2020). A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus [Dataset]. http://doi.org/10.5281/zenodo.3236528
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 9, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jungyeul Park; Francis Tyers; Jungyeul Park; Francis Tyers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing by using UDPipe (http://ufal.mff.cuni.cz/udpipe)

    udpipe --tokenize --tag sjmorph.model input > output

    see https://github.com/jungyeul/sjmorph

  9. E

    GATE: English Part-of-Speech and Morphology Analyzer

    • live.european-language-grid.eu
    Updated Feb 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). GATE: English Part-of-Speech and Morphology Analyzer [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/517
    Explore at:
    Dataset updated
    Feb 24, 2020
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Annotates tokens and sentences in English text, adding part-of-speech and morphological root and affix to each token.

  10. m

    Amazigh Linguistic Dataset: Part-of-Speech Tagging, Named Entity...

    • data.mendeley.com
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otman MAAROUF (2025). Amazigh Linguistic Dataset: Part-of-Speech Tagging, Named Entity Recognition, and Parallel Corpus (Tifinagh-English) [Dataset]. http://doi.org/10.17632/vdgfhfnr26.1
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Otman MAAROUF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a comprehensive linguistic resource for the Amazigh (Berber) language, focusing on three key components: Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and a parallel corpus of Amazigh sentences in Tifinagh script with their English translations. The dataset is designed to support research in natural language processing (NLP), computational linguistics, and machine translation for the Amazigh language, which is historically underrepresented in digital and linguistic resources.

  11. E

    Data from: A part-of-speech (POS) lexicon of Classical Tibetan for NLP

    • live.european-language-grid.eu
    • zenodo.org
    txt
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). A part-of-speech (POS) lexicon of Classical Tibetan for NLP [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1333
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 15, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tibet
    Description

    This part-of-speech (POS) lexicon of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). The data for verbs comes from a digitized version of A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition (Munich: Bayerische Akademie der Wissenschaften, 2010) by Nathan W. Hill. Otherwise data comes from the manually part-of-speech tagged training data produced by the corpus and a few lexical items specifically added by hand to improve rule based tagging.

    funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)

  12. h

    arabic_pos_dialect

    • huggingface.co
    Updated Nov 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    arabic_pos_dialect [Dataset]. https://huggingface.co/datasets/QCRI/arabic_pos_dialect
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2021
    Dataset authored and provided by
    Arabic Language Technologies, Qatar Computing Research Institute
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Arabic POS Dialect

      Dataset Summary
    

    This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.

      Supported Tasks and Leaderboards
    

    The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/arabic_pos_dialect.

  13. P

    Twitter PoS VCB Dataset

    • paperswithcode.com
    Updated Aug 31, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Twitter PoS VCB Dataset [Dataset]. https://paperswithcode.com/dataset/twitter-pos-vcb
    Explore at:
    Dataset updated
    Aug 31, 2013
    Authors
    Leon Derczynski; Alan Ritter; Sam Clark; Kalina Bontcheva
    Description

    The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.

  14. d

    Character-level part-of-speech tagger of Slovene language - Dataset - B2FIND...

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Character-level part-of-speech tagger of Slovene language - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/b6ff53f7-ccc8-5662-a68c-96ee7289eea1
    Explore at:
    Dataset updated
    Oct 24, 2023
    Description

    Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been trained on the ssj500k 2.1 corpus, http://hdl.handle.net/11356/1181.

  15. E

    ELG API for Icelandic POS tagger

    • live.european-language-grid.eu
    Updated Sep 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reykjavik University (2022). ELG API for Icelandic POS tagger [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/18084
    Explore at:
    Dataset updated
    Sep 26, 2022
    Dataset authored and provided by
    Reykjavik University
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ELG API for a Part-of-Speech tagger for Icelandic, POS, by the Language and Voice Lab at Reykjavik University, which is licensed under this Apache License 2.0.

    The Tokenizer PyPi package is pip installed when the docker image is built. This tool was developed by Miðeind and is licensed under this MIT license. The ELG API implementation imports this PyPi package.

  16. E

    GATE: Universal Dependencies POS Tagger for ca / Catalan

    • live.european-language-grid.eu
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). GATE: Universal Dependencies POS Tagger for ca / Catalan [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/1431
    Explore at:
    Dataset updated
    Nov 12, 2020
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Universal Dependencies POS tagger for ca / Catalan

    based on a simple window-based maximum entropy model.

  17. Z

    Hachidaishu part of speech dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamamoto, Hilofumi (2022). Hachidaishu part of speech dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4835805
    Explore at:
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    Yamamoto, Hilofumi
    Hodošček, Bor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hachidaishu part-of-speech dataset

    This dataset contains the part-of-speech information of the Imperial Anthology of Japanese Poetry and the Hachidaishu.

    Data offset

    Example: #1 Kokinshu

    10001 年/名/とし の/格助/の 内/名/うち に/格助/に 春/名/はる は/係助/は き/カ変-用:来:く/き に/完-用:ぬ:ぬ/に けり/過-終:けり:けり/けり 一とせ/名/ひととせ を/*助/を こそ/名/こぞ と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む ことし/名/ことし と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む

    A line a poem: tokens are separated by spaces; and a token consists of pos elements separated by slashes.

    1st column "10001" contains two elements: the first digit is an anthology ID and the rest is a poem ID; the anthology ID: 1..Kokinshu, 2..Gosenshu, 3..Shuishu, 4..Goshuishu, 5..Kin'yoshu, 6..Shikashu, 7..Senzaishu, and 8..Shinkokinshu.

    The poem ID is the same as in the database "Nijuichidaishu."

    2nd column and the followings are the information of each token.

    In case of noun and particle, such as tokens not having conjugations: text/POS/reading.

    In case of verb, adjectives, such as tokens having conjugations: text/POS:lemma-kanji:lemma-reading/reading.

  18. c

    Data from: MULTEXT-East free lexicons 4.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Jul 13, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomaž Erjavec; Ştefan Bruda; Ivan Derzhanski; Ludmila Dimitrova; Radovan Garabík; Peter Holozan; Nancy Ide; Heiki-Jaan Kaalep; Natalia Kotsyba; Csaba Oravecz; Vladimír Petkevič; Greg Priest-Dorman; Igor Shevchenko; Kiril Simov; Lydia Sinapova; Han Steenwijk; Laszlo Tihanyi; Dan Tufiş; Jean Véronis (2015). MULTEXT-East free lexicons 4.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1041?show=full
    Explore at:
    Dataset updated
    Jul 13, 2015
    Authors
    Tomaž Erjavec; Ştefan Bruda; Ivan Derzhanski; Ludmila Dimitrova; Radovan Garabík; Peter Holozan; Nancy Ide; Heiki-Jaan Kaalep; Natalia Kotsyba; Csaba Oravecz; Vladimír Petkevič; Greg Priest-Dorman; Igor Shevchenko; Kiril Simov; Lydia Sinapova; Han Steenwijk; Laszlo Tihanyi; Dan Tufiş; Jean Véronis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.

    This submission contains the freely available MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1042) gives those that are available only for non-commercial use.

  19. h

    mac_morpho

    • huggingface.co
    • paperswithcode.com
    Updated Dec 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NILC NLP (2020). mac_morpho [Dataset]. https://huggingface.co/datasets/nilc-nlp/mac_morpho
    Explore at:
    Dataset updated
    Dec 21, 2020
    Dataset authored and provided by
    NILC NLP
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3]. The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.

    [1] Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V. 2003. An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR 2003

    [2] Fonseca, E.R., Rosa, J.L.G. 2013. Mac-morpho revisited: Towards robust part-of-speech. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology – STIL

    [3] Fonseca, E.R., Aluísio, Sandra Maria, Rosa, J.L.G. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society.

  20. c

    Data from: The CLASSLA-StanfordNLP model for morphosyntactic annotation of...

    • clarin.si
    • live.european-language-grid.eu
    Updated Jul 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikola Ljubešić; Vanja Štefanec (2020). The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard Croatian 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1331
    Explore at:
    Dataset updated
    Jul 17, 2020
    Authors
    Nikola Ljubešić; Vanja Štefanec
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1210), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.11.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wahde, Mattias (2024). A challenging data set for evaluating part-of-speech taggers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10299107

Data from: A challenging data set for evaluating part-of-speech taggers

Related Article
Explore at:
Dataset updated
Feb 24, 2024
Dataset provided by
Wahde, Mattias
Suvanto, Minerva
Della Vedova, Marco L.
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data set contains 2,227 sentences, with a part-of-speech (POS) tag specified for a single word in the sentence. The data file is a tab-separated text file where each row (after the header row) is formatted as follows: sentence