Meaning of Shorthand Tags

I put the Spanish text with the Stanford POS Tagger (via NLTK in Python).

Here is my code:

import nltk from nltk.tag.stanford import POSTagger spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar') spanish_postagger.tag('esta es una oracion de prueba'.split()) 

Result:

 [(u'esta', u'pd000000'), (u'es', u'vsip000'), (u'una', u'di0000'), (u'oracion', u'nc0s000'), (u'de', u'sp000'), (u'prueba', u'nc0s000')] 

I want to know where I can find what exactly means pd000000, vsip000, di0000, nc0s000, sp000?

+5
source share
1 answer

This is a simplified version of the tag set used in the AnCora tree structure . Here you can find their documentation by tags: https://web.archive.org/web/20160325024315/http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

"Simplification" consists of nulling many finite fields that are not strictly related to the tag part of speech. For example, our speech tag will always give you null ( 0 ) values ​​for the NER field of the original tag set (see the EAGLES Name Documentation ).

In short: the fields in the POS tags created by our tag correspond exactly to the AnCora POS fields, but many of these fields will be null . For most practical purposes, you only need to look at the first 2-4 characters of the tag. The first character always indicates a wide POS category, and the second character indicates some subtype.


We are now starting to write introductory documentation for using Spanish with CoreNLP (which means understanding these tags and much more). At the moment, you can find more information on the first page of our technical documentation .

+9
source

Source: https://habr.com/ru/post/1207326/


All Articles