Spanish POS marking with Stanford NLP - is it possible to get a person / number / gender?

I use Stanford NLP to tag POS for Spanish texts. I can get a POS tag for each word, but I notice that they give me only the first four sections of the Ancora tag, and it skips the last three sections for a person, number and gender.

  • Why does Stanford NLP use only a smaller version of the Ancora tag?

  • Is it possible to get the whole tag using Stanford NLP?

Here is my code (please excuse jruby ...):

props = java.util.Properties.new() props.put("tokenize.language", "es") props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse") props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz") props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger") props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz") pipeline = StanfordCoreNLP.new(props) annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.") 

I get this as output:

[Text = No CharacterOffsetBegin = 0 CharacterOffsetEnd = 2 PartOfSpeech = rn Lemma = no NamedEntityTag = O] [Text = sé CharacterOffsetBegin = 3 CharacterOffsetEnd = 5 PartOfSpeech = vmip000 Lemma = sé NamedEntityTag = O] 6 Character = Quff PartOfSpeech = pt000000 Lemma = qué NamedEntityTag = O] [Text = Estoy CharacterOffsetBegin = 10 CharacterOffsetEnd = 15 PartOfSpeech = vmip000 Lemma = estoy NamedEntityTag = O] [Text = haciendo CharacterOffsetBegin = 16 CharacterOffentmoendmendmendsetfmtendfenetmeffetndmeffset [Text =. CharacterOffsetBegin = 24 CharacterOffsetEnd = 25 PartOfSpeech = fp Lemma =. NamedEntityTag = O]

(I notice that the lemmas are incorrect, but this is probably a problem for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)

+1
source share
2 answers

Why does Stanford NLP use only a smaller version of the Ancora tag?

This was a practical decision made to ensure high precision marking. (Saving morphological information on tags led to the fact that the entire tagger suffered from sparse data and worsened not only the morphological annotation, but the entire board.)

Is it possible to get the whole tag using Stanford NLP?

Not. You could do this quite far with a simple rule-based system, or use the Stanford classifier to train your own morphological annotator. (Feel free to share your code if you choose any path!)

+1
source

If it is not strict only to use the PST tester at Stanford, you can try using the POS and the RDRPOSTagger morphological tagging tool. RDRPOSTagger supports POS pre-training and morphological labeling in 40 different languages, including Spanish.

For Spanish POS and morphological labeling, RDRPOSTagger was trained using the IULA Spanish LSP Treebank. Then RDRPOSTagger received a tag accuracy of 97.95% with a tag speed of 200 thousand words / second in the Java implementation ( 10 thousand words / second in the Python implementation) using a Windows 7 OS computer 64-bit core i5 2.50GHz CPU and 6 GB of memory.

0
source

Source: https://habr.com/ru/post/1207330/


All Articles