I use Stanford NLP to tag POS for Spanish texts. I can get a POS tag for each word, but I notice that they give me only the first four sections of the Ancora tag, and it skips the last three sections for a person, number and gender.
Here is my code (please excuse jruby ...):
props = java.util.Properties.new() props.put("tokenize.language", "es") props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse") props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz") props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger") props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz") pipeline = StanfordCoreNLP.new(props) annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")
I get this as output:
[Text = No CharacterOffsetBegin = 0 CharacterOffsetEnd = 2 PartOfSpeech = rn Lemma = no NamedEntityTag = O] [Text = sé CharacterOffsetBegin = 3 CharacterOffsetEnd = 5 PartOfSpeech = vmip000 Lemma = sé NamedEntityTag = O] 6 Character = Quff PartOfSpeech = pt000000 Lemma = qué NamedEntityTag = O] [Text = Estoy CharacterOffsetBegin = 10 CharacterOffsetEnd = 15 PartOfSpeech = vmip000 Lemma = estoy NamedEntityTag = O] [Text = haciendo CharacterOffsetBegin = 16 CharacterOffentmoendmendmendsetfmtendfenetmeffetndmeffset [Text =. CharacterOffsetBegin = 24 CharacterOffsetEnd = 25 PartOfSpeech = fp Lemma =. NamedEntityTag = O]
(I notice that the lemmas are incorrect, but this is probably a problem for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)
source share