Difference in transcript of NER tagger NLTK (python) vs JAVA

I use both python and java to run the Stford tester in Stanford, but I see a difference in the results.

For example, when I introduced the sentence, "Participated in all aspects of data modeling, using ERwin as the main software for this."

JAVA Result:

"ERwin": "PERSON" 

Python result:

 In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split()) Out [6]:[(u'Involved', u'O'), (u'in', u'O'), (u'all', u'O'), (u'aspects', u'O'), (u'of', u'O'), (u'data', u'O'), (u'modeling', u'O'), (u'using', u'O'), (u'ERwin', u'O'), (u'as', u'O'), (u'the', u'O'), (u'primary', u'O'), (u'software', u'O'), (u'for', u'O'), (u'this.', u'O')] 

Python nltk wrapper cannot catch "ERwin" as PERSON.

Interestingly, Python and Java use the same prepared data (english.all.3class.caseless.distsim.crf.ser.gz) released in 2015-04-20.

My ultimate goal is to get python to work just like Java does.

I look at the StanfordNERTagger at nltk.tag to see if there is anything that I can change. The following is the shell code:

 class StanfordNERTagger(StanfordTagger): """ A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to: - a model trained on training data - (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable. - (optionally) the encoding of the training data (default: UTF-8) Example: >>> from nltk.tag import StanfordNERTagger >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP >>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')] """ _SEPARATOR = '/' _JAR = 'stanford-ner.jar' _FORMAT = 'slashTags' def __init__(self, *args, **kwargs): super(StanfordNERTagger, self).__init__(*args, **kwargs) @property def _cmd(self): # Adding -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false for not using stanford Tokenizer return ['edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', self._stanford_model, '-textFile', self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions','\"tokenizeNLs=false\"'] def parse_output(self, text, sentences): if self._FORMAT == 'slashTags': # Joint together to a big list tagged_sentences = [] for tagged_sentence in text.strip().split("\n"): for tagged_word in tagged_sentence.strip().split(): word_tags = tagged_word.strip().split(self._SEPARATOR) tagged_sentences.append((''.join(word_tags[:-1]), word_tags[-1])) # Separate it according to the input result = [] start = 0 for sent in sentences: result.append(tagged_sentences[start:start + len(sent)]) start += len(sent); return result raise NotImplementedError 

Or, if it is due to the use of a different classifier (in Java code, it seems to use an AbstractSequenceClassifier, on the other hand, the NLTK python shell uses a CRFClassifier.) Is there a way that I can use AbstractSequenceClassifier in a python wrapper?

+5
source share
1 answer

Try setting maxAdditionalKnownLCWords to 0 in the properties file (or command line) for CoreNLP and, if possible, for NLTK. This disables the option that allows the NER system to learn a little test data, which can lead to slightly different results.

+5
source

Source: https://habr.com/ru/post/1239962/


All Articles