I use both python and java to run the Stford tester in Stanford, but I see a difference in the results.
For example, when I introduced the sentence, "Participated in all aspects of data modeling, using ERwin as the main software for this."
JAVA Result:
"ERwin": "PERSON"
Python result:
In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split()) Out [6]:[(u'Involved', u'O'), (u'in', u'O'), (u'all', u'O'), (u'aspects', u'O'), (u'of', u'O'), (u'data', u'O'), (u'modeling', u'O'), (u'using', u'O'), (u'ERwin', u'O'), (u'as', u'O'), (u'the', u'O'), (u'primary', u'O'), (u'software', u'O'), (u'for', u'O'), (u'this.', u'O')]
Python nltk wrapper cannot catch "ERwin" as PERSON.
Interestingly, Python and Java use the same prepared data (english.all.3class.caseless.distsim.crf.ser.gz) released in 2015-04-20.
My ultimate goal is to get python to work just like Java does.
I look at the StanfordNERTagger at nltk.tag to see if there is anything that I can change. The following is the shell code:
class StanfordNERTagger(StanfordTagger): """ A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to: - a model trained on training data - (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable. - (optionally) the encoding of the training data (default: UTF-8) Example: >>> from nltk.tag import StanfordNERTagger >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
Or, if it is due to the use of a different classifier (in Java code, it seems to use an AbstractSequenceClassifier, on the other hand, the NLTK python shell uses a CRFClassifier.) Is there a way that I can use AbstractSequenceClassifier in a python wrapper?