DistSim file value for Stanford NER

Question

DistSim file value for Stanford NER

In one example .prop file that comes with Stanford NER software, there are two options that I don’t understand:

useDistSim = true distSimLexicon = /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters

Does anyone have a hint about what DistSim means, and where can I find additional documentation on how to use these options?

UPDATE: I just found out that DistSim stands for distribution similarity. I am still wondering what this means in this context.

+4

nlp named-entity-recognition stanford-nlp

titusn Jul 18 '13 at 12:59

source share

1 answer

Christopher manning · Accepted Answer · 2013-07-20T18:02:33+0000

“DistSim” refers to the use of functions based on word classes / clusters constructed using distribution similarity clustering methods (eg, Brown clustering, exchange clusters). Words of words of class classes that are similar, semantically and / or syntactically, and allow the NER system to better generalize, including better processing of words not in the NER system training data. Many of our distributed models use distribution similarity clustering functions, as well as word identification functions, and benefit greatly from this. There is a whole group of flags / properties in Stanford NER that affect how distribution similarities are interpreted / used: useDistSim , distSimLexicon , distSimFileFormat , distSimMaxBits , casedDistSim , numberEquivalenceDistSim , unknownWordDistSimClass , and you need to look at the code in NERFeatureFactory.java , to decode detail, NERFeatureFactory.java , but in the simple case, you just need the first two, and they need to be used during model training, as well as during testing. The default vocabulary format is just a text file with a series of lines with two columns separated by tabs word clusterName . Cluster names are arbitrary.

DistSim file value for Stanford NER

More articles: