βDistSimβ refers to the use of functions based on word classes / clusters constructed using distribution similarity clustering methods (eg, Brown clustering, exchange clusters). Words of words of class classes that are similar, semantically and / or syntactically, and allow the NER system to better generalize, including better processing of words not in the NER system training data. Many of our distributed models use distribution similarity clustering functions, as well as word identification functions, and benefit greatly from this. There is a whole group of flags / properties in Stanford NER that affect how distribution similarities are interpreted / used: useDistSim , distSimLexicon , distSimFileFormat , distSimMaxBits , casedDistSim , numberEquivalenceDistSim , unknownWordDistSimClass , and you need to look at the code in NERFeatureFactory.java , to decode detail, NERFeatureFactory.java , but in the simple case, you just need the first two, and they need to be used during model training, as well as during testing. The default vocabulary format is just a text file with a series of lines with two columns separated by tabs word clusterName . Cluster names are arbitrary.
source share