I am trying to check the spelling accuracy of text samples using Stanford NLP. This is just a text metric, not a filter or anything else, so if it's a little different, as long as the error is uniform.
My first idea was to check if the word is known to vocabulary:
private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
@Analyze(weight=25, name="Spelling")
public double spelling() {
int result = 0;
for (List<? extends HasWord> list : sentences) {
for (HasWord w : list) {
if (! lp.getLexicon().isKnown(w.word())) {
System.out.format("misspelled: %s\n", w.word());
result++;
}
}
}
return result / sentences.size();
}
However, this creates quite a few false positives:
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
Any ideas on how to make this better?
source
share