Language Discovery Using Stanford NLP

Question

Language Discovery Using Stanford NLP

I am wondering if Stanford CoreNLP can be used to determine which language the sentence is written in. If so, how accurate can these algorithms be?

+8

nlp stanford-nlp

Kelvin lee Mar 26 '15 at 10:34

source share

2 answers

Standford CoreNLP does not have a language identifier (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml

There are many other tools for detecting / identifying a language. But take the stated accuracy with a pinch of salt. Usually priced narrowly, limited to:

fixed list of languages
significant length of test sentences and
in one language and
asymmetric proportion of training to testing.

Famous language identification tools include:

TextCat ( http://cran.r-project.org/web/packages/textcat/index.html )
CLD2 ( https://code.google.com/p/cld2/ )
LingPipe ( http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html )
LangID ( https://github.com/saffsd/langid.py )
CLD3 ( https://github.com/google/cld3 )

For a complete list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/

The general task associated with identifying a language (with training / testing data) includes:

Also look at:

+10

alvas Mar 27 '15 at 7:44

source share

Nikita Astrakhantsev · Accepted Answer · 2015-03-26T22:53:36+0000

Stanford COreNLP does not currently have a language identifier. “almost” - because non-being is much more difficult to prove.

EDIT: Nonetheless, indirect evidence is given below:

language identification is not mentioned either on the main page , the CoreNLP page , or the FAQ (although there is a question “How to run CoreNLP in other languages?”), as well as in 2014 articles by CoreNLP authors;
tools that integrate several NLP libraries including Stanford CoreNLP use another library for identification language, for example DKPro Core ASL ; also other users talking about language identification, and CoreNLP did not mention this feature.
The source CoreNLP file contains Language classes, but nothing has to do with language identification - you can manually check for all 84 occurrences of the word "language" here

Try TIKA or TextCat , or the Java Language Detection Library (they say "99% accuracy for 53 languages").

In general, the quality depends on the size of the input text: if it is long enough (say, at least a few words and is not specially chosen), then the accuracy can be pretty good - about 95%.

Language Discovery Using Stanford NLP

More articles: