Language Discovery Using Stanford NLP

I am wondering if Stanford CoreNLP can be used to determine which language the sentence is written in. If so, how accurate can these algorithms be?

+8
source share
2 answers

Stanford COreNLP does not currently have a language identifier. β€œalmost” - because non-being is much more difficult to prove.

EDIT: Nonetheless, indirect evidence is given below:

  • language identification is not mentioned either on the main page , the CoreNLP page , or the FAQ (although there is a question β€œHow to run CoreNLP in other languages?”), as well as in 2014 articles by CoreNLP authors;
  • tools that integrate several NLP libraries including Stanford CoreNLP use another library for identification language, for example DKPro Core ASL ; also other users talking about language identification, and CoreNLP did not mention this feature.
  • The source CoreNLP file contains Language classes, but nothing has to do with language identification - you can manually check for all 84 occurrences of the word "language" here

Try TIKA or TextCat , or the Java Language Detection Library (they say "99% accuracy for 53 languages").

In general, the quality depends on the size of the input text: if it is long enough (say, at least a few words and is not specially chosen), then the accuracy can be pretty good - about 95%.

+11
source

Standford CoreNLP does not have a language identifier (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml


There are many other tools for detecting / identifying a language. But take the stated accuracy with a pinch of salt. Usually priced narrowly, limited to:

  • fixed list of languages
  • significant length of test sentences and
  • in one language and
  • asymmetric proportion of training to testing.

Famous language identification tools include:

For a complete list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/


The general task associated with identifying a language (with training / testing data) includes:


Also look at:

+10
source

Source: https://habr.com/ru/post/984290/


All Articles