Language definition

Question

Language definition

I use tesseract for OCR, mainly on invoices. However, tesseract requires you to specify the language before it starts processing the file.

I thought I would execute ocr based on a predefined default language. Then I would like to use the resulting text to check which language is used. If this is not the default language, I process it again to get the best result from tesseract.

But how can I implement a language detection algorithm? Is there a C ++ library that I could use?

+4

c ++ ocr nlp language-detection

Pedro Nov 16 '11 at 19:15

source share

3 answers

I am not sure if this will help as the library is in Java. But I found it really cool, as it is capable of detecting about 50 languages from a given text and with a fairly good level of accuracy. You might like to take a look at it, and since it is open source, you can rewrite the code in C ++ and return it to the open source community if your application should only be written in C ++.

Here is a link to the same:

http://code.google.com/p/language-detection/

Note. For analysis, the Apache Nutch and Tika libraries are used.

+3

Abhishek jain Oct 9 '12 at 7:11

source share

You might want to read my paper WiLI Authentication Dataset and try lidtk .

TL DR: Try trying CLD-2.

0

Martin thoma Jan 25 '18 at 17:35

source share

nguyenq · Accepted Answer · 2011-11-18T02:38:25+0000

This article, "Natural Language Authentication for OCR Applications," describes methods related to authentication tasks similar to your requirements.

Language definition

More articles: