Language definition

I use tesseract for OCR, mainly on invoices. However, tesseract requires you to specify the language before it starts processing the file.

I thought I would execute ocr based on a predefined default language. Then I would like to use the resulting text to check which language is used. If this is not the default language, I process it again to get the best result from tesseract.

But how can I implement a language detection algorithm? Is there a C ++ library that I could use?

+4
source share
3 answers

This article, "Natural Language Authentication for OCR Applications," describes methods related to authentication tasks similar to your requirements.

+3
source

I am not sure if this will help as the library is in Java. But I found it really cool, as it is capable of detecting about 50 languages ​​from a given text and with a fairly good level of accuracy. You might like to take a look at it, and since it is open source, you can rewrite the code in C ++ and return it to the open source community if your application should only be written in C ++.

Here is a link to the same:

http://code.google.com/p/language-detection/

Note. For analysis, the Apache Nutch and Tika libraries are used.

+3
source

You might want to read my paper WiLI Authentication Dataset and try lidtk .

TL DR: Try trying CLD-2.

0
source

Source: https://habr.com/ru/post/1381553/


All Articles