Which library should I use to extract text from images?

Question

Which library should I use to extract text from images?

I am writing a program that, when defining an image of a low-level mathematical problem (for example, 98 * 13), should be able to output an answer. The numbers will be black and have a white background. Not captcha, just an image of a mathematical problem.

Math problems will have only two numbers and one operator, and this operator will only +, -, * or /.

Obviously, I know how to do calculations;) I'm just not sure how to do this to get text from an image.

A free library would be ideal ... although if I needed to write the code myself, I could probably handle it.

+4

c # ocr

Entity Feb 28 '11 at 19:26

source share

5 answers

Taylor bird · Answer 1 · 2011-02-28T19:30:56+0000

Try this post regarding using C ++ Google Tessaract OCR lib in C #

Tesseract OCR

Lou franco · Answer 2 · 2011-02-28T19:33:04+0000

You need an OCR. Google has a free Tesseract library, but it's C code. You can use it in a C ++ / CLI project and access it through .NET.

This article provides some information on number recognition (for sudoku, but your problem is similar)

http://sudokugrab.blogspot.com/2009/07/how-does-it-all-work.html

Loïc sombart · Answer 3 · 2017-02-28T10:50:21+0000

To extract words from an image, I use the most accurate open source OCR engine: Tesseract . Available here or directly in your NuGet packages.

And this is my function in C #, which extracts words from the image passed to sourceFilePath . Install EngineMode in TesseractAndCube; he discovers more words than others.

 var path = "YourSolutionDirectoryPath"; using (var engine = new TesseractEngine(path + Path.DirectorySeparatorChar + "tessdata", "fra", EngineMode.TesseractAndCube)) { using (var img = Pix.LoadFromFile(sourceFilePath)) { using (var page = engine.Process(img)) { var text = page.GetText(); // text variable contains a string with all words found } } }

I hope this helps.

user6736260 · Answer 4 · 2016-08-19T19:08:10+0000

you can use Microsoft Office Document Imaging (Interop.MODI.dll) in the visa studio and extract the text of the images

 Document modiDocument = new Document(); modiDocument.Create(filePath); modiDocument.OCR(MiLANGUAGES.miLANG_ENGLISH); MODI.Image modiImage = (modiDocument.Images[0] as MODI.Image); string extractedText = modiImage.Layout.Text; modiDocument.Close(); return extractedText;

Tienkamp · Answer 5 · 2017-02-28T11:16:44+0000

Here is an example of useful code for C #:

Using Tesseract : A free, open source OCR application for the Windows desktop - A modern GUI interface for the Tesseract OCR engine. The application also includes support for reading and OCR'ing PDF files: https://github.com/A9T9/Free-Ocr-Windows-Desktop
Using Microsoft OCR : The free, open source OCR application for the Windows Store, a modern GUI for the Microsoft OCR library. The application also includes support for reading and OCR'ing PDF files: https://github.com/A9T9/Free-OCR-Software

Which library should I use to extract text from images?

More articles: