Background reading to analyze sloppy / bizarre / "almost structured" data?

Question

Background reading to analyze sloppy / bizarre / "almost structured" data?

I support a program that needs to parse data that is present in an “almost structured” form in the text. those. various programs that produce it use several different formats, maybe they were printed and OCR'd returned (yes, I know) with errors, etc., so I need to use a heuristic that guesses how it was created and applied various quirks modes, etc. This is frustrating because I am a little familiar with the theory and practice of parsing, if everything behaves well, and there are good parsing schemes, etc., but the unreliability of the data led me to write very sloppy ad-hoc code. Everything is fine at the moment, but I'm worried that expanding it to handle more options and more complex data will get out of hand.So my question is:

Since there are many existing commercial products that do related things ("fad modes" in web browsers, interpreting errors in compilers, even natural language processing and data mining, etc.). I am sure that some smart people have invested in this thought and tried to develop a theory, and what are the best sources for background reading to parse unprincipled data as much as possible?

I understand that this is somewhat open, but my problem is that I think I need more information to even know what the right questions to ask.

+3

parsing text-mining

Max strini Sep 2 '09 at 17:25

source share

1 answer

Jason D · Accepted Answer · 2009-11-24T08:13:40+0000

, , , , , ...

, , , - "" , ( , )

OCR, , . , , , "" (.. ) , , OCR .

"Parsing Frameworks", , , , , . , , . , - - .

, , . , , , , . ( "" , , , , )

Background reading to analyze sloppy / bizarre / "almost structured" data?

More articles: