I am trying to parse thousands of text text files containing information about the company, materials, chemical properties, etc. (e.g. material safety data sheets) with Python. Text files contain similar information in freely structured formatting, so that they are easy to read, but unstructured so that they are easy to parse (for example, not XML or CSV). In short, it's just everywhere.
Initially, data is entered by different people working in different companies manually. Another set of people transcribes information into these text files (OCR to text file).
Is there a parsing library or templates for extracting bits of this type of information? (This seems to be a "common" data entry problem.) Of course, regular expressions will be used a lot. I have no experience with natural language processing libraries. Will they be suitable for the problem?
My initial thought was to try to group files in different chains, and then create a set of parsing functions for each format. Unfortunately, it can only work for a small subset of the problem, and various cases can quickly get out of hand.
Given this question, I will give a few examples illustrating the problem.
ADDRESS INFORMATION
Each file contains company information, such as information and address. Information may or may not have an identifier, it may or may not be on the same line, etc. In short, it seems like every combination.
Example (with field information):
MANUFACTURER: Foo Bar Inc. ADDRESS: 123 Foo St. Bar, CA 90012
Ex. (wo / field information):
Foo Bar Inc. 123 Foo St. Bar, CA 90012
Ex. (Sometimes extra lines between information):
FOO BAR INC. 123 FOO ST. BAR, CA 90012
Ex. (inconsistent field names):
MANUFACTURER NAME: FOO BAR INC. CREATIVE DIVISION ADDRESS: 123 FOO ST. CITY, STATE & ZIP: BAR, CALIFORNIA 90012 PHONE NUMBER: 310-111-2222
SECTION INFORMATION
Specification sheets also have similar sections, but are inconsistent orders, headers, number types, and delimiters.
Example:
Example:
Section I. Materials ------------------------------------------
Example:
And sometimes the files changed width, so the next line breaks.
Example:
becomes:
Here is a complete example:
Hope this clarifies the problems parsing the file. You will notice that the flow around the lines, the separation of information on different lines, etc. Not everyone has the exact structure, some of them will be formatted in different ways, with information in different places. Here is a link to a paper hard copy .
MATERIAL SAFETY DATA SHEET ================================================================= ========= SECTION I-PRODUCT AND PREPARATION INFORMATION ================================================================= ========= MANUFACTURER: Some Company Inc EMERGENCY AND INFORMATION TELEPHONE (111)222-3333 ADDRESS: Some Road City, ST 12346 IDENTITY (AS USED ON LABEL AND LIST): Some Identity PREPARATION DATE: Some Date ================================================================= ========= SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION ================================================================= ========= OSHA ACGIH HAZARDOUS COMPONENTS CAS