Python text analysis: unstructured but similar information with various formatting

I am trying to parse thousands of text text files containing information about the company, materials, chemical properties, etc. (e.g. material safety data sheets) with Python. Text files contain similar information in freely structured formatting, so that they are easy to read, but unstructured so that they are easy to parse (for example, not XML or CSV). In short, it's just everywhere.

Initially, data is entered by different people working in different companies manually. Another set of people transcribes information into these text files (OCR to text file).

Is there a parsing library or templates for extracting bits of this type of information? (This seems to be a "common" data entry problem.) Of course, regular expressions will be used a lot. I have no experience with natural language processing libraries. Will they be suitable for the problem?

My initial thought was to try to group files in different chains, and then create a set of parsing functions for each format. Unfortunately, it can only work for a small subset of the problem, and various cases can quickly get out of hand.

Given this question, I will give a few examples illustrating the problem.

ADDRESS INFORMATION
Each file contains company information, such as information and address. Information may or may not have an identifier, it may or may not be on the same line, etc. In short, it seems like every combination.

Example (with field information):

MANUFACTURER: Foo Bar Inc. ADDRESS: 123 Foo St. Bar, CA 90012 

Ex. (wo / field information):

 Foo Bar Inc. 123 Foo St. Bar, CA 90012 

Ex. (Sometimes extra lines between information):

 FOO BAR INC. 123 FOO ST. BAR, CA 90012 

Ex. (inconsistent field names):

 MANUFACTURER NAME: FOO BAR INC. CREATIVE DIVISION ADDRESS: 123 FOO ST. CITY, STATE & ZIP: BAR, CALIFORNIA 90012 PHONE NUMBER: 310-111-2222 

SECTION INFORMATION
Specification sheets also have similar sections, but are inconsistent orders, headers, number types, and delimiters.

Example:

 ======================================== SECTION 1 -- MATERIALS ======================================== 

Example:

 Section I. Materials ------------------------------------------ 

Example:

 ----- Section 3 Materials 

And sometimes the files changed width, so the next line breaks.

Example:

 =================================================== 1. Materials =================================================== 

becomes:

 ========================================= ========== 1. Materials ========================================= ========== 

Here is a complete example:
Hope this clarifies the problems parsing the file. You will notice that the flow around the lines, the separation of information on different lines, etc. Not everyone has the exact structure, some of them will be formatted in different ways, with information in different places. Here is a link to a paper hard copy .

 MATERIAL SAFETY DATA SHEET ================================================================= ========= SECTION I-PRODUCT AND PREPARATION INFORMATION ================================================================= ========= MANUFACTURER: Some Company Inc EMERGENCY AND INFORMATION TELEPHONE (111)222-3333 ADDRESS: Some Road City, ST 12346 IDENTITY (AS USED ON LABEL AND LIST): Some Identity PREPARATION DATE: Some Date ================================================================= ========= SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION ================================================================= ========= OSHA ACGIH HAZARDOUS COMPONENTS CAS# PEL TWA TLV % (SPECIFIC CHEMICAL IDENTITY; COMMON NAME(S) ----------------------------------------------------------------- --------- Some Chemical 111-22-3 15 10 10 12.34 ================================================================= ========= SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS ================================================================= ========= Boiling Point: N/A Specific Gravity (H20=1): N/A Vapor Pressure (mm Hg): N/A Melting Point: N/A Vapor Density (AIR=1) N/A Evaporation Rate (Butyl Acetate=1) N/A Solubility in Water: None Appearance: Solid, various colors, may have slight odor. N/A = Not applicable ================================================================= ========= SECTION IV-FIRE AND EXPLOSION HAZARD DATA ================================================================= ========= FLASH POINT (METHOD USED): None FLAMMABLE LIMITS: None LEL: N/A UEL: N/A EXTINGUISHING MEDIA: None SPECIAL FIRE FIGHTING PROCEDURES: None required. UNUSUAL FIRE AND EXPLOSION HAZARDS: None. ================================================================= ========= SECTION V-REACTIVITY DATA ================================================================= ========= STABILITY: Stable CONDITIONS TO AVOID: None INCOMPATIBILITY (MATERIALS TO AVOID): None HAZARDOUS POLYMERIZATION: Will not occur ================================================================= ========= SECTION VI-HEALTH HAZARD DATA ================================================================= ========= ROUTES OF ENTRY: INHALATION: Yes SKIN: Possibly INGESTION: Possibly EYES: Possibly HEALTH HAZARDS (ACUTE AND CHRONIC): Pneumoconiosis, silicosis, emphysema, nose and throat irritation, eye irritation, skin irritation in some. CARCINOGENICITY: No applicable information found. SIGNS AND SYMPTOMS OF EXPOSURE: Coughing, sneezing; irritation of the mucous membranes; eye irritation; skin irritation or rash, dry throat. MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE: Nasal, bronchial or pulmonary conditions which tend to restrict breathing, skin abrasions. EMERGENCY AND FIRST AID PROCEDURES: Remove to fresh air, irrigate eyes, wash with soap and water, contact physician if necessary. ================================================================= ========= SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE ================================================================= ========= STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED: Normal clean-up procedures. WASTE DISPOSAL METHOD: Standard landfill methods consistent with applicable state and federal regulations. PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING: Use caution not to drop, crush, break or chip. OTHER PRECAUTIONS: Do not use at speeds greater than the not-to-exceed speed printed on the hub assembly. ================================================================= ========= SECTION VIII-CONTROL MEASURES ================================================================= ========= RESPIRATORY PROTECTION (SPECIFY TYPE): OSHA or NIOSH approved respirators may be required. VENTILATION: Local exhaust recommended. Special: N/A. Mechanical: Useful. Other: N/A. PROTECTIVE GLOVES: May be useful. EYE PROTECTION: Recommended. OTHER PROTECTIVE CLOTHING OR EQUIPMENT: Not required. WORK/HYGIENIC PRACTICES: Keep clothing and area clean. Wash to remove 
+4
source share
2 answers

I would write a for loop with many state variables, process each line and use state variables to keep track of what is happening. Contexts ( if ) inside the for loop will do the same “questions” that a person would have to do if he parsed the file manually.

 " for line in file: Is there a colon in line? field_name = normalize(informaton before the colon) data = information after the colon else: field_name = next_field_in_list(previous_field) data = line " 

And so on. I could not understand from the examples if you had at least a fixed order for the fields, and either the maximum number of fields per record, or a separate record separator. Without them, I think it would be harder to write.

+2
source

it’s also true that msds had a fixed number of points so that it could be used as an index ... My real goal is to analyze the information to collect on 200 different msds.

-1
source

Source: https://habr.com/ru/post/1347399/


All Articles