Parsing data from a text file

I have a text file with this content:

******** ENTRY 01 ******** ID: 01 Data1: 0.1834869385E-002 Data2: 10.9598489301 Data3: -0.1091356549E+001 Data4: 715 

And then an empty line and repeats more similar blocks, all of which have the same data fields.

I am porting C ++ code to Python, and a certain part gets the file line by line, detects the text header and then detects each text field to retrieve the data. This is not at all like smart code, and I think Python should have some library for easy data analysis. In the end, it's almost like a CSV!

Any idea for this?

+4
source share
3 answers

Actually it is very far from CSV.

You can use the file as an iterator; The following generator function gives full sections:

 def load_sections(filename): with open(filename, 'r') as infile: line = '' while True: while not line.startswith('****'): line = next(infile) # raises StopIteration, ending the generator continue # find next entry entry = {} for line in infile: line = line.strip() if not line: break key, value = map(str.strip, line.split(':', 1)) entry[key] = value yield entry 

This treats the file as an iterator, which means that any loop moves the file to the next line. The external circuit serves only to go from section to section; inner while and for loops do all the real work; first skip the lines until a **** section of the header is found (otherwise dropped), then loop through all the non-empty lines to create the section.

Use a function in a loop:

 for section in load_sections(filename): print section 

Repeating your sample data in a text file results in:

 >>> for section in load_sections('/tmp/test.txt'): ... print section ... {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} {'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'} 

You can add some data converters to it if you want; mapping the key to the called will be:

 converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int} 

then in the generator function, instead of entry[key] = value do entry[key] = converters.get(key, lambda v: v)(value) .

+8
source

my_file:

 ******** ENTRY 01 ******** ID: 01 Data1: 0.1834869385E-002 Data2: 10.9598489301 Data3: -0.1091356549E+001 Data4: 715 ID: 02 Data1: 0.18348674325E-012 Data2: 10.9598489301 Data3: 0.0 Data4: 5748 ID: 03 Data1: 20.1834869385E-002 Data2: 10.954576354 Data3: 10.13476858762435E+001 Data4: 7456 

Python script:

 import re with open('my_file', 'r') as f: data = list() group = dict() for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()): if key in group: data.append(group) group = dict() group[key] = value data.append(group) print data 

Printed Output:

 [ { 'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301' }, { 'Data4': '5748', 'Data1': '0.18348674325E-012', 'ID': '02', 'Data3': '0.0', 'Data2': '10.9598489301' }, { 'Data4': '7456', 'Data1': '20.1834869385E-002', 'ID': '03', 'Data3': '10.13476858762435E+001', 'Data2': '10.954576354' } ] 
+3
source

A very simple approach could be

 all_objects = [] with open("datafile") as f: for L in f: if L[:3] == "***": # Line starts with asterisks, create a new object all_objects.append({}) elif ":" in L: # Line is a key/value field, update current object k, v = map(str.strip, L.split(":", 1)) all_objects[-1][k] = v 
0
source

Source: https://habr.com/ru/post/1486204/


All Articles