Parsing a text file in Python (pyparsing)

For reasons that I really donโ€™t understand, the REST API, which I use instead of JSON or XML output, uses a special structured text format. In simplest form

SECTION_NAME entry other qualifying bits of the entry entry2 other qualifying bits ... 

They are not tab delimiters, as may appear in the structure, but instead are limited in space, and qualification bits may contain words with spaces. The space between SECTION_NAME and elements also varies from 1 to several (6 or more) spaces.

In addition, one part of the format contains entries in the form

 SECTION_NAME entry SUB_SECTION more information SUB_SECTION2 more information 

For reference, an extract of real data (some sections are omitted), which shows the use of the structure:

 ENTRY hsa04064 Pathway NAME NF-kappa B signaling pathway - Homo sapiens (human) DRUG D09347 Fostamatinib (USAN) D09348 Fostamatinib disodium (USAN) D09692 Veliparib (USAN/INN) D09730 Olaparib (JAN/INN) D09913 Iniparib (USAN/INN) REFERENCE PMID:21772278 AUTHORS Oeckinghaus A, Hayden MS, Ghosh S TITLE Crosstalk in NF-kappaB signaling pathways. JOURNAL Nat Immunol 12:695-708 (2011) 

As I try to parse this strange format into something more robust (a dictionary that can then be converted to JSON), I'm not sure what to do: splitting blindly in space causes a mess (it also affects the information with spaces), and I'm not sure how I can understand when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?

EDIT:

I started using pyparsing for the job, but multi-line entries mixed up me, here is an example with DRUG:

  from pyparsing import * punctuation = ",.'`&-" special_chars = "\()[]" drug = Keyword("DRUG") drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word( alphanums + special_chars))) + ZeroOrMore(LineEnd()) drug_lines = OneOrMore(drug_content) drug_parser = drug + drug_lines 

When applying DRUG to the first 3 lines in the example, I get the wrong result (\ n converted to actual data for easier reading):

  ['DRUG', ['D09347', 'Fostamatinib (USAN) D09348 Fostamatinib disodium (USAN) D09692 Veliparib (USAN']] 

As you can see, the following entries are merged together while I expect:

  ['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"], ['D09692', ' Veliparib (USAN)']]] 
+6
source share
2 answers

I would recommend using a parser based approach. For example, Python PLY can be used for this task.

+3
source

A better approach is to use regular expressions, for example:

 m = re.compile('^ENTRY\s+(.*)$') m.search(line) if m: m.groups()[0].strip() 

for strings without input, you should use the last entry you found.

A simpler approach is split into a record, for example:

 vals = line.split('DRUG') if len(vals) > 1: drug_field = vals[1].strip() 
0
source

Source: https://habr.com/ru/post/919653/


All Articles