For reasons that I really donโt understand, the REST API, which I use instead of JSON or XML output, uses a special structured text format. In simplest form
SECTION_NAME entry other qualifying bits of the entry entry2 other qualifying bits ...
They are not tab delimiters, as may appear in the structure, but instead are limited in space, and qualification bits may contain words with spaces. The space between SECTION_NAME and elements also varies from 1 to several (6 or more) spaces.
In addition, one part of the format contains entries in the form
SECTION_NAME entry SUB_SECTION more information SUB_SECTION2 more information
For reference, an extract of real data (some sections are omitted), which shows the use of the structure:
ENTRY hsa04064 Pathway NAME NF-kappa B signaling pathway - Homo sapiens (human) DRUG D09347 Fostamatinib (USAN) D09348 Fostamatinib disodium (USAN) D09692 Veliparib (USAN/INN) D09730 Olaparib (JAN/INN) D09913 Iniparib (USAN/INN) REFERENCE PMID:21772278 AUTHORS Oeckinghaus A, Hayden MS, Ghosh S TITLE Crosstalk in NF-kappaB signaling pathways. JOURNAL Nat Immunol 12:695-708 (2011)
As I try to parse this strange format into something more robust (a dictionary that can then be converted to JSON), I'm not sure what to do: splitting blindly in space causes a mess (it also affects the information with spaces), and I'm not sure how I can understand when a section starts or not. Is text manipulation enough for the job or should I use more sophisticated methods?
EDIT:
I started using pyparsing for the job, but multi-line entries mixed up me, here is an example with DRUG:
from pyparsing import * punctuation = ",.'`&-" special_chars = "\()[]" drug = Keyword("DRUG") drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word( alphanums + special_chars))) + ZeroOrMore(LineEnd()) drug_lines = OneOrMore(drug_content) drug_parser = drug + drug_lines
When applying DRUG to the first 3 lines in the example, I get the wrong result (\ n converted to actual data for easier reading):
['DRUG', ['D09347', 'Fostamatinib (USAN) D09348 Fostamatinib disodium (USAN) D09692 Veliparib (USAN']]
As you can see, the following entries are merged together while I expect:
['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"], ['D09692', ' Veliparib (USAN)']]]