This is an example of a complex tab-separated file that I am trying to parse.
ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094 ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
I am trying to get output containing only some fields, for example:
ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
The main problem is that not all rows contain the same number of fields, so I need to delete, for example, the fields containing the DESCRIPTION line and add an empty field to the lines where the CLASS field is not.
In addition, for some fields, the data is divided into more than one (fi, line 1, the field following DISEASE contains the disease data!), And I need to join them.
I tried:
input = open('file', 'r') dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"] split_tab = [] output = [] for line in input: split_tab.append(line.split('\t')) for item in dict: for element in split_tab: if item in element: output.append(element) else: output.append('\tNA\t')
But it stores everything, not just the elements specified in the dict. Could you help me?
Sonny source share