Parsing tab delimited file with missing fields

This is an example of a complex tab-separated file that I am trying to parse.

ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094 ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism 

I am trying to get output containing only some fields, for example:

 ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism 

The main problem is that not all rows contain the same number of fields, so I need to delete, for example, the fields containing the DESCRIPTION line and add an empty field to the lines where the CLASS field is not.

In addition, for some fields, the data is divided into more than one (fi, line 1, the field following DISEASE contains the disease data!), And I need to join them.

I tried:

 input = open('file', 'r') dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"] split_tab = [] output = [] for line in input: split_tab.append(line.split('\t')) for item in dict: for element in split_tab: if item in element: output.append(element) else: output.append('\tNA\t') 

But it stores everything, not just the elements specified in the dict. Could you help me?

+4
source share
3 answers

Use the built-in csv library. Your work will be much easier.

For some sample code:

 import csv reader = csv.reader(open('myfile.csv', 'rb'), dialect='excel-tab') fieldnames = ['Name','Class'] writer = csv.DictWriter(open('myfile.csv', 'rb'), fieldnames, restval='', extrasaction='ignore', dialect='excel-tab') for row in reader: newrow = {} for field in row: key = field.split(' ', 1)[0] newrow[key] = field writer.writerow(newrow) 

Pay particular attention to how DictWriter is configured. This is much simpler if you include restval and extrasaction . They allow you to pass a dictionary with more or less meanings than the author expects.

Just enter your field names and set the reader to the correct dialect. This may include adding your own, but the csv link contains instructions on how to do this.

EDIT

After Rob's comment below, I reviewed this to take into account the fact that the csv dialogs are not as strong as I thought.

+6
source
 requiredKeys = 'ENTRY NAME CLASS DISEASE DBLINKS REL_PATHWAY'.split(' ') for line in open('file', 'r'): fields = line.split('\t') fieldMap = {} for field in fields: key = field.split(' ', 1)[0] fieldMap[key] = field print '\t'.join([fieldMap.get(key, 'NA') for key in requiredKeys]) 
+3
source

Your line

split_tab.append (line.split ('\ t'))

ruin it. Create a list inside a list. try this instead:

split_tab = line.split ('\ t')

+2
source

Source: https://habr.com/ru/post/1381190/


All Articles