Parsing tab delimited file with missing fields

Question

Parsing tab delimited file with missing fields

This is an example of a complex tab-separated file that I am trying to parse.

ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094 ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism

I am trying to get output containing only some fields, for example:

 ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism

The main problem is that not all rows contain the same number of fields, so I need to delete, for example, the fields containing the DESCRIPTION line and add an empty field to the lines where the CLASS field is not.

In addition, for some fields, the data is divided into more than one (fi, line 1, the field following DISEASE contains the disease data!), And I need to join them.

I tried:

 input = open('file', 'r') dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"] split_tab = [] output = [] for line in input: split_tab.append(line.split('\t')) for item in dict: for element in split_tab: if item in element: output.append(element) else: output.append('\tNA\t')

But it stores everything, not just the elements specified in the dict. Could you help me?

+4

python parsing csv

Sonny Nov 14 '11 at 20:18

source share

3 answers

 requiredKeys = 'ENTRY NAME CLASS DISEASE DBLINKS REL_PATHWAY'.split(' ') for line in open('file', 'r'): fields = line.split('\t') fieldMap = {} for field in fields: key = field.split(' ', 1)[0] fieldMap[key] = field print '\t'.join([fieldMap.get(key, 'NA') for key in requiredKeys])

+3

rob mayoff Nov 14 '11 at 20:42

source share

Your line

split_tab.append (line.split ('\ t'))

ruin it. Create a list inside a list. try this instead:

split_tab = line.split ('\ t')

+2

Rookie Nov 14 '11 at 20:35

source share

Spencer rathbun · Accepted Answer · 2011-11-14T20:36:36+0000

Use the built-in csv library. Your work will be much easier.

For some sample code:

 import csv reader = csv.reader(open('myfile.csv', 'rb'), dialect='excel-tab') fieldnames = ['Name','Class'] writer = csv.DictWriter(open('myfile.csv', 'rb'), fieldnames, restval='', extrasaction='ignore', dialect='excel-tab') for row in reader: newrow = {} for field in row: key = field.split(' ', 1)[0] newrow[key] = field writer.writerow(newrow)

Pay particular attention to how DictWriter is configured. This is much simpler if you include restval and extrasaction . They allow you to pass a dictionary with more or less meanings than the author expects.

Just enter your field names and set the reader to the correct dialect. This may include adding your own, but the csv link contains instructions on how to do this.

EDIT

After Rob's comment below, I reviewed this to take into account the fact that the csv dialogs are not as strong as I thought.

Parsing tab delimited file with missing fields

More articles: