I have been provided with a file from which I want to extract useful data. The file format looks something like this:
LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3
etc...
What I would like to do is delete LINE: and the line number, as well as TOKENKIND: so I just left with the line consisting of "somedata somedate somedata ..."
I use Python to do this using regular expressions (I'm not sure if they are correct) to match the bits of the file I would like to delete.
My question is: how can I get Python to match multiple regex groups and ignore them by adding anything that doesn't match my regex to my output string? My current code is as follows:
import re import sys ignoredTokens = re.compile(''' (?P<WHITESPACE> \s+ ) | (?P<LINE> LINE:\s[0-9]+ ) | (?P<TOKEN> [AZ]+: ) ''', re.VERBOSE) tokenList = open(sys.argv[1], 'r').read() cleanedList = '' scanner = ignoredTokens.scanner(tokenList) for line in tokenList: match = scanner.match() if match.lastgroup not in ('WHITESPACE', 'LINE', 'TOKEN'): cleanedList = cleanedList + match.group(match.lastindex) + ' ' print cleanedList
python regex lexical-analysis
greenie Nov 24 '09 at 16:12 2009-11-24 16:12
source share