Matching and removing multiple regex groups

I have been provided with a file from which I want to extract useful data. The file format looks something like this:

LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3 

etc...

What I would like to do is delete LINE: and the line number, as well as TOKENKIND: so I just left with the line consisting of "somedata somedate somedata ..."

I use Python to do this using regular expressions (I'm not sure if they are correct) to match the bits of the file I would like to delete.

My question is: how can I get Python to match multiple regex groups and ignore them by adding anything that doesn't match my regex to my output string? My current code is as follows:

 import re import sys ignoredTokens = re.compile(''' (?P<WHITESPACE> \s+ ) | (?P<LINE> LINE:\s[0-9]+ ) | (?P<TOKEN> [AZ]+: ) ''', re.VERBOSE) tokenList = open(sys.argv[1], 'r').read() cleanedList = '' scanner = ignoredTokens.scanner(tokenList) for line in tokenList: match = scanner.match() if match.lastgroup not in ('WHITESPACE', 'LINE', 'TOKEN'): cleanedList = cleanedList + match.group(match.lastindex) + ' ' print cleanedList 
+2
python regex lexical-analysis
Nov 24 '09 at 16:12
source share
3 answers
 import re x = '''LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3''' junkre = re.compile(r'(\s*LINE:\s*\d*\s*)|(\s*TOKENKIND:)', re.DOTALL) print junkre.sub('', x) 
+4
Nov 24 '09 at 16:26
source share

No need to use regex in Python. Its Python after all, not Perl. Just think and use its capabilities to process strings.

 f=open("file") for line in f: if line.startswith("LINE:"): continue if "TOKENKIND" in line: print line.split(" ",1)[-1].strip() f.close() 
+2
Nov 25 '09 at 0:55
source share

What about replacing (^LINE: \d+$)|(^\w+:) empty string "" ?

Use \n instead of ^ and $ to remove unnecessary blank lines.

+1
Nov 24 '09 at 16:21
source share



All Articles