Matching and removing multiple regex groups

Question

Matching and removing multiple regex groups

I have been provided with a file from which I want to extract useful data. The file format looks something like this:

LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3

etc...

What I would like to do is delete LINE: and the line number, as well as TOKENKIND: so I just left with the line consisting of "somedata somedate somedata ..."

I use Python to do this using regular expressions (I'm not sure if they are correct) to match the bits of the file I would like to delete.

My question is: how can I get Python to match multiple regex groups and ignore them by adding anything that doesn't match my regex to my output string? My current code is as follows:

 import re import sys ignoredTokens = re.compile(''' (?P<WHITESPACE> \s+ ) | (?P<LINE> LINE:\s[0-9]+ ) | (?P<TOKEN> [AZ]+: ) ''', re.VERBOSE) tokenList = open(sys.argv[1], 'r').read() cleanedList = '' scanner = ignoredTokens.scanner(tokenList) for line in tokenList: match = scanner.match() if match.lastgroup not in ('WHITESPACE', 'LINE', 'TOKEN'): cleanedList = cleanedList + match.group(match.lastindex) + ' ' print cleanedList

+2

python regex lexical-analysis

greenie Nov 24 '09 at 16:12

source share

3 answers

No need to use regex in Python. Its Python after all, not Perl. Just think and use its capabilities to process strings.

 f=open("file") for line in f: if line.startswith("LINE:"): continue if "TOKENKIND" in line: print line.split(" ",1)[-1].strip() f.close()

+2

ghostdog74 Nov 25 '09 at 0:55

source share

What about replacing (^LINE: \d+$)|(^\w+:) empty string "" ?

Use \n instead of ^ and $ to remove unnecessary blank lines.

+1

Amarghosh Nov 24 '09 at 16:21

source share

Alex Martelli · Accepted Answer · 2009-11-24 16:26

 import re x = '''LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3''' junkre = re.compile(r'(\s*LINE:\s*\d*\s*)|(\s*TOKENKIND:)', re.DOTALL) print junkre.sub('', x)

Matching and removing multiple regex groups

More articles: