here the regex approach is used:
import re text = """1,000 barrels 5 Megawatts hours (MWh) 80 Megawatt hours (MWh) (...)""" r_unit = re.compile("\((\w+)\)") r_value = re.compile("([\d,]+)") for line in text.splitlines(): unit = r_unit.search(line) if unit: unit = unit.groups()[0] else: unit = "" value = r_value.search(line) if value: value = value.groups()[0] else: value = "" print value, unit
or another simpler approach would use regexp as follows:
r = re.compile("(([\d,]+).*\(?(\w+)?\)?)") for line, value, unit in r.findall(text): print value, unit
(I thought about it right after writing the previous one: -p)
full description of the latest regexp:
( <- LINE GROUP ( <- VALUE GROUP [ <- character grouping (ie read char is one of the following characters) \d <- any digit , <- a comma ] + <- one or more of the previous expression ) . <- any character * <- zero or more of the previous expression \( <- a real parenthesis ? <- zero or one of the previous expression ( <- UNIT GROUP [ \w <- any alphabetic/in-word character + <- one or more of the previous expression ] ) ? <- zero or one of the previous expression \) <- a real ending parenthesis ? <- zero or one of the previous expression ) )
source share