Python numeric data extraction

If I have some lines that are read:

1,000 barrels 5 Megawatts hours (MWh) 80 Megawatt hours (MWh) (5 MW per peak hour). 

What is the best way to capture numeric elements (namely, only the first instance) and the first parentheses, if one exists.

My current approach is to use a separator string for each. ' '. and str.isalpha ' '. and str.isalpha to find non-alpha elements. But, not sure how to get the first entry in brackets.

+6
source share
2 answers

here the regex approach is used:

 import re text = """1,000 barrels 5 Megawatts hours (MWh) 80 Megawatt hours (MWh) (...)""" r_unit = re.compile("\((\w+)\)") r_value = re.compile("([\d,]+)") for line in text.splitlines(): unit = r_unit.search(line) if unit: unit = unit.groups()[0] else: unit = "" value = r_value.search(line) if value: value = value.groups()[0] else: value = "" print value, unit 

or another simpler approach would use regexp as follows:

 r = re.compile("(([\d,]+).*\(?(\w+)?\)?)") for line, value, unit in r.findall(text): print value, unit 

(I thought about it right after writing the previous one: -p)

full description of the latest regexp:

 ( <- LINE GROUP ( <- VALUE GROUP [ <- character grouping (ie read char is one of the following characters) \d <- any digit , <- a comma ] + <- one or more of the previous expression ) . <- any character * <- zero or more of the previous expression \( <- a real parenthesis ? <- zero or one of the previous expression ( <- UNIT GROUP [ \w <- any alphabetic/in-word character + <- one or more of the previous expression ] ) ? <- zero or one of the previous expression \) <- a real ending parenthesis ? <- zero or one of the previous expression ) ) 
+4
source

To extract numerical values ​​you can use re

 import re value = """1,000 barrels 5 Megawatts hours (MWh) 80 Megawatt hours (MWh) (5 MW per peak hour)""" re.findall("[0-9]+,?[0-9]*", value) 
+1
source

Source: https://habr.com/ru/post/946696/


All Articles