Python tokenize clause with optional key / shaft pairs

Question

Python tokenize clause with optional key / shaft pairs

I am trying to parse a sentence (or a line of text) where you have a sentence, and maybe some key / shaft pairs were followed on the same line. A pair is not only a key / value pair, but also a dynamic one. I am looking for a result to be something like:

Entrance:

"There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

Output:

 Values = {'theSentence' : "There was a cow at home.", 'home' : "mary", 'cowname' : "betsy", 'date'= "10-jan-2013" }

Entrance:

 "Mike ordered a large hamburger. lastname=Smith store=burgerville"

Output:

 Values = {'theSentence' : "Mike ordered a large hamburger.", 'lastname' : "Smith", 'store' : "burgerville" }

Entrance:

 "Sam is nice."

Output:

 Values = {'theSentence' : "Sam is nice."}

Thanks for any input / direction. I know the suggestions show that this is a homework problem, but I'm just new to python. I know this is probably a regex solution, but I'm not the best with regards to regex.

+4

python regex tokenize text-parsing

tazzytazzy Jul 22 '13 at 18:50

source share

8 answers

The first step is to do

 inputStr = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013" theSentence, others = str.split('.')

Then you will want to smash the "others". Play with split () (the argument you pass tells Python what to split into a string) and see what you can do. :)

+1

A.Wan Jul 22 '13 at 18:53

source share

If your offer is completed on . then you can perform the following approach.

 >>> testList = inputString.split('.') >>> Values['theSentence'] = testList[0]+'.'

For the rest of the values, just do it.

 >>> for elem in testList[1].split(): key, val = elem.split('=') Values[key] = val

Giving You Values So

 >>> Values {'date': '10-jan-2013', 'home': 'mary', 'cowname': 'betsy', 'theSentence': 'There was a cow at home.'} >>> Values2 {'lastname': 'Smith', 'theSentence': 'Mike ordered a large hamburger.', 'store': 'burgerville'} >>> Values3 {'theSentence': 'Sam is nice.'}

+1

Sukrit kalra Jul 22 '13 at 18:58

source share

Assuming there can only be 1 point that divides pairs of sentences and destinations:

 input = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013" sentence, assignments = input.split(". ") result = {'theSentence': sentence + "."} for item in assignments.split(): key, value = item.split("=") result[key] = value print result

prints:

 {'date': '10-jan-2013', 'home': 'mary', 'cowname': 'betsy', 'theSentence': 'There was a cow at home.'}

+1

alecxe Jul 22 '13 at 18:58

source share

Assuming that = not displayed in the sentence itself. This seems more reasonable than assuming that the offer ends with . .

 s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013" eq_loc = s.find('=') if eq_loc > -1: meta_loc = s[:eq_loc].rfind(' ') s = s[:meta_loc] metastr = s[meta_loc + 1:] metadict = dict(m.split('=') for m in metastr.split()) else: metadict = {} metadict["theSentence"] = s

0

Fast turtle Jul 22 '13 at 19:00

source share

So, as usual, there are many ways to do this. Here, a regexp-based approach is used that looks for key = value pairs:

 import re sentence = "..." values = {} for match in re.finditer("(\w+)=(\S+)", sentence): if not values: # everything left to the first key/value pair is the sentence values["theSentence"] = sentence[:match.start()].strip() else: key, value = match.groups() values[key] = value if not values: # no key/value pairs, keep the entire sentence values["theSentence"] = sentence

This assumes that the key is a Python style identifier and that this value consists of one or more characters without spaces.

0

Fredrik Jul 22 '13 at 19:01

source share

Suppose the first period separates the sentence from the values, you can use something like this:

 #! /usr/bin/python3 a = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013" values = (lambda s, tail: (lambda d, kv: (d, d.update (kv) ) ) ( {'theSentence': s}, {k: v for k, v in (x.split ('=') for x in tail.strip ().split (' ') ) } ) ) (*a.split ('.', 1) ) [0] print (values)

0

Hyperboreus Jul 22 '13 at 19:04

source share

No one posted a clear, single-line font. The answer to the question, but you need to do this on one line, this is the way of Python!

 {"theSentence": sentence.split(".")[0]}.update({item.split("=")[0]: item.split("=")[1] for item in sentence.split(".")[1].split()})

Oh, not super elegant, but it is completely on the same line. No import even.

0

Slater victoroff Jul 22 '13 at 19:12

source share

georg · Accepted Answer · 2013-07-22T19:04:31+0000

I would use re.sub :

 import re s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013" d = {} def add(m): d[m.group(1)] = m.group(2) s = re.sub(r'(\w+)=(\S+)', add, s) d['theSentence'] = s.strip() print d

Here's a more compact version if you prefer:

 d = {} d['theSentence'] = re.sub(r'(\w+)=(\S+)', lambda m: d.setdefault(m.group(1), m.group(2)) and '', s).strip()

Or maybe findall is the best option:

 rx = '(\w+)=(\S+)|(\S.+?)(?=\w+=|$)' d = { a or 'theSentence': (b or c).strip() for a, b, c in re.findall(rx, s) } print d

Python tokenize clause with optional key / shaft pairs

More articles: