How to separate a line from an unformed line

Question

How to separate a line from an unformed line

I have this line format

2013-06-05T11:01:02.955 LASTNAME=Jone FIRSTNAME=Jason PERSONNELID=salalm QID=231412 READER_NAME="CAZ.1 LOBBY LEFT TURNSTYLE OUT" ACCESS_TYPE="Access Granted" EVENT_TIME_UTC=1370480141.000 REGION=UTAH

some of them look like

  2013-06-05T11:15:48.670 LASTNAME=Ga FIRSTNAME="Je " PERSONNELID=jega QID=Q10138202 READER_NAME="CAZ.1 ELEVATOR LOBBY DBL GLASS" ACCESS_TYPE="Access Granted" EVENT_TIME_UTC=1370481333.000 REGION=UTAH

I want to extract the value PERSONNELID, REGION, ACCESS_TYPE, EVENT_TIME_UTC

I was going to use split (""), however the value of READER_NAME and ACCESS_TYPE has a bunch of spaces. Can I convert to JSON and search by key

What is the way to extract these lines.

Thank you in advance

+4

python

user1413449 Jun 05 '13 at 18:16

source share

3 answers

Let's analyze the problem: you want to match one of the four identifiers, and then the = sign, and then either a quoted string or a sequence of characters without spaces.

This is the perfect job for regex:

 >>> s= ' 2013-06-05T11:01:02.955 LASTNAME=Jone FIRSTNAME=Jason PERSONNELID=salal m QID=231412 READER_NAME="CAZ.1 LOBBY LEFT TURNSTYLE OUT" ACCESS_TYPE="Access Gr anted" EVENT_TIME_UTC=1370480141.000 REGION=UTAH' >>> import re >>> regex = re.compile(r"""\b(PERSONNELID|REGION|ACCESS_TYPE|EVENT_TIME_UTC) ... = ... ("[^"]*"|\S+)""", re.VERBOSE) >>> result = regex.findall(s) >>> result [('PERSONNELID', 'salalm'), ('ACCESS_TYPE', '"Access Granted"'), ('EVENT_TIME_UTC', '1370480141.000'), ('REGION', 'UTAH')] >>> dict(result) {'EVENT_TIME_UTC': '1370480141.000', 'PERSONNELID': 'salalm', 'ACCESS_TYPE': '"Access Granted"', 'REGION': 'UTAH'}

Explanation:

\b ensures that the match starts at the word boundary .

"[^"]*" matches a quote, followed by any number of non-categorical characters and another quote.

\S+ matches one or more characters without spaces.

Including the "interesting" parts of the regular expression in parentheses, creating groups you get a list of tuples for each part matching each other.

+3

Tim pietzcker Jun 05 '13 at 18:22

source share

Looking for an existing parser is a good idea. If you can find a format that already describes your data, or that you can trivially convert your data, you win.

In this case, the conversion to JSON is similar to what will work primarily as a parsing.

But you just want to break down the simple components of value and name=value , where you can specify a quote value ... the same rules as the minimum shell syntax. So shlex will do it for you:

 >>> import shlex >>> shlex.split('2013-06-05T11:01:02.955 LASTNAME=Jone FIRSTNAME=Jason PERSONNELID=salalm QID=231412 READER_NAME="CAZ.1 LOBBY LEFT TURNSTYLE OUT" ACCESS_TYPE="Access Granted" EVENT_TIME_UTC=1370480141.000 REGION=UTAH') ['2013-06-05T11:01:02.955', 'LASTNAME=Jone', 'FIRSTNAME=Jason', 'PERSONNELID=salalm', 'QID=231412', 'READER_NAME=CAZ.1 LOBBY LEFT TURNSTYLE OUT', 'ACCESS_TYPE=Access Granted', 'EVENT_TIME_UTC=1370480141.000', 'REGION=UTAH']

You still need to split each name=value pair into name and value components, but this is just namevalue.split('=', 1) . But it is pretty much implied that you need to do this separately, given that you have some elements that are not pairs of names and values ( 2013-06-05T11:01:02.955 ).

Of course, you can always consider them as pairs of names and values with empty values:

 >>> dict(namevalue.partition('=')[::2] for namevalue in shlex.split(s)) {'2013-06-05T11:01:02.955': '', 'ACCESS_TYPE': 'Access Granted', 'EVENT_TIME_UTC': '1370480141.000', 'FIRSTNAME': 'Jason', 'LASTNAME': 'Jone', 'PERSONNELID': 'salalm', 'QID': '231412', 'READER_NAME': 'CAZ.1 LOBBY LEFT TURNSTYLE OUT', 'REGION': 'UTAH'}

+3

abarnert Jun 05 '13 at 18:28

source share

DSM · Accepted Answer · 2013-06-05T18:27:10+0000

One hack I found useful in the past is to use shlex.split :

 >>> s = '2013-06-05T11:01:02.955 LASTNAME=Jone FIRSTNAME=Jason PERSONNELID=salalm QID=231412 READER_NAME="CAZ.1 LOBBY LEFT TURNSTYLE OUT" ACCESS_TYPE="Access Granted" EVENT_TIME_UTC=1370480141.000 REGION=UTAH' >>> split = shlex.split(s) >>> split ['2013-06-05T11:01:02.955', 'LASTNAME=Jone', 'FIRSTNAME=Jason', 'PERSONNELID=salalm', 'QID=231412', 'READER_NAME=CAZ.1 LOBBY LEFT TURNSTYLE OUT', 'ACCESS_TYPE=Access Granted', 'EVENT_TIME_UTC=1370480141.000', 'REGION=UTAH']

And then we can turn this into a dictionary:

 >>> parsed = dict(k.split("=", 1) for k in split if '=' in k) >>> parsed {'EVENT_TIME_UTC': '1370480141.000', 'FIRSTNAME': 'Jason', 'LASTNAME': 'Jone', 'REGION': 'UTAH', 'ACCESS_TYPE': 'Access Granted', 'PERSONNELID': 'salalm', 'QID': '231412', 'READER_NAME': 'CAZ.1 LOBBY LEFT TURNSTYLE OUT'}

As @abarnert points out, you can store more information if you want:

 >>> dict(k.partition('=')[::2] for k in split) {'2013-06-05T11:01:02.955': '', 'EVENT_TIME_UTC': '1370480141.000', 'FIRSTNAME': 'Jason', 'LASTNAME': 'Jone', 'REGION': 'UTAH', 'ACCESS_TYPE': 'Access Granted', 'PERSONNELID': 'salalm', 'QID': '231412', 'READER_NAME': 'CAZ.1 LOBBY LEFT TURNSTYLE OUT'}

Et cetera. The key point, as he put it nicely, is that the syntax you showed is very similar to the minimal shell syntax. OTOH, if there is a violation of the template that you showed elsewhere, you might want to go back to writing a custom parser. The shlex approach shlex convenient when it is applied, but not as durable as you might need.

How to separate a line from an unformed line

More articles: