Calculate values from a block of text based on specific keys

Question

Calculate values from a block of text based on specific keys

I am parsing text from a source that is beyond my control, which is not in a very convenient format. I have lines like this:

Problem Category: Problems with people Problem Subcategory: Space exploration Type of error: Inoperability Software version: 9.8.77.omni.3 Problem: Problem with the signal barrier camera.

I want to break a line with such keys:

Problem_Category = "Human Endeavors"
Problem_Subcategory = "Space Exploration"
Problem_Type = "Failure to Launch"
Software_Version = "9.8.77.omni.3"
Problem_Details = "Issue with signal barrier chamber."

The keys will always be in the same order and are always followed by a semicolon, but there must not be a space or a newline between the value and the next key. I'm not sure what can be used as a separator for parsing this, since colons and spaces can also be displayed in values. How can I parse this text?

+1

python regex parsing

Jesse bikman Oct 3 '14 at 13:30

source share

4 answers

unutbu · Answer 1 · 2014-10-03T13:43:18+0000

If your block of text is a line:

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

Then

import re
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

pat = r'({}):'.format('|'.join(names))
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2))
print(data)

gives dict

{'Problem Category': ' Human Endeavors ',
 'Problem Details': ' Issue with signal barrier chamber.',
 'Problem Subcategory': ' Space Exploration',
 'Problem Type': ' Failure to Launch',
 'Software Version': ' 9.8.77.omni.3'}

So you can assign

text = df_dict['NOTE_DETAILS'][0]
...
df_dict['NOTE_DETAILS'][0] = data

and then you can access the sub-categories with index indexing:

df_dict['NOTE_DETAILS'][0]['Problem_Category']

. dicts/DataFrames dicts . Zen of Python , Flat .

davidism · Answer 2 · 2014-10-03T14:22:02+0000

, , " ", " ", .

# get input from somewhere
raw = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

# these are the keys, in order, without the colon, that will be captured
keys = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
prev_key = None
remaining = raw
out = {}

for key in keys:
    # get the value from before the key and after the key
    prev_value, _, remaining = remaining.partition(key + ':')

    # start storing values after the first iteration, since we need to partition the second key to get the first value
    if prev_key is not None:
        out[prev_key] = prev_value.strip()

    # what key to store next iteration
    prev_key = key

# capture the final value (since it lags behind the parse loop)
out[prev_key] = remaining.strip()

# out now contains the parsed values, print it out nicely
for key in keys:
    print('{}: {}'.format(key, out[key]))

:

Problem Category: Human Endeavors
Problem Subcategory: Space Exploration
Problem Type: Failure to Launch
Software Version: 9.8.77.omni.3
Problem Details: Issue with signal barrier chamber.

Kevin · Answer 3 · 2014-10-03T14:53:56+0000

regex, .

#splits a string using multiple delimiters.
def multi_split(s, delims):
    strings = [s]
    for delim in delims:
        strings = [x for s in strings for x in s.split(delim) if x]
    return strings

s = "Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber."
categories = ["Problem Category", "Problem Subcategory", "Problem Type", "Software Version", "Problem Details"]
headers = [category + ": " for category in categories]

details = multi_split(s, headers)
print details

details_dict = dict(zip(categories, details))
print details_dict

( , ):

[
    'Human Endeavors ', 
    'Space Exploration', 
    'Failure to Launch', 
    '9.8.77.omni.3', 
    'Issue with signal barrier chamber.'
]

{
    'Problem Subcategory': 'Space Exploration', 
    'Problem Details': 'Issue with signal barrier chamber.', 
    'Problem Category': 'Human Endeavors ', 
    'Software Version': '9.8.77.omni.3', 
    'Problem Type': 'Failure to Launch'
}

rns · Answer 4 · 2014-10-03T15:44:30+0000

It is just a job for general BNF parsing that does a great job of ambiguity. I used perl and Marpa , a general BNF parser. Hope this helps.

use 5.010;
use strict;
use warnings;

use Marpa::R2;

my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),

    :default ::= action => [ name, values ]

    pairs ::= pair+

    pair ::= name (' ') value

    name ::= 'Problem Category:'
    name ::= 'Problem Subcategory:'
    name ::= 'Problem Type:'
    name ::= 'Software Version:'
    name ::= 'Problem Details:'

    value ::= [\s\S]+

    :discard ~ whitespace
    whitespace ~ [\s]+

END_OF_SOURCE
} );

my $input = <<EOI;
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
EOI

my $ast = ${ $g->parse( \$input ) };

my @pairs;

ast_traverse($ast);

for my $pair (@pairs){
    my ($name, $value) = @$pair;
    say "$name = $value";
}

sub ast_traverse{
    my $ast = shift;
    if (ref $ast){
        my ($id, @children) = @$ast;
        if ($id eq 'pair'){

            my ($name, $value) = @children;

            chop $name->[1];

            shift @$value;
            $value = join('', @$value);
            chomp $value;

            push @pairs, [ $name->[1], '"' . $value . '"' ];
        }
        else {
            ast_traverse($_) for @children;
        }
    }
}

Fingerprints:

Problem Category = "Human Endeavors "
Problem Subcategory = "Space Exploration"
Problem Type = "Failure to Launch"
Software Version = "9.8.77.omni.3"
Problem Details = "Issue with signal barrier chamber."

Calculate values ​​from a block of text based on specific keys

More articles:

Calculate values from a block of text based on specific keys