Unstructured text analysis in Python

I wanted to parse a text file containing unstructured text. I need to get the address, date of birth, name, gender and identifier.

. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12    
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13

In the above example, the first three lines are the address, the line with "F" is the gender, DOB is the line after "F", name after DOB, identifier after name, and no. 12 under the identifier - index / record number

However, the format is not consistent. In the second group, the address is 4 lines instead of 3 and there is no index / record. added after the name (if the person does not have an identifier field).

I wanted to rewrite the text in the following format:

name, ID, address, sex, DOB
+3
source share
5 answers

( pyparsing pastebin). , .

data = """\
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
"""

from pyparsing import LineEnd, oneOf, Word, nums, Combine, restOfLine, \
    alphanums, Suppress, empty, originalTextFor, OneOrMore, alphas, \
    Group, ZeroOrMore

NL = LineEnd().suppress()
gender = oneOf("M F")
integer = Word(nums)
date = Combine(integer + '/' + integer + '/' + integer)

# define the simple line definitions
gender_line = gender("sex") + NL
dob_line = date("DOB") + NL
name_line = restOfLine("name") + NL
id_line = Word(alphanums+"-")("ID") + NL
recnum_line = integer("recnum") + NL

# define forms of address lines
first_addr_line = Suppress('.') + empty + restOfLine + NL
# a subsequent address line is any line that is not a gender definition
subsq_addr_line = ~(gender_line) + restOfLine + NL

# a line with a name and a recnum combined, if there is no ID
name_recnum_line = originalTextFor(OneOrMore(Word(alphas+',')))("name") + \
    integer("recnum") + NL

# defining the form of an overall record, either with or without an ID
record = Group((first_addr_line + ZeroOrMore(subsq_addr_line))("address") + 
    gender_line + 
    dob_line +
    ((name_line +
        id_line + 
        recnum_line) |
      name_recnum_line))

# parse data
records = OneOrMore(record).parseString(data)

# output the desired results (note that address is actually a list of lines)
for rec in records:
    if rec.ID:
        print "%(name)s, %(ID)s, %(address)s, %(sex)s, %(DOB)s" % rec
    else:
        print "%(name)s, , %(address)s, %(sex)s, %(DOB)s" % rec
print

# how to access the individual fields of the parsed record
for rec in records:
    print rec.dump()
    print rec.name, 'is', rec.sex
    print

ALOMO, TERESITA CABALLES, 3412-00000-A1652TCA2, ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], F, 01/16/1952
AMURAO, CALIXTO MANALO, , ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], M, 10/14/1967

['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'F', '01/16/1952', 'ALOMO, TERESITA CABALLES', '3412-00000-A1652TCA2', '12']
- DOB: 01/16/1952
- ID: 3412-00000-A1652TCA2
- address: ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: ALOMO, TERESITA CABALLES
- recnum: 12
- sex: F
ALOMO, TERESITA CABALLES is F

['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'M', '10/14/1967', 'AMURAO, CALIXTO MANALO', '13']
- DOB: 10/14/1967
- address: ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: AMURAO, CALIXTO MANALO
- recnum: 13
- sex: M
AMURAO, CALIXTO MANALO is M
+12

.

, , person. , , .

+4

, . , .

, . . Mallet CRF ++.

+3

. , python, redemo.py( , c:\python26\Tools\scripts).

( ). , , , , :

import re
re_entity_splitter = re.compile(r'^\.')

entities = re_entity_splitter.split(open(textfile).read())

, ( ). r . R "raw string", escape-, " ".

After you divide the file into individuals, you can choose the gender and date of birth. Use them:

re_gender     = re.compile(r'^[MF]')
re_birth_Date = re.compile(r'\d\d/\d\d/\d\d')

And go away. You can insert a flat file into the demo GUI and experiment with creating templates according to what you need. You disassembled it as soon as possible. Once you achieve this, you can use the names of symbolic groups (see Documents) to quickly and cleanly highlight individual elements.

+2
source

Hack work quickly here.

f = open('data.txt')

def process(file):
    address = ""

    for line in file:
        if line == '': raise StopIteration
        line = line.rstrip() # to ignore \n
        if line in ('M','F'):
            sex = line
            break
        else:
            address += line

    DOB = file.readline().rstrip() # to ignore \n
    name = file.readline().rstrip()

    if name[-1].isdigit():
        name = re.match(r'^([^\d]+)\d+', name).group(1)
        ID = None
    else:
        ID = file.readline().rstrip()
        file.readline() # ignore the record #

    print (name, ID, address, sex, DOB)

while True:
    process(f)
+1
source

Source: https://habr.com/ru/post/1717606/


All Articles