How to remove space variables in each line of a text file based on a special condition - single-line in Python?

I have some data (text files) that are formatted in the most uneven way. I am trying to minimize the amount of manual work on parsing this data.

Data examples

Name        Degree      CLASS       CODE        EDU     Scores
--------------------------------------------------------------------------------------
John Marshall       CSC   78659944   89989        BE   900
Think Code DB I10   MSC  87782  1231  MS            878
Mary 200 Jones    CIVIL      98993483  32985        BE       898
John G. S  Mech 7653 54 MS 65
Silent Ghost  Python Ninja 788505  88448  MS Comp  887

Conditions:

  • More than one space needs to be compressed to a separator (better?). The ultimate goal is to store these files in a database.
  • With the exception of the first column, the remaining columns will not contain spaces, so all these spaces can be compressed into a pipe.
  • Only the first column can contain multiple words with spaces (Mary K Jones). The remaining columns are mainly numbers and some alphabets.
  • . , . ( , , , !).
  • , . , , .

, ! , oneliner. , , , : (

Muchos gracias "Pythonistas" !

+3
3

- , :

>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
    lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
    print(lst)


[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']

Regex , , , (\s) breaks (\b) . , , lst. , , . , :

>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
    ...  # here empty lst indicates issues with regex

:

>>> import re
>>> for line in open(fname):
    lst = re.split(r'\s{2,}', line)
    l = len(lst)
    if l in (2,3):
        lst[l-1:] = lst[l-1].split()
    print(lst)

['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']

- , :

if l < 3:
    lst = line.split()
    print(lst)
    iname = input('enter indexes that for elements of name: ')     # use raw_input in py2k
    idegr = input('enter indexes that for elements of degree: ')

Uhm, , , , :

>>> for line in open(fname):
    name, _, rest = line.partition('  ')
    lst = [name] + rest.split()
    print(lst)
+3

SilentGhost, ( ), , , .

import re

for line in open(fname):
    name, rest = re.split('\s{2,}', line, maxsplit=1)
    print [name] + rest.split()
+2

, ( "\ t" ) 3 ( ).

, , . , , .

line.replace('\t', ' ' * 3) line.expandtabs().

expandtabs .

( ), , ( ) - , .

, - "-", ? , - , :

RecordType  ID1                  ID2         Description           
----------- -------------------- ----------- ----------------------
1           12345678             123456      Widget                
4           87654321             654321      Gizmoid

, , , . :

sizes = map(len, dash_line.split())

expandtabs() , , , , , print repr(line) 5 ( ). , , .

+1

Source: https://habr.com/ru/post/1768258/


All Articles