Find a list in a row

I see many examples of finding strings in strings or searching strings in lists, but how to find a list in a string. For example, I have a csv file with data columns, and the last column is a row, or sometimes it is a list. Here is a subset of the data showing only the last 3 columns.

TRUE, 93877, S26476961
TRUE, 93878, ['S26489167', 'S26492524']
FALSE, 93879, S26476962
FALSE, 93880, ['S26489168', 'S26492527', 'S26492528']

At first I tried to parse every whole line of the csv file with a comma, but it parses the commas in the list (creating additional columns). I just want the list to be recognized as a single piece of data, so I can work with it as a list of n elements. The @TemporalWolf comment helps a lot, because if I use the csv module (specifically csv.reader) as such ...

reader = csv.reader(inFile)
for row in reader:
    print(row)

It stores the list in one column. Now the problem remains that this is just a string. In other words, row[n][0]returns the left bracket ( [), but I want to do this in a list.

+4
source share
3 answers

It depends on the character 'with which the items in your list are quoted. Using this information, it is separated only by commas, not followed or preceding this character, using a regular expression:

import re
import pandas as pd
import io


text = """TRUE, 93877, S26476961
TRUE, 93878, ['S26489167', 'S26492524']
FALSE, 93879, S26476962
FALSE, 93880, ['S26489168', 'S26492527', 'S26492528']"""

with io.StringIO(text) as f:
    for line in f:
        print(re.split("(?<!'), (?!')", line.strip()))


# ['TRUE', '93877', 'S26476961']
# ['TRUE', '93878', "['S26489167', 'S26492524']"]
# ['FALSE', '93879', 'S26476962']
# ['FALSE', '93880', "['S26489168', 'S26492527', 'S26492528']"]

# Or with pandas

with io.StringIO(text) as f:
    print(pd.read_csv(f,
                  header=None,
                  sep="(?<!'), (?!')",
                  engine='python'))

#        0      1                                        2
# 0   True  93877                                S26476961
# 1   True  93878               ['S26489167', 'S26492524']
# 2  False  93879                                S26476962
# 3  False  93880  ['S26489168', 'S26492527', 'S26492528']

Edit:

If you are using python2, you will need to convert the text to unicode (putting a character uin front of the text) in order to be able to use io.StringIO:

import re
import pandas as pd
import io


text = u"""TRUE, 93877, S26476961
TRUE, 93878, ['S26489167', 'S26492524']
FALSE, 93879, S26476962
FALSE, 93880, ['S26489168', 'S26492527', 'S26492528']"""

with io.StringIO(text) as f:
    for line in f:
        print(re.split("(?<!'), (?!')", line.strip()))


# ['TRUE', '93877', 'S26476961']
# ['TRUE', '93878', "['S26489167', 'S26492524']"]
# ['FALSE', '93879', 'S26476962']
# ['FALSE', '93880', "['S26489168', 'S26492527', 'S26492528']"]

# Or with pandas

with io.StringIO(text) as f:
    print(pd.read_csv(f,
                  header=None,
                  sep="(?<!'), (?!')",
                  engine='python'))

#        0      1                                        2
# 0   True  93877                                S26476961
# 1   True  93878               ['S26489167', 'S26492524']
# 2  False  93879                                S26476962
# 3  False  93880  ['S26489168', 'S26492527', 'S26492528']

Edit 2:

', :

import ast
import re


with io.StringIO(text) as f:
    for line in f:
        parts = re.split(", (?=\[)", line.strip())
        line = []
        for part in parts:
            if all(char in part for char in ('[]')):
                line.append(ast.literal_eval(part))
            else:
                line += part.split(", ")
        print(line)

, , :

  • , , , . , list ast.literal_eval .
  • .

, .

, .

+4

: Python. , , :

import ast

def get_columns(line):
    def valid(code):
        try:
            ast.parse(code.strip())
        except SyntaxError:
            return False
        return True
    sections = line.split(',')
    columns = []
    for i, section in enumerate(sections):
        if i == len(sections) - 1 or valid(section):
            columns.append(section)
        else:
            sections[i + 1] = ','.join([section, sections[i + 1]])
    return columns

with open(inFile) as f:
    for line in f:
        for column in get_columns(line):
            print(column)

, . , , "" , .

Python 2 3.

+2

This is the opposite approach. It finds lists in the data returned csv.readerby checking the leading [and trailing ]in the elements of the string.

import csv 

def find_lists(row):
    sublist = []
    for item in row:
        if not sublist:
            if item.startswith('['):
                if item.endswith(']'):
                    yield [item[1:-1]]
                else:
                    sublist.append(item[1:])
            else:
                yield item
        else:
            if item.endswith(']'):
                sublist.append(item[:-1])
                yield sublist
                sublist = []
            else:
                sublist.append(item)
    for item in sublist:
        yield item

with open('test.csv') as infile:
    reader = csv.reader(infile, skipinitialspace=True)
    for row in reader:
        print(list(find_lists(row)))
0
source

Source: https://habr.com/ru/post/1682882/


All Articles