Pandas: error reading CSV file using `sep` and` comment` arguments

Situation

I need to create a pandas framework from a CSV file that has the following characteristics:

  • The separator used by the file can be either a comma or a space, and I don’t know in advance which file will have it.
  • At the top of the file, there can be one or more comment lines starting with #.

Problem

I tried to solve this problem with help pd.read_csvwith sep=Noneand arguments comment='#'. In my opinion, the argument sep=Nonetells pandas to automatically determine the delimiter character, and the argument comment='#'tells pandas that all lines starting with #are comment lines that should be ignored.

These arguments work fine when used individually. However, when I use them together, I get an error message TypeError: expected string or bytes-like object. The following code example demonstrates this:

from io import StringIO
import pandas as pd

# Simulated data file contents
tabular_data = (
    '# Data generated on 04 May 2017\n'
    'col1,col2,col3\n'
    '5.9,7.8,3.2\n'
    '7.1,0.4,8.1\n'
    '9.4,5.4,1.9\n'
)

# This works
df1 = pd.read_csv(StringIO(tabular_data), sep=None)
print(df1)

# This also works
df2 = pd.read_csv(StringIO(tabular_data), comment='#')
print(df2)

# This will give an error
df3 = pd.read_csv(StringIO(tabular_data), sep=None, comment='#')
print(df3)

Unfortunately, I really don't understand what causes the error. Can anyone here help me solve this problem?

+4
source share
1 answer

Try the following:

In [186]: df = pd.read_csv(StringIO(tabular_data), sep=r'(?:,|\s+)',
                           comment='#', engine='python')

In [187]: df
Out[187]:
   col1  col2  col3
0   5.9   7.8   3.2
1   7.1   0.4   8.1
2   9.4   5.4   1.9

'(?:,|\s+)' is RegEx for selecting any comma or any number of consecutive spaces / tabs

+4
source

Source: https://habr.com/ru/post/1676371/


All Articles