How to find an input string with mixed types

I read in big csv in pandas with:

features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])

I get:

sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  %!PS-Adobe-3.0

How can I find the first line of input that triggers this warning? I need to do this in order to debug a problem with an input file that should not have mixed types.

+4
source share
2 answers
for endrow in range(1000, 4000000, 1000):
    startrow = endrow - 1000
    rows = 1000
    try:
        pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
                names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
                       'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
                usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
                         'SrcPackets','DstPackets','SrcBytes','DstBytes'])
    except ValueError:
        print(f"Error is from row {startrow} to row {endrows}")

Divide the file into several data frames with 1000 lines to see which range of lines there is a mixed type value that causes this problem.

+1
source

Once Pandas finishes reading the file, you can NOT find out which lines were problematic (see this answer to find out why).

, . , , - , .

Pandas, chunksize=1 pd.read_csv(), ( N, 1). . , .

:

# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)

# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.

i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
    row_types = [type(val) for val in row_df.iloc[0]]
    if row_types != expected_types:
        print(i) # this row is the wanted one
        break
    i += 1

, , "" . , , , , ( ).

+1
source

Source: https://habr.com/ru/post/1690868/


All Articles