Error for data tokenization during Pandas read_csv. How to actually see bad lines?

I have a big csv which I load as follows

df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])

I get some errors during the boot process.

  • Firstly, if I do not specify warn_bad_lines=True,error_bad_lines=False, I get:

    Error for data tokenization. C: Expected 22 fields on line 329867, saw 24

  • Secondly, if I use the options listed above, now I get:

    CParserError: error tokenization data. Error C: Inner EOF line starting at line 32357585

Question: how can I take a look at these bad lines to understand what is happening? Is it possible to return read_csvthese dummy lines?

I tried the following hint ( Pandas ParserError EOF character when reading multiple CSV files on HDF5 ):

from pandas import parser

try:
  df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail

but still get

Error for data tokenization. C: Expected 22 fields on line 329867, saw 24

+6
source share
3 answers

In my case, adding a separator helped:

data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')
0
source

We can get the line number from the error and print the line to see how it looks

Try:

import subprocess
import re
from pandas import parser

try:
  filename='mydata.tsv'
  df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail
  err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
  line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename'  for printing line 2 of filename
  print 'Bad line'
  print line # to see line 
0
source

:

1: , , , Python CSV , :

import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here

with open(file) as f_obj:
    for line_number, row in enumerate(csv.reader(f_obj)):
        if line_number in lines_set: # put your bad lines numbers here
            print(line_number, row)
        if line_number > max(lines_set):
            break

:

import csv


def read_my_lines(file, lines_list, reader=csv.reader):
    lines_set = set(lines_list)
    with open(file) as f_obj:
        for line_number, row in enumerate(csv.reader(f_obj)):
            if line_number > max(lines_set):
                break
            elif line_number in lines_set:
                print(line_number, row)


if __name__ == '__main__':
    read_my_lines(file='your_filename.csv', lines_list=[100, 200])

part2: :

, . ..

pd.read_csv(filename)

? , .

.

skiprows header=0 3 , , , .

, .

, header=0 , .

:

, , sep=None, .

pandas.read_csv:

sep: str, default ', . sep None, C , Python , , Pythons, csv.Sniffer. , , 1 '\ s+', , Python. , . : '\ r\t'

0

Source: https://habr.com/ru/post/1650948/


All Articles