Panda read_csv always crashes when working with a small file

I am trying to import a rather small (217 lines, 87 columns, 15 thousand) csv for analysis in Python using Panda . The file is rather poorly structured, but I would like to import it, since this is raw data that I do not want to manually manipulate outside of Python (for example, using Excel). Unfortunately, this always leads to a crash "The kernel seems to be dead, it will automatically restart."

https://www.wakari.io/sharing/bundle/uniquely/ReadCSV

Was there some kind of research that pointed out possible read_csv crashes, but always for really large files, so I don't understand the problem. The failure occurs both using the local installation (64-bit Anaconda, IPython (Py 2.7) Notebook), and Wakari.

Can someone help me? It would be really appreciated. Many thanks!

the code:

# I have a somehow ugly, illustrative csv file, but it is not too big, 217 rows, 87 colums. # File can be downloaded at http://www.win2day.at/download/lo_1986.csv # In[1]: file_csv = 'lo_1986.csv' f = open(file_csv, mode="r") x = 0 for line in f: print x, ": ", line x = x + 1 f.close() # Now I'd like to import this csv into Python using Pandas - but this always lead to a crash: # "The kernel appears to have died. It will restart automatically." # In[ ]: import pandas as pd pd.read_csv(file_csv, delimiter=';') # What am I doing wrong? 
+5
source share
2 answers

This is due to an invalid character (e.g. 0xe0) in the file

If you add the encoding parameter to the read_csv () call, you will see this trace stack instead of segfault

 >>> df = pandas.read_csv("/tmp/lo_1986.csv", delimiter=";", encoding="utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 205, in _read return parser.read() File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read ret = self._engine.read(nrows) File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read data = self._reader.read(nrows) File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745) File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964) File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780) File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793) File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484) File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642) File "parser.pyx", line 1051, in pandas.parser.TextReader._string_convert (pandas/parser.c:10905) File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas/parser.c:15657) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data 

You can do preprocessing to remove these characters before asking pandas to read in the file

Attached image to highlight invalid characters in the file

enter image description here

+6
source

Thanks so much for your comments. I could no longer agree with the comment that this is indeed a very confusing csv. But, unfortunately, this is how the Austrian state lottery shares its information with drawn numbers and quotes for payments.

I continued to play, also looking at special characters. In the end, at least for me, the solution was surprisingly simple:

 pd.read_csv(file_csv, delimiter=';', encoding='latin-1', engine='python') 

Added coding helps to display special characters correctly, but changes in the game were a parameter of the engine. Honestly, I don’t understand why, but now it works.

Thanks again!

+4
source

Source: https://habr.com/ru/post/1200631/


All Articles