Pandas skiprows beyond 900000 not working

My csv file contains 6 million entries, and I am trying to split it into several smaller files using skiprows . My version of Pandas is "0.12.0" and the code is

pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000) 

It works as long as skiprows are less than 900,000. Any idea if it is expected? If I do not use skiprows, my burrows can go up to 5 million records. I have not tried this yet. try this as well.

tried csv splitter, but it does not work properly for the first record, maybe because each cell consists of several lines of code, etc.

EDIT : I can split it into csv by reading the entire 7GB file using Pandas read_csv and writing parts to several csv files.

+6
source share
1 answer

The problem is that you are specifying both nrows and chunksize . At least in pandas 0.14.0 using

 pandas.read_csv(filename, nrows=some_number, chunksize=another_number) 

returns a Dataframe (reading all data), whereas

 pandas.read_csv(filename, chunksize=another_number) 

returns a TextFileReader that loads the file lazily.

Splitting csv works as follows:

 for chunk in pandas.read_csv(filename, chunksize=your_chunk_size): chunk.to_csv(some_filename) 
+1
source

Source: https://habr.com/ru/post/959074/


All Articles