Get the last 10,000 lines of a csv file

In pandas, I can just use pandas.io.parser.read_csv("file.csv", nrows=10000) to get the first 10,000 lines of the csv file.

But since my csv file is huge and the last lines are more relevant than the first, I would like to read the last 10,000 lines. However, it is not so simple, even if I know the file length, because if I skip the first 990,000 lines of a 100-liter CSV file using pandas.io.parser.read_csv("file.csv", nrows=10000, skiprows=990000) , the first line containing the file header is also skipped ( header=0 is measured after applying skiprows , so this does not help.)

How to get the last 10,000 lines from a csv file with a header in line 0, preferably without knowing the length of the file in lines?

+5
source share
4 answers

First you can calculate your file size:

 size = sum(1 for l in open('file.csv')) 

Then use skiprows with range :

 df = pd.read_csv('file.csv', skiprows=range(1, size - 10000)) 

EDIT

As @ivan_pozdeev said with this solution, you need to go through the file twice though. I tried reading the entire file using pandas and then using the tail method, but this method is slower than expected.

Example dataframe:

 pd.DataFrame(np.random.randn(1000000,3), columns=list('abc')).to_csv('file.csv') 

Timing

 def f1(): size = sum(1 for l in open('file.csv')) return pd.read_csv('file.csv', skiprows=range(1, size - 10000)) def f2(): return pd.read_csv('file.csv').tail(10000) In [10]: %timeit f1() 1 loop, best of 3: 1.8 s per loop In [11]: %timeit f2() 1 loop, best of 3: 1.94 s per loop 
+5
source

Using the @Anton Protopopov Sample File. Reading a partial bit of a file and a header in separate operations is much cheaper than reading the entire file.

Just read the last lines

 In [22]: df = read_csv("file.csv", nrows=10000, skiprows=990001, header=None, index_col=0) In [23]: df Out[23]: 1 2 3 0 990000 -0.902507 -0.274718 1.155361 990001 -0.591442 -0.318853 -0.089092 990002 -1.461444 -0.070372 0.946964 990003 0.608169 -0.076891 0.431654 990004 1.149982 0.661430 0.456155 ... ... ... ... 999995 0.057719 0.370591 0.081722 999996 0.157751 -1.204664 1.150288 999997 -2.174867 -0.578116 0.647010 999998 -0.668920 1.059817 -2.091019 999999 -0.263830 -1.195737 -0.571498 [10000 rows x 3 columns] 

Do it very fast

 In [24]: %timeit read_csv("file.csv", nrows=10000, skiprows=990001, header=None, index_col=0) 1 loop, best of 3: 262 ms per loop 

Pretty cheap to determine a-priori file length

 In [25]: %timeit sum(1 for l in open('file.csv')) 10 loops, best of 3: 104 ms per loop 

Reading in title

 In [26]: df.columns = read_csv('file.csv', header=0, nrows=1, index_col=0).columns In [27]: df Out[27]: abc 0 990000 -0.902507 -0.274718 1.155361 990001 -0.591442 -0.318853 -0.089092 990002 -1.461444 -0.070372 0.946964 990003 0.608169 -0.076891 0.431654 990004 1.149982 0.661430 0.456155 ... ... ... ... 999995 0.057719 0.370591 0.081722 999996 0.157751 -1.204664 1.150288 999997 -2.174867 -0.578116 0.647010 999998 -0.668920 1.059817 -2.091019 999999 -0.263830 -1.195737 -0.571498 [10000 rows x 3 columns] 
+3
source

The only way to take exactly the last N lines in accordance with Anton Protopopov is to go through the entire file first, counting the lines.

But for the next step, taking them, you can do the optimization (which does tail ):

  • when you go, save the line offsets in a circular buffer of length N. Then in the end the oldest element in the buffer will be the necessary offset. Then all that is required is f.seek() in the file object according to Working with 10 + GB Dataset in Python Pandas .

A faster way, which does not include traversing the entire file, would be to not require the exact number of lines: from what I see, you only need an arbitrary large number. So you can:

  • get a rough estimate of the offset you need to refer to (for example, calculate / estimate the average line length).
  • find there, then to the next (or previous) line break

    This requires extra care if you can have data with built-in line breaks: in this case there is no reliable way to find out which quotes are opening and which are closing. You should make assumptions about what can and cannot be inside / outside quotation marks ... and even how far to look for a quote to see if line breaks are included!

+1
source

You can try tail from pandas, it returns the last n lines

 df.tail(10000) 
0
source

Source: https://habr.com/ru/post/1244989/


All Articles