How can I filter the lines when loading in the Pandas read_csv function?

How can I filter which CSV lines are loaded into memory using pandas? This is similar to the option you need to find in read_csv . Did I miss something?

Example: we have a CSV with a timestamp column, and we would like to load only rows with a timestamp that exceeds a given constant.

+64
pandas
Nov 30 '12 at 18:38
source share
5 answers

Unable to filter lines before loading CSV file into pandas object.

You can either upload the file and then filter using df[df['field'] > constant] , or if you have a very large file and are worried about running out of memory, then use an iterator and apply a filter as you combine the pieces of your file, for example:

 import pandas as pd iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000) df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv]) 

You can change chunksize according to available memory. See here for more details.

+115
Nov 30 '12 at 21:31
source share

I did not find a direct way to do this in the context of read_csv . However, read_csv returns a DataFrame that can be filtered by selecting rows by the logical vector df[bool_vec] :

 filtered = df[(df['timestamp'] > targettime)] 

This is the selection of all rows in df (it is assumed that df is any DataFrame, such as the result of a read_csv call that at least contains a datetime column timestamp ) for which the values ​​in the timestamp column are greater than the target time value. Related questions .

+7
Nov 30 '12 at 19:43
source share

You can specify the nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

+2
Nov 12 '18 at 5:59
source share

If you are using Linux, you can use grep.

 # to import either on Python2 or Python3 import pandas as pd from time import time # not needed just for timing try: from StringIO import StringIO except ImportError: from io import StringIO def zgrep_data(f, string): '''grep multiple items f is filepath, string is what you are filtering for''' grep = 'grep' # change to zgrep for gzipped files print('{} for {} from {}'.format(grep,string,f)) start_time = time() if string == '': out = subprocess.check_output([grep, string, f]) grep_data = StringIO(out) data = pd.read_csv(grep_data, sep=',', header=0) else: # read only the first row to get the columns. May need to change depending on # how the data is stored columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] out = subprocess.check_output([grep, string, f]) grep_data = StringIO(out) data = pd.read_csv(grep_data, sep=',', names=columns, header=None) print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time)) return data 
+1
Dec 13 '17 at 14:26
source share

I was able to download CSV through this.

 import pandas as pd iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000) df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv]) 

However, I noticed that the number of results (shown in df.shape) varies depending on the fragment size ..... any idea?

0
Feb 19 '19 at 6:42
source share



All Articles