Pandas read_csv and filter columns using usecols

I have a csv file that does not fit with pandas.read_csv when I filter columns using usecols and use multiple indexes.

 import pandas as pd csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" f = open('foo.csv', 'w') f.write(csv) f.close() df1 = pd.read_csv('foo.csv', index_col=["date", "loc"], usecols=["dummy", "date", "loc", "x"], parse_dates=["date"], header=0, names=["dummy", "date", "loc", "x"]) print df1 # Ignore the dummy columns df2 = pd.read_csv('foo.csv', index_col=["date", "loc"], usecols=["date", "loc", "x"], # <----------- Changed parse_dates=["date"], header=0, names=["dummy", "date", "loc", "x"]) print df2 

I expect df1 and df2 to be the same except for the missing dummy column, but the columns fall into the wrong label. Also, the date is treated as a date.

 In [118]: %run test.py dummy x date loc 2009-01-01 a bar 1 2009-01-02 a bar 3 2009-01-03 a bar 5 2009-01-01 b bar 1 2009-01-02 b bar 3 2009-01-03 b bar 5 date date loc a 1 20090101 3 20090102 5 20090103 b 1 20090101 3 20090102 5 20090103 

Using column numbers instead of names gives me the same problem. I can solve this problem by dropping the dummy column after the read_csv step, but I'm trying to figure out what is going wrong. I am using pandas 0.10.1.

edit: fixed bad header usage.

+42
python pandas
Feb 22 '13 at 4:50
source share
4 answers

If your csv file contains additional data, columns can be removed from the DataFrame after import.

 import pandas as pd from StringIO import StringIO csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" df = pd.read_csv(StringIO(csv), index_col=["date", "loc"], usecols=["dummy", "date", "loc", "x"], parse_dates=["date"], header=0, names=["dummy", "date", "loc", "x"]) del df['dummy'] 

What gives us:

  x date loc 2009-01-01 a 1 2009-01-02 a 3 2009-01-03 a 5 2009-01-01 b 1 2009-01-02 b 3 2009-01-03 b 5 
+7
Feb 26 '13 at 22:01
source share

The answer from @chip completely skips the point of the two keyword arguments.

  • names are needed only if there is no header, and you want to specify other arguments using column names, not integer indices.
  • usecols must provide a filter before reading the entire DataFrame into memory; if used correctly, it should never be necessary to delete columns after reading.

This solution fixes these oddities:

 import pandas as pd from StringIO import StringIO csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" df = pd.read_csv(StringIO(csv), header=0, index_col=["date", "loc"], usecols=["date", "loc", "x"], parse_dates=["date"]) 

What gives us:

  x date loc 2009-01-01 a 1 2009-01-02 a 3 2009-01-03 a 5 2009-01-01 b 1 2009-01-02 b 3 2009-01-03 b 5 
+40
Jan 06 '15 at 2:47
source share

This code reaches what you want - also its weird and certainly buggy:

I noticed that it works when:

a) you specify index_col rel. to the number of columns that you really use is therefore its three columns in this example, not four (you drop the dummy and start the count from that time)

b) same for parse_dates

c) wrong for usecols ;) for obvious reasons

d) here I adapted names to reflect this behavior

 import pandas as pd from StringIO import StringIO csv = """dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5 """ df = pd.read_csv(StringIO(csv), index_col=[0,1], usecols=[1,2,3], parse_dates=[0], header=0, names=["date", "loc", "", "x"]) print df 

which prints

  x date loc 2009-01-01 a 1 2009-01-02 a 3 2009-01-03 a 5 2009-01-01 b 1 2009-01-02 b 3 2009-01-03 b 5 
+9
Feb 22 '13 at 18:04
source share

import csv first and use csv.DictReader, it is easy to handle ...

-3
Feb 25
source share



All Articles