Not reading all rows when importing csv to pandas dataframe

Question

Not reading all rows when importing csv to pandas dataframe

I am trying to make a kaggle call here , and unfortunately I am stuck at a very basic step. To do this, blame my limited knowledge of python. I am trying to read a dataset in the pandas framework by running the following command:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")

The problem is that this file, as you know, has more than 300,000 entries, but I only read 7945, 21.

 print (test.shape) (7945, 21)

Now I double-checked the file, and I can’t find anything special in the line number 7945. Any indication why this might happen. It seems to be a very normal situation, I hope that some of you who have encountered this error can help me.

+5

python-3.x pandas csv machine-learning kaggle

kushal bhola Oct 16 '15 at 2:50

source share

1 answer

jezrael · Accepted Answer · 2015-10-16T05:57:14+0000

I think it's better to use the read_csv function with the quoting=csv.QUOTE_NONE and error_bad_lines=False parameters. link

 import pandas as pd import csv test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False) print (test.shape) #(381422, 22)

But some data (problematic) will be skipped.

If you want to skip email data tags, you can use:

 import pandas as pd import csv test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None, names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"]) print (test.shape) #delete row with NaN in column MetadataFrom test = test.dropna(subset=['MetadataFrom']) #delete headers in data test = test[test.MetadataFrom != 'MetadataFrom']

Not reading all rows when importing csv to pandas dataframe

More articles: