Not reading all rows when importing csv to pandas dataframe

I am trying to make a kaggle call here , and unfortunately I am stuck at a very basic step. To do this, blame my limited knowledge of python. I am trying to read a dataset in the pandas framework by running the following command:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv") 

The problem is that this file, as you know, has more than 300,000 entries, but I only read 7945, 21.

 print (test.shape) (7945, 21) 

Now I double-checked the file, and I canโ€™t find anything special in the line number 7945. Any indication why this might happen. It seems to be a very normal situation, I hope that some of you who have encountered this error can help me.

+5
source share
1 answer

I think it's better to use the read_csv function with the quoting=csv.QUOTE_NONE and error_bad_lines=False parameters. link

 import pandas as pd import csv test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False) print (test.shape) #(381422, 22) 

But some data (problematic) will be skipped.

If you want to skip email data tags, you can use:

 import pandas as pd import csv test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None, names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"]) print (test.shape) #delete row with NaN in column MetadataFrom test = test.dropna(subset=['MetadataFrom']) #delete headers in data test = test[test.MetadataFrom != 'MetadataFrom'] 
+2
source

Source: https://habr.com/ru/post/1233828/


All Articles