Parsing a CSV file in pandas with commas in the last column

Question

Parsing a CSV file in pandas with commas in the last column

I am stuck in some poorly formatted CSV data that I need to read in the Pandas framework. I cannot change the way data is written (this happens elsewhere), so please, no solutions suggesting this.

Most of the data is fine, but some rows have commas in the last column. A simplified example:

column1 is fine,column 2 is fine,column3, however, has commas in it!

All rows should have the same number of columns (3), but this example, of course, breaks the CSV reader, because commas assume 5 columns, if in fact there are 3 columns.

Please note that there are no quotes that would allow me to use standard CSV reading tools to solve this problem.

However, I know that extra commas are always found in the last (most extreme) column. This means that I can use a solution that boils down to:

"Always assume that there are three columns counting from the left, and interpret all additional commas as the contents of the row in column 3." Or, in other words, "interpret the first two commas as column separators, but assume that any subsequent commas are part of the row in column 3."

I can come up with a lot of confusing ways to do this, but my question is: is there any elegant and concise way to resolve this issue, preferably in my appeal to pandas.csv_reader(...)?

+4

python pandas

mustachio Jun 11 '14 at 13:30

source share

1 answer

unutbu · Accepted Answer · 2014-06-11T14:32:12+0000

csv, :

import csv
with open('path/to/broken.csv', 'rb') as f, open('path/to/fixed.csv', 'wb') as g:
    writer = csv.writer(g, delimiter=',')
    for line in f:
        row = line.split(',', 2)
        writer.writerow(row)

import pandas as pd
df = pd.read_csv('path/to/fixed.csv')

Parsing a CSV file in pandas with commas in the last column

More articles: