Merge two CSV files based on data from a column

Question

Merge two CSV files based on data from a column

I have two csv files as shown below.

CSV1

data13 data23 d main_data1;main_data2 data13 data23 data12 data22 d main_data1;main_data2 data12 data22 data11 data21 d main_data1;main_data2 data11 data21 data3 data4 d main_data2;main_data4 data3 data4 data52 data62 d main_data3 data51 data62 data51 data61 d main_data3 main_data3 data61 data7 data8 d main_data4 data7 data8

CSV2

 id1 main_data1 a1 a2 a3 id2 main_data2 b1 b2 b3 id3 main_data3 c1 c2 c3 id4 main_data4 d1 d2 d3 id5 main_data5 e1 e2 e3

Now my question is: I know how to combine two CSV files when one of the columns in both files is the same. But my question is a little different. column 4 of CSV1 may contain column 2 of CSV2. I would like to get the CSV file below

FINAL_CSV

 id1 main_data1 a1 a2 a3 data13 id2 main_data2 b1 b2 b3 data3 id3 main_data3 c1 c2 c3 main_data3 id4 main_data4 d1 d2 d3 data7 id5 main_data5 e1 e2 e3

Where:
1. It compares the data from both columns and gets the corresponding rows from the first occurrence and writes to the csv file.
2. If there is no match, it can leave the last column in FINAL_CSV blank or write “NA” or something like that.
3. When the data in columns 4 and 5 of CSV1 match exactly, it returns this row instead of the first occurrence.

I completely lost how to do this. Help with its part is also wonderful. Any suggestions are welcome. PS-I know that the data from the csv file should be separated by a comma, but for clarity, I prefer tabs, although the actual data is separated by commas.

EDIT: Actually, "main_data" can be in any column in CSV2, and not just in column2. The same "main_data" can also be repeated in several lines, then I would like to get all the corresponding lines.

+5

python merge regex csv

abn Nov 20 '14 at 22:16

source share

4 answers

user3442743 · Answer 1 · 2015-01-02T22:24:08+0000

Method with (g) awk.

  awk -F, 'NR==FNR{a[$2]=$0;next} {split($4,b,";");x=b[1]} (x in a)&&!c[x]++{d[x]=$5} ($5 in a){d[$5]=$5} END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}' CSV1 CSV2

Exit

 id1,main_data1,a1,a2,a3,data13 id2,main_data2,b1,b2,b3,data3 id3,main_data3,c1,c2,c3,main_data3 id4,main_data4,d1,d2,d3,data7 id5,main_data5,e1,e2,e3,

lafras · Answer 2 · 2015-01-08T14:24:53+0000

Do you consider using pandas ? If you are familiar with R, then the data frames should be pretty simple. The following gives you what you want:

 from pandas import merge, read_table csv1 = read_table('CSV1.csv', sep=r"[;,]", header=None) csv2 = read_table('CSV2.csv', sep=r"[,]", header=None) print csv1 print csv2

Notice that I replaced the tabs with commas and split them into half-columns. The conclusion so far should be:

  0 1 2 3 4 5 6 0 data13 data23 d main_data1 main_data2 data13 data23 1 data12 data22 d main_data1 main_data2 data12 data22 2 data11 data21 d main_data1 main_data2 data11 data21 3 data3 data4 d main_data2 main_data4 data3 data4 4 data52 data62 d main_data3 NaN data51 data62 5 data51 data61 d main_data3 NaN main_data3 data61 6 data7 data8 d main_data4 NaN data7 data8 [7 rows x 7 columns] 0 1 2 3 4 0 id1 main_data1 a1 a2 a3 1 id2 main_data2 b1 b2 b3 2 id3 main_data3 c1 c2 c3 3 id4 main_data4 d1 d2 d3 4 id5 main_data5 e1 e2 e3 [5 rows x 5 columns]

Using the left join:

 kw1 = dict(how='left', \ left_on=[3,4], \ right_on=[1,1], \ suffixes=('l', 'r')) df1 = merge(csv1, csv2, **kw1) df1.drop_duplicates(cols=[3], inplace=True) print df1[[0,7]]

Gives the zero and seventh merge columns:

  3 5 0 main_data1 data13 3 main_data2 data3 4 main_data3 data51 6 main_data4 data7 [4 rows x 2 columns]

And to give the result how you want it, do another merge (this time an external join) with CSV2 :

 kw2 = dict(how='outer', \ left_on=[3], \ right_on=[1], \ suffixes=('l', 'r')) df2 = merge(df1, csv2, **kw2) print df2[[15,16,17,18,19,8]]

Conclusion:

  0 1 2 3r 4r 5 0 id1 main_data1 a1 a2 a3 data13 1 id2 main_data2 b1 b2 b3 data3 2 id3 main_data3 c1 c2 c3 data51 3 id4 main_data4 d1 d2 d3 data7 4 id5 main_data5 e1 e2 e3 NaN

You do not need to use **kw for keyword arguments. I just used it to do everything horizontally.

I allow read_table and merge define column names. If you assign column names yourself, you will get better output.

Matthias Ossadnik · Answer 3 · 2014-11-21T00:23:44+0000

Because the merge condition seems complex, it may be useful to load data into a database and use SQL. Using SQLite in memory, you can do it like this (assuming the data is separated by commas)

 import csv import sqlite3 def createTable(cursor, rows, tablename): tableCreated = False for row in rows: if not tableCreated: sql = "CREATE TABLE %s(ROW INTEGER PRIMARY KEY, " + ", ".join(["c%d" % (i+1) for i in range(len(row))]) + ")" cur.execute(sql % tablename) tableCreated = True sql = "INSERT INTO %s VALUES(NULL, " + ", ".join(["'" + c + "'" for c in row]) + ")" cur.execute(sql % tablename) conn.commit() conn = sqlite3.connect(":memory:") cur = conn.cursor() for filename, tablename in [(path_to_csv1, "CSV1"), (path_to_csv2, "CSV2")]: with open(filename, "r") as f: reader = csv.reader(f, delimiter=',') rows = [row for row in reader] createTable(cur, rows, tablename)

Then you can formulate your join logic in SQL. You can run queries as follows:

 for row in cur.execute(your_sql_statement): print row

The following query gives the desired result:

 WITH MATCHES AS( -- get all matches SELECT CSV2.* , CSV1.ROW as ROW_1 , CSV1.C4 as C4_1 , CSV1.C5 as C5_1 FROM CSV2 LEFT JOIN CSV1 ON CSV1.C4 LIKE '%' || CSV2.C2 || '%' ), EXACT AS( -- matches where CSV1.C4 = CSV1.C5 SELECT * FROM MATCHES WHERE C4_1 = C5_1 ), MIN_ROW AS( -- CSV1.ROW of first occurence for each CSV2.C1 SELECT C1 , min(ROW_1) as ROW_1 FROM MATCHES WHERE C1 NOT IN (SELECT C1 FROM EXACT) GROUP BY C1, C2, C3, C4, C5 ) -- use C4=C5 first SELECT * FROM EXACT UNION -- if match not in exact, use first occurence SELECT MATCHES.* FROM MIN_ROW INNER JOIN MATCHES ON MIN_ROW.C1 = MATCHES.C1 AND (MIN_ROW.ROW_1 = MATCHES.ROW_1 OR MIN_ROW.ROW_1 IS NULL) ORDER BY C1

mfitzp · Answer 4 · 2015-01-02T16:14:40+0000

Since you originally asked for a Python solution for this, I thought I provided it. The simplest solution that happened was to first load CSV1 and use it to create a mapping dictionary to use when generating output from CSV2.

If I understand the input file correctly, you need to consider only the values to the left of ; (if there is). This can be achieved using split(';') and accepting a null element. If not ; , then the zero element will be the entire string. Assigning a mapper then you just need to follow the rules that you defined (add only if they do not already exist, unless the columns 4 and 5 correspond).

In the code below you will get the requested result:

 import csv mapper = dict() with open('CSV1', 'r') as f1: reader = csv.reader(f1) for row in reader: # Column 3 contains the match; but we only want the left-most (before semi-colon) i = row[3].split(';')[0] # Column 4 contains the target value for output t = row[4] if i not in mapper: mapper[i] = t elif row[3] == row[4]: mapper[i] = t with open('CSV2', 'r') as f2: with open('FINAL_CSV', 'wb') as fo: reader = csv.reader(f2) writer = csv.writer(fo) for row in reader: if row[1] in mapper: row.append( mapper[ row[1] ] ) writer.writerow(row)

Output file:

 id1,main_data1,a1,a2,a3,data13 id2,main_data2,b1,b2,b3,data3 id3,main_data3,c1,c2,c3,main_data3 id4,main_data4,d1,d2,d3,data7 id5,main_data5,e1,e2,e3

To refer to the modification "main_data can be in any CSV column", use the following code:

 for row in reader: for r in row: if r in mapper: row.append( mapper[ r ] ) break writer.writerow(row)

This will search for each entry in the current CSV2 line and, if there is a match (with the original data of the mapper), add the matching data to the line. Then the string will be recorded as before.

Merge two CSV files based on data from a column

More articles: