Loop through the lines of one CSV file to find the corresponding data in another

I had an interesting problem:

file1.csv has several hundred lines, for example:

Code,DTime 1,2010-12-26 17:01 2,2010-12-26 17:07 2,2010-12-26 17:15 

file2.csv has about 11 million lines, for example:

 id,D,Sym,DateTime,Bid,Ask 1375022797,D,USD,2010-12-26 17:00:15,1.311400,1.311700 1375022965,D,USD,2010-12-26 17:00:56,1.311200,1.311500 1375022984,D,USD,2010-12-26 17:00:56,1.311300,1.311600 1375023013,D,USD,2010-12-26 17:01:01,1.311200,1.311500 1375023039,D,USD,2010-12-26 17:01:02,1.311100,1.311400 1375023055,D,USD,2010-12-26 17:01:03,1.311200,1.311500 1375023063,D,USD,2010-12-26 17:01:03,1.311300,1.311600 

What I'm trying to do is write a script that takes each DTime value in file1.csv and finds the first partial match instance in the DateTime column of file2.csv and displays DateTime, Bid. Ask for this line. Partial match - first 16 characters.

Both files are sorted from oldest to newest, so if "2010-12-26 17:01" from file1.csv matches 4 files in2.csv, I only need to extract the first file: "2010-12-26 17: 01:01 "

I donโ€™t know how to continue .. I tried the dictionary, but the order of the values โ€‹โ€‹is important, so I'm not sure if this will work. Maybe bring the column file1 DTime to the list and for each record in this DateTime search list in file2?

Thanks guys,

+4
source share
3 answers

If you don't have duplicate DTime values, this should work:

 import csv file1reader = csv.reader(open("file1.csv"), delimiter=",") file2reader = csv.reader(open("file2.csv"), delimiter=",") header1 = file1reader.next() #header header2 = file2reader.next() #header for Code, DTime in file1reader: for id_, D, Sym, DateTime, Bid, Ask in file2reader: if DateTime.startswith(DTime): # found it print DateTime, Bid, Ask # output data break # break and continue where we left next time 

Edit

 import csv from datetime import datetime file1reader = csv.reader(open("file1.csv"), delimiter=",") file2reader = csv.reader(open("file2.csv"), delimiter=",") header1 = file1reader.next() #header header2 = file2reader.next() #header for Code, DTime in file1reader: DTime = datetime.strptime(DTime, "%Y-%m-%d %H:%M") for id_, D, Sym, DateTime, Bid, Ask in file2reader: DateTime = datetime.strptime(DateTime, "%Y-%m-%d %H:%M:%S") if DateTime>=DTime: # found it print DateTime, Bid, Ask # output data break # break and continue where we left next time 
+6
source

If you only need to do this once, you really should use a database. Add a column to table2 containing DATETIME without seconds so you can join exact matches, not LIKE.

It will be fast and even faster if you index these columns. And if you can also store file1.csv in a database, you donโ€™t need iterations: you can get the whole set of results in a single query of choice. This is the material for which SQL is created.

PS. If you decide to continue this approach, you can ask for help with a request.

+3
source

you can create a dictionary from file2, where the key is the prefix of the time you want, and the value is either the first line or all the lines corresponding to this prefix. then this is just a question:

 entries = file2Dict.get(file1Entry) if entries: print "First entry is %s" entries[0] 
+1
source

Source: https://habr.com/ru/post/1399743/


All Articles