Quickly extract pieces of lines from a large CSV file

I have a large CSV file full of stock data formatted as such:

Ticker Symbol, Date, [some variables ...]

So, each line starts with a character (for example, "AMZN"), then has a date, then has 12 variables associated with the price or volume on the selected date. This file contains about 10,000 different securities, and I have a line for every day when the shares were publicly sold for each of them. The file is sorted alphabetically by ticker character, and the second by date. The entire file is about 3.3 GB.

The task I want to solve would be to be able to retrieve the last n rows of data for a given ticker symbol relative to the current date. I have code that does this, but based on my observations, it seems to take an average of about 8-10 seconds for each search (all tests retrieved 100 rows).

I have functions that I would like to run in order to require me to capture such chunks for hundreds or thousands of characters, and I really would like to reduce the time. My code is inefficient, but I'm not sure how to make it work faster.

First, I have a function called getData:

def getData(symbol, filename): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(filename, 'r') as f: for line in f: match = checkMatch(symbol, l, line) if beforeMatch and match: beforeMatch = False out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and match: out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and not match: break return out 

(This code contains several helper functions, checkMatch and formatLineData, which I will show below). Then there is another function called getDataColumn that gets the column I want with the correct number of days represented:

 def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False): dataset = getData(symbol) if not changeRateTransform: column = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return column 

(changeRateTransform converts raw numbers to daily rate of change numbers if True.) Helper functions:

 def checkMatch(symbol, symbolLength, line): out = False if line[:symbolLength+1] == symbol + ",": out = True return out def formatLineData(lineData): out = [lineData[0]] out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date()) out += [float(d) for d in lineData[2:6]] out += [int(float(d)) for d in lineData[6:9]] out += [float(d) for d in lineData[9:13]] out.append(int(float(lineData[13]))) return out 

Does anyone have an idea about which parts of my code are slower, and how can I do this better? I cannot do the kind of analysis that I want to do without speeding it up.


EDIT: In response to the comments, I made some changes to the code to use existing methods in the csv module:

 def getData(symbol, database): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(database, 'r') as f: databaseReader = csv.reader(f, delimiter=",") for row in databaseReader: match = (row[0] == symbol) if beforeMatch and match: beforeMatch = False out.append(formatLineData(row)) elif not beforeMatch and match: out.append(formatLineData(row)) elif not beforeMatch and not match: break return out def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False): if not changeRateTransform: out = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return out 

Performance was worse using the csv.reader class. I tested two stocks, AMZN (near the top of the file) and ZNGA (near the bottom of the file). With the original method, the runtime was 0.99 seconds and 18.37 seconds, respectively. Using a new method using the csv module, the runtime was 3.04 seconds and 64.94 seconds, respectively. Both return correct results.

My idea is that more time is taken from the search for stock than from parsing. If I try these methods in the first warehouse in file A, the methods will execute in about 0.12 seconds.

+6
source share
3 answers

When you are going to do a lot of analysis in one data set, the pragmatic approach is to read all of this in a database. It is intended for quick inquiry; CSV no. For example, use sqlite command line tools that can be directly imported from CSV. Then add one index on (Symbol, Date) and the search will be almost instantaneous.

If for some reason this is not possible, for example, because new files may appear at any time, and you cannot afford the preparation time before starting to analyze them, you will need to work as efficiently as possible with CSV, which will focus on the rest part of my answer. Remember that this is a balancing act. Either you pay a lot in advance, or a little more for each search. After all, for a certain number of searches it would be cheaper to pay in advance.

Optimization is the maximization of the amount of work performed. Using generators and the built-in csv module will not help in this case. You will still read the entire file and parse it all, at least for line breaks. With so much data, this does not work.

Analysis requires reading, so you need to find a way first. The best practices of leaving all the intricacies of the CSV format in a specialized module do not matter if they cannot give you the necessary performance. Some fraud should be done, but as little as possible. In this case, I suppose, we can safely assume that the beginning of a new line can be defined as b'\n"AMZN",' (following your example). Yes, binary is here, because remember: until you figure it out. You can scan the file as binary from the very beginning until you find the first line. From there, read the number of lines you need, decode and parse them properly, etc. There is no need for optimization there, because 100 lines have nothing to worry about, compared to hundreds of thousands of irrelevant lines that you do not work for.

Dropping all this parsing buys you a lot, but reading also needs to be optimized. Do not load the entire file into memory and do not skip as many Python layers as you can. Using mmap allows the OS to decide what to load into memory transparently, and allows you to directly work with data.

However, you are potentially reading the entire file if the character is near the end. This is a linear search, which means that the time it takes is linearly proportional to the number of lines in the file. You can do better. Since the file is sorted, you can improve the function to perform a binary search instead. The number of steps that will be performed (where the step reads the line) is close to the binary logarithm of the number of lines. In other words: the number of times you can split a file into two parts of (almost) the same size. When there are a million lines, that’s five orders of magnitude difference!

Here's what I came up with based on Python's own bisect_left with some measures to take into account the fact that your “values” span more than one index:

 import csv from itertools import islice import mmap def iter_symbol_lines(f, symbol): # How to recognize the start of a line of interest ident = b'"' + symbol.encode() + b'",' # The memory-mapped file mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # Skip the header mm.readline() # The inclusive lower bound of the byte range we're still interested in lo = mm.tell() # The exclusive upper bound of the byte range we're still interested in hi = mm.size() # As long as the range isn't empty while lo < hi: # Find the position of the beginning of a line near the middle of the range mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1 # Go to that position mm.seek(mid) # Is it a line that comes before lines we're interested in? if mm.readline() < ident: # If so, ignore everything up to right after this line lo = mm.tell() else: # Otherwise, ignore everything from right before this line hi = mid # We found where the first line of interest would be expected; go there mm.seek(lo) while True: line = mm.readline() if not line.startswith(ident): break yield line.decode() with open(filename) as f: r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10)) for line in r: print(line) 

No guarantees regarding this code; I did not pay much attention to edge cases, and I could not check (any) your file (s), so we will consider it as a proof of concept. It's very fast, however - think about the tens of milliseconds on an SSD!

+3
source

So, I have an alternative solution that I ran and tested on my own, as well as a dataset that I got on Quandl, which seems to have all the same headers and similar data. (Assuming that I did not understand the wrong result that you are trying to achieve).

I have this command line tool that one of our engineers built for us to parse massive csvs - since I process an absurd amount of data daily - it is open and you can get it here: https://github.com/DataFoxCo/gocsv

I also wrote a short bash script for it if you do not want to pipelining commands, but also supports pipelining.

The command to run the following short script follows a simple simple convention:

bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'

 #!/bin/bash dates="$3" cat "$1" \ | gocsv filter --columns 'ticker' --regex "$2" \ | gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv' 
  • both arguments for the ticker and for dates are regular expressions
  • You can add as many variations as you want to this single regular expression by dividing them by | .
  • So, if you want AMZN and MSFT, then you just change it to this: AMZN|MSFT

  • I did something very similar to dates - but I only limited the space in any dates from this month or last month.

Final result

Initial data:

 myusername$ gocsv dims wikiprices.csv Dimensions: Rows: 23946 Columns: 14 myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+' myusername$ gocsv dims AMZN|MSFT-out.csv Dimensions: Rows: 24 Columns: 14 

Here is an example where I limited myself to only these two tickers, and then only until December:

enter image description here

Voila - in seconds you have a second file saved without the data you need.

The gocsv program has excellent documentation, by the way, and a ton of other functions, for example. launching vlookup basically on any scale (which inspired the creator to make a tool)

+2
source

in addition to using csv.reader I think using itertools.groupby will speed up the search for the required partitions, so the actual iteration might look something like this:

 import csv from itertools import groupby from operator import itemgetter #for the keyfunc for groupby def getData(wanted_symbol, filename): with open(filename) as file: reader = csv.reader(file) #so each line in reader is basically line[:-1].split(",") from the plain file for symb, lines in groupby(reader, itemgetter(0)): #so here symb is the symbol at the start of each line of lines #and lines is the lines that all have that symbol in common if symb != wanted_symbol: continue #skip this whole section if it has a different symbol for line in lines: #here we have each line as a list of fields #for only the lines that have `wanted_symbol` as the first element <DO STUFF HERE> 

so in the <DO STUFF HERE> space you could out.append(formatLineData(line)) do what your current code does, but the code for this function has many unnecessary slicing and += operations, which, in my opinion, are pretty roads for lists (maybe wrong), another way to apply conversions is to have a list of all conversions:

 def conv_date(date_str): return datetime.strptime(date_str, '%Y-%m-%d').date() #the conversions applied to each element (taken from original formatLineData) castings = [str, conv_date, #0, 1 float, float, float, float, #2:6 int, int, int, #6:9 float, float, float, float, #9:13 int] #13 

then use zip to apply them to each field in the line in a list comprehension:

  [conv(val) for conv, val in zip(castings, line)] 

so that you replace <DO STUFF HERE> with out.append this understanding.


I also wonder if it would be better to switch the order of groupby and reader , since you do not need to parse most of the file as csv, just the parts that you are actually repeating so that you can use keyfunc, which separates only the first field of the line

 def getData(wanted_symbol, filename): out = [] #why are you starting this with strings in it? def checkMatch(line): #define the function to only take the line #this would be the keyfunc for groupby in this example return line.split(",",1)[0] #only split once, return the first element with open(filename) as file: for symb, lines in groupby(file,checkMatch): #so here symb is the symbol at the start of each line of lines if symb != wanted_symbol: continue #skip this whole section if it has a different symbol for line in csv.reader(lines): out.append( [typ(val) for typ,val in zip(castings,line)] ) return out 
+1
source

Source: https://habr.com/ru/post/1013246/


All Articles