I have a large CSV file full of stock data formatted as such:
Ticker Symbol, Date, [some variables ...]
So, each line starts with a character (for example, "AMZN"), then has a date, then has 12 variables associated with the price or volume on the selected date. This file contains about 10,000 different securities, and I have a line for every day when the shares were publicly sold for each of them. The file is sorted alphabetically by ticker character, and the second by date. The entire file is about 3.3 GB.
The task I want to solve would be to be able to retrieve the last n rows of data for a given ticker symbol relative to the current date. I have code that does this, but based on my observations, it seems to take an average of about 8-10 seconds for each search (all tests retrieved 100 rows).
I have functions that I would like to run in order to require me to capture such chunks for hundreds or thousands of characters, and I really would like to reduce the time. My code is inefficient, but I'm not sure how to make it work faster.
First, I have a function called getData:
def getData(symbol, filename): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(filename, 'r') as f: for line in f: match = checkMatch(symbol, l, line) if beforeMatch and match: beforeMatch = False out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and match: out.append(formatLineData(line[:-1].split(","))) elif not beforeMatch and not match: break return out
(This code contains several helper functions, checkMatch and formatLineData, which I will show below). Then there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False): dataset = getData(symbol) if not changeRateTransform: column = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return column
(changeRateTransform converts raw numbers to daily rate of change numbers if True.) Helper functions:
def checkMatch(symbol, symbolLength, line): out = False if line[:symbolLength+1] == symbol + ",": out = True return out def formatLineData(lineData): out = [lineData[0]] out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date()) out += [float(d) for d in lineData[2:6]] out += [int(float(d)) for d in lineData[6:9]] out += [float(d) for d in lineData[9:13]] out.append(int(float(lineData[13]))) return out
Does anyone have an idea about which parts of my code are slower, and how can I do this better? I cannot do the kind of analysis that I want to do without speeding it up.
EDIT: In response to the comments, I made some changes to the code to use existing methods in the csv module:
def getData(symbol, database): out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend", "Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"] l = len(symbol) beforeMatch = True with open(database, 'r') as f: databaseReader = csv.reader(f, delimiter=",") for row in databaseReader: match = (row[0] == symbol) if beforeMatch and match: beforeMatch = False out.append(formatLineData(row)) elif not beforeMatch and match: out.append(formatLineData(row)) elif not beforeMatch and not match: break return out def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False): if not changeRateTransform: out = [day[col] for day in dataset[-numDays:]] else: n = len(dataset) out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)] return out
Performance was worse using the csv.reader class. I tested two stocks, AMZN (near the top of the file) and ZNGA (near the bottom of the file). With the original method, the runtime was 0.99 seconds and 18.37 seconds, respectively. Using a new method using the csv module, the runtime was 3.04 seconds and 64.94 seconds, respectively. Both return correct results.
My idea is that more time is taken from the search for stock than from parsing. If I try these methods in the first warehouse in file A, the methods will execute in about 0.12 seconds.