Python MySQLdb SScursor is slow compared to exporting and importing from a CSV file. Is acceleration possible?

As part of building a data warehouse, I have to query the source database table for rows of order 75M.

What I want to do with 75M rows is some processing and then adding the result to another database. Now this is pretty much data, and I have had success using basically two approaches:

1) Export the request to a CSV file using the MySQL "SELECT ... INTO" capabilities and using the python fileinput module to read it and

2) connecting to a MySQL database using MySQLdb SScursor (the cursor puts the query into memory by default, kills the python script) and extracts the results in the form of blocks about 10 thousand rows in size (this is the size of the block that I have turned out to be the fastest).

The first approach is a manual SQL query (takes about 6 minutes), followed by a python script that reads the csv file and processes it. The reason I use fileinput to read a file is because fileinput does not load the entire file into memory from the very beginning and works well with large files. Simply moving the file (reading each line in the file and intercepting the call) takes about 80 seconds, i.e. 1M lines / s.

The second approach is a python script executing the same request (also takes about 6 minutes or a little longer), and then a set of while-loops until any of those remaining in SScursor remains. Here, just reading the lines (selecting one fragment after another and not doing anything else) takes about 15 minutes, or about 85 thousand lines / s.

The two numbers (lines / s) above are apparently not quite comparable, but when comparing the two approaches in my application, the first takes about 20 minutes (of which about five are MySQL dump in a CSV file) and the second takes about 35 minutes (of which about five minutes the request is executed). This means dumping and reading to / from a CSV file is about two times faster than directly using SScursor.

This is not a problem if it does not limit the portability of my system: the "SELECT ... INTO" statement requires that MySQL have write permissions, and I suspect that it is not as safe as using cursors. On the other hand, 15 minutes (and grows as the original database grows) in fact, I can’t save on each assembly.

So am I missing something? Is there any known reason why SScursor will be much slower than dumping / reading to / from a CSV file, so fileinput is C optimized where SScursor is not? Any ideas on how to solve this problem? Anything to check? I would believe that SScursor can be as fast as the first approach, but after reading everything I can find about this, I'm at a standstill.

Now, to the code:

Not that I thought the query has any problem (it is as fast as I can ask and takes similar time in both approaches), but here it is for completeness:

SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2) FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID ORDER BY LT.someID ASC, LT.weekID ASC; 

The primary code in the first approach looks something like this:

  import fileinput INPUT_PATH = 'path/to/csv/dump/dump.csv' event_list = [] ID = -1 for line in fileinput.input([INPUT_PATH]): split_line = line.split(';') if split_line[0] == ID: event_list.append(split_line[1:]) else: process_function(ID,event_list) event_list = [ split_line[1:] ] ID = split_line[0] process_function(ID,event_list) 

The primary code in the second approach:

  import MySQLdb ...opening connection, defining SScursor called ssc... CHUNK_SIZE = 100000 query_stmt = """SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2) FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID ORDER BY LT.someID ASC, LT.weekID ASC""" ssc.execute(query_stmt) event_list = [] ID = -1 data_chunk = ssc.fetchmany(CHUNK_SIZE) while data_chunk: for row in data_chunk: if row[0] == ID: event_list.append([ row[1], row[2], row[3] ]) else: process_function(ID,event_list) event_list = [[ row[1], row[2], row[3] ]] ID = row[0] data_chunk = ssc.fetchmany(CHUNK_SIZE) process_function(ID,event_list) 

Finally, I'm on Ubuntu 13.04 with a MySQL 5.5.31 server. I am using Python 2.7.4 with MySQLdb 1.2.3. Thank you for staying with me for so long!

+4
source share
1 answer

After using cProfile I found a lot of time spent implicitly building decimal objects, since it was a numeric type returned from an SQL query in my Python script. In the first approach, the decimal value was written to the CSV file as an integer, and then read as such a Python script. The CSV file I / O flattened the data, making the script faster. Two scenarios now have the same speed (the second approach is still a bit slower).

I also did some conversion of the date in the MySQL database to an integer type. My request:

 SELECT LT.SomeID, LT.weekID, CAST(DATE_FORMAT(W.monday,'%Y%m%d') AS UNSIGNED), CAST(GREATEST(LT.attr1, LT.attr2) AS UNSIGNED) FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID ORDER BY LT.someID ASC, LT.weekID ASC; 

This almost eliminates the difference in processing time between the two approaches.

The lesson here is that when doing large queries, post-processing data types does MATTER! Restarting a request to reduce function calls in Python can significantly improve overall processing speed.

+1
source

Source: https://habr.com/ru/post/1490533/


All Articles