On our production server, we need to split 900 thousand images into different servers and update 400 thousand rows (MySQL with InnoDB engine). I wrote a python script that goes through the following steps:
- Select a small piece of data from db (10 rows)
- Make new dirs
- Copy files to created directories and rename them
- Update db (there are some upgrade triggers that will load the server)
- Repeat
My code is:
import os, shutil
import database
LIMIT_START_OFFSET = 0
LIMIT_ROW_COUNT = 10
SRC_PATHS = ('/var/www/site/public/upload/images/',)
DST_PATH = '/var/www/site/public/upload/new_images/'
def main():
offset = LIMIT_START_OFFSET
while True:
db = Connection(DB_HOST, DB_NAME, DB_USER, DB_PASSWD)
db_data = db.query('''
SELECT id AS news_id, image AS src_filename
FROM emd_news
ORDER BY id ASC
LIMIT %s, %s''', offset, LIMIT_ROW_COUNT)
offset = offset + LIMIT_ROW_COUNT
news_images = get_news_images(db_data)
make_dst_dirs(DST_PATH, [i['dst_dirname'] for i in news_images])
news_to_update = copy_news_images(SRC_PATHS, DST_PATH, news_images)
db.executemany('''
UPDATE emd_news
SET image = %s
WHERE id = %s
LIMIT 1''', [(i['filename'], i['news_id']) for i in news_to_update])
db.close()
if not db_data: break
if __name__ == '__main__':
main()
Pretty simple task, but I'm a little nervous about performance.
How can I make this script more efficient?
UPD: In the end, I used the original script without any changes. It took about 5 hours. And it was fast at the beginning and very slow at the end.