How to efficiently read a large (30GB +) TAR file with BZ2 JSON twitter files in PostgreSQL

Question

How to efficiently read a large (30GB +) TAR file with BZ2 JSON twitter files in PostgreSQL

I am trying to get twitter data from archive.org archive and upload it to the database. I try to first download all the tweets for a specific month, and then make a choice for tweets and only compose those that interest me (for example, by language or hashtag).

I can run the script described below to accomplish what I'm looking for, but I have a problem in that it is incredibly slow. It works for about half an hour and reads only ~ 6/50,000 internal .bz2 files in one TAR file.

Some statistics of the sample tar file:

Total size: ~ 30-40 GB
The number of internal .bz2 files (in folders): 50,000
Size of a single .bz2 file: ~ 600kb
Size of one extracted JSON file: ~ 5 MB, ~ 3600 tweets.

What should I look for when optimizing this process for speed?

Should I extract files to disk instead of buffering them in Python?
Should I look at the multithreading part of the process? What part of the process will be optimal for this?
Alternatively, is the speed that I am currently getting relatively normal for such a script?

Currently, the script uses ~ 3% of my processor and ~ 6% of my RAM memory.

Any help is greatly appreciated.

import tarfile import dataset # Using dataset as I'm still iteratively developing the table structure(s) import json import datetime def scrape_tar_contents(filename): """Iterates over an input TAR filename, retrieving each .bz2 container: extracts & retrieves JSON contents; stores JSON contents in a postgreSQL database""" tar = tarfile.open(filename, 'r') inner_files = [filename for filename in tar.getnames() if filename.endswith('.bz2')] num_bz2_files = len(inner_files) bz2_count = 1 print('Starting work on file... ' + filename[-20:]) for bz2_filename in inner_files: # Loop over all files in the TAR archive print('Starting work on inner file... ' + bz2_filename[-20:] + ': ' + str(bz2_count) + '/' + str(num_bz2_files)) t_extract = tar.extractfile(bz2_filename) data = t_extract.read() txt = bz2.decompress(data) tweet_errors = 0 current_line = 1 num_lines = len(txt.split('\n')) for line in txt.split('\n'): # Loop over the lines in the resulting text file. if current_line % 100 == 0: print('Working on line ' + str(current_line) + '/' + str(num_lines)) try: tweet = json.loads(line) except ValueError, e: error_log = {'Date_time': datetime.datetime.now(), 'File_TAR': filename, 'File_BZ2': bz2_filename, 'Line_number': current_line, 'Line': line, 'Error': str(e)} tweet_errors += 1 db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number']) print('Error occured, now at ' + str(tweet_errors)) try: tweet_id = tweet['id'] tweet_text = tweet['text'] tweet_locale = tweet['lang'] created_at = tweet['created_at'] tweet_json = tweet data = {'tweet_id': tweet_id, 'tweet_text': tweet_text, 'tweet_locale': tweet_locale, 'created_at_str': created_at, 'date_loaded': datetime.datetime.now(), 'tweet_json': tweet_json} db['tweets'].upsert(data, ['tweet_id']) except KeyError, e: error_log = {'Date_time': datetime.datetime.now(), 'File_TAR': filename, 'File_BZ2': bz2_filename, 'Line_number': current_line, 'Line': line, 'Error': str(e)} tweet_errors += 1 db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number']) print('Error occured, now at ' + str(tweet_errors)) continue if __name__ == "__main__": with open("postgresConnecString.txt", 'r') as f: db_connectionstring = f.readline() db = dataset.connect(db_connectionstring) filename = r'H:/Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar' scrape_tar_contents(filename)

+5

json python

MattV Jan 08 '15 at 11:17

source share

1 answer

unutbu · Accepted Answer · 2015-01-08T12:32:19+0000

The tar file does not contain a pointer to where the files are located. In addition, a tar file may contain multiple copies of the same file . Therefore, when you extract a single file, the entire tar file must be read. Even after it finds the file, the rest of the tar file must still be read to check if a later copy exists.

This makes extracting a single file as costly as extracting all files.

Therefore, never use tar.extractfile(...) in a large tar file (unless you need only one file or you don’t have space to extract everything).

If you have space (and given the size of modern hard drives, you will almost certainly do so), extract everything either using tar.extractall or using a system call before tar xf ... and then process the extracted files.

How to efficiently read a large (30GB +) TAR file with BZ2 JSON twitter files in PostgreSQL

More articles: