I am trying to get twitter data from archive.org archive and upload it to the database. I try to first download all the tweets for a specific month, and then make a choice for tweets and only compose those that interest me (for example, by language or hashtag).
I can run the script described below to accomplish what I'm looking for, but I have a problem in that it is incredibly slow. It works for about half an hour and reads only ~ 6/50,000 internal .bz2 files in one TAR file.
Some statistics of the sample tar file:
- Total size: ~ 30-40 GB
- The number of internal .bz2 files (in folders): 50,000
- Size of a single .bz2 file: ~ 600kb
- Size of one extracted JSON file: ~ 5 MB, ~ 3600 tweets.
What should I look for when optimizing this process for speed?
- Should I extract files to disk instead of buffering them in Python?
- Should I look at the multithreading part of the process? What part of the process will be optimal for this?
- Alternatively, is the speed that I am currently getting relatively normal for such a script?
Currently, the script uses ~ 3% of my processor and ~ 6% of my RAM memory.
Any help is greatly appreciated.
import tarfile import dataset
MattV source share