This is not about reading large JSON files, but about how to most efficiently read a large number of JSON files.
Question
I am working with the last.fm dataset from the Million song data set . The data is available as a set of text files encoded in JSON, where the keys are: track_id, artist, title, timestamp, similars and tags.
I am currently reading them in pandas as follows after going through a few options as it is the fastest as shown here :
import os import pandas as pd try: import ujson as json except ImportError: try: import simplejson as json except ImportError: import json
The current method reads a subset (1% of the full data set in less than a second). However, reading the entire train set is too slow and takes forever (I also waited a couple of hours) to read and became a bottleneck for further tasks, such as shown here .
I also use ujson to speed up parsing json files, which you can obviously see from here, here
UPDATE 1 Using comprehension generator instead of comprehension list.
data_list=(json.load(open(file)) for file in all_files)
Tjain source share