Reading a huge number of json files in Python?

This is not about reading large JSON files, but about how to most efficiently read a large number of JSON files.

Question

I am working with the last.fm dataset from the Million song data set . The data is available as a set of text files encoded in JSON, where the keys are: track_id, artist, title, timestamp, similars and tags.

I am currently reading them in pandas as follows after going through a few options as it is the fastest as shown here :

import os import pandas as pd try: import ujson as json except ImportError: try: import simplejson as json except ImportError: import json # Path to the dataset path = "../lastfm_train/" # Getting list of all json files in dataset all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] data_list=[json.load(open(file)) for file in all_files] df = pd.DataFrame(data_list, columns=['similars', 'track_id']) df.set_index('track_id', inplace=True) 

The current method reads a subset (1% of the full data set in less than a second). However, reading the entire train set is too slow and takes forever (I also waited a couple of hours) to read and became a bottleneck for further tasks, such as shown here .

I also use ujson to speed up parsing json files, which you can obviously see from here, here

UPDATE 1 Using comprehension generator instead of comprehension list.

 data_list=(json.load(open(file)) for file in all_files) 
+6
source share
2 answers

If you need to read and write a dataset several times, you can try converting the .json files to a faster format. For example, in pandas 0. 20+, you can try using the .feather format.

+2
source

I would build an iterator on the files and just yield two columns you want.

Then you can create an instance of the DataFrame with this iterator.

 import os import json import pandas as pd # Path to the dataset path = "../lastfm_train/" def data_iterator(path): for root, dirs, files in os.walk(path): for f in files: if f.endswith('.json'): fp = os.path.join(root,f) with open(fp) as o: data = json.load(o) yield {"similars" : data["similars"], "track_id": data["track_id"]} df = pd.DataFrame(data_iterator(path)) df.set_index('track_id', inplace=True) 

Thus, you only go through the list of files once, and you will not duplicate data before and after transferring it to the DataFrame

0
source

Source: https://habr.com/ru/post/1262752/


All Articles