How to efficiently create a sparse Adjacency matrix from an adjacency list?

I am working with the last.fm dataset from the Million song data set . The data is available as a set of text files encoded by json, in which there are keys: track_id, artist, title, timestamp, similars and tags.

Using the similars and track_id fields, I'm trying to create a sparse adjacency matrix so that I can perform other tasks using a dataset. My next attempt. However, it is very slow (especially the to_sparse , opening and loading all json files, and the slowest is the apply function, which I came up with, even if after several improvements: /), I am new to pandas, and I improved it with of my first attempt, but I'm sure that some vectorization or other methods will significantly increase speed and efficiency.

 import os import pandas as pd import numpy as np # Path to the dataset path = "../lastfm_subset/" # Getting list of all json files in dataset all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] data_list=[json.load(open(file)) for file in all_files] df = pd.DataFrame(data_list, columns=['similars', 'track_id']) df.set_index('track_id', inplace=True) a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse() def make_graph(adjacent): importance = 1/len(adjacent['similars']) neighbors = list(filter(lambda x: x[1] > threshold, adjacent['similars'])) if len(neighbors) == 0: return t_id, similarity_score = map(list, zip(*neighbors)) a.loc[list(t_id), adjacent['track_id']] = importance df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similars']].apply(make_graph, axis=1) 

I also believe that the way to read a dataset can be significantly improved and better written.

So, we just need to read the data, and then make an efficient macrobial adjacency matrix from the adjacency list.

The similars key has a list list. Internal lists are 1x2 with track_id with a similar rating for the song and the like.

Since I am new to this topic, I am open to tips, suggestions, and best practices available for any part of such tasks.

UPDATE 1

After entering comments from comments, a slightly improved version, although it is still far from acceptable speeds. The good part is, the apply function is fast enough. However, understanding the list of opening and loading json files to create data_list is very slow. Moreover, to_sparse runs forever, so I worked without creating a sparse matrix.

 import os import pandas as pd import numpy as np # Path to the dataset path = "../lastfm_subset/" # Getting list of all json files in dataset all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] data_list=[json.load(open(file)) for file in all_files] df = pd.DataFrame(data_list, columns=['similars', 'track_id']) df.set_index('track_id', inplace=True) df.loc[( df['similars'].str.len() > 0 ), 'importance' ] = 1/len(df['similars']) # Update 1 a = pd.DataFrame(df['importance'],columns= df.index, index=df.index)#.to_sparse(fill_value=0) def make_graph(row): neighbors = list(filter(lambda x: x[1] > threshold, row['similars'])) if len(neighbors) == 0: return t_id, similarity_score = map(list, zip(*neighbors)) a.loc[list(t_id), row['track_id']] = row['importance'] df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similar', 'importance']].apply(make_graph, axis=1) 

Update 2

Using comprehension generator instead of comprehension list.

 data_list=(json.load(open(file)) for file in all_files) 

I also use ujson to achieve speed when parsing json files, which can be seen, obviously, from here, here

 try: import ujson as json except ImportError: try: import simplejson as json except ImportError: import json 
+4
source share

Source: https://habr.com/ru/post/1262754/


All Articles