How to efficiently create a sparse Adjacency matrix from an adjacency list?

Question

How to efficiently create a sparse Adjacency matrix from an adjacency list?

I am working with the last.fm dataset from the Million song data set . The data is available as a set of text files encoded by json, in which there are keys: track_id, artist, title, timestamp, similars and tags.

Using the similars and track_id fields, I'm trying to create a sparse adjacency matrix so that I can perform other tasks using a dataset. My next attempt. However, it is very slow (especially the to_sparse , opening and loading all json files, and the slowest is the apply function, which I came up with, even if after several improvements: /), I am new to pandas, and I improved it with of my first attempt, but I'm sure that some vectorization or other methods will significantly increase speed and efficiency.

 import os import pandas as pd import numpy as np # Path to the dataset path = "../lastfm_subset/" # Getting list of all json files in dataset all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] data_list=[json.load(open(file)) for file in all_files] df = pd.DataFrame(data_list, columns=['similars', 'track_id']) df.set_index('track_id', inplace=True) a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse() def make_graph(adjacent): importance = 1/len(adjacent['similars']) neighbors = list(filter(lambda x: x[1] > threshold, adjacent['similars'])) if len(neighbors) == 0: return t_id, similarity_score = map(list, zip(*neighbors)) a.loc[list(t_id), adjacent['track_id']] = importance df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similars']].apply(make_graph, axis=1)

I also believe that the way to read a dataset can be significantly improved and better written.

So, we just need to read the data, and then make an efficient macrobial adjacency matrix from the adjacency list.

The similars key has a list list. Internal lists are 1x2 with track_id with a similar rating for the song and the like.

Since I am new to this topic, I am open to tips, suggestions, and best practices available for any part of such tasks.

UPDATE 1

After entering comments from comments, a slightly improved version, although it is still far from acceptable speeds. The good part is, the apply function is fast enough. However, understanding the list of opening and loading json files to create data_list is very slow. Moreover, to_sparse runs forever, so I worked without creating a sparse matrix.

 import os import pandas as pd import numpy as np # Path to the dataset path = "../lastfm_subset/" # Getting list of all json files in dataset all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] data_list=[json.load(open(file)) for file in all_files] df = pd.DataFrame(data_list, columns=['similars', 'track_id']) df.set_index('track_id', inplace=True) df.loc[( df['similars'].str.len() > 0 ), 'importance' ] = 1/len(df['similars']) # Update 1 a = pd.DataFrame(df['importance'],columns= df.index, index=df.index)#.to_sparse(fill_value=0) def make_graph(row): neighbors = list(filter(lambda x: x[1] > threshold, row['similars'])) if len(neighbors) == 0: return t_id, similarity_score = map(list, zip(*neighbors)) a.loc[list(t_id), row['track_id']] = row['importance'] df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similar', 'importance']].apply(make_graph, axis=1)

Update 2

Using comprehension generator instead of comprehension list.

 data_list=(json.load(open(file)) for file in all_files)

I also use ujson to achieve speed when parsing json files, which can be seen, obviously, from here, here

 try: import ujson as json except ImportError: try: import simplejson as json except ImportError: import json

+4

python numpy pandas

Tjain Jan 13 '17 at 6:37

source share

No one has answered this question yet.

See similar questions:

6

What happens faster - loading a pickled dictionary object or loading a JSON file - into a dictionary?

6

Reading a huge number of json files in Python?

or similar:

3790

How can I safely create a subdirectory?

3474

How to list all the catalog files?

3235

How to check if a list is empty?

2849

How to make a flat list from a list of lists?

2227

How to clone or copy a list?

2047

How to combine two lists in Python?

2030

How do you break a list into pieces with uniform size?

2028

How to read a file line by line in a list?

1782

How can I get the number of items in a list?

1621

How to randomly select an item from a list?

How to efficiently create a sparse Adjacency matrix from an adjacency list?

More articles: