Convert sparse matrix python to lean sparse matrix

I use python scikit-learn to cluster documents, and I have a sparse matrix stored in a dict object:

For instance:

 doc_term_dict = { ('d1','t1'): 12, \ ('d2','t3'): 10, \ ('d3','t2'): 5 \ } # from mysql data table <type 'dict'> 

I want to use scikit-learn for clustering, where the input matrix type is scipy.sparse.csr.csr_matrix

Example:

 (0, 2164) 0.245793088885 (0, 2076) 0.205702177467 (0, 2037) 0.193810934784 (0, 2005) 0.14547028437 (0, 1953) 0.153720023365 ... <class 'scipy.sparse.csr.csr_matrix'> 

I can not find a way to convert the dict to this csr matrix (I never used scipy .)

+5
source share
3 answers

Pretty simple. First read the dictionary and convert the keys to the corresponding rows and columns. Scipy supports (and recommends for this purpose) the COO-rdinate format for sparse matrices.

Pass data , row and column , where A[row[k], column[k] = data[k] (for all k) defines the matrix. Then let Scipy do the conversion to CSR.

Please check that I have rows and columns the way you want, I could wrap them. I also suggested that the input would be 1-indexed.

My code below prints:

 (0, 0) 12 (1, 2) 10 (2, 1) 5 

the code:

 #!/usr/bin/env python3 #http://stackoverflow.com/questions/26335059/converting-python-sparse-matrix-dict-to-scipy-sparse-matrix from scipy.sparse import csr_matrix, coo_matrix def convert(term_dict): ''' Convert a dictionary with elements of form ('d1', 't1'): 12 to a CSR type matrix. The element ('d1', 't1'): 12 becomes entry (0, 0) = 12. * Conversion from 1-indexed to 0-indexed. * d is row * t is column. ''' # Create the appropriate format for the COO format. data = [] row = [] col = [] for k, v in term_dict.items(): r = int(k[0][1:]) c = int(k[1][1:]) data.append(v) row.append(r-1) col.append(c-1) # Create the COO-matrix coo = coo_matrix((data,(row,col))) # Let Scipy convert COO to CSR format and return return csr_matrix(coo) if __name__=='__main__': doc_term_dict = { ('d1','t1'): 12, \ ('d2','t3'): 10, \ ('d3','t2'): 5 \ } print(convert(doc_term_dict)) 
+5
source

We can make @Unapiedra (excellent) the answer a little more sparse:

 from scipy.sparse import csr_matrix def _dict_to_csr(term_dict): term_dict_v = list(term_dict.itervalues()) term_dict_k = list(term_dict.iterkeys()) shape = list(repeat(np.asarray(term_dict_k).max() + 1,2)) csr = csr_matrix((term_dict_v, zip(*term_dict_k)), shape = shape) return csr 
+2
source

Same as @carsonc, but for Python 3.X:

 from scipy.sparse import csr_matrix def _dict_to_csr(term_dict): term_dict_v = term_dict.values() term_dict_k = term_dict.keys() term_dict_k_zip = zip(*term_dict_k) term_dict_k_zip_list = list(term_dict_k_zip) shape = (len(term_dict_k_zip_list[0]), len(term_dict_k_zip_list[1])) csr = csr_matrix((list(term_dict_v), list(map(list, zip(*term_dict_k)))), shape = shape) return csr 
0
source

Source: https://habr.com/ru/post/1204544/


All Articles