How can I handle huge matrices?

I am performing topic discovery with supervised learning. However, my matrices are very large ( 202180 x 15000 ), and I cannot fit them into the models I want. Most of the matrix consists of zeros. Only logistic regression works. Is there a way that I can continue to work with the same matrix, but allow them to work with the models I want? How can I create my matrices differently?

Here is my code:

 import numpy as np import subprocess from sklearn.linear_model import SGDClassifier from sklearn.linear_model import LogisticRegression from sklearn import metrics def run(command): output = subprocess.check_output(command, shell=True) return output 

Download Dictionary

  f = open('/Users/win/Documents/wholedata/RightVo.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("Training column size:") print(col) 

Create a train matrix

 row = run('cat '+'/Users/win/Documents/wholedata/X_tr.txt'+" | wc -l").split()[0] print("Training row size:") print(row) matrix_tmp = np.zeros((int(row),col), dtype=np.int64) print("Train Matrix size:") print(matrix_tmp.size) label_tmp = np.zeros((int(row)), dtype=np.int64) f = open('/Users/win/Documents/wholedata/X_tr.txt','r') count = 0 for line in f: line_tmp = line.split() #print(line_tmp) for word in line_tmp[0:]: if word not in vocab_temp: continue matrix_tmp[count][vocab_temp.index(word)] = 1 count = count + 1 f.close() print("Train matrix is:\n ") print(matrix_tmp) print(label_tmp) print("Train Label size:") print(len(label_tmp)) f = open('/Users/win/Documents/wholedata/RightVo.txt','r') vocab_tmp = f.read().split() f.close() col = len(vocab_tmp) print("Test column size:") print(col) 

Make a test matrix

 row = run('cat '+'/Users/win/Documents/wholedata/X_te.txt'+" | wc -l").split()[0] print("Test row size:") print(row) matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64) print("Test matrix size:") print(matrix_tmp_test.size) label_tmp_test = np.zeros((int(row)), dtype=np.int64) f = open('/Users/win/Documents/wholedata/X_te.txt','r') count = 0 for line in f: line_tmp = line.split() #print(line_tmp) for word in line_tmp[0:]: if word not in vocab_tmp: continue matrix_tmp_test[count][vocab_tmp.index(word)] = 1 count = count + 1 f.close() print("Test Matrix is: \n") print(matrix_tmp_test) print(label_tmp_test) print("Test Label Size:") print(len(label_tmp_test)) xtrain=[] with open("/Users/win/Documents/wholedata/Y_te.txt") as filer: for line in filer: xtrain.append(line.strip().split()) xtrain= np.ravel(xtrain) label_tmp_test=xtrain ytrain=[] with open("/Users/win/Documents/wholedata/Y_tr.txt") as filer: for line in filer: ytrain.append(line.strip().split()) ytrain = np.ravel(ytrain) label_tmp=ytrain 

Controlled load model

 model = LogisticRegression() model = model.fit(matrix_tmp, label_tmp) #print(model) print("Entered 1") y_train_pred = model.predict(matrix_tmp_test) print("Entered 2") print(metrics.accuracy_score(label_tmp_test, y_train_pred)) 
+5
source share
1 answer

You can use the specific data structure available in the scipy package called the sparse matrix: http://docs.scipy.org/doc/scipy/reference/sparse.html

In accordance with

+5
source

Source: https://habr.com/ru/post/1240748/


All Articles