Storing a lot of boolean data in python

I need to store sparse matrix data. The data size is 10^6 10^4 . In each column, I store the vector 0, except for a few values, where it is true .

Then I need to sum over the columns in each matrix and multiply each row with a scalar. I tried dictionaries, but they fail when I need to add and multiply.

What would you use?

PS. numpy.zeros is too small

+4
source share
3 answers

How about two dictations? Assuming this is a matrix ( x for True ):

  0 1 2 3 4 5 6 7 0 xxx 1 x 2 x 3 x 4 5 6 xx 7 

You will need to store

 rows = {0: [0, 2, 5], 1: [1], 2: [7], 3: [4], 6: [2, 5]} 

You can easily convert this to

 columns = {0: [0], 1: [1], 2: [0, 6], 4: [3], 5: [0, 6], 7: [2]} 

using something like

 columns = {} for row in rows: for column in rows[row]: columns.setdefault(column, []).append(row) 

and then summarize by columns ( sum(1 for x in column[2]) ) or by rows and multiply the result by what you want.

+4
source

As already mentioned, you should look at scipy.sparse :

http://docs.scipy.org/doc/scipy/reference/sparse.html

There are many different formats that are optimized for various sparse operations, including scalar multiplication and summation.

For instance:

 import scipy.sparse import numpy as np rows = np.array([1,100,1000]) cols = np.array([100,99,1474]) vals = np.ones_like(rows) A = scipy.sparse.coo_matrix((vals,(rows,cols)),shape=(int(1E6),int(1E6)),dtype=np.bool) 

Then multiply by scalar and take the sum:

 B = 3*A B.sum() # 9 
+2
source

There are literally hundreds of methods for this, depending on your needs. The rare Wikipedia matrix is a good start to understand a method that is specific to your needs.

As an extremely simple example, you can use the Dictionary of Keys class as follows:

 class SparseDOK(dict): def __init__(self): pass def __setitem__(self,key,value): if value in[0,0.0,False,None]: dict.__setitem__(self,key,False) dict.__delitem__(self,key) else: dict.__setitem__(self,key,True) def __getitem__(self, key): try: return dict.__getitem__(self, key) except KeyError: return False >>> dok=SparseDOK() >>> dok[10,20]=55 >>> print dok {(10, 20): True} >>> print dok[10,20] True >>> print dok[55,300] False >>> dok[10,20]=False >>> print dok[10,20] False 

Each entry in an arbitrary β€œmatrix” is considered False, unless True is set. You will need to add error checking, but it will be very compact and fast.

The advantage of constructing a key dictionary is a very efficient construction of the data structure. You only need to look at the source data once, and you can easily add or delete data. The disadvantage is less interactive processing of the matrix after its creation.

Since dictionary keys are tuples, it is trivial to add indexes by row or column. Since to complete this work, we need the entire matrix to build, we can simply build a dict with any sum or desired product once, and then turn to this type of processed data.

 >>> dok[10,20]=True >>> dok[10,2000]=True >>> dok[11,2000]=True >>> dok[35000,2000]=True >>> dok[10,35000]=True >>> print dok {(11, 2000): True, (10, 2000): True, (35000, 2000): True, (10, 20): True, (10, 35000): True} cols={} for tup in dok.keys(): if tup[1] not in cols: cols[tup[1]]=1 else: cols[tup[1]]+=1 >>> print cols {2000: 3, 35000: 1, 20: 1} 

Now you can refer to the col key in cols for the sum of the rows by col. It is trivial to add a product, etc. Just remember that you need to recalculate the amounts / products if the original DOK is edited or modified. You can save the total amount if you expect the DOK to change often after it is created.

If your needs are more complex, consider using SciPy or Pysparse . As you can see, SciPy has 7 different sparse matrix formats. Do not invent something that others have already done better ...

+1
source

Source: https://habr.com/ru/post/1395951/


All Articles