A quick way to see the total number of observations for Python Pandas correlation matrix entries

Suppose I have a pandas.DataFrame called df . Columns df represent different individuals, and the axis of the index represents time, therefore the record (i, j) represents an individual j observation for a period of time i, and we can assume that all data are of type float , possibly with NaN values.

In my case, I have about 14,000 columns and several hundred rows.

pandas.corr will return me a 14,000 by 14,000 correlation matrix, and its time performance is suitable for my application.

But I would also like to know for each pair of individuals (j_1, j_2) how many non-zero observations were involved in calculating the correlation, so I can identify the correlation cells that suffer from poor data coverage.

The best I could come up with is the following:

 not_null_locations = pandas.notnull(df).values.astype(int) common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations), columns=df.columns, index=df.columns) 

The amount of memory and speed of this process are becoming a bit problematic.

Is there a faster way to get general observations using pandas ?

+6
source share
2 answers

You can do this, but you will need to cythonize (otherwise much slower); However, the amount of memory should be better (this gives the number of observations on nan, gives the number of valid observations, but is easily converted)

 l = len(df.columns) results = np.zeros((l,l)) mask = pd.isnull(df) for i, ac in enumerate(df): for j, bc in enumerate(df): results[j,i] = (mask[i] & mask[j]).sum() results = DataFrame(results,index=df.columns,columns=df.columns) 
+3
source

In fact, you can make @Jeff respond a little faster just by repeating up to (but not including) i + 1 in a nested loop, and since the correlation is symmetrical, you can assign values ​​at the same time. You can also move access to mask[i] outside of the nested loop, which is a bit of optimization, but can lead to some performance improvements for very large frames.

 l = len(df.columns) results = np.zeros((l,l)) mask = pd.isnull(df) for i in range(l): maski = mask[i] for j in range(i + 1): results[i,j] = results[j,i] = (maski & mask[j]).sum() results = DataFrame(results,index=df.columns,columns=df.columns) 
+3
source

Source: https://habr.com/ru/post/951727/


All Articles