A quick way to see the total number of observations for Python Pandas correlation matrix entries

Question

A quick way to see the total number of observations for Python Pandas correlation matrix entries

Suppose I have a pandas.DataFrame called df . Columns df represent different individuals, and the axis of the index represents time, therefore the record (i, j) represents an individual j observation for a period of time i, and we can assume that all data are of type float , possibly with NaN values.

In my case, I have about 14,000 columns and several hundred rows.

pandas.corr will return me a 14,000 by 14,000 correlation matrix, and its time performance is suitable for my application.

But I would also like to know for each pair of individuals (j_1, j_2) how many non-zero observations were involved in calculating the correlation, so I can identify the correlation cells that suffer from poor data coverage.

The best I could come up with is the following:

 not_null_locations = pandas.notnull(df).values.astype(int) common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations), columns=df.columns, index=df.columns)

The amount of memory and speed of this process are becoming a bit problematic.

Is there a faster way to get general observations using pandas ?

+6

python numpy pandas missing-data

ely Aug 14 '13 at 14:00

source share

2 answers

In fact, you can make @Jeff respond a little faster just by repeating up to (but not including) i + 1 in a nested loop, and since the correlation is symmetrical, you can assign values at the same time. You can also move access to mask[i] outside of the nested loop, which is a bit of optimization, but can lead to some performance improvements for very large frames.

 l = len(df.columns) results = np.zeros((l,l)) mask = pd.isnull(df) for i in range(l): maski = mask[i] for j in range(i + 1): results[i,j] = results[j,i] = (maski & mask[j]).sum() results = DataFrame(results,index=df.columns,columns=df.columns)

+3

Phillip cloud Aug 14 '13 at 15:02

source share

Jeff · Accepted Answer · 2013-08-14T14:44:56+0000

You can do this, but you will need to cythonize (otherwise much slower); However, the amount of memory should be better (this gives the number of observations on nan, gives the number of valid observations, but is easily converted)

 l = len(df.columns) results = np.zeros((l,l)) mask = pd.isnull(df) for i, ac in enumerate(df): for j, bc in enumerate(df): results[j,i] = (mask[i] & mask[j]).sum() results = DataFrame(results,index=df.columns,columns=df.columns)

A quick way to see the total number of observations for Python Pandas correlation matrix entries

More articles: