Suppose I have a pandas.DataFrame called df . Columns df represent different individuals, and the axis of the index represents time, therefore the record (i, j) represents an individual j observation for a period of time i, and we can assume that all data are of type float , possibly with NaN values.
In my case, I have about 14,000 columns and several hundred rows.
pandas.corr will return me a 14,000 by 14,000 correlation matrix, and its time performance is suitable for my application.
But I would also like to know for each pair of individuals (j_1, j_2) how many non-zero observations were involved in calculating the correlation, so I can identify the correlation cells that suffer from poor data coverage.
The best I could come up with is the following:
not_null_locations = pandas.notnull(df).values.astype(int) common_obs = pandas.DataFrame(not_null_locations.T.dot(not_null_locations), columns=df.columns, index=df.columns)
The amount of memory and speed of this process are becoming a bit problematic.
Is there a faster way to get general observations using pandas ?
source share