Suppose I have the following data:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
I want to quickly calculate the global occurrence of each value in the set of all values ββin the data frame.
It works:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
But very slowly:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
I thought this feature could speed things up using some pandas power:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
group_bys = {c: df.groupby(c).size() for c in df}
stacked = pd.concat([v for k, v in group_bys.items()])
global_counts = stacked.reset_index().groupby('index').sum()
return global_counts
This is definitely faster (75% of the time of the previous approach), but there should be something faster ...
%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop
The results of the above two approaches are identical (with a slight modification of the results returned quick_global_count
):
dict(c) == quick_global_count(df).to_dict()[0]
True
What is a faster way to count global occurrences of values ββin a data framework?
source
share