A quick way to count occurrences of all values ​​in a pandas DataFrame

Suppose I have the following data:

import pandas as pd
import numpy as np
import random
from string import ascii_uppercase

random.seed(100)

n = 1000000

# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}

df = pd.DataFrame(data)

I want to quickly calculate the global occurrence of each value in the set of all values ​​in the data frame.

It works:

from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])

But very slowly:

%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop

I thought this feature could speed things up using some pandas power:

def quick_global_count(df, na_value=-999):
    df = df.fillna(na_value)
    # Get counts of each element for each column in the passed dataframe
    group_bys = {c: df.groupby(c).size() for c in df}
    # Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
    stacked = pd.concat([v for k, v in group_bys.items()])
    # Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
    # Then groupby and sum on that index to get global counts
    global_counts = stacked.reset_index().groupby('index').sum()
    return global_counts

This is definitely faster (75% of the time of the previous approach), but there should be something faster ...

%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop

The results of the above two approaches are identical (with a slight modification of the results returned quick_global_count):

dict(c) == quick_global_count(df).to_dict()[0]
True

What is a faster way to count global occurrences of values ​​in a data framework?

+4
source share
3 answers

Approach No. 1

, NumPy , ( , NumPy ) bincount -

a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]

. np.bincount(a.ravel()) .

# 1S ()

: astype(str). , fillna() -. , , . astype('S1') , . , , NaNs 'n'. fillna, 'n' .

, -

def app1S(df):
    ar = df.values.astype('S1')
    a = ar.view(np.uint8)
    count = np.bincount(a.ravel())[65:65+26]
    return count

pandas-0.20.3 numpy-1.13.3 -

In [3]: # Setup input
   ...: random.seed(100)
   ...: n = 1000000
   ...: data = {letter: [random.choice(list(ascii_uppercase) + 
   ...:         [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
   ...: df = pd.DataFrame(data)
   ...: 

# @Wen soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop

# @andrew_reece soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop

# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop

np.unique ( ) -

unq, count = np.unique(df.fillna(-999), return_counts=1)
+6
df.apply(pd.value_counts).sum(axis=1)

:

# example data
N = 10000000
rownum = int(N/1000.)
colnum = int(N/10000.)

str_vals = ['A','B','C','D']
str_data = np.random.choice(str_vals, size=N).reshape(rownum, colnum)
str_df = pd.DataFrame(str_data)

num_vals = [1,2,3,4]
num_data = np.random.choice(num_vals, size=N).reshape(rownum, colnum)
num_df = pd.DataFrame(num_data)

num_df.shape 
# (10000, 1000)

%%timeit
num_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 883 ms per loop

%%timeit
str_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 2.76 s per loop
+5

melt, then value_counts(PS, still can not compare with the solution numpy)

 df.melt().value.value_counts()

Time

%timeit df.melt().value.value_counts()
100 loops, best of 3: 1.43 ms per loop
%timeit c = Counter([v for c in df for v in df[c].fillna(-999)])
100 loops, best of 3: 5.23 ms per loop
%timeit df.apply(pd.value_counts).sum()
100 loops, best of 3: 18.5 ms per loop
+4
source

Source: https://habr.com/ru/post/1687524/


All Articles