Python: create pandas data framework with columns based on unique values ​​in nestled list

I have a list containing different areas for each sample. I would like to make a dataframe so that each row (sample) has the presence or absence of a corresponding region (column). For example, the data may look like this:

region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]

And the final dataframe will look something like this:

North America    South America     Asia     Australia
1                0                 0        0
1                1                 0        0
0                0                 1        0
1                0                 1        1

I think that I could probably find a way using closed loops and additions, but is there an even more pythonic way to do this? Perhaps with numpy.where?

+4
source share
4 answers

pandas
str.get_dummies

pd.Series(region_list).str.join('|').str.get_dummies()

   Asia  Australia  North America  South America
0     0          0              1              0
1     0          0              1              1
2     1          0              0              0
3     1          1              1              0

numpy
np.bincount with pd.factorize

n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)

   North America  South America  Asia  Australia
0              1              0     0          0
1              1              1     0          0
2              0              0     1          0
3              1              0     1          1

The timing

%timeit pd.Series(region_list).str.join('|').str.get_dummies()
1000 loops, best of 3: 1.42 ms per loop

%%timeit
n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)
1000 loops, best of 3: 204 Β΅s per loop
+6
source

Try:

df = pd.DataFrame(region_list)

df2 = df.stack().reset_index(name='region')

df_out = pd.get_dummies(df2.set_index('level_0')['region']).groupby(level=0).sum().rename_axis(None)

print(df_out)

Conclusion:

         Asia  Australia  North America  South America                                               
0           0          0              1              0
1           0          0              1              1
2           1          0              0              0
3           1          1              1              0
+4

!

import pandas as pd
import itertools
pd.get_dummies(pd.DataFrame(list(itertools.chain(*region_list)))

Output
       0_Asia  0_Australia  0_North America  0_South America
    0       0            0                1                0
    1       0            0                1                0
    2       0            0                0                1
    3       1            0                0                0
    4       0            0                1                0
    5       1            0                0                0
    6       0            1                0                0
+1

chain.from_iterable itertools list comprehension:

from itertools import chain

region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]

regions = list(set(chain.from_iterable(region_list)))
vals = [[1 if j in k else 0 for j in regions] for k in region_list]
df = pd.DataFrame(vals, columns=regions)
print(df)

:

   Australia  Asia  North America  South America
0          0     0              1              0
1          0     0              1              1
2          0     1              0              0
3          1     1              1              0
+1
source

Source: https://habr.com/ru/post/1679096/


All Articles