Pandas dataframe summary irregular list dictionary

Question

Pandas dataframe summary irregular list dictionary

(Or a list of lists ... I just edited)

Is there an existing python / pandas method to convert a structure like this

food2 = {} food2["apple"] = ["fruit", "round"] food2["bananna"] = ["fruit", "yellow", "long"] food2["carrot"] = ["veg", "orange", "long"] food2["raddish"] = ["veg", "red"]

into such a pivot table?

 +---------+-------+-----+-------+------+--------+--------+-----+ | | fruit | veg | round | long | yellow | orange | red | +---------+-------+-----+-------+------+--------+--------+-----+ | apple | 1 | | 1 | | | | | +---------+-------+-----+-------+------+--------+--------+-----+ | bananna | 1 | | | 1 | 1 | | | +---------+-------+-----+-------+------+--------+--------+-----+ | carrot | | 1 | | 1 | | 1 | | +---------+-------+-----+-------+------+--------+--------+-----+ | raddish | | 1 | | | | | 1 | +---------+-------+-----+-------+------+--------+--------+-----+

Naively, I would probably just skip the dictionary. I see how I can use the card in each internal list, but I don’t know how to join / arrange them in the dictionary. As soon as I joined them, I could just use pandas.pivot_table

 for key in food2: attrlist = food2[key] onefruit_pairs = map(lambda x: [key, x], attrlist) one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr']) print(one_fruit_frame) fruit attr 0 bananna fruit 1 bananna yellow 2 bananna long fruit attr 0 carrot veg 1 carrot orange 2 carrot long fruit attr 0 apple fruit 1 apple round fruit attr 0 raddish veg 1 raddish red

+5

python pandas pivot-table

Mark miller Jan 11 '16 at 17:37

source share

2 answers

Answer using pandas.

 # Test data food2 = {} food2["apple"] = ["fruit", "round"] food2["bananna"] = ["fruit", "yellow", "long"] food2["carrot"] = ["veg", "orange", "long"] food2["raddish"] = ["veg", "red"] df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ])) # pivoting to long format df = pd.melt(df, var_name='item', value_name='categ') # cross-tabulation df = pd.crosstab(df['item'], df['categ']) # sorting index, maybe not necessary df.sort_index(inplace=True) df categ fruit long orange red round veg yellow item apple 1 0 0 0 1 0 0 bananna 1 1 0 0 0 0 1 carrot 0 1 1 0 0 1 0 raddish 0 0 0 1 0 1 0

+1

Romain Jan 11 '16 at 19:36

source share

Paulo scardine · Accepted Answer · 2016-01-11T18:15:37+0000

Pure Python:

 from itertools import chain def count(d): cols = set(chain(*d.values())) yield ['name'] + list(cols) for row, values in d.items(): yield [row] + [(col in values) for col in cols]

Testing:

 >>> food2 = { "apple": ["fruit", "round"], "bananna": ["fruit", "yellow", "long"], "carrot": ["veg", "orange", "long"], "raddish": ["veg", "red"] } >>> list(count(food2)) [['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'], ['bananna', True, False, True, True, False, False, False], ['carrot', True, True, False, False, True, False, False], ['apple', False, False, True, False, False, True, False], ['raddish', False, True, False, False, False, False, True]]

[update]

Performance test:

 >>> from itertools import product >>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7))) >>> attrs = labels[:1000] >>> import random >>> sample = {} >>> for k in labels: ... sample[k] = random.sample(attrs, 5) >>> import time >>> n = time.time(); list(count(sample)); print time.time() - n 62.0367980003

Less than 2 minutes passed, for 279936 rows of 1000 columns on my busy machine (many chrome tabs open). Let me know if performance is unacceptable.

[update]

Testing performance from another answer:

 >>> n = time.time(); \ ... df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \ ... print time.time() - n 72.0512290001

The next line ( df = pd.melt(...) ) took too much time, so I canceled the test. Take this result with salt because it works on a busy machine.

Pandas dataframe summary irregular list dictionary

More articles: