Pure Python:
from itertools import chain def count(d): cols = set(chain(*d.values())) yield ['name'] + list(cols) for row, values in d.items(): yield [row] + [(col in values) for col in cols]
Testing:
>>> food2 = { "apple": ["fruit", "round"], "bananna": ["fruit", "yellow", "long"], "carrot": ["veg", "orange", "long"], "raddish": ["veg", "red"] } >>> list(count(food2)) [['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'], ['bananna', True, False, True, True, False, False, False], ['carrot', True, True, False, False, True, False, False], ['apple', False, False, True, False, False, True, False], ['raddish', False, True, False, False, False, False, True]]
[update]
Performance test:
>>> from itertools import product >>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7))) >>> attrs = labels[:1000] >>> import random >>> sample = {} >>> for k in labels: ... sample[k] = random.sample(attrs, 5) >>> import time >>> n = time.time(); list(count(sample)); print time.time() - n 62.0367980003
Less than 2 minutes passed, for 279936 rows of 1000 columns on my busy machine (many chrome tabs open). Let me know if performance is unacceptable.
[update]
Testing performance from another answer:
>>> n = time.time(); \ ... df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \ ... print time.time() - n 72.0512290001
The next line ( df = pd.melt(...) ) took too much time, so I canceled the test. Take this result with salt because it works on a busy machine.