Creating a dataframe from a dictionary where records have different lengths

Say I have a dictionary with 10 key-value pairs. Each entry contains a numpy array. However, the length of the array is not the same for all of them.

How to create a data framework where each column contains a different record?

When I try:

pd.DataFrame(my_dict) 

I get:

 ValueError: arrays must all be the same length 

Any way to overcome this? I'm glad Pandas used NaN to populate these columns for shorter entries.

+86
source share
7 answers

In Python 3.x:

 In [6]: d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) ) In [7]: DataFrame(dict([ (k,Series(v)) for k,v in d.items() ])) Out[7]: AB 0 1 1 1 2 2 2 NaN 3 3 NaN 4 

In Python 2.x:

replace d.items() with d.iteritems() .

+104
source

Here is an easy way to do this:

 In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) ) In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index') In[22]: df Out[22]: 0 1 2 3 A 1 2 NaN NaN B 1 2 3 4 In[23]: df.transpose() Out[23]: AB 0 1 1 1 2 2 2 NaN 3 3 NaN 4 
+70
source

The following is a way to tidy up your syntax, but essentially do the same as the other answers:

 >>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8} >>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() }) >>> dict_df one 2 3 0 1.0 4 8.0 1 2.0 5 NaN 2 3.0 6 NaN 3 NaN 7 NaN 

A similar syntax exists for lists:

 >>> mylist = [ [1,2,3], [4,5], 6 ] >>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ]) >>> list_df 0 1 2 0 1.0 2.0 3.0 1 4.0 5.0 NaN 2 6.0 NaN NaN 

Another syntax for lists:

 >>> mylist = [ [1,2,3], [4,5], 6 ] >>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) }) >>> list_df 0 1 2 0 1 4.0 6.0 1 2 5.0 NaN 2 3 NaN NaN 

In all of these cases, you should be careful to check which pandas data type will be guessed for your columns. Columns containing any (missing) NaN values ​​will be converted, for example, to a floating point number.

+10
source

Although this does not directly answer the OP question. I found this a great solution for my case, when I had unequal arrays, and I would like to share:

from pandas documentation

 In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']), ....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} ....: In [32]: df = DataFrame(d) In [33]: df Out[33]: one two a 1 1 b 2 2 c 3 3 d NaN 4 
+3
source

You can also use pd.concat along axis=1 with a list of pd.Series objects:

 import pandas as pd, numpy as np d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])} res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1) print(res) AB 0 1.0 1 1 2.0 2 2 NaN 3 3 NaN 4 
+3
source

Both of the following lines work fine:

 pd.DataFrame.from_dict(df, orient='index').transpose() #A pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better) 

But with% timeit on Jupyter, I have a 4x speed ratio for B versus A, which is quite impressive, especially when working with a huge dataset (mostly with a lot of columns / functions).

+1
source

If you do not want it to display NaN , and you have two specific lengths, adding a “space” to each remaining cell will also work.

 import pandas long = [6, 4, 7, 3] short = [5, 6] for n in range(len(long) - len(short)): short.append(' ') df = pd.DataFrame({'A':long, 'B':short}] # Make sure Excel file exists in the working directory datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter') df.to_excel(datatoexcel,sheet_name = 'Sheet1') datatoexcel.save() AB 0 6 5 1 4 6 2 7 3 3 

If you have more than 2 record lengths, it is recommended that you create a function that uses a similar method.

+1
source

Source: https://habr.com/ru/post/1243885/


All Articles