Problems merging two pandas data with different forms?

This is pretty simple, but I don’t understand why I cannot combine the two data frames. I have the following dfwith different shapes (one bigger and wider than the other):

df1

    A                      id
0   microsoft inc          1
1   apple computer.        2
2   Google Inc.            3
3   IBM                    4
4   amazon, Inc.           5

df2

    B   C   D   E   id
0   (01780-500-01)  237489 - 342    API     True.   1
0   (409-6043-01)   234324  API     Other   2
0   23423423    API     NaN     NaN     3
0   (001722-5e240-60)   NaN     NaN     Other   4
1   (0012172-52411-60)  32423423.   NaN     Other   4
0   29849032-29482390   API     Yes     False   5
1   329482030-23490-1   API     Yes     False   5

I would like to merge df1and df2in the column index:

Df3

    A   B   C   D   E   id
0   microsoft inc   (01780-500-01)  237489 - 342    API     True.   1
1   apple computer. (409-6043-01)   234324  API     Other   2
2   Google Inc. 23423423    API     NaN     NaN     3
3   IBM (001722-5e240-60)   NaN     NaN     Other   4
4   IBM (0012172-52411-60)  32423423.   NaN     Other   4
5   amazon, Inc.    29849032-29482390   API     Yes     False   5
6   amazon, Inc.    329482030-23490-1   API     Yes     False   5

I know that this can be done using merge () . Also, I read this excellent tutorial and tried:

AT:

pd.merge(df1, df2, on=df1.id, how='outer')

Of:

IndexError: indices are out-of-bounds

Then I tried:

pd.merge(df2, df1, on='id', how='outer')

And, apparently, he repeated the combined lines several times, something like this:

    A   B   C   D   E   index
0   microsoft inc   (01780-500-01)  237489 - 342    API     True.   1
1   apple computer. (409-6043-01)   234324  API     Other   2
2   apple computer. (409-6043-01)   234324  API     Other   2
3   apple computer. (409-6043-01)   234324  API     Other   2
4   apple computer. (409-6043-01)   234324  API     Other   2
5   apple computer. (409-6043-01)   234324  API     Other   2
6   apple computer. (409-6043-01)   234324  API     Other   2
7   apple computer. (409-6043-01)   234324  API     Other   2
8   apple computer. (409-6043-01)   234324  API     Other   2
...

I think this is due to the fact that I created a temporary index df2['position'] = df2.index, as the indexes look weird and then deleted. So my question is: how to get it df3?

UPDATE

df2 :

df2.reset_index(drop=True, inplace=True)

:

    B   C   D   E   id
0   (01780-500-01)  237489 - 342    API     True.   1
1   (409-6043-01)   234324  API     Other   2
2   23423423    API     NaN     NaN     3
3   (001722-5e240-60)   NaN     NaN     Other   4
4   (0012172-52411-60)  32423423.   NaN     Other   4
5   29849032-29482390   API     Yes     False   5
6   329482030-23490-1   API     Yes     False   5

. .

>>>print(df2.dtypes)
B    object
C    object
D    object
E    object
id   int64
dtype: object

>>>print(df1.dtypes)
A                object
id               int64
dtype: object

Update2

>>>print(df2['id'])
0        1
1        2
2        3
3        4
4        4
5        5
6        5
7        6
8        6
9        7
10       8
11       8
12       8
13       8
14       9
15      10
16      11
17      11
18      12
19      12
20      13
21      13
22      14
23      15
24      16
25      16
26      17
27      17
28      18
29      18
      ... 
476    132
477    132
478    132
479    132
480    132
481    132
482    132
483    132
484    133
485    133
486    133
487    133
488    134
489    134
490    134
491    134
492    135
493    135
494    136
495    136
496    137
497    137
498    137
499    137
500    137
501    137
502    137
503    138
504    138
505    138
Name: id, dtype: int64

>>>print(df1)

0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10
10     11
11      8
12     12
13      6
14      7
15      8
16      6
17     11
18     13
19     14
20     15
21     11
22      2
23     16
24     17
25     18
26      9
27     19
28     11
29     20
       ..
108    57
109    43
110    22
111     2
112    58
113    49
114    22
115    59
116     2
117     6
118    22
119     2
120    37
121     2
122     9
123    60
124    61
125    62
126    63
127    42
128    64
129     4
130    29
131    11
132     2
133    25
134     4
135    65
136    66
137     4
Name: id, dtype: int64
+4
1

id, join:

df1 = pd.DataFrame([('microsoft inc',1),
('apple computer.',2),
('Google Inc.',3),
('IBM',4),
('amazon, Inc.',5)],columns = ('A','id'))

df2 = pd.DataFrame([('(01780-500-01)','237489', '- 342','API',   1),
('(409-6043-01)','234324', ' API','Other   ',2),
('23423423','API', 'NaN','NaN',     3),
('(001722-5e240-60)','NaN', 'NaN','Other',   4),
('(0012172-52411-60)','32423423','   NaN','Other',   4),
('29849032-29482390','API', '    Yes','     False',   5),
('329482030-23490-1','API', '    Yes','     False',   5)],
columns = ['B','C','D','E','id'])

df1  =df1.set_index('id')
df1.drop_duplicates(inplace=True)
df2  = df2.set_index('id')
df3  = df1.join(df2,how='outer')

(aka join keys) , on='id'.

. pd.merge(df1, df2, on='id', how='outer'). , id , @JohnE

+1

Source: https://habr.com/ru/post/1661091/


All Articles