How to effectively combine two columns into one column / combine rows?

I have two columns (A and Date), as shown below, and they need to be combined into a single column, for example, column C. This data set contains more than 900,000 rows. enter image description here

Then I met with two main problems.

  • The data type of the column "Date"is equal timestamp, when I combine them with a string type, you will throw an error:

TypeError: unsupported operand type for +: 'Timestamp' and 'str'.

  1. Code is also too expensive. I wrote a for loop to do the combination as shown below:

    for i in the range (0.911462): DF ['Combine'] [I] = DF ['Date'] [I] + DF ['A'] [I]

, , for-loop , IO.

?

+4
4

, . strftime:

In [11]: df = pd.DataFrame([[pd.Timestamp("2017-01-01"), 'a'], [pd.Timestamp("2017-01-02"), 'b']], columns=["A", "B"])

In [12]: df["A"].dt.strftime("%Y-%m-%d") + df["B"]
Out[12]:
0    2017-01-01a
1    2017-01-02b
dtype: object
+4

astype, Timestamp string:

import pandas as pd
df = pd.DataFrame({'A':['XX','YY','ZZ','AA'], 'Date':[pd.Timestamp("2016-01-01"),pd.Timestamp('2016-01-15'),pd.Timestamp('2016-12-01'),pd.Timestamp('2016-07-12')]})
df['Combine'] = df['Date'].astype(str) + '_'+df['A']
df

df :

    A   Date        Combine
0   XX  2016-01-01  2016-01-01_XX
1   YY  2016-01-15  2016-01-15_YY
2   ZZ  2016-12-01  2016-12-01_ZZ
3   AA  2016-07-12  2016-07-12_AA
+3

df = pd.DataFrame(dict(
        A='XX YY ZZ AA'.split(),
        Date=pd.date_range('2017-03-31', periods=4)
    ))

1
apply lambda format .
, .

df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

2
numpy.core.defchararray.add
, 'datetime64[D]' .

chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
df.assign(C=chr_add(chr_add(d, '_'), a))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

3
Rip-off @AndyHayden . '_' strftime... , , timeit.

df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

%%timeit
chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
chr_add(chr_add(d, '_'), a)

%timeit df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))
%timeit df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

10000 loops, best of 3: 53.2 ยตs per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 831 ยตs per loop

df = pd.concat([df] * 10000, ignore_index=True)

10 loops, best of 3: 80.3 ms per loop
1 loop, best of 3: 4.58 s per loop
1 loop, best of 3: 233 ms per loop
+2

1.,

2. , map/reduce. MrJob - , , / , , , , script . : , , .

.

0

Source: https://habr.com/ru/post/1679767/


All Articles