Pythonic way to filter columns and then create a new column

Question

Pythonic way to filter columns and then create a new column

I have a .xlsx file that I open with this code:

import pandas as pd

df = pd.read_excel(open('file.xlsx','rb'))
df['Description'].head

and I have the following result, which looks pretty good.

ID     | Description
:----- | :-----------------------------
0      | Some Description with no hash
1      | Text with #one hash
2      | Text with #two #hashes

Now I want to create a new column, saving only words starting with C #, like this one:

ID     | Description                      |  Only_Hash
:----- | :-----------------------------   |  :-----------------
0      | Some Description with no hash    |   Nan
1      | Text with #one hash              |   #one
2      | Text with #two #hashes           |   #two #hashes

I managed to count / split C # strings:

descriptionWithHash = df['Description'].str.contains('#').sum()

but now I want to create a column as described above. What is the easiest way to do this?

Hello!

PS: it is assumed that the table format will be displayed in the question, but I can’t understand why it is wrong!

+4

python pandas

Claudio Jul 31 '17 at 11:14

source share

2 answers

In [126]: df.join(df.Description
     ...:           .str.extractall(r'(\#\w+)')
     ...:           .unstack(-1)
     ...:           .T.apply(lambda x: x.str.cat(sep=' ')).T
     ...:           .to_frame(name='Hash'))
Out[126]:
   ID                    Description          Hash
0   0  Some Description with no hash           NaN
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes

+5

Maxu Jul 31 '17 at 11:20

source share

jezrael · Accepted Answer · 2017-07-31T11:19:38+0000

You can use str.findallwith str.join:

df['new'] =  df['Description'].str.findall('(\#\w+)').str.join(' ')
print(df)
   ID                    Description           new
0   0  Some Description with no hash              
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes

And for NaNs:

df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ').replace('',np.nan)
print(df)
   ID                    Description           new
0   0  Some Description with no hash           NaN
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes

Pythonic way to filter columns and then create a new column

More articles: