How to conditionally copy a substring into a new pandas dataframe column?

Question

How to conditionally copy a substring into a new pandas dataframe column?

This is my first post, so I hope I do not wonder, and I understand. Basically, this is a two-part question. I need to set up a code that first checks to see if column A = "VALID". If so, I need to extract the substring from column B and place it in a new column, designated as "C". If the condition is false, I would add "NA". Look at the second table for my desired result.

| A | B | |-------------|-----------------------------------| | VALID |asdfafX'XextractthisY'Yeaaadf | | INVALID |secondrowX'XsubtextY'Yelakj | | VALID |secondrowX'XextractthistooY'Yelakj |

 | A | B | C | |-------------|-------------------------------------|-----------------| | VALID |"asdfafX'XextractthisY'Yeaaadf" | extractthis | | INVALID |"secondrowX'XsubtextY'Yelakj" | NA | | VALID |"secondrowX'XextractthistooY'Yelakj" | extractthistoo |

A few notes:

- A substring always starts after the phrase “X'X” and ends right before “Y'Y”.

- The substring will have different lengths from cell to cell.

I know that the following code is incorrect, but I wanted to show you how I tried to solve this problem:

 import pandas as pd if df[A] == "VALID": df[C] = df[B]df.str[start:finish] else: df[C].isna()

I apologize for the errors in this base code, as I am new to python in general and still rely on the IDE and trial and error to guide me. Any help you can provide is appreciated.

+5

python string substring pandas dataframe

ParalysisByAnalysis Aug 23 '17 at 23:44

source share

1 answer

cᴏʟᴅsᴘᴇᴇᴅ · Accepted Answer · 2017-08-23T23:51:43+0000

You can use pd.Series.str.extract :

 In [737]: df Out[737]: AB 0 VALID asdfafX'XextractthisY'Yeaaadf 1 INVALID secondrowX'XsubtextY'Yelakj 2 VALID secondrowX'XextractthistooY'Yelakj In [745]: df['C'] = df[df.A == 'VALID'].B.str.extract("(?<=X'X)(.*?)(?=Y'Y)", expand=False) In [746]: df Out[746]: ABC 0 VALID asdfafX'XextractthisY'Yeaaadf extractthis 1 INVALID secondrowX'XsubtextY'Yelakj NaN 2 VALID secondrowX'XextractthistooY'Yelakj extractthistoo

Regular expression pattern:

 (?<=X'X)(.*?)(?=Y'Y)

(?<=X'X) is a lookbehind for X'X
(.*?) matches everything between lookbehind and lookahead.
(?=Y'Y) is an expression for Y'Y

How to conditionally copy a substring into a new pandas dataframe column?

More articles: