How to conditionally copy a substring into a new pandas dataframe column?

This is my first post, so I hope I do not wonder, and I understand. Basically, this is a two-part question. I need to set up a code that first checks to see if column A = "VALID". If so, I need to extract the substring from column B and place it in a new column, designated as "C". If the condition is false, I would add "NA". Look at the second table for my desired result.

| A | B | |-------------|-----------------------------------| | VALID |asdfafX'XextractthisY'Yeaaadf | | INVALID |secondrowX'XsubtextY'Yelakj | | VALID |secondrowX'XextractthistooY'Yelakj | 

 | A | B | C | |-------------|-------------------------------------|-----------------| | VALID |"asdfafX'XextractthisY'Yeaaadf" | extractthis | | INVALID |"secondrowX'XsubtextY'Yelakj" | NA | | VALID |"secondrowX'XextractthistooY'Yelakj" | extractthistoo | 

A few notes:

- A substring always starts after the phrase โ€œX'Xโ€ and ends right before โ€œY'Yโ€.

- The substring will have different lengths from cell to cell.

I know that the following code is incorrect, but I wanted to show you how I tried to solve this problem:

 import pandas as pd if df[A] == "VALID": df[C] = df[B]df.str[start:finish] else: df[C].isna() 

I apologize for the errors in this base code, as I am new to python in general and still rely on the IDE and trial and error to guide me. Any help you can provide is appreciated.

+5
source share
1 answer

You can use pd.Series.str.extract :

 In [737]: df Out[737]: AB 0 VALID asdfafX'XextractthisY'Yeaaadf 1 INVALID secondrowX'XsubtextY'Yelakj 2 VALID secondrowX'XextractthistooY'Yelakj In [745]: df['C'] = df[df.A == 'VALID'].B.str.extract("(?<=X'X)(.*?)(?=Y'Y)", expand=False) In [746]: df Out[746]: ABC 0 VALID asdfafX'XextractthisY'Yeaaadf extractthis 1 INVALID secondrowX'XsubtextY'Yelakj NaN 2 VALID secondrowX'XextractthistooY'Yelakj extractthistoo 

Regular expression pattern:

 (?<=X'X)(.*?)(?=Y'Y) 
  • (?<=X'X) is a lookbehind for X'X

  • (.*?) matches everything between lookbehind and lookahead.

  • (?=Y'Y) is an expression for Y'Y

+3
source

Source: https://habr.com/ru/post/1271170/


All Articles