Select data using regular expression

I have such a data frame

import pandas as pd

df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})

        a  b
0     abc  0
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

Now I want to select all the columns of this data block, where the entries in the column abegin with r, and then five numbers.

From here I learned how to do this if it started with rno numbers:

print df.loc[df['a'].str.startswith('r'), :]

        a  b
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

Something like that

print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]

certainly not working. How to do it right?

+4
source share
2 answers

Option 1
pd.Series.str.match

df.a.str.match('^r\d{5}$')

1     True
2     True
3    False
4     True
5    False
Name: a, dtype: bool

Use it as a filter

df[df.a.str.match('^r\d{5}$')]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Option 2
Accounting for a custom list using string methods

f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]

[False, True, True, False, True, False]

Use it as a filter

df[[f(s) for s in df.a.values.tolist()]]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

The timing

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]

10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop
+5
source

You can use str.containsand pass a regular expression pattern:

In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]

Out[112]: 
        a  b
1  r00001  1
2  r00010  2
4  r01234  4

^r - r, \d{5} 5

startswith , ,

str.contains str.match, , str.contains re.search, str.match re.match, , . docs.

, $, , . :

In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df

Out[117]: 
         a  b
0      abc  0
1  r000010  1
2   r00010  2
3     rfoo  3
4   r01234  4
5    r1234  5


In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]

Out[118]: 
        a  b
2  r00010  2
4  r01234  4
+5

Source: https://habr.com/ru/post/1680898/


All Articles