Does Pandas extractall () not extract all cases with a given regular expression?

I have a nested list of strings that I would like to extract from them. Date format:

Two numbers (from 01 to 12 ) of letters (valid month) a hyphen of two numbers, for example: 08-Janโ€”07 or 03-Octโ€”01

I tried using the following regex:

 r'\d{2}(โ€”|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}' 

Then I tested it as follows:

 import pandas as pd df = pd.DataFrame({'blobs':['6-Feb- 1 4 Facebook's virtual-reality division created a 3-EBรš7 11 network of 500 free demo stations in Best Buy stores to give people a taste of VR using the Oculus Rift 90 GT 48 headset. But according to a Wednesday report from Business Insider, about 200 of the demo stations will close after low interest from consumers. 17-Feb-2014', 'I think in a store environment getting people to sit down and go through that experience of getting a headset on and getting set up is quite a difficult thing to achieve," said Geoff Blaber, a CCS Insight analyst. 29โ€”Oct-2012 Blaber 32 FAX 2978 expects that it will get easier when companies can convince 18-Oct-12 credit cards. ' ]}) df 

Then:

 df['blobs'].str.extractall(r'\d{2}(โ€”|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}') 

However, they do not work. The previous regex doesnโ€™t give me anything (i.e. Just hypens - ):

  Col 0 NaN 1 - 2 - 3 NaN 4 NaN 5 - ... n - 

How can I fix them to get ?:

  Col 0 6-Feb-14, 17-Feb-2014 1 29โ€”Oct-2012, 18-Oct-12 

UPDATE

I also tried:

 import re df['col'] = df.blobs.apply(lambda x: re.findall('\d{2}(โ€”|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}',x)) s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True) s.name = "col" df = df.drop('col') df 

However, I also got:

 ValueError Traceback (most recent call last) <ipython-input-4-5e9a34bd159f> in <module>() 3 s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True) 4 s.name = "col" ----> 5 df = df.drop('col') 6 df /usr/local/lib/python3.5/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors) 1905 new_axis = axis.drop(labels, level=level, errors=errors) 1906 else: -> 1907 new_axis = axis.drop(labels, errors=errors) 1908 dropped = self.reindex(**{axis_name: new_axis}) 1909 try: /usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in drop(self, labels, errors) 3260 if errors != 'ignore': 3261 raise ValueError('labels %s not contained in axis' % -> 3262 labels[mask]) 3263 indexer = indexer[~mask] 3264 return self.delete(indexer) ValueError: labels ['col'] not contained in axis 
+6
source share
1 answer

When you use Series.str.extract or Series.str.extractall , the captured substrings are returned, not all matches. Thus, you need to make sure that you capture (i.e. add ( ) ) the part of the template that you want to capture.

Now, several expected matches in your lines make it difficult to work with extractall , it seems you can use Series.str.findall which can return all matches if the capture group is not defined in the template.

Using

 rx = r'\b\d{1,2}[-โ€“โ€”](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[-โ€“โ€”](?:\d{4}|\d{2})\b' df['Col'] = df['blobs'].str.findall(rx).apply(','.join) 

.apply(','.join) converts lists to comma-separated strings in a Col column.

Sample means:

  • \b - word boundary
  • \d{1,2} - 1 or 2 digits
  • [-โ€“โ€”] - hyphen, em- or en-dash
  • (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) - any of the 12-month abbreviations
  • [-โ€“โ€”] - hyphen, em- or en-dash
  • (?:\d{4}|\d{2}) - 4 or 2 digits
  • \b - word boundary
+1
source

Source: https://habr.com/ru/post/1014997/


All Articles