I have a nested list of strings that I would like to extract from them. Date format:
Two numbers (from 01
to 12
) of letters (valid month) a hyphen of two numbers, for example: 08-Janโ07
or 03-Octโ01
I tried using the following regex:
r'\d{2}(โ|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}'
Then I tested it as follows:
import pandas as pd df = pd.DataFrame({'blobs':['6-Feb- 1 4 Facebook's virtual-reality division created a 3-EBร7 11 network of 500 free demo stations in Best Buy stores to give people a taste of VR using the Oculus Rift 90 GT 48 headset. But according to a Wednesday report from Business Insider, about 200 of the demo stations will close after low interest from consumers. 17-Feb-2014', 'I think in a store environment getting people to sit down and go through that experience of getting a headset on and getting set up is quite a difficult thing to achieve," said Geoff Blaber, a CCS Insight analyst. 29โOct-2012 Blaber 32 FAX 2978 expects that it will get easier when companies can convince 18-Oct-12 credit cards. ' ]}) df
Then:
df['blobs'].str.extractall(r'\d{2}(โ|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}')
However, they do not work. The previous regex doesnโt give me anything (i.e. Just hypens -
):
Col 0 NaN 1 - 2 - 3 NaN 4 NaN 5 - ... n -
How can I fix them to get ?:
Col 0 6-Feb-14, 17-Feb-2014 1 29โOct-2012, 18-Oct-12
UPDATE
I also tried:
import re df['col'] = df.blobs.apply(lambda x: re.findall('\d{2}(โ|-)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2,4}',x)) s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True) s.name = "col" df = df.drop('col') df
However, I also got:
ValueError Traceback (most recent call last) <ipython-input-4-5e9a34bd159f> in <module>() 3 s = df.apply(lambda x: pd.Series(x['col']),axis=1).stack().reset_index(level=1, drop=True) 4 s.name = "col" ----> 5 df = df.drop('col') 6 df /usr/local/lib/python3.5/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors) 1905 new_axis = axis.drop(labels, level=level, errors=errors) 1906 else: -> 1907 new_axis = axis.drop(labels, errors=errors) 1908 dropped = self.reindex(**{axis_name: new_axis}) 1909 try: /usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in drop(self, labels, errors) 3260 if errors != 'ignore': 3261 raise ValueError('labels %s not contained in axis' % -> 3262 labels[mask]) 3263 indexer = indexer[~mask] 3264 return self.delete(indexer) ValueError: labels ['col'] not contained in axis