Replacing punctuation in a data frame based on a punctuation list

Question

Replacing punctuation in a data frame based on a punctuation list

Using Canopy and Pandas, I have a data frame a, which is defined:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single-column file containing a list of lines containing text, numeric, and punctuation.

Assuming df looks like this:

Test
% HGH & 12
abv123 !!!
porkyfries

I want my results to be:

Test
hgh12
ab123
porkyfries

Efforts so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The above command just returns me with the same dataset. Appreciate any findings.

Edit: the reason I use Pandas is that the data is huge, spanning 1M lines, and future use of encoding will apply to a list that fits up to 30M lines. In short, I need to gently clear the data for large data sets.

+4

python pandas dataframe large-data

BernardL 10 . '14 8:49

3

()

import string
text = text.translate(None, string.punctuation.translate(None, '"'))

, 'a', pandas.

+1

philshem 10 . '14 9:12

:

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

:

'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'

:

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

:

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick %

:

df['text'] = df['text'].str.replace(pattern, '')
df

. Ex - replace (pattern, '$')

:

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick

+1

Aakash Saxena 10 '17 13:51

EdChum · Accepted Answer · 2014-02-10T09:22:44+0000

replace :

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

, -/

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

Replacing punctuation in a data frame based on a punctuation list

More articles: