Python: getting rid of a string using regular expressions

Question

Python: getting rid of a string using regular expressions

I have a web scraper that accepts forum questions, breaks them into separate words and writes to a text file. Words are stored in a list of tuples. Each tuple contains a word and its frequency. So ...

[(u'move', 3), (u'exploration', 4), (u'prediction', 21),
 (u'find', 5), (u'user', 2), (u'interface', 2), (u'pleasant', 2),
 (u'am', 11), (u'puzzled', 2), (u'find', 5), (u'way', 5),
 (u'prediction', 21), (u'mode', 2), (u'have', 21),
 (u'explored', 2), (u'file', 9), (u'Can', 7), (u'help', 6),
 (u'Possible', 1), (u'bug', 2), (u'data', 31), (u'is', 17)

however, some person on the forum used a character that breaks all my code, because that character is no longer a Unicode space.

(u'used\u200b', 1)

The listing of this file does not lead to an error, but does write to a text file. I found that string.strip()they string.replace()didn’t help, so I was wondering how to use the regular expression library to get rid of this character. I plan to parse the entire list of tuples to find it.

+4

python regex unicode

ceilingfan999 Jul 20 '15 at 17:11

source

1

roeland · Accepted Answer · 2015-07-20T22:23:31+0000

python 2.7. replace :

>>> u'used\u200b'.replace(u'\u200b', '*')
u'used*'

strip:

>>> u'used\u200b'.strip(u'\u200b')
u'used'

, Unicode. u'\u200b', '\u200b'. u .

.

>>> import codecs
>>> f = codecs.open('a.txt', encoding='utf-8', mode='w')
>>> f.write(u'used\u200bZero')

. :

Python: getting rid of a string using regular expressions

More articles: