I have a web scraper that accepts forum questions, breaks them into separate words and writes to a text file. Words are stored in a list of tuples. Each tuple contains a word and its frequency. So ...
[(u'move', 3), (u'exploration', 4), (u'prediction', 21),
(u'find', 5), (u'user', 2), (u'interface', 2), (u'pleasant', 2),
(u'am', 11), (u'puzzled', 2), (u'find', 5), (u'way', 5),
(u'prediction', 21), (u'mode', 2), (u'have', 21),
(u'explored', 2), (u'file', 9), (u'Can', 7), (u'help', 6),
(u'Possible', 1), (u'bug', 2), (u'data', 31), (u'is', 17)
however, some person on the forum used a character that breaks all my code, because that character is no longer a Unicode space.
(u'used\u200b', 1)
The listing of this file does not lead to an error, but does write to a text file. I found that string.strip()they string.replace()didn’t help, so I was wondering how to use the regular expression library to get rid of this character. I plan to parse the entire list of tuples to find it.