Python regex string in a word list (including hyphenated words)

Question

Python regex string in a word list (including hyphenated words)

I would like to parse a string to get a list, including all words (portable words too). Current code:

s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)

returns:

['', 'this', 'is', 'A', 'sentence', 'one', 'word']

and I would like him to return:

['', 'this', 'is', 'A', 'sentence', 'one-word']

+3

python regex

Antonio Aug 4 '10 at 14:56

source share

5 answers

kennytm · Answer 1 · 2010-08-04T15:15:00+0000

If you don't need a leading empty string, you can use a template \w(?:[-\w]*\w)?to match:

>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']

Please note that it will not match words with apostrophes, for example won't.

Tony veijalainen · Answer 2 · 2010-08-04T19:33:49+0000

Here's my traditional “why use regexp when you can use Python”:

import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
                 for word in s.replace(';','; ').split()
                 ])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""

Jens · Answer 3 · 2010-08-04T14:58:19+0000

"[^\w-]+".

pyInTheSky · Answer 4 · 2010-08-04T16:50:39+0000

s = "-this is. A - sentence;one-word what's" re.findall("\w+-\w+|[\w']+",s)

: ['this', 'is', 'A', 'sentence', 'one-word', "what's" ]

make sure you notice that the correct order is to first look for the words with the gifenization!

fasouto · Answer 5 · 2010-08-04T15:02:35+0000

You can try with the NLTK library:

>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']

I found it here: http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html Hope this helps

Python regex string in a word list (including hyphenated words)

More articles: