Elegant way to get hashtags from a string in Python?

Question

Elegant way to get hashtags from a string in Python?

I'm looking for a clean way to get a set (list, array, whatever) of words starting with # inside a given line.

In C # I would write

 var hashtags = input .Split (' ') .Where (s => s[0] == '#') .Select (s => s.Substring (1)) .Distinct ();

What is the relatively elegant code for this in Python?

EDIT

Input Example: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected Result: ["stackoverflow", "rocks", "announcement"]

+6

python string list-comprehension hashtag

Dan abramov Jun 13 '11 at 2:04

source share

6 answers

 [i[1:] for i in line.split() if i.startswith("#")]

This version will get rid of any empty lines (as I read such problems in the comments) and lines that are only "#" . In addition, as in Bertrand Marron , it is better to include this in the set as follows (to avoid duplication for O (1) lookup time):

 set([i[1:] for i in line.split() if i.startswith("#")])

+15

inspectorG4dget Jun 13 '11 at 14:09

source share

findall object method can get them all at once:

 >>> import re >>> s = "this #is a #string with several #hashtags" >>> pat = re.compile(r"#(\w+)") >>> pat.findall(s) ['is', 'string', 'hashtags'] >>>

+8

bgporter Jun 13 '11 at 14:17

source share

I would say

 hashtags = [word[1:] for word in input.split() if word[0] == '#']

Edit: this will create a set without any duplicates.

 set(hashtags)

+7

Bertrand marron Jun 13 '11 at 14:08

source share

Another option is regEx:

 import re inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement" re.findall(r'(?i)\#\w+', inputLine) # will includes # re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

+1

Artsiom Rudzenka Jun 13 '11 at 14:14

source share

There are some problems with the answers presented here.

{tag.strip ("#") for the tag in .split () tags, if tag.startswith ("#")}
[i [1:] for i in the string .split () if i.startswith ("#")]

wont work if you have hashtag like '# one # two #'

2 re.compile(r"#(\w+)") does not work for many Unicode languages (even using re.UNICODE)

I saw more ways to extract the hashtag, but found that they did not answer all cases

so I wrote a little python code to handle most cases. he works for me.

 def get_hashtagslist(string): ret = [] s='' hashtag = False for char in string: if char=='#': hashtag = True if s: ret.append(s) s='' continue # take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' ) if hashtag and char in [' ','.',',','(',')',':','{','}'] and s: ret.append(s) s='' hashtag=False if hashtag: s+=char if s: ret.append(s) return set(ret)

0

Eyal ch 10 sept. '15 at 9:55

source share

utdemir · Accepted Answer · 2011-06-13T14:20:37+0000

With @ inspectorG4dget answer , if you do not want duplicates, you can use many concepts, not lists.

 >>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement" >>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")} set(['announcement', 'rocks', 'stackoverflow'])

Note that the { } syntax for the concept set only works with Python 2.7.
If you are working with older versions, the channel list view ( [ ] ) is displayed on set as suggested by @Bertrand .

Elegant way to get hashtags from a string in Python?

More articles: