Elegant way to get hashtags from a string in Python?

I'm looking for a clean way to get a set (list, array, whatever) of words starting with # inside a given line.

In C # I would write

 var hashtags = input .Split (' ') .Where (s => s[0] == '#') .Select (s => s.Substring (1)) .Distinct (); 

What is the relatively elegant code for this in Python?

EDIT

Input Example: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected Result: ["stackoverflow", "rocks", "announcement"]

+6
source share
6 answers

With @ inspectorG4dget answer , if you do not want duplicates, you can use many concepts, not lists.

 >>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement" >>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")} set(['announcement', 'rocks', 'stackoverflow']) 

Note that the { } syntax for the concept set only works with Python 2.7.
If you are working with older versions, the channel list view ( [ ] ) is displayed on set as suggested by @Bertrand .

+15
source
 [i[1:] for i in line.split() if i.startswith("#")] 

This version will get rid of any empty lines (as I read such problems in the comments) and lines that are only "#" . In addition, as in Bertrand Marron , it is better to include this in the set as follows (to avoid duplication for O (1) lookup time):

 set([i[1:] for i in line.split() if i.startswith("#")]) 
+15
source

findall object method can get them all at once:

 >>> import re >>> s = "this #is a #string with several #hashtags" >>> pat = re.compile(r"#(\w+)") >>> pat.findall(s) ['is', 'string', 'hashtags'] >>> 
+8
source

I would say

 hashtags = [word[1:] for word in input.split() if word[0] == '#'] 

Edit: this will create a set without any duplicates.

 set(hashtags) 
+7
source

Another option is regEx:

 import re inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement" re.findall(r'(?i)\#\w+', inputLine) # will includes # re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include # 
+1
source

There are some problems with the answers presented here.

  • {tag.strip ("#") for the tag in .split () tags, if tag.startswith ("#")}

    [i [1:] for i in the string .split () if i.startswith ("#")]

wont work if you have hashtag like '# one # two #'

2 re.compile(r"#(\w+)") does not work for many Unicode languages ​​(even using re.UNICODE)

I saw more ways to extract the hashtag, but found that they did not answer all cases

so I wrote a little python code to handle most cases. he works for me.

 def get_hashtagslist(string): ret = [] s='' hashtag = False for char in string: if char=='#': hashtag = True if s: ret.append(s) s='' continue # take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' ) if hashtag and char in [' ','.',',','(',')',':','{','}'] and s: ret.append(s) s='' hashtag=False if hashtag: s+=char if s: ret.append(s) return set(ret) 
0
source

Source: https://habr.com/ru/post/890409/


All Articles