Python: replace urls with name names from a string

I would like to remove the urls from the string and replace them with the names of the original content.

For instance:

mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"

sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"

To replace url with a header, I wrote this snipplet:

#get_title: string -> string
def get_title(url):
    """Returns the title of the input URL"""

    output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
    return output.title.string

I somehow need to apply this function to strings where it catches URLs and converts them to headers via get_title.

+3
source share
2 answers

Here is a question with URL validation information in Python: How do you validate a regex URL in Python?

The urlparse module is probably the best choice. You still have to decide what a valid URL is in the context of your application.

URL-, , , url .

( valid_url):

def sanitize(mystring):
  for word in mystring.split(" "):
    if valid_url(word):
      mystring = mystring.replace(word, get_title(word))
  return mystring
+3

, , , (re.sub , Match ):

url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)

, URL-, , .

+2

Source: https://habr.com/ru/post/1744631/


All Articles