Getting href attribute value in all <a> tags in html file using Python

I am creating a python application and I need to get the URL of all links on one web page. I already have a function that uses urllib to download an html file from the Internet and convert it to a list of lines with readlines ().

I currently have this code that uses regex (I'm not very good at this) to search for links in each line:

for line in lines:
    result = re.match ('/href="(.*)"/iU', line)
    print result

This does not work, as it only prints “None” for each line in the file, but I am sure that at least there are 3 links to the file that I open.

Can someone give me a hint?

Thank you in advance

+1
source share
7

, , , "Dive Into Python" .

URL- -:

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())         
usock.close()      
parser.close()                    
for url in parser.urls: print url

.

+1

Beautiful Soup :

from BeautifulSoup import BeautifulSoup as soup

html = soup('<body><a href="123">qwe</a><a href="456">asd</a></body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
+11

BeautifulSoup lxml (http://lxml.de/);

import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/@href")
for link in links:
    print link
+8

HTML, Python. htmllib.

+4

, , , .
: <A> , , , "href=", <textarea> html- . , href , , .

: XPath, DOM-, .. , ( HTML - DOM).
XPath - , (W3C), . XPath, regexp .
adw XPath .

+3

: regex HTML. HTML. . 200 .

HTML.

:

re.match ('/href="(.*)"/iU', line)

"/.../flags" Python. :

re.match('href="(.*)"', line, re.I|re.U)

. hrefs , " " . - '. *? , , "[^" ] *, .

HTML. .

+3

html , . , URL-.

Do something like this:

links = re.finditer(' href="?([^\s^"]+)', content)

for link in links:
  print link
+1
source

Source: https://habr.com/ru/post/1710797/


All Articles