Getting href attribute value in all <a> tags in html file using Python
I am creating a python application and I need to get the URL of all links on one web page. I already have a function that uses urllib to download an html file from the Internet and convert it to a list of lines with readlines ().
I currently have this code that uses regex (I'm not very good at this) to search for links in each line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This does not work, as it only prints “None” for each line in the file, but I am sure that at least there are 3 links to the file that I open.
Can someone give me a hint?
Thank you in advance
+1
7
, , , "Dive Into Python" .
URL- -:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
.
+1
BeautifulSoup lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/@href")
for link in links:
print link
+8