Web crawler - following links

Question

Web crawler - following links

Please carry me. I am new to Python, but I have a lot of fun. I am trying to code a web crawler who crawls the election results in the last referendum in Denmark. I managed to extract all the relevant links from the main page. And now I want Python to track each of 92 links and collect 9 pieces of information from each of these pages. But I'm so stuck. Hope you can give me a hint.

Here is my code:

import requests import urllib2 from bs4 import BeautifulSoup # This is the original url http://www.kmdvalg.dk/ soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read()) my_list = [] all_links = soup.find_all("a") for link in all_links: link2 = link["href"] my_list.append(link2) for i in my_list[1:93]: print i # The output shows all the links that I would like to follow and gather information from. How do I do that?

+5

python web-crawler beautifulsoup

Metods Feb 15 '16 at 21:23

source share

4 answers

Here is my solution using lxml . It looks like BeautifulSoup

 import lxml from lxml import html import requests page = requests.get('http://www.kmdvalg.dk/main') tree = html.fromstring(page.content) my_list = tree.xpath('//div[@class="LetterGroup"]//a/@href') # grab all link print 'Length of all links = ', len(my_list)

my_list is a list of all links. And now you can use it for a loop to clear the information inside each page.

We can execute a loop through each link. Inside each page, you can extract information as an example. This is only for the top table.

 table_information = [] for t in my_list: page_detail = requests.get(t) tree = html.fromstring(page_detail.content) table_key = tree.xpath('//td[@class="statusHeader"]/text()') table_value = tree.xpath('//td[@class="statusText"]/text()') + tree.xpath('//td[@class="statusText"]/a/text()') table_information.append(zip([t]*len(table_key), table_key, table_value))

For the table below the page

 table_information_below = [] for t in my_list: page_detail = requests.get(t) tree = html.fromstring(page_detail.content) l1 = tree.xpath('//tr[@class="tableRowPrimary"]/td[@class="StemmerNu"]/text()') l2 = tree.xpath('//tr[@class="tableRowSecondary"]/td[@class="StemmerNu"]/text()') table_information_below.append([t]+l1+l2)

Hope this help!

+5

titipata Feb 15 '16 at 9:33

source share

That would be my solution for your problem.

  import requests from bs4 import BeautifulSoup def spider(): url = "http://www.kmdvalg.dk/main" source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for link in soup.findAll('div', {'class': 'LetterGroup'}): anc = link.find('a') href = anc.get('href') print(anc.getText()) print(href) # spider2(href) call a second function from here that is similar to this one(making url = to herf) spider2(href) print("\n") def spider2(linktofollow): url = linktofollow source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, 'html.parser') for link in soup.findAll('tr', {'class': 'tableRowPrimary'}): anc = link.find('td') print(anc.getText()) print("\n") spider()

its not done ... I get a simple element from the table, but you get the idea and how it should work.

+2

CVasquezG Feb 15 '16 at 9:51

source share

Here is my last code that runs smoothly. Please let me know if I could make it smarter!

 import urllib2 from bs4 import BeautifulSoup import codecs f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1") soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read()) liste = [] alle_links = soup.find_all("a") for link in alle_links: link2 = link["href"] liste.append(link2) for url in liste[1:93]: soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1')) tds = soup.findAll('td') stemmernu = soup.findAll('td', class_='StemmerNu') print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n' f.close()

+1

Metods Feb 16 '16 at 10:27

source share

dvdwllc · Accepted Answer · 2016-02-15T21:40:24+0000

A simple approach would be to iterate over your list of URLs and analyze them each separately:

 for url in my_list: soup = BeautifulSoup(urllib2.urlopen(url).read()) # then parse each page individually here

Alternatively, you can significantly speed up the process using Futures .

 from requests_futures.sessions import FuturesSession def my_parse_function(html): """Use this function to parse each page""" soup = BeautifulSoup(html) all_paragraphs = soup.find_all('p') return all_paragraphs session = FuturesSession(max_workers=5) futures = [session.get(url) for url in my_list] page_results = [my_parse_function(future.result()) for future in results]

Web crawler - following links

More articles: