Download .xls files from a webpage using Python and BeautifulSoup

I want to download all .xls or .xlsx or .csv from this site to the specified folder.

 https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009 

I studied mechanization, beautiful soup, urllib2, etc. The mechanism does not work in Python 3, urllib2 also had problems with Python 3, I was looking for a workaround, but I could not. So, I'm currently trying to get it working with Beautiful Soup.

I found some sample code and tried to change it according to my problem, as follows -

 from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve, quote from urllib.parse import urljoin url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/' u = urlopen(url) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html) for link in soup.select('div[webpartid] a'): href = link.get('href') if href.startswith('javascript:'): continue filename = href.rsplit('/', 1)[-1] href = urljoin(url, quote(href)) try: urlretrieve(href, filename) except: print('failed to download') 

However, when you run this code, the files from the landing page are not retrieved and an error message is not displayed (for example, “failed to load”).

  • How can I use BeautifulSoup to select Excel files from a page?
  • How to load these files into a local file using Python?
+5
source share
4 answers

Problems with your script as it stands:

  • url has a trailing / , which, if necessary, requests an invalid page, not a list of files that you want to download.
  • The CSS selector in soup.select(...) selects a div with the webpartid attribute that does not exist anywhere in this linked document.
  • You join the URL and quote it, even if the links are listed on the page as absolute URLs and they do not need to be quoted.
  • The try:...except: block stops you when you see errors that occur while trying to download a file. Using an except block without a specific exception is bad practice and should be avoided.

A modified version of your code that will receive the correct files and try to download them looks like this:

 from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve, quote from urllib.parse import urljoin # Remove the trailing / you had, as that gives a 404 page url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009' u = urlopen(url) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html, "html.parser") # Select all A elements with href attributes containing URLs starting with http:// for link in soup.select('a[href^="http://"]'): href = link.get('href') # Make sure it has one of the correct extensions if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']): continue filename = href.rsplit('/', 1)[-1] print("Downloading %s to %s..." % (href, filename) ) urlretrieve(href, filename) print("Done.") 

However, if you run this, you will notice that an urllib.error.HTTPError: HTTP Error 403: Forbidden exception is urllib.error.HTTPError: HTTP Error 403: Forbidden , although the file can be downloaded in the browser. At first I thought it was a referral check (to prevent hotlinking), however, if you look at the request in your browser (for example, the Chrome developer tools), you will notice that the initial http:// request is also blocked here, and then Chrome tries execute the https:// request for the same file.

In other words, the request must go through HTTPS to work (even though the URLs on the page indicate). To fix this, you will need to rewrite http: to https: before using the URL for the request. The following code will correctly change URLs and upload files. I also added a variable to indicate the output folder, which is added to the file name using os.path.join :

 import os from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009' OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder u = urlopen(URL) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html, "html.parser") for link in soup.select('a[href^="http://"]'): href = link.get('href') if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']): continue filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1]) # We need a https:// URL for this site href = href.replace('http://','https://') print("Downloading %s to %s..." % (href, filename) ) urlretrieve(href, filename) print("Done.") 
+2
source

I found this to be a good working example using BeautifulSoup4 , requests and wget modules for Python 2.7:

 import requests import wget import os from bs4 import BeautifulSoup, SoupStrainer url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009' file_types = ['.xls', '.xlsx', '.csv'] for file_type in file_types: response = requests.get(url) for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): if file_type in link['href']: full_path = url + link['href'] wget.download(full_path) 
+1
source
 i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden Also tried by adding user agents my modified code import os from bs4 import BeautifulSoup # Python 3.x from urllib.request import Request,urlopen, urlretrieve headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers) #URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009' OUTPUT_DIR = 'E:\python\out' # path to output folder, '.' or '' uses current folder u = urlopen(URL) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html, "html.parser") for link in soup.select('a[href^="http://"]'): href = link.get('href') if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']): continue filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1]) # We need a https:// URL for this site href = href.replace('http://','https://') print("Downloading %s to %s..." % (href, filename) ) urlretrieve(href, filename) print("Done.") 
+1
source

This worked better for me ... using python3

 import os import urllib from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve from urllib.error import HTTPError URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009' OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder u = urlopen(URL) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html, "html.parser") for link in soup.select('a[href^="http://"]'): href = link.get('href') if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']): continue filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1]) # We need a https:// URL for this site href = href.replace('http://','https://') try: print("Downloading %s to %s..." % (href, filename) ) urlretrieve(href, filename) print("Done.") except urllib.error.HTTPError as err: if err.code == 404: continue 
0
source

Source: https://habr.com/ru/post/1239990/


All Articles