Download file from Blob URL using Python

I want my Python script to download an Excel Master (Download, XLSX) Excel file from this Frankfurt Stock Exchange webpage .

When, to get it with urrliband wget, it turns out that the URL leads to Blob, and the downloaded file is only 289 bytes and unreadable.

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I am not completely familiar with Blobs and ask the following questions:

  • Is it possible to successfully restore a file "behind blob" using Python?

  • If so, is it necessary to disclose the "true" URL behind the blob - if there is such a thing - and how? My concern is that the link above will not be static, but in fact often changes.

+1
source share
2 answers

This length of 289 bytes may be the HTML for the page 403 forbidden. This is because the server is smart and rejects if your code does not specify a user agent.

Python 3

# python3
import urllib.request as request

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)

# print or write
print(f.read())

Python 2

# python2
import urllib2

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)

print(f.read())
+1
source
from bs4 import BeautifulSoup
import requests
import re

url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg)  #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\\Users\user\Downloads\file.xlsx','wb') as ff:
    ff.write(file.content)

recommend requests and BeautifulSoup as good lib

+2
source

Source: https://habr.com/ru/post/1689660/


All Articles