Getting top wallpapers from reddit

I am trying to get the hottest wallpapers from Reddit subreddit wallpapers . I use beautiful soupto get the layout HTMLfor the first wallpaper and then regexto get URLfrom the anchor tag. But more often than often, I get a URL that does not match my regular expression. Here is the code I'm using:

r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    print r.status_code
    text = r.text
    soup = BeautifulSoup(text, "html.parser")

search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())

Is there any way around this?

+4
source share
2 answers

:
.json URL- Reddit json HTML.
, https://www.reddit.com/r/wallpapers HTML- ,
https://www.reddit.com/r/wallpapers/.json json-, json python

:

>>> import urllib
>>> import json

>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())

>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'

>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'

>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'

, , reddit HTML - URL-, .
, json HTML

PS: [children] - . - , - .. ['data']['children'][2]['data']['url'] . ?:)

PPS: , urllib. , Reddit, User-Agent ( 429, .

+5

, Jarwins . HTML . href, URL

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    soup = BeautifulSoup(r.text, "html.parser")
    url = str(soup.find_all('a', {'class':'title'})[1]["href"])
    print url
+1

Source: https://habr.com/ru/post/1622220/


All Articles