Python - How to get a Wikipedia page to redirect me?

I want to save several different links on Wikipedia, but I do not want to store two different links to the same page twice. For example, the following links are different, but they point to the same page on Wikipedia:

https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no 
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________

The only difference is that one character is uppercase. Or the following links:

https://en.wikipedia.org/wiki/(0,1)-matrix 
https://en.wikipedia.org/wiki/(0,1)_matrix 
___________________________________|______ 

These are only different because one has a “-” and the other has a “_" (''). Therefore, I want to save only one of them or the following links:

https://en.wikipedia.org/wiki/Tetrahydroharman 
https://en.wikipedia.org/wiki/Logical_matrix 

I have tried to answer this question qaru.site/questions/1245271 / ... . But that did not work for me. (The result is the source URL for me, not the one the wiki redirects to me in the browser) So, how can I achieve what I'm looking for?

+4
3

301/302, , , 200, URL- JS

. -, &redirect=no URL

In [42]: import requests

In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
    ...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')

In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
    ...: split('r@ndom}-=||')[-1]

In [45]: idx = tmp.find('"/>')

In [46]: real_link = tmp[:idx]

In [47]: real_link
Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'

URL- <link rel="canonical" href=".

, , , bs4, , .

+3

MediaWiki API,

JSON ()

, , - , title

"Halab":

https://en.wikipedia.org/w/api.php?action=query&titles=Halab&&redirects&format=json

:

{  
   "batchcomplete":"",
   "query":{  
      "redirects":[  
         {  
            "from":"Halab",
            "to":"Aleppo"
         }
      ],
      "pages":{  
         "159244":{  
            "pageid":159244,
            "ns":0,
            "title":"Aleppo"
         }
      }
   }
}

Python:

import json
import requests

query = requests.get(r'https://en.wikipedia.org/w/api.php?action=query&titles={}&&redirects&format=json'.format('Halab'))

data = json.loads(query.text)
0

Amit Tripati's answer throws an exception. this is my answer:

res = requests.get(url)
doc = lxml.html.fromstring(res.content)
for t in doc.xpath("//link[contains(@rel, 'canonical')]"):
    new_url = str(t.attrib['href'])

from my experience, there might be a redirect to the same URL. so better check (url! = new_url) before using new_url.

0
source

Source: https://habr.com/ru/post/1689973/


All Articles