Download pdf file from Wikipedia

Wikipedia provides a link (left to print / export) in each article to download the article in pdf format. I wrote a small Haskell script that first gets a Wikipedia link and displays a render link. When I give the rendering URL as input, I get empty tags, but the same URL in the browser provides a download link.

Can someone please tell me how to solve this problem? The generated ideone code.

import Network.HTTP import Text.HTML.TagSoup import Data.Maybe parseHelp :: Tag String -> Maybe String parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y then Just $ "http://en.wikipedia.org" ++ snd ( y !! 0 ) else Nothing parse :: [ Tag String ] -> Maybe String parse [] = Nothing parse ( x : xs ) | isTagOpen x = case parseHelp x of Just s -> Just s Nothing -> parse xs | otherwise = parse xs main = do x <- getLine tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url let lst = head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1 url = fromJust . parse $ lst --rendering url putStrLn url tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url ) print tags_2 
+6
source share
1 answer

If you try to request a URL through some external tool, such as wget , you will see that Wikipedia does not directly process the results page. It actually returns a 302 Moved Temporarily redirect.

Entering this URL in the browser will be fine, as the browser will automatically redirect. simpleHTTP , however, will not. simpleHTTP , as the name suggests, is pretty simple. It does not handle things like cookies, SSL, or redirects.

Instead, you will want to use Network.Browser . It offers much more control over query execution. In particular, the setAllowRedirects function setAllowRedirects force it to automatically follow forwarding.

Here is a quick and dirty function for loading a URL into a String with redirection support:

 import Network.Browser grabUrl :: String -> IO String grabUrl url = fmap (rspBody . snd) . browse $ do -- Disable logging output setErrHandler $ const (return ()) setOutHandler $ const (return ()) setAllowRedirects True request $ getRequest url 
+5
source

Source: https://habr.com/ru/post/896920/


All Articles