Cutting the script Aaron Swartz used to download several thousand articles from the Jstor archive

Aaron Swartz played an important role in shaping the Internet during his early years. For those familiar with Aaron , you probably know that he committed suicide after face-to-face under the age of 35 in prison for downloading a large number of articles from the jstor archive, a digital library of academic journals and books. The script that he used to download articles was released and is shown below. (Here's the Aaron Documentary link for anyone interested.)

keepgrabbing.py

This is the code:

import subprocess, urllib, random
class NoBlocks(Exception): pass
def getblocks():
    r = urllib.urlopen("http://{?REDACTED?}/grab").read()
    if '<html' in r.lower(): raise NoBlocks
    return r.split()


import sys
if len(sys.argv) > 1:
    prefix = ['--socks5', sys.argv[1]]
else:
    prefix = []#'-interface','eth0:1']
line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]


while 1:
    blocks = getblocks()
    for block in blocks:
        print block
        subprocess.Popen(line(block)).wait()

, . . , , RSS- Reddit, 26 .

, , .

Jstor - . 2010 , JSTOR , . , , , , .

, Getblocks(), urllib Jstor, HTML- - .

, , , sys if/else.

, ... ? ?

< 1 else, ?

if len(sys.argv) > 1:
    prefix = ['--socks5', sys.argv[1]]
else:
    prefix = []#'-interface','eth0:1']
line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]

Aaron seige .

, .

, , . Jstor , . Jstor , " IP" .

" cookie , Literatum.... IP , , . 8500 . , , , , . MDC , .

+4
2

, . subprocess.Popen line :

subprocess.Popen(line(block)).wait()

getblocks - (, jstor), PDF. script.

lambda line , Popen curl, . , , " " ( , , , cookie).

+2

# else - . , .

0

Source: https://habr.com/ru/post/1695915/


All Articles