Downloading a large number of files using python

Is there a way to upload many files using python? This code is fast enough to download about 100 or so files. But I need to upload 300,000 files. Obviously, they are all very small files (or I wouldn’t download 300,000 of them :)), so this loop seems to be the real bottleneck. Anyone have any thoughts? Maybe use MPI or streams?

Do I just need to live with a bottleneck? Or is there a faster way, maybe not even using python?

(I included the full start of the code just for completeness)

from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache 

#we start with a huge file of urls

data= pd.read_csv("edgar.csv")
datatemp2=data[data['form'].str.contains("14A")]
datatemp3=data[data['form'].str.contains("14C")]

#data2 is the cut-down file

data2=datatemp2.append(datatemp3)
flist=np.array(data2['filename'])
print len(flist)
print flist

###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD

original=os.getcwd().copy()
os.chdir(str(os.getcwd())+str('/edgar14A14C'))


for i in xrange(len(flist)):
    url = "ftp://ftp.sec.gov/"+str(flist[i])
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    f.close()
    print i
+4
source share
1 answer

multiprocessing job(), .

: ( )

from multiprocessing import Pool

def job(url):
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    f.close()

pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)

:

  • , CPU Core Core (s)
  • job().
  • urls job() .

Python multiprocessing.Pool.map , no. .

, , - progress :

from multiprocessing import Pool


from progress.bar import Bar


def job(input):
    # do some work


pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
    bar.next()
bar.finish()

, , eta ..

, requests API (-) - .

+10

Source: https://habr.com/ru/post/1545915/


All Articles