Python / BeautifulSoup cleanup multithreading is not accelerated at all

Question

Python / BeautifulSoup cleanup multithreading is not accelerated at all

I have a csv file ("SomeSiteValidURLs.csv") that lists all the links that I need to clear. The code works and will go through the URLs in csv, clear the information and write / save in another CSV file ("Output.csv"). However, since I plan to do this for most of the site (for> 10,000,000 pages), speed is important. Each link takes about 1 s to scan and save information in csv, which is too slow for the project scale. Therefore, I turned on the multithreading module and, to my surprise, it does not accelerate at all, it still occupies one link to a person. Did I do something wrong? Is there any other way to speed up processing speed?

Without multithreading:

import urllib2 import csv from bs4 import BeautifulSoup import threading def crawlToCSV(FileName): with open(FileName, "rb") as f: for URLrecords in f: OpenSomeSiteURL = urllib2.urlopen(URLrecords) Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml") OpenSomeSiteURL.close() tbodyTags = Soup_SomeSite.find("tbody") trTags = tbodyTags.find_all("tr", class_="result-item ") placeHolder = [] for trTag in trTags: tdTags = trTag.find("td", class_="result-value") tdTags_string = tdTags.string placeHolder.append(tdTags_string) with open("Output.csv", "ab") as f: writeFile = csv.writer(f) writeFile.writerow(placeHolder) crawltoCSV("SomeSiteValidURLs.csv")

With multithreading:

 import urllib2 import csv from bs4 import BeautifulSoup import threading def crawlToCSV(FileName): with open(FileName, "rb") as f: for URLrecords in f: OpenSomeSiteURL = urllib2.urlopen(URLrecords) Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml") OpenSomeSiteURL.close() tbodyTags = Soup_SomeSite.find("tbody") trTags = tbodyTags.find_all("tr", class_="result-item ") placeHolder = [] for trTag in trTags: tdTags = trTag.find("td", class_="result-value") tdTags_string = tdTags.string placeHolder.append(tdTags_string) with open("Output.csv", "ab") as f: writeFile = csv.writer(f) writeFile.writerow(placeHolder) fileName = "SomeSiteValidURLs.csv" if __name__ == "__main__": t = threading.Thread(target=crawlToCSV, args=(fileName, )) t.start() t.join()

+5

multithreading python-2.7 parallel-processing web-scraping beautifulsoup

Kubik888 Aug 18 '14 at 10:42

source share

1 answer

dano · Accepted Answer · 2014-08-18T23:15:57+0000

You do not parallelize this correctly. What you really want to do is that the work done inside the for loop happens at the same time as many employees. Right now, you are moving all the work into a single background thread, which does all this synchronously. This will not improve performance at all (in fact, it will hurt a little).

Here is an example that ThreadPool uses to parallelize network operation and parsing. It is unsafe to try to write to a CSV file over many streams at once, so instead we return data that would be written back to the parent, and the parent should write all the results to the file at the end.

 import urllib2 import csv from bs4 import BeautifulSoup from multiprocessing.dummy import Pool # This is a thread-based Pool from multiprocessing import cpu_count def crawlToCSV(URLrecord): OpenSomeSiteURL = urllib2.urlopen(URLrecord) Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml") OpenSomeSiteURL.close() tbodyTags = Soup_SomeSite.find("tbody") trTags = tbodyTags.find_all("tr", class_="result-item ") placeHolder = [] for trTag in trTags: tdTags = trTag.find("td", class_="result-value") tdTags_string = tdTags.string placeHolder.append(tdTags_string) return placeHolder if __name__ == "__main__": fileName = "SomeSiteValidURLs.csv" pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads. with open(FileName, "rb") as f: results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV with open("Output.csv", "ab") as f: writeFile = csv.writer(f) for result in results: writeFile.writerow(result)

Note that in Python, threads only actually speed up I / O - because of the GIL, CPU-bound operations (like BeautifulSoup parsing / searching) cannot actually be executed in parallel through threads, because only one thread can execute CPU based operations at a time. Thus, you still may not see the acceleration that you hoped for with this approach. When you need to speed up processor-bound operations in Python, you need to use multiple processes instead of threads. Fortunately, you can easily see how this script executes multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool . No other changes are required.

Edit:

If you need to scale this to a file with 10,000,000 lines, you need to adjust this code a bit - Pool.map converts the iterability that you pass to the list before sending it to your employees, which obviously will not work very well with 10 million list of applications; having all of this in memory is likely to cause your system to crash. The same problem with saving all the results in the list. Instead, you should use Pool.imap :

imap (func, iterable [, chunksize])
LASER version of the map ().
The chunksize argument is the same as the one used by the map () method. For very long iterations that use a lot of importance for chunksize, you can do the job much faster than using the default value of 1.

 if __name__ == "__main__": fileName = "SomeSiteValidURLs.csv" FILE_LINES = 10,000,000 NUM_WORKERS = cpu_count() * 2 chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes. pool = Pool(NUM_WORKERS) with open(FileName, "rb") as f: result_iter = pool.imap(crawlToCSV, f) with open("Output.csv", "ab") as f: writeFile = csv.writer(f) for result in result_iter: # lazily iterate over results. writeFile.writerow(result)

With imap we never put all f into memory at once and do not immediately store all the results in memory. The most we have in mind is chunksize strings f , which should be more manageable.

Python / BeautifulSoup cleanup multithreading is not accelerated at all

More articles: