Python using Beautiful Soup to process HTML on specific content

Question

Python using Beautiful Soup to process HTML on specific content

So, when I decided to parse the content from the website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients in a text file. The ingredients are located in:

<div class = "ingredients" style = "margin-top: 10px;" >

and in this every ingredient is stored between

<li class = "plaincharacterwrap">

Someone was good enough to provide code using a regular expression, but it gets confused when you modify it from site to site. Therefore, I wanted to use Beautiful Soup, as it has many built-in functions. In addition, I can confuse how to actually do this.

Code:

import re import urllib2,sys from BeautifulSoup import BeautifulSoup, NavigableString html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx") soup = BeautifulSoup(html) try: ingrdiv = soup.find('div', attrs={'class': 'ingredients'}) except IOError: print 'IO error'

How do you start? I want to find the actual div class, and then parse all those ingredients that are in the li class.

Any help would be appreciated! Thanks!

+4

python html parsing beautifulsoup

Eric Apr 11 '11 at 0:19

source share

2 answers

Yes, a special regular expression template must be written for each site.

But I think that

1- Treatments made with Beautiful Soup must also be tailored to each site.

2-regular expressions are not so difficult to write, and with a little habit it can be done quickly

I am curious to see what types of treatment I need to do with Beautiful Soup to get the same results that I got in a few minutes. Once I tried to learn beautiful soup, but I did not refuse this mess. I have to try again, now I'm a little more skilled in Python. But regular expressions were ok and enough for me so far

Here is the code for this new site:

 import urllib import re url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' sock = urllib.urlopen(url) ch = sock.read() sock.close() x = ch.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') print '\n'.join(patingr.findall(ch,x))

.

EDIT

I downloaded and installed BeautifulSoup and compared with regex.

I don’t think I made a mistake in the comparison code

 import urllib import re from time import clock import BeautifulSoup url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read() te = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) t1 = clock()-te te = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] res2 = '\n'.join(ingreds) t2 = clock()-te print res1 print print res2 print print 'res1==res2 is ',res1==res2 print '\nRegex :',t1 print '\nBeautifulSoup :',t2 print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1

result

 1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste 1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste res1==res2 is True Regex : 0.00210892725193 BeautifulSoup : 2.32453566026 BeautifulSoup execution time / Regex execution time == 1102.23605776

No comments!

.

EDIT 2

I realized that in my code I am not using a regular expression, I am using a method that uses a regular expression and find () .

This is the method that I use when resorting to regular expressions, because in some cases it increases the speed of treatment. This is because the find () function is very fast.

To know what we are comparing, we need the following codes.

In codes 3 and 4, I took into account Achim's remarks in another message flow: using re.IGNORECASE and re.DOTALL, ["\ '] instead. "

These codes are separated because they must be executed in different files in order to get reliable results: I don’t know why, but if all the codes are executed in one file, some resulting times are very different (0.00075 instead of 0.0022 for example)

 import urllib import re import BeautifulSoup from time import clock url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read() # Simple regex , without x te = clock() patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res0 = '\n'.join(patingr.findall(data)) t0 = clock()-te print '\nSimple regex , without x :',t0

and

 # Simple regex , with x te = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) t1 = clock()-te print '\nSimple regex , with x :',t1

and

 # Regex with flags , without x and y te = clock() patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n', flags=re.DOTALL|re.IGNORECASE) res10 = '\n'.join(patingr.findall(data)) t10 = clock()-te print '\nRegex with flags , without x and y :',t10

and

 # Regex with flags , with x and y te = clock() x = data.find('Ingredients</h3>') y = data.find('h3>\r\n Footnotes</h3>\r\n') patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n', flags=re.DOTALL|re.IGNORECASE) res11 = '\n'.join(patingr.findall(data,x,y)) t11 = clock()-te print '\nRegex with flags , without x and y :',t11

and

 # BeautifulSoup te = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] res2 = '\n'.join(ingreds) t2 = clock()-te print '\nBeautifulSoup :',t2

result

 Simple regex , without x : 0.00230488284125 Simple regex , with x : 0.00229121279385 Regex with flags , without x and y : 0.00758719458758 Regex with flags , with x and y : 0.00183724493364 BeautifulSoup : 2.58728860791

Using x does not affect speed for a simple regular expression.

A regular expression with flags without x and y takes longer, but the result does not match with the others, since it catches an additional piece of text. Therefore, in a real application, this will be a regular expression with flags and x / y to be used.

A more complex regex with flags and with x and y takes 20% less time.

Well, the results don't change very much, with or without x / y.

So my conclusion is the same
using a regular expression, resorting to Find () or not, remains about 1000 times faster than BeautifulSoup, and I evaluate 100 times faster than lxml (I did not install lxml)

.

To what you wrote, Hugh, I would say:

When a regular expression is erroneous, it is not faster and slower. He does not work.

When a regular expression is wrong, the encoder makes it right, that's all.

I don’t understand why 95% of people at stackoverflow.com want to convince the other 5% that regular expressions should not be used to parse HTML or XML or anything else. I say “analyze”, not “disassemble”. As I understand it, the analyzer first analyzes the PURPOSE of the text, and then displays the contents of the elements we want. On the contrary, the regular expression matches what was searched, it does not create an HTML / XML tree or anything else that the parser does, and that I do not know very well.

So, I am very pleased with regular expressions. I have no problem writing even very long REs, and regular expressions let you run programs that need to respond quickly after analyzing the text. BS or lxml will work, but it will be a hassle.

I would have other comments, but I do not have time for a subject in which, in essence, I allow others to do what they prefer.

+2

eyquem Apr 11 '11 at 0:59

source share

Hugh bothwell · Accepted Answer · 2011-04-11T00:43:27+0000

 import urllib2 import BeautifulSoup def main(): url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx" data = urllib2.urlopen(url).read() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] fname = 'PorkChopsRecipe.txt' with open(fname, 'w') as outf: outf.write('\n'.join(ingreds)) if __name__=="__main__": main()

leads to

 1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste

.

Following answer to @eyquem:

 from time import clock import urllib import re import BeautifulSoup import lxml.html start = clock() url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read() print "Loading took", (clock()-start), "s" # by regex start = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) print "Regex parse took", (clock()-start), "s" # by BeautifulSoup start = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li')) print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1) # by lxml start = clock() lx = lxml.html.fromstring(data) ingreds = lx.xpath('//div[@class="ingredients"]//li/text()') res3 = '\n'.join(s.strip() for s in ingreds) print "lxml parse took", (clock()-start), "s - same =", (res3==res1)

gives

 Loading took 1.09091222621 s Regex parse took 0.000432703726233 s BeautifulSoup parse took 0.28126133314 s - same = True lxml parse took 0.0100940499505 s - same = True

Regex is much faster (except when it's wrong); but if you are considering loading a page and parsing it together, BeautifulSoup still only has 20% of the runtime. If you're terribly concerned about speed, I recommend using lxml.

Python using Beautiful Soup to process HTML on specific content

EDIT

EDIT 2

More articles: