Yes, a special regular expression template must be written for each site.
But I think that
1- Treatments made with Beautiful Soup must also be tailored to each site.
2-regular expressions are not so difficult to write, and with a little habit it can be done quickly
I am curious to see what types of treatment I need to do with Beautiful Soup to get the same results that I got in a few minutes. Once I tried to learn beautiful soup, but I did not refuse this mess. I have to try again, now I'm a little more skilled in Python. But regular expressions were ok and enough for me so far
Here is the code for this new site:
import urllib import re url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' sock = urllib.urlopen(url) ch = sock.read() sock.close() x = ch.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') print '\n'.join(patingr.findall(ch,x))
.
EDIT
I downloaded and installed BeautifulSoup and compared with regex.
I don’t think I made a mistake in the comparison code
import urllib import re from time import clock import BeautifulSoup url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read() te = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) t1 = clock()-te te = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] res2 = '\n'.join(ingreds) t2 = clock()-te print res1 print print res2 print print 'res1==res2 is ',res1==res2 print '\nRegex :',t1 print '\nBeautifulSoup :',t2 print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
result
1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste 1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste res1==res2 is True Regex : 0.00210892725193 BeautifulSoup : 2.32453566026 BeautifulSoup execution time / Regex execution time == 1102.23605776
No comments!
.
EDIT 2
I realized that in my code I am not using a regular expression, I am using a method that uses a regular expression and find () .
This is the method that I use when resorting to regular expressions, because in some cases it increases the speed of treatment. This is because the find () function is very fast.
To know what we are comparing, we need the following codes.
In codes 3 and 4, I took into account Achim's remarks in another message flow: using re.IGNORECASE and re.DOTALL, ["\ '] instead. "
These codes are separated because they must be executed in different files in order to get reliable results: I don’t know why, but if all the codes are executed in one file, some resulting times are very different (0.00075 instead of 0.0022 for example)
import urllib import re import BeautifulSoup from time import clock url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read()
and
# Simple regex , with x te = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) t1 = clock()-te print '\nSimple regex , with x :',t1
and
# Regex with flags , without x and y te = clock() patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n', flags=re.DOTALL|re.IGNORECASE) res10 = '\n'.join(patingr.findall(data)) t10 = clock()-te print '\nRegex with flags , without x and y :',t10
and
# Regex with flags , with x and y te = clock() x = data.find('Ingredients</h3>') y = data.find('h3>\r\n Footnotes</h3>\r\n') patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n', flags=re.DOTALL|re.IGNORECASE) res11 = '\n'.join(patingr.findall(data,x,y)) t11 = clock()-te print '\nRegex with flags , without x and y :',t11
and
# BeautifulSoup te = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] res2 = '\n'.join(ingreds) t2 = clock()-te print '\nBeautifulSoup :',t2
result
Simple regex , without x : 0.00230488284125 Simple regex , with x : 0.00229121279385 Regex with flags , without x and y : 0.00758719458758 Regex with flags , with x and y : 0.00183724493364 BeautifulSoup : 2.58728860791
Using x does not affect speed for a simple regular expression.
A regular expression with flags without x and y takes longer, but the result does not match with the others, since it catches an additional piece of text. Therefore, in a real application, this will be a regular expression with flags and x / y to be used.
A more complex regex with flags and with x and y takes 20% less time.
Well, the results don't change very much, with or without x / y.
So my conclusion is the same
using a regular expression, resorting to Find () or not, remains about 1000 times faster than BeautifulSoup, and I evaluate 100 times faster than lxml (I did not install lxml)
.
To what you wrote, Hugh, I would say:
When a regular expression is erroneous, it is not faster and slower. He does not work.
When a regular expression is wrong, the encoder makes it right, that's all.
I don’t understand why 95% of people at stackoverflow.com want to convince the other 5% that regular expressions should not be used to parse HTML or XML or anything else. I say “analyze”, not “disassemble”. As I understand it, the analyzer first analyzes the PURPOSE of the text, and then displays the contents of the elements we want. On the contrary, the regular expression matches what was searched, it does not create an HTML / XML tree or anything else that the parser does, and that I do not know very well.
So, I am very pleased with regular expressions. I have no problem writing even very long REs, and regular expressions let you run programs that need to respond quickly after analyzing the text. BS or lxml will work, but it will be a hassle.
I would have other comments, but I do not have time for a subject in which, in essence, I allow others to do what they prefer.