Beautiful Soup 4 find_all did not find the links that Beautiful Soup 3 found

I noticed a very annoying error: BeautifulSoup4 (package:) bs4often finds fewer tags than the previous version (package:) BeautifulSoup.

An instance of this problem is reproduced here:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Conclusion:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

The difference is not insignificant, as you can see.

Here are the exact versions of the modules if someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
0
source share
1 answer

You have installed lxml, which means that BeautifulSoup 4 will use this analyzer on top of the standard library html.parser.

lxml 3.2.1 ( 1701 ); lxml libxml2 libxslt, . , . . lxml requirements; libxml2 2.7.8 .

:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')
+9

Source: https://habr.com/ru/post/1538777/


All Articles