Parse a nested HTML list with BeautifulSoup

Question

Parse a nested HTML list with BeautifulSoup

I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:

<ul> <li>Operating System <ul> <li>Linux <ul> <li>Debian</li> <li>Fedora</li> <li>Ubuntu</li> </ul> </li> <li>Windows</li> <li>OS X</li> </ul> </li> <li>Programming Languages <ul> <li>Python</li> <li>C#</li> <li>Ruby</li> </ul> </li> </ul>

I want to convert it to dict like this:

 { 'Operating System': { 'Linux': { 'Debian': None, 'Fedora': None, 'Ubuntu': None, }, 'Windows': None, 'OS X': None, }, 'Programming Languages': { 'Python': None, 'C#': None, 'Ruby': None, } }

My initial attempt was to use find_all('li', recursive=False) . It returns top-level elements (operating system and programming languages), as well as children.

How can I do this with BeautifulSoup?

+4

python dictionary html-parsing beautifulsoup

flowfree Jul 25 '13 at 5:56

source share

1 answer

Zero piraeus · Accepted Answer · 2013-07-25T06:39:54+0000

Here is one way:

 def dictify(ul): result = {} for li in ul.find_all("li", recursive=False): key = next(li.stripped_strings) ul = li.find("ul") if ul: result[key] = dictify(ul) else: result[key] = None return result

Usage example:

 >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(""" ... <ul> ... <li>Operating System ... <ul> ... <li>Linux ... <ul> ... <li>Debian</li> ... <li>Fedora</li> ... <li>Ubuntu</li> ... </ul> ... </li> ... <li>Windows</li> ... <li>OS X</li> ... </ul> ... </li> ... <li>Programming Languages ... <ul> ... <li>Python</li> ... <li>C#</li> ... <li>Ruby</li> ... </ul> ... </li> ... </ul> ... """) >>> ul = soup.body.ul >>> from pprint import pprint >>> pprint(dictify(ul), width=1) {u'Operating System': {u'Linux': {u'Debian': None, u'Fedora': None, u'Ubuntu': None}, u'OS X': None, u'Windows': None}, u'Programming Languages': {u'C#': None, u'Python': None, u'Ruby': None}}

Parse a nested HTML list with BeautifulSoup

More articles: