Separating an HTML document with BeautifulSoup

We are dealing with long aggregated HTML documents (for conversion to PDF). In some situations, an aggregated HTML document should be divided into sections (highlighted HTML pages starting with the H1 tag) or sub-chapters (highlighted HTML pages starting with each H1 or H2 tag). We are using BeautifulSoup to control aggregated HTML so far, but we have not been able to find the right way to use BeautifulSoup to properly retrieve a subdocument (e.g. from the first H1 to the next H2).

+4
source share
2 answers

I had some experience with BeautifulSoup and I'm not sure if it supports what you want to do directly. Here are two ideas.

Search

Below is the documentation about the search tools that it has. Perhaps you can search for both H1s and H2s and see if this helps retrieve subdocuments

http://www.crummy.com/software/BeautifulSoup/documentation.html#Search Tree Analysis

Pretty Print + grep

BeautifulSoup has very useful prefix functionality to print html fairly. Once this is done, each H1 or H2 will be on its own line, in which case you can easily use text processing utilities such as grep to easily identify the string no. containing H1 and H2, and simple texts between them.

http://www.crummy.com/software/BeautifulSoup/documentation.html#Print a document

+2
source

Since you have not been offered a solution with a parser, can I assume that you should control yourself with regular expressions?

The second point of the Dane has the same nature, since the name grep comes from "global - regular expression - print". But this is complicated by the fact that pre-processing requires the use of predefined functionality.

On the contrary, regular expressions are a powerful tool that can be used directly in the text.

Could you give more information on what you want to do?

-1
source

Source: https://habr.com/ru/post/1341076/


All Articles