It will be difficult to explain.
I am extracting some web pages using BeautifulSoup and I am trying to organize them in a list. I only retrieve elements on the page that have a text class. Like this:
content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
"indent-1"]})
line_list = [line for line in filtered_soup]
This works fine, but I would also like to combine some of the items in the list. On the web page, some elements class="text..."also have id="en...". They technically MUST be the parents of other elements class="text...", but the web page was not set up that way.
In my list "line_list" there is an element with elements class="text..."and id="en...", then there are several elements with only class="text...", then there is an element with class="text..."and id="en...", and this pattern continues to repeat. Here you can think about this:
line_list = [A, a, a, a, B, b, b, C, c, c, c, c]
Now here is the hard part to explain. Let say line_list[0]have both elements, line_list[1-3]only the class element, and line_list[4]both elements again. I would like to iterate through line_listand concatenate the elements in one line. But when the iteration hits an element containing both "id" and "class" (i.e. line_list[4]), I would like it to start creating a new line.
Or, if someone can come up with a better way to do this, it will be awesome. I was going to try to do this:
line_string = ''.join(line_list)
split_list = line_string.split('id="en')
join , line_string , .
, ? , , "", "id" , "" . :
line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}
html, - :
line_list = [<span class="text 1" id="en-13987>A<span class="small-caps" style="font-variant: small-caps">A</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 1">a</span></span>,
<span class="text 1">a</span>,
<span class="text 2" id="en-13988">B<span class="small-caps" style="font-variant: small-caps">B</span>B</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span></span>,
<span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span>,
<span class="text 3" id="en-13989">C</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 3">c<span class="small-caps" style="font variant: small-caps">c</span>c</span></span>,
<span class="text 3">c<span class="small-caps" style="font-variant: small-caps">c</span>c</span>,]
, . !