Combine items in a list until an item containing specific text is found?

It will be difficult to explain.

I am extracting some web pages using BeautifulSoup and I am trying to organize them in a list. I only retrieve elements on the page that have a text class. Like this:

content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
                                                "indent-1"]})
line_list = [line for line in filtered_soup]
#text_list = [line.get_text() for line in filtered_soup]

This works fine, but I would also like to combine some of the items in the list. On the web page, some elements class="text..."also have id="en...". They technically MUST be the parents of other elements class="text...", but the web page was not set up that way.

In my list "line_list" there is an element with elements class="text..."and id="en...", then there are several elements with only class="text...", then there is an element with class="text..."and id="en...", and this pattern continues to repeat. Here you can think about this:

line_list = [A, a, a, a, B, b, b, C, c, c, c, c]

Now here is the hard part to explain. Let say line_list[0]have both elements, line_list[1-3]only the class element, and line_list[4]both elements again. I would like to iterate through line_listand concatenate the elements in one line. But when the iteration hits an element containing both "id" and "class" (i.e. line_list[4]), I would like it to start creating a new line.

Or, if someone can come up with a better way to do this, it will be awesome. I was going to try to do this:

line_string = ''.join(line_list)
split_list = line_string.split('id="en')

join , line_string , .

, ? , , "", "id" , "" . :

line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}

html, - :

line_list = [<span class="text 1" id="en-13987>A<span class="small-caps" style="font-variant: small-caps">A</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 1">a</span></span>,
             <span class="text 1">a</span>,
             <span class="text 2" id="en-13988">B<span class="small-caps" style="font-variant: small-caps">B</span>B</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span></span>,
             <span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span>,
             <span class="text 3" id="en-13989">C</span>,
              <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 3">c<span class="small-caps" style="font variant: small-caps">c</span>c</span></span>,
             <span class="text 3">c<span class="small-caps" style="font-variant: small-caps">c</span>c</span>,]

, . !

+4
3

, , :

text_list = []
current = []
for line in line_list:
    if line.get('id', '').startswith('en'):
        if current:
            text_list.append(' '.join(current))
            current = []
    current.append(line.text)
if current:
    text_list.append(' '.join(current))

,

import bs4

content = '''
<span class='text' class='indent-1' id='en00'>And one</span>
<span class='text' class='indent-1'>And two</span>
<span class='text' class='indent-1'>And three</span>
<span class='text' class='indent-1' id='en01'>And four</span>
<span class='text' class='indent-1'>And five</span>
'''

soup = bs4.BeautifulSoup(content)
filtered_soup = soup.find_all("span", {"class":["text", "indent-1"]})
line_list = [line for line in filtered_soup]

a for x in test_list: print(x)

And one And two And three
And four And five

, -, .

: , , , :

def has_id_en(elem):
    return elem.get('id', '').startswith('en')

def segment(sequence, is_head):
  current = []
  for x in sequence:
      if is_head(x):
          if current:
              yield current
              current = []
      current.append(x)
  if current:
      yield current

text_list = [' '.join(e.text for e in bunch)
             for bunch in segment(line_list, has_id_en)]

, , segment , bs4 / , "" , .

+1

itertools.groupby, :

import itertools

def has_id_en(elem):
    # return True if the elem has id="en..."
    ...

for is_id_en, elems in itertools.groupby(filtered_soup, has_id_en):
    if is_id_en:
        # this is the parent
        continue
    else:
        # do somthing with this group of elems
        ...
+1

itertools.takewhile, , "" "separator". , :

def has_both(x):
    return x.isupper() # or whatever your actual condition is

line_dic = {}
last = None
for x in line_list:
    if has_both(x):
        last = x
        line_dic[last] = []
    else:
        line_dic[last].append(x)

{'A': ['a', 'a', 'a'], 'C': ['c', 'c', 'c', 'c'], 'B': ['b', 'b']}

For Python 2.7 and later, you can also use collections.OrderedDictto keep the order in which elements are inserted into the dictionary. Also, if you expect to see "children" elements before any "parent" elements, initialize line_dicas {None: []}.

0
source

Source: https://habr.com/ru/post/1569141/


All Articles