Extract string from html tag with beautiful soup

Question

Extract string from html tag with beautiful soup

I have a html file similar to this in the_files subdirectory

 <div class='log'>start</div> <div class='ts'>2017-03-14 09:17:52.859 +0800&nbsp;</div><div class='log'>bla bla bla</div> <div class='ts'>2017-03-14 09:17:55.619 +0800&nbsp;</div><div class='log'>aba aba aba</div> ... ...

I want to extract a line in each tag and print it like this on the terminal

 2017-03-14 09:17:52.859 +0800 , bla bla bla 2017-03-14 09:17:55.619 +0800 , aba aba aba ... ...

I want to ignore the first line <div class='log'>start</div> .

My code is bye

 from bs4 import BeautifulSoup path = "the_files/" def do_task_html(): dir_path = os.listdir(path) for file in dir_path: if file.endswith(".html"): soup = BeautifulSoup(open(path+file)) item1 = [element.text for element in soup.find_all("div", "ts")] string1 = ''.join(item1) item2 = [element.text for element in soup.find_all("div", "log")] string2 = ''.join(item2) print string1 + "," + string2

This code gives the result as follows

 2017-03-14 09:17:52.859 +0800 2017-03-14 09:17:55.619 +0800 , start bla bla bla aba aba aba ... ...

Is there any way to fix this?

Thank you for your help.

+5

python beautifulsoup

Ling Mar 24 '17 at 10:50

source share

1 answer

Zroq · Accepted Answer · 2017-03-24T10:54:04+0000

Extract each div by class, getting its text and the text next_sibling .

 for div in soup.find_all("div", class_="ts"): print ("%s, %s") % (div.get_text(strip=True), div.next_sibling.get_text(strip=True))

Outputs:

 2017-03-14 09:17:52.859 +0800, bla bla bla 2017-03-14 09:17:55.619 +0800, aba aba aba

Extract string from html tag with beautiful soup

More articles: