Convert html to text using python language

import urllib2

from BeautifulSoup import *

resp = urllib2.urlopen("file:///D:/sample.html")

rawhtml = resp.read()

resp.close()
print rawhtml

I use this code to get text from an html document, but it also gives me html code. What should I do to extract only text from an html document?

+3
source share
5 answers

Please note that your example does not use Beautifulsoup. See doc and follow the examples below.

The following example, taken from the link above, searches for items soupfor <td>.

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print
+4
source

There is a way in the module documentation itself to extract all the lines from the document. @ http://www.crummy.com/software/BeautifulSoup/

from BeautifulSoup import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)

all_strings = [e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
print all_strings
+3

(. 60):

def gettextonly(soup):
    v=soup.string
    if v == None:
        c=soup.contents
        resulttext=''
        for t in c:
            subtext=gettextonly(t)
            resulttext+=subtext+'\n'
        return resulttext
    else:
        return v.strip()

:

>>>from BeautifulSoup import BeautifulSoup

>>>doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'

>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'

, , (\n).

, , (. 61):

import re
def separatewords(text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']

:

>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is', 
u'paragraph', u'two']
+1

html2text.

- "lynx -dump"

+1

html2text , . html2text auml ouml, Auml Ouml .

unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)

html2text converts to markdown text syntax, so the converted text can also be returned to the html format (of course, some information will be lost during the conversion).

0
source

Source: https://habr.com/ru/post/1762122/


All Articles