Python regex for highlighting script tags

Question

Python regex for highlighting script tags

I'm a little afraid to ask about this, fearing retaliation from SO. "You cannot parse HTML with regular expressions." Why re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)doesn't it separate a multi-line 'script', but only two single-line ones at the end, please?

Thanks HC

>>> import re
>>> data = """\
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script> 
    <script type="text/javascript" src="../_static/doctools.js"></script>
"""

>>> print (re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)[0])
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>

+3

python regex

hcvst Jan 15 '11 at 6:38

source share

4 answers

, , , lxml, HTML. (lxml , BeautifulSoup, .)

, , . , HTML-, , , , .

script

HTMLParser lxml:

from lxml import etree
from StringIO import StringIO

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

for s in tree.xpath('//script'):
    s.getparent().remove(s)

print etree.tostring(tree.getroot(), pretty_print=True)

:

<html>
  <head>
    <title>Regular Expression HOWTO &#8212; Python v2.7.1 documentation</title>
  </head>
</html>

lxml

, , , <script>, , Cleaner lxml , :

from lxml.html.clean import Cleaner

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

cleaner = Cleaner(page_structure=False)
print cleaner.clean_html(broken_html)

... :

<html><head><title>Regular Expression HOWTO — Python v2.7.1 documentation</title></head></html>

(nb) nothtml html - , 1 , <html><body>, 2 , , :))

+4

Mark Longair 15 . '11 8:11

To remove html, style and script, you can use re.

def stripTags(text):
  # scripts = re.compile(r'<script.*?/script>')
  scripts = re.compile(r'<(script).*?</\1>(?s)')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

I can work easily

+2

Shafiq Jan 17 '17 at 9:24

source share

Short answer, do not do this. Use Beautiful Soup or elementree to get rid of them. Parse your data as HTML or XML. Regular expressions will not work and are the wrong answer to this problem.

+1

Omnifarious Jan 15 '11 at 6:42

source share

Mark Longair · Accepted Answer · 2011-01-15T07:00:02+0000

Leaving aside the question of whether this is a good idea at all, the problem with your example is that the fourth parameter is equal - there is no parameter in Python 2.6, although it was introduced as the fifth parameter in Python 2.7. Instead, you can add `(? S) to the end of your regular expression for the same effect: re.subncountflags

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', data)[0])

<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 




>>>

... Python 2.7, :

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', 0, data)[0])

... .. 0 count.

Python regex for highlighting script tags

script

lxml

More articles: