Python regex for highlighting script tags

I'm a little afraid to ask about this, fearing retaliation from SO. "You cannot parse HTML with regular expressions." Why re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)doesn't it separate a multi-line 'script', but only two single-line ones at the end, please?

Thanks HC

>>> import re
>>> data = """\
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script> 
    <script type="text/javascript" src="../_static/doctools.js"></script>
"""

>>> print (re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)[0])
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
+3
source share
4 answers

Leaving aside the question of whether this is a good idea at all, the problem with your example is that the fourth parameter is equal - there is no parameter in Python 2.6, although it was introduced as the fifth parameter in Python 2.7. Instead, you can add `(? S) to the end of your regular expression for the same effect: re.subncountflags

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', data)[0])

<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 




>>>

... Python 2.7, :

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', 0, data)[0])

... .. 0 count.

+6

, , , lxml, HTML. (lxml , BeautifulSoup, .)

, , . , HTML-, , , , .

script

HTMLParser lxml:

from lxml import etree
from StringIO import StringIO

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

for s in tree.xpath('//script'):
    s.getparent().remove(s)

print etree.tostring(tree.getroot(), pretty_print=True)

:

<html>
  <head>
    <title>Regular Expression HOWTO &#8212; Python v2.7.1 documentation</title>
  </head>
</html>

lxml

, , , <script>, , Cleaner lxml , :

from lxml.html.clean import Cleaner

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

cleaner = Cleaner(page_structure=False)
print cleaner.clean_html(broken_html)

... :

<html><head><title>Regular Expression HOWTO — Python v2.7.1 documentation</title></head></html>

(nb) nothtml html - , 1 , <html><body>, 2 , , :))

+4

To remove html, style and script, you can use re.

def stripTags(text):
  # scripts = re.compile(r'<script.*?/script>')
  scripts = re.compile(r'<(script).*?</\1>(?s)')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

I can work easily

+2
source

Short answer, do not do this. Use Beautiful Soup or elementree to get rid of them. Parse your data as HTML or XML. Regular expressions will not work and are the wrong answer to this problem.

+1
source

Source: https://habr.com/ru/post/1785590/


All Articles