Python multiple regex replaces

I am new to python. I searched for a few days, but found only some small fragments of my concept. Python 2.7 on windows (I chose python because it is multi-platform and the result can be portable on windows).

I would like to make a script that looks for a folder for * .txt UTF-8 text files, downloads the contents (one file one by one), changes the non-ascii characters to html entitites, then adds html tags at the beginning and end of each line , but 2 tag options, one for the file header and one for the tail of the file, which (head-tail) is separated by an empty line. After that, the whole result should be written to another text file (s), for example * .htm. To be visual:

unicode1.txt:

űnícődé text line1 űnícődé text line2 [empty line] űnícődé text line3 űnícődé text line4 

The result should be in unicode1.htm:

 <p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line1</p> <p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line2</p> [empty line] <p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p> <p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p> 

I began to develop the core of my solution, but I was stuck. See Versions of the script (for simplicity, I chose the encoding with xmlcharrefreplace).

V1:

 import re, cgi, fileinput file="_utf8.txt" text="" for line in fileinput.input(file, inplace=0): line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1) text=text+re.sub(r"$", "</p>", line, 1) print text 

This worked, a good result, but fileinput is not suitable for use for this task.

V2:

 import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1) text=text+re.sub(r"$", "</p>", line, 1) f.close() print text 

This ruined the result by closing the tag at the start of the line, replacing the first letter, etc.

V3 (checked multi-line flag):

 import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1, flags=re.M) text=text+re.sub(r"$", "</p>", line, 1, flags=re.M) f.close() print text 

The same result.

V4 (tried 1 regex instead of 2):

 import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1) f.close() print text 

The same result. Please, help.

Edit: I just checked the result file with hexadecimal code, and there is an x0D bit in front of each closing tag! Why?

Edit2: changes for a more logical approach

 text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1) 

Edit3: using hexadecimal code, I saw the reason for the confused result: extra CR (x0D) bytes before each CRLF. I traced the CR problem what made it: concatenation with

 # -*- coding: utf-8 -*- text="" f=u"unicode text line1\r\n unicode text line2" for line in f: text+=line print text 

This leads to:

 unicode text line1\r\r\n unicode text line2 

Any idea how to fix this?

+4
source share
2 answers
 #!/usr/bin/env python import cgi import fileinput import os import shutil import sys def textfiles(rootdir, extensions=('.txt',)): for dirpath, dirs, files in os.walk(rootdir): for f in files: if f.lower().endswith(extensions): yield os.path.join(dirpath, f) def htmlfiles(files): for f in files: root, _ = os.path.splitext(f) newf = root + '.html' shutil.copy2(f, newf) yield newf for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True): if fileinput.isfirstline(): klass = 'aaa' # start head part line = cgi.escape(line.decode('utf-8').strip()) line = line.encode('ascii', 'xmlcharrefreplace') if not line: # empty line klass = 'bbb' # start tail part print(line) else: print('<p class="%s">%s</p>' % (klass, line)) 

Example

 $ python txt2html.py c:\root\dir 
+1
source

There is no need for regular expressions, just do the following:

 with open('utf8.txt') as f: class_name = 'aaa' for line in f: if line == '\n': classname = 'bbb' else: # decode / convert line line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip()) # write line to file 

The results you get do not look caused by regular expressions, as they seem to be correct. The problem is most likely in the line where you are encoding / converting. Print this line without adding tags to see how much it is expected.

+3
source

Source: https://habr.com/ru/post/1392329/


All Articles