I am new to python. I searched for a few days, but found only some small fragments of my concept. Python 2.7 on windows (I chose python because it is multi-platform and the result can be portable on windows).
I would like to make a script that looks for a folder for * .txt UTF-8 text files, downloads the contents (one file one by one), changes the non-ascii characters to html entitites, then adds html tags at the beginning and end of each line , but 2 tag options, one for the file header and one for the tail of the file, which (head-tail) is separated by an empty line. After that, the whole result should be written to another text file (s), for example * .htm. To be visual:
unicode1.txt:
űnícődé text line1 űnícődé text line2 [empty line] űnícődé text line3 űnícődé text line4
The result should be in unicode1.htm:
<p class='aaa'>űnícődé text line1</p> <p class='aaa'>űnícődé text line2</p> [empty line] <p class='bbb'>űnícődé text line3</p> <p class='bbb'>űnícődé text line3</p>
I began to develop the core of my solution, but I was stuck. See Versions of the script (for simplicity, I chose the encoding with xmlcharrefreplace).
V1:
import re, cgi, fileinput file="_utf8.txt" text="" for line in fileinput.input(file, inplace=0): line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1) text=text+re.sub(r"$", "</p>", line, 1) print text
This worked, a good result, but fileinput is not suitable for use for this task.
V2:
import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1) text=text+re.sub(r"$", "</p>", line, 1) f.close() print text
This ruined the result by closing the tag at the start of the line, replacing the first letter, etc.
V3 (checked multi-line flag):
import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') line=re.sub(r"^", "<p>", line, 1, flags=re.M) text=text+re.sub(r"$", "</p>", line, 1, flags=re.M) f.close() print text
The same result.
V4 (tried 1 regex instead of 2):
import re, cgi, codecs file="_utf8.txt" text="" f=codecs.open(file, encoding='utf-8') for line in f: line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace') text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1) f.close() print text
The same result. Please, help.
Edit: I just checked the result file with hexadecimal code, and there is an x0D bit in front of each closing tag! Why?
Edit2: changes for a more logical approach
text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
Edit3: using hexadecimal code, I saw the reason for the confused result: extra CR (x0D) bytes before each CRLF. I traced the CR problem what made it: concatenation with
# -*- coding: utf-8 -*- text="" f=u"unicode text line1\r\n unicode text line2" for line in f: text+=line print text
This leads to:
unicode text line1\r\r\n unicode text line2
Any idea how to fix this?