Python restricts newline for readlines ()

I am trying to break text that uses a combination of new line characters LF , CRLF and NEL . I need a better method to exclude the NEL character from the scene.

Is it possible to tell readlines() exclude NEL when splitting strings? I can read() and look for matching only for LF and CRLF split points in a loop.

Is there a better solution?

I open the file with codecs.open() to open the utf-8 text file.

And when using readlines() it breaks into NEL characters:

session screenshot

File contents:

 "u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'" 
+3
source share
1 answer

file.readlines() will only be split into \n , \r or \r\n depending on the OS and if universal newline support is supported.

U + 0085 NEXT LINE (NEL) is not recognized as a newline separator in this context, and you do not need to do anything to let file.readlines() ignore it.

Quoting the open() function documentation :

Python is usually created with universal newline support; upon delivery of 'U' , the file is opened as a text file, but lines can be interrupted by one of the following: Unix end-of-line convention '\n' , Macintosh convention '\r' or Windows convention '\r\n' . All of these external representations are treated as '\n' the Python program. If Python is built without universal newline support, the mode with 'U' same as regular text mode. Note that file objects opened in this way also have the newlines attribute, which has the value None (if new characters have not yet been viewed), '\n' , '\r' , '\r\n' or a tuple containing all types new lines.

and universal glossary entry for newlines :

A way of interpreting text streams in which all of the following are recognized as line terminators: Unix end-of-line convention '\n' , Windows convention '\r\n' and old Macintosh convention '\r' . See PEP 278 and PEP 3116 , as well as str.splitlines() for additional usage.

Unfortunately, codecs.open() breaks with this rule; The documentation vaguely refers to the particular requested codec:

Line endings are implemented using the codec decoder method and are included in list entries if keepends is true.

Instead of codecs.open() use io.open() to open the file in the correct encoding, and then process the lines one by one:

 with io.open(filename, encoding=correct_encoding) as f: lines = f.open() 

io is a new I / O framework that completely replaces the Python 2 system with Python 3. It only processes the \n , \r and \r\n tags:

 >>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8')) >>> import codecs >>> codecs.open('/tmp/test.txt', encoding='utf8').readlines() [u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n'] >>> import io >>> io.open('/tmp/test.txt', encoding='utf8').readlines() [u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n'] 

The result of codecs.open() is due to code using str.splitlines() , which has a documentation error ; when splitting a Unicode string, it will be split into everything that the Unicode standard considers line break (which is a rather complicated problem ). The documentation for this method does not explain this; he claims to have split only in accordance with Universal Newline's rules.

+8
source

Source: https://habr.com/ru/post/1270546/


All Articles