How to exclude U + 2028 from line separators in Python when reading a file?

Question

How to exclude U + 2028 from line separators in Python when reading a file?

I have a file in UTF-8 where some lines contain the line separator character U + 2028 ( http://www.fileformat.info/info/unicode/char/2028/index.htm ). I do not want this to be considered as line break when I read lines from a file. Is there a way to exclude it from delimiters when I iterate over a file or use readlines ()? (Also, to read the entire file into a string, and then split it into \ n.) Thanks!

+3

python readline utf-8 separator

user135773 Jul 9 '09 at 16:44

source share

5 answers

I could not reproduce this behavior, but here is a naive solution that simply combines the readline results until they end in U + 2028.

 #!/usr/bin/env python from __future__ import with_statement def my_readlines(f): buf = u"" for line in f.readlines(): uline = line.decode('utf8') buf += uline if uline[-1] != u'\u2028': yield buf buf = u"" if buf: yield buf with open("in.txt", "rb") as fin: for l in my_readlines(fin): print l

+2

Alexander Ljungberg Jul 9 '09 at 18:04

source share

Thanks to everyone for the answer. I think I know why you might not be able to reproduce this. I only realized that this happens if I decode the file when I open it, as in:

 f = codecs.open(filename, encoding='utf-8') for line in f: print line

Lines are not split on u2028 if I open the file first and then decode individual lines:

 f = open(filename) for line in f: print line.decode("utf8")

(I use Python 2.6 for Windows. The file was originally UTF16LE, and then it was converted to UTF8).

This is very interesting, I think I will no longer use codecs.open :-).

+1

user135773 Jul 9 '09 at 10:24

source share

If you are using Python 3.0 (note that I do not do this, so I cannot test it), according to the documentation, you can pass the optional newline parameter to open to indicate which line separator to use. However, the documentation doesn't mention U + 2028 at all (it only mentions \r , \n and \r\n as line separators), so it really is a surprise to me that this even happens (although I can confirm this even with Python 2.6 )

0

balpha Jul 9 '09 at 17:03

source share

The codec module does the right thing. U + 2028 is called "LINE SEPARATOR" with the comment "can be used to represent this semantics uniquely." Therefore, it is reasonable to consider it as a line separator.

Presumably, the creator would not put U + 2028 characters for no good reason ... does the file u have "\ n"? Why do you want strings not to be split on U + 2028?

0

John machin Jul 10 '09 at 1:15

source share

Markus · Accepted Answer · 2009-07-09T21:04:52+0000

I cannot duplicate this behavior in python 2.5, 2.6 or 3.0 on mac os x - U + 2028 is always considered non-endline. Could you tell us more about where you see this error?

However, here is a subclass of the file class that can do what you want:

#/usr/bin/python # -*- coding: utf-8 -*- class MyFile (file): def __init__(self, *arg, **kwarg): file.__init__(self, *arg, **kwarg) self.EOF = False def next(self, catchEOF = False): if self.EOF: raise StopIteration("End of file") try: nextLine= file.next(self) except StopIteration: self.EOF = True if not catchEOF: raise return "" if nextLine.decode("utf8")[-1] == u'\u2028': return nextLine+self.next(catchEOF = True) else: return nextLine A = MyFile("someUnicode.txt") for line in A: print line.strip("\n").decode("utf8")

How to exclude U + 2028 from line separators in Python when reading a file?

More articles: