Thanks to everyone for the answer. I think I know why you might not be able to reproduce this. I only realized that this happens if I decode the file when I open it, as in:
f = codecs.open(filename, encoding='utf-8') for line in f: print line
Lines are not split on u2028 if I open the file first and then decode individual lines:
f = open(filename) for line in f: print line.decode("utf8")
(I use Python 2.6 for Windows. The file was originally UTF16LE, and then it was converted to UTF8).
This is very interesting, I think I will no longer use codecs.open :-).
source share