Read a very large single line txt file and split it

I have the following problem: I have a file with a size of almost 500 mb. Its text, all on one line. The text is split at the end of the virtual line, called ROW_DEL, and is located in the text as follows:

this is a line ROW_DEL and this is a line 

Now I need to do the following: I want to split this file into its own lines to get such a file:

 this is a line and this is a line 

the problem, even if I open it using a Windows text editor, it breaks because the file is large.

Is it possible to split this file as I mentioned with C #, Java or Python? What would be the best blast so as not to overload my processor.

+4
source share
3 answers

In fact, 500 MB of text is not that big, it's just that the notebook sucks. You probably don't have sed since you are on windows, but at least trying a naive solution in python, I think it will work fine:

 import os with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out: f_out.write(f_in.read().replace('ROW_DEL ', os.linesep)) 
+1
source

Read this file in chunks, for example, use StreamReader.ReadBlock in C #. You can set the maximum number of characters to read there.

For each fragment you read, you can replace ROW_DEL with \r\n and add it to a new file.

Remember to increase the current index by the number of characters you just read.

+1
source

Here is my solution.
It is easy in principle (ŁukaszW.pl gave it), but it is not so easy to code if you want to take care of specific cases (which ŁukaszW.pl did not).

Special cases: when the ROW_DEL separator is split into two of the read fragments (as I4V pointed out), and even more subtly if there are two adjacent ROW_DEL, of which the second is divided into two read fragments.

Since ROW_DEL is longer than any of the possible newlines ( '\r' , '\n' , '\r\n' ), it can be replaced in place on the new line used by the OS. That is why I decided to rewrite the file on my own.
For this, I use the 'r+' mode, it does not create a new file.
It is also imperative to use the binary mode 'b' .

The principle is to read a piece (in real life its size will be 262144, for example) and x additional characters, wher x - separator length - 1.
And then, to check if a separator is present at the end of the fragment + x characters.
Accoridng, if present or not, the piece is shortened or not before the ROW_DEL conversion is performed, and rewritten in place.

Nude code:

 text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL\n') print '\nlength of the text : %d chars\n' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length<len(sep): print 'Length of second argument, %d , is '\ 'the minimum value for the third argument'\ % len(sep) return x = len(sep)-1 x2 = 2*x file_length = getsize(whichfile) with open(whichfile,'rb+') as fR,\ open(whichfile,'rb+') as fW: while True: chunk = fR.read(chunk_length) pch = fR.tell() twelve = chunk[-x:] + fR.read(x) ptw = fR.tell() if sep in twelve: pt = twelve.find(sep) m = ("\n !! %r is " "at position %d in twelve !!" % (sep,pt)) y = chunk[0:-x+pt].replace(sep,OSeol) else: pt = x m = '' y = chunk.replace(sep,OSeol) pos = fW.tell() fW.write(y) fW.flush() fsync(fW.fileno()) if fR.tell()<file_length: fR.seek(-x2+pt,1) else: fW.truncate() break rewrite('eessaa.txt','ROW_DEL',14) with open('eessaa.txt','rb') as f: ch = f.read() print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1)) print '\nlength of the text : %d chars\n' % len(ch) 

To execute the execution, enter another code that prints the messages:

 text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL\n') print '\nlength of the text : %d chars\n' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length<len(sep): print 'Length of second argument, %d , is '\ 'the minimum value for the third argument'\ % len(sep) return x = len(sep)-1 x2 = 2*x file_length = getsize(whichfile) with open(whichfile,'rb+') as fR,\ open(whichfile,'rb+') as fW: while True: chunk = fR.read(chunk_length) pch = fR.tell() twelve = chunk[-x:] + fR.read(x) ptw = fR.tell() if sep in twelve: pt = twelve.find(sep) m = ("\n !! %r is " "at position %d in twelve !!" % (sep,pt)) y = chunk[0:-x+pt].replace(sep,OSeol) else: pt = x m = '' y = chunk.replace(sep,OSeol) print ('chunk == %r %d chars\n' ' -> fR now at position %d\n' 'twelve == %r %d chars %s\n' ' -> fR now at position %d' % (chunk ,len(chunk), pch, twelve,len(twelve),m, ptw) ) pos = fW.tell() fW.write(y) fW.flush() fsync(fW.fileno()) print (' %r %d long\n' ' has been written from position %d\n' ' => fW now at position %d' % (y,len(y),pos,fW.tell())) if fR.tell()<file_length: fR.seek(-x2+pt,1) print ' -> fR moved %d characters back to position %d'\ % (x2-pt,fR.tell()) else: print (" => fR is at position %d == file size\n" ' File has thoroughly been read' % fR.tell()) fW.truncate() break raw_input('\npress any key to continue') rewrite('eessaa.txt','ROW_DEL',14) with open('eessaa.txt','rb') as f: ch = f.read() print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1)) print '\nlength of the text : %d chars\n' % len(ch) 

There is some subtlety in processing the ends of chunks to determine if ROW_DEL is in two pieces, and if there are two ROW_DEL adjacent. That's why I published my decision for a long time: I finally had to write fR.seek(-x2+pt,1) and not only fR.seek(-2*x,1) , or fR.seek(-x,1) in accordance with the fact that sep is cross-border or not (2 * x is x2 in the code, with ROW_DEL x and x2 - 6 and 12). Anyone who is interested in this issue will consider it by changing the codes in the accoridng sections if 'ROW_DEL' is in twelve or not.

+1
source

Source: https://habr.com/ru/post/1481125/


All Articles