As you want to replace strings with strings of the same length, replacements can be made in place, that is, rewrite only the bits that need to be replaced, without the need to write a whole new modified file.
So, with regex, this is very easy to do. The fact that the file is a CSV file does not matter at all in this method:
from os import listdir from os.path import join import re pat = re.compile('ww|\.\.') dicrepl = {'ww':'vv' , '..':'--'} for filename in listdir(path): with open(join(path,filename),'rb+') as f: ch = f.read() f.seek(0,0) pos = 0 for match in pat.finditer(ch): f.seek(match.start()-pos, 1) f.write(dicrepl[match.group()]) pos = match.end()
It is absolutely necessary to open such procedures in binary mode: this is "b" in the "rb +" mode.
The fact that the file is opened in the 'r +' mode allows you to read and write anywhere in it (if it was opened in 'a', we could only write at the end of the file)
But if the files are so large that the ch object will have too much memory consumption, it should be changed.
If the replacements will have a different length than the original lines, it is required to write a new file with the changes. (if the length of the replacement lines is always less than the length of the replaced lines, this is a special case and can still be processed without the need to write a new file. This may be interesting in a large file)
The interest in f.seek (match.start () - pos, 1) instead of f.seek (match.start (), 0) is that it moves the pointer from pos to match.start () without moving the pointer from position 0 to match.start () , then from 0 to match.start () .
Conversely, with f.seek (match.start (), 0), the pointer must first be returned to position 0 (the beginning of the file) then move forward, counting the match .start () the number of characters to stop in the correct position match.start () , because searching (..., 0) means that the position has been reached from the beginning of the file, and searching (..., 1) means that the movement is performed from the position CURRENT. EDIT:
If you want to replace only the isolated ww lines, and not the ww lines in the longer wwwwwww lines, the regular expression should be
pat = re.compile('(?<!w)ww(?!w)|(?<!\.)\.\.(?!\.)')
This is a regular expression feature that can be obtained using replace () without complicated string manipulation.
EDIT:
I forgot the f.seek (0,0) instruction after f.read () . This instruction is necessary to move the file pointer to the beginning of the file, because during reading the pointer moves to the end.
I have adjusted the code and now it works.
Here is the code that follows the processing:
from os import listdir from os.path import join import re pat = re.compile('(?<!w)ww(?!w)|(?<!\.)\.\.(?!\.)') dicrepl = {'ww':'vv' , '..':'ZZ'} path = ................................... with open(path,'rb+') as f: print "file has just been opened, file pointer is at position ",f.tell() print '- reading of the file : ch = f.read()' ch = f.read() print "file has just been read"+\ "\nfile pointer is now at position ",f.tell(),' , the end of the file' print "- file pointer is moved back to the beginning of the file : f.seek(0,0)" f.seek(0,0) print "file pointer is now again at position ",f.tell() pos = 0 print '\n- process of replacrement is now launched :' for match in pat.finditer(ch): print print 'is at position ',f.tell() print 'group ',match.group(),' detected on span ',match.span() f.seek(match.start()-pos, 1) print 'pointer having been moved on position ',f.tell() f.write(dicrepl[match.group()]) print 'detected group have been replaced with ',dicrepl[match.group()] print 'now at position ',f.tell() pos = match.end()