Memory limit using Regex on a massive text file

I have a text file like this:

('1', '2')
('3', '4')
     .
     .
     .

and I'm trying to make it look like this:

1 2
3 4
etc...

I am trying to do this with the re module in python by combining the re.sub commands as follows:

for line in file:
    s = re.sub(r"\(", "", line)
    s1 = re.sub(r",", "", s)
    s2 = re.sub(r"'", "", s1)
    s3 = re.sub(r"\)", "", s2)
    output.write(s3)
output.close()

It seems to work just fine until I get to the end of my output file; then it becomes inconsistent and stops working. I think this is because of the explicit SIZE file I'm working with; 300 MB or approximately 12 million lines.

Can someone help me confirm that I just do not have enough memory? Or if it is something else? Suitable alternatives or ways around this?

+4
source share
4 answers

, , :

import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
       output.write(' '.join(re.findall('\d+', line))
       output.write('\n')
+2

python ast.literal_eval. , with, :

With open(file_name) as input,open(output_name,'w') as output:
    for line in input:
       output.write(','.join(ast.literal_eval(line.strip())))
+1

I would use namedtuple for better performance. And the code becomes more readable.

# Python 3

from collections import namedtuple
from ast import literal_eval
#...

Row = namedtuple('Row', 'x y')
with open(in_file, 'r') as f, open(out_file, 'w') as output:
    for line in f.readlines():
        output.write("{0.x} {0.y}".
                     format(Row._make(literal_eval(line))))
+1
source

This is one way to do this without the re module:

in_file = open(r'd:\temp\02\input.txt', 'r')
out_file = open(r'd:\temp\02\output.txt', 'w')

for line in in_file:
    out_file.write(line.replace("'", '').replace('(', '').replace(', ', ' ').replace(')', ''))
out_file.close()
0
source

Source: https://habr.com/ru/post/1608529/


All Articles