Memory limit using Regex on a massive text file

Question

Memory limit using Regex on a massive text file

I have a text file like this:

('1', '2')
('3', '4')
     .
     .
     .

and I'm trying to make it look like this:

1 2
3 4
etc...

I am trying to do this with the re module in python by combining the re.sub commands as follows:

for line in file:
    s = re.sub(r"\(", "", line)
    s1 = re.sub(r",", "", s)
    s2 = re.sub(r"'", "", s1)
    s3 = re.sub(r"\)", "", s2)
    output.write(s3)
output.close()

It seems to work just fine until I get to the end of my output file; then it becomes inconsistent and stops working. I think this is because of the explicit SIZE file I'm working with; 300 MB or approximately 12 million lines.

Can someone help me confirm that I just do not have enough memory? Or if it is something else? Suitable alternatives or ways around this?

+4

python regex

Eli riekeberg Sep 22 '15 at 15:43

source share

4 answers

python ast.literal_eval. , with, :

With open(file_name) as input,open(output_name,'w') as output:
    for line in input:
       output.write(','.join(ast.literal_eval(line.strip())))

+1

Kasramvd 22 . '15 15:47

I would use namedtuple for better performance. And the code becomes more readable.

# Python 3

from collections import namedtuple
from ast import literal_eval
#...

Row = namedtuple('Row', 'x y')
with open(in_file, 'r') as f, open(out_file, 'w') as output:
    for line in f.readlines():
        output.write("{0.x} {0.y}".
                     format(Row._make(literal_eval(line))))

+1

siegerts Sep 22 '15 at 16:05

source share

This is one way to do this without the re module:

in_file = open(r'd:\temp\02\input.txt', 'r')
out_file = open(r'd:\temp\02\output.txt', 'w')

for line in in_file:
    out_file.write(line.replace("'", '').replace('(', '').replace(', ', ' ').replace(')', ''))
out_file.close()

0

Matej Sep 22 '15 at 18:47

source share

Christian Stade-Schuldt · Accepted Answer · 2015-09-22T15:51:57+0000

, , :

import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
       output.write(' '.join(re.findall('\d+', line))
       output.write('\n')

Memory limit using Regex on a massive text file

More articles: