Upper memory limit?

Question

Upper memory limit?

Is there a memory limit for python? I used a python script to calculate the average values from a file of at least 150 MB in size.

Depending on the size of the file, I sometimes run into a MemoryError .

Is it possible to assign more memory to python so that I don't encounter an error?

EDIT: code below

NOTE. File sizes can vary greatly (up to 20 GB), the minimum file size is 150 MB.

 file_A1_B1 = open("A1_B1_100000.txt", "r") file_A2_B2 = open("A2_B2_100000.txt", "r") file_A1_B2 = open("A1_B2_100000.txt", "r") file_A2_B1 = open("A2_B1_100000.txt", "r") file_write = open ("average_generations.txt", "w") mutation_average = open("mutation_average", "w") files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1] for u in files: line = u.readlines() list_of_lines = [] for i in line: values = i.split('\t') list_of_lines.append(values) count = 0 for j in list_of_lines: count +=1 for k in range(0,count): list_of_lines[k].remove('\n') length = len(list_of_lines[0]) print_counter = 4 for o in range(0,length): total = 0 for p in range(0,count): number = float(list_of_lines[p][o]) total = total + number average = total/count print average if print_counter == 4: file_write.write(str(average)+'\n') print_counter = 0 print_counter +=1 file_write.write('\n')

+19

python memory

Harpal Nov 26 '10 at 12:11

source share

5 answers

You read the entire file in memory ( line = u.readlines() ), which of course will not succeed if the file is too large (and you say that some of them are up to 20 GB), so your problem is there.

Better iteration over each line:

 for current_line in u: do_something_with(current_line)

- recommended approach.

Later in the script, you do very strange things, for example, first count all the elements in the list, and then construct a for loop over the range of that number. Why not iterate over the list directly? What is the purpose of your script? I got the impression that this can be done much easier.

This is one of the advantages of higher-level languages such as Python (unlike C, where you need to do these homework yourself): let Python handle the iteration for you and collect in memory what you really need to have in memory at any given time.

Also, apparently, you are processing TSV files (values separated by tabs), you should take a look at the csv module , which will handle all splits, delete \n , etc. for you.

+13

Tim Pietzcker Nov 26 '10 at 12:26

source share

Python can use all the memory available to its environment. My simple “memory test” crashes on ActiveState Python 2.6 after using

 1959167 [MiB]

In jython 2.5, it fires earlier:

  239000 [MiB]

Maybe I can configure Jython to use more memory (it uses restrictions from the JVM)

Test application:

 import sys sl = [] i = 0 # some magic 1024 - overhead of string object fill_size = 1024 if sys.version.startswith('2.7'): fill_size = 1003 if sys.version.startswith('3'): fill_size = 497 print(fill_size) MiB = 0 while True: s = str(i).zfill(fill_size) sl.append(s) if i == 0: try: sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s))) except AttributeError: pass i += 1 if i % 1024 == 0: MiB += 1 if MiB % 25 == 0: sys.stderr.write('%d [MiB]\n' % (MiB))

In your application, you immediately read the entire file. For such large files, you should read line by line.

+11

Michał Niklas Nov 26 '10 at 12:30

source share

No, there is no limit to using Python to use memory in Python. I regularly work with Python applications that can use several gigabytes of memory. Most likely, your script actually uses more memory than is available on the computer you are running on.

In this case, the solution is to overwrite the script for more memory efficiency or add more physical memory if the script is already optimized to minimize memory usage.

Edit:

Your script reads all the contents of your files into memory at once ( line = u.readlines() ). Since you process files up to 20 GB in size, you will get memory errors with this approach if you do not have a huge amount of memory on your computer.

A better approach would be to read the files one line at a time:

 for u in files: for line in u: # This will iterate over each line in the file # Read values from the line, do necessary calculations

+7

Pär Wieslander Nov 26 '10 at 12:19

source share

You not only read all each file in memory, but also carefully process the information in a table called list_of_lines .

You have a secondary problem: your options for variable names greatly obfuscate what you are doing.

Here is your script rewritten with caplines readlines () and with meaningful names:

 file_A1_B1 = open("A1_B1_100000.txt", "r") file_A2_B2 = open("A2_B2_100000.txt", "r") file_A1_B2 = open("A1_B2_100000.txt", "r") file_A2_B1 = open("A2_B1_100000.txt", "r") file_write = open ("average_generations.txt", "w") mutation_average = open("mutation_average", "w") # not used files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1] for afile in files: table = [] for aline in afile: values = aline.split('\t') values.remove('\n') # why? table.append(values) row_count = len(table) row0length = len(table[0]) print_counter = 4 for column_index in range(row0length): column_total = 0 for row_index in range(row_count): number = float(table[row_index][column_index]) column_total = column_total + number column_average = column_total/row_count print column_average if print_counter == 4: file_write.write(str(column_average)+'\n') print_counter = 0 print_counter +=1 file_write.write('\n')

It quickly becomes apparent that (1) you are calculating the average values of the columns (2), obfuscation has led some others to think that you are calculating the average values in a row.

When you calculate the middle columns, the output is not required until the end of each file, and the amount of additional memory required is proportional to the number of columns.

Here is a revised version of the outer loop code:

 for afile in files: for row_count, aline in enumerate(afile, start=1): values = aline.split('\t') values.remove('\n') # why? fvalues = map(float, values) if row_count == 1: row0length = len(fvalues) column_index_range = range(row0length) column_totals = fvalues else: assert len(fvalues) == row0length for column_index in column_index_range: column_totals[column_index] += fvalues[column_index] print_counter = 4 for column_index in column_index_range: column_average = column_totals[column_index] / row_count print column_average if print_counter == 4: file_write.write(str(column_average)+'\n') print_counter = 0 print_counter +=1

+4

John Machin Nov 26 '10 at 10:27

source share

martineau · Accepted Answer · 2010-11-27 00:51

(This is my third answer, because I misunderstood what your code is doing in my original, and then made a small but critical mistake in my second, hopefully three charm.

Editing Since this seems to be a popular answer, I made a few changes to improve its implementation over the years, most of which are not very significant. This is so if people use it as a template, this will provide an even better foundation.

As others have pointed out, your MemoryError problem is most likely due to the fact that you are trying to read the entire contents of huge files in memory, and then, in addition to this, effectively double the amount of memory needed to create a list of lists of string values from each line.

Python memory limits are determined by how much physical disk space and virtual memory on your computer and operating system is available. Even if you don’t use all of this and your program “works,” using it can be impractical because it takes too much time.

In any case, the most obvious way to avoid this is to process each file one line at a time, which means that you must do the processing step by step.

To do this, a list of current totals for each of the fields is saved. When this is completed, the average value of each field can be calculated by dividing the corresponding total value by the counter of all read lines. Once this is done, these averages can be printed, and some are written to one of the output files. I also consciously tried to use very descriptive variable names to try to understand them.

 try: from itertools import izip_longest except ImportError: # Python 3 from itertools import zip_longest as izip_longest GROUP_SIZE = 4 input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt", "A2_B1_100000.txt"] file_write = open("average_generations.txt", 'w') mutation_average = open("mutation_average", 'w') # left in, but nothing written for file_name in input_file_names: with open(file_name, 'r') as input_file: print('processing file: {}'.format(file_name)) totals = [] for count, fields in enumerate((line.split('\t') for line in input_file), 1): totals = [sum(values) for values in izip_longest(totals, map(float, fields), fillvalue=0)] averages = [total/count for total in totals] for print_counter, average in enumerate(averages): print(' {:9.4f}'.format(average)) if print_counter % GROUP_SIZE == 0: file_write.write(str(average)+'\n') file_write.write('\n') file_write.close() mutation_average.close()

Upper memory limit?

More articles: