How to calculate average number of .dat files using python?

So, I have 50-60 .dat files containing m rows and n columns of numbers. I need to take the average number of all files and create a new file in the same format. I have to do this in python. Can anyone help me with this?

I wrote the code. I understand that there are some incompatible types, but I can’t come up with alternatives, so I haven’t changed anything.

#! /usr/bin/python import os CC = 1.96 average = [] total = [] count = 0 os.chdir("./") for files in os.listdir("."): if files.endswith(".dat"): infile = open(files) cur = [] cur = infile.readlines() for i in xrange(0, len(cur)): cur[i] = cur[i].split() total += cur count += 1 average = [x/count for x in total] #calculate uncertainty uncert = [] for files in os.listdir("."): if files.endswith(".dat"): infile = open(files) cur = [] cur = infile.readlines for i in xrange(0, len(cur)): cur[i] = cur[i].split() uncert += (cur - average)**2 uncert = uncert**.5 uncert = uncert*CC 
+4
source share
2 answers

Here's a fairly time and resource efficient approach that reads values ​​and calculates their average values ​​for all files in parallel, but only reads in one line for each file at a time - however, it temporarily reads the entire first .dat into memory to determine how many lines and columns of numbers will be in each file.

You did not say if your “numbers” were integers or floating or what, therefore, it reads them as floating points (which will work even if it is not). Regardless of this, averages are calculated and displayed as floating point numbers.

Refresh

I modified my original answer to also calculate the standard deviation of the population ( sigma ) of the values ​​in each row and column according to your comment. He does this immediately after calculating their average value, so a second pass is not required to read all the data again. In addition, in response to the suggestion made in the comments, a context manager was added to ensure that all input files are closed.

Please note that standard deviations are printed only and not written to the output file, but the execution of the same or a separate file should be simple enough to add.

 from contextlib import contextmanager from itertools import izip from glob import iglob from math import sqrt from sys import exit @contextmanager def multi_file_manager(files, mode='rt'): files = [open(file, mode) for file in files] yield files for file in files: file.close() # generator function to read, convert, and yield each value from a text file def read_values(file, datatype=float): for line in file: for value in (datatype(word) for word in line.split()): yield value # enumerate multiple egual length iterables simultaneously as (i, n0, n1, ...) def multi_enumerate(*iterables, **kwds): start = kwds.get('start', 0) return ((n,)+t for n, t in enumerate(izip(*iterables), start)) DATA_FILE_PATTERN = 'data*.dat' MIN_DATA_FILES = 2 with multi_file_manager(iglob(DATA_FILE_PATTERN)) as datfiles: num_files = len(datfiles) if num_files < MIN_DATA_FILES: print('Less than {} .dat files were found to process, ' 'terminating.'.format(MIN_DATA_FILES)) exit(1) # determine number of rows and cols from first file temp = [line.split() for line in datfiles[0]] num_rows = len(temp) num_cols = len(temp[0]) datfiles[0].seek(0) # rewind first file del temp # no longer needed print '{} .dat files found, each must have {} rows x {} cols\n'.format( num_files, num_rows, num_cols) means = [] std_devs = [] divisor = float(num_files-1) # Bessel correction for sample standard dev generators = [read_values(file) for file in datfiles] for _ in xrange(num_rows): # main processing loop for _ in xrange(num_cols): # create a sequence of next cell values from each file values = tuple(next(g) for g in generators) mean = float(sum(values)) / num_files means.append(mean) means_diff_sq = ((value-mean)**2 for value in values) std_dev = sqrt(sum(means_diff_sq) / divisor) std_devs.append(std_dev) print 'Average and (standard deviation) of values:' with open('means.txt', 'wt') as averages: for i, mean, std_dev in multi_enumerate(means, std_devs): print '{:.2f} ({:.2f})'.format(mean, std_dev), averages.write('{:.2f}'.format(mean)) # note std dev not written if i % num_cols != num_cols-1: # not last column? averages.write(' ') # delimiter between values on line else: print # newline averages.write('\n') 
+4
source

I'm not sure which aspect of the process is giving you the problem, but I will simply answer specifically to getting the average values ​​of all the dat files.

Assuming a data structure as follows:

 72 12 94 79 76 5 30 98 97 48 79 95 63 74 70 18 92 20 32 50 77 88 60 98 19 17 14 66 80 24 ... 

Getting average file values:

 import glob import itertools avgs = [] for datpath in glob.iglob("*.dat"): with open(datpath, 'r') as f: str_nums = itertools.chain.from_iterable(i.strip().split() for i in f) nums = map(int, str_nums) avg = sum(nums) / len(nums) avgs.append(avg) print avgs 

It moves through each .dat file, reads and appends lines. Converts them to int (maybe a float if you want) and adds avg.

If these files are huge and you are worried about the amount of memory you are reading, you can more explicitly scroll through each line and save the counter just like your original example did:

 for datpath in glob.iglob("*.dat"): with open(datpath, 'r') as f: count = 0 total = 0 for line in f: nums = [int(i) for i in line.strip().split()] count += len(nums) total += sum(nums) avgs.append(total / count) 
  • Note. I do not handle exceptional cases, such as an empty file and creating a split by zero situation.
+1
source

Source: https://habr.com/ru/post/1440652/


All Articles