"Reading in a large text file in hdf5 via PyTables or PyHDF?

I am trying to use some statistics using SciPy, but my input dataset is quite large (~ 1.9 GB) and in dbf format. The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I have 3GB RAM, but win32 works).

i.e:.

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

From other posts, I can see that the batch array provided by PyTables may be useful, but first of all it is a problem with reading this data. Or, in other words, PyTables or PyHDF easily create the HDF5 output that is required, but what should I do to get my data into an array first?

For instance:

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

and then I could create a table or array, but how do I get back to the original dbf data? In description?

, !

+3
2

, ( numpy, - . docs ), HDF5 , . , , , .

, , genfromtxt. memmap/hdf5.

, " dbf"? , , - . HDF5 "", .

, , h5py - hdf5, pytables, .

+4

dbf, dbf - , , , :

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]
0

Source: https://habr.com/ru/post/1784162/


All Articles