CPickle very large amount of data

I have about 0.8 million 256x256 RGB images, which is over 7 GB.

I want to use them as training data in the Nuclear Network and I want to put them in the cPickle file along with their labels.

Now it takes up a lot of memory, to the extent that it is necessary to exchange with my memory on the hard drive, and almost all of it consumes.

It is a bad idea?

What would be a smarter / more practical way to load into CNN or re-work without causing too many memory problems?

This code looks like

import numpy as np import cPickle from PIL import Image import sys,os pixels = [] labels = [] traindata = [] data=[] for subdir, dirs, files in os.walk('images'): curdir='' for file in files: if file.endswith(".jpg"): floc=str(subdir)+'/'+str(file) im= Image.open(floc) pix=np.array(im.getdata()) pixels.append(pix) labels.append(1) pixels=np.array(pixels) labels=np.array(labels) traindata.append(pixels) traindata.append(labels) traindata=np.array(traindata) .....# do the same for validation and test data .....# put all data and labels into 'data' array cPickle.dump(data,open('data.pkl','wb')) 
+3
source share
1 answer

It is a bad idea?

Yes indeed.

You are trying to load immediately 7 GB of compressed image data into memory (about 195 GB for 800 KB 256 * 256 RGB files). This will not work. You should find a way to update the image according to the CNN pattern, saving the state while passing.

Also consider how large your set of CCN parameters will be. Pickle is not intended for large amounts of data. If you need to store GB-based neural network data, you are much better off using a database. If the set of parameters of the neural network is only a few MB, pickling will be fine.

You can also take a look at the documentation for pickle.HIGHEST_PROTOCOL so that you are not stuck with an old and unoptimized pickle format file.

+5
source

Source: https://habr.com/ru/post/1265767/


All Articles