Memory error with large images when starting a convolutional neural network using TensorFlow on an AWS g2.2xlarge instance

I am running a convolutional neural network on an AWS g2.2xlarge instance. The model works great with 30,000 64x64 images. However, when I try to run it with 128x128 images, it gives a memory error (see below), even when I enter only one image (having 2 channels - real and imaginary).
Since the error mentions the form tensor [32768,16384], I assume that this happens during the first (fully connected) level, which takes an input image with two channels 128 * 128 * 2 = 32768 and outputs a vector of 128 * 128 = 16384. I I found recommendations for reducing the batch size, however I already use only 1 input image.
Hereit is written that using cudnn you can get up to 700-900px on the same AWS instance that I use (although I don’t know if they use fully connected layers). I tried two different AMIs ( 1 and 2 ), both with cudnn installed, but still got a memory error.

My questions:
1. How to calculate how much memory is required for the tensor [32768.16384]? I am not a computer scientist, so I would appreciate a detailed answer.
2. I assume that I'm trying to figure out if the instance that I use really has too little memory for my data (g2.2xlarge has 15 gigabytes), or I'm just doing something wrong.

Error:

2018-01-24 16:36:53.666427: I 
tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports 
instructions that this TensorFlow binary was not compiled to use: SSE4.1 
SSE4.2 AVX
2018-01-24 16:36:55.069050: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node 
read from SysFS had negative value (-1), but there must be at least one NUMA 
node, so returning NUMA node zero
2018-01-24 16:36:55.069287: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1062] Found device 0 with 
properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-01-24 16:36:55.069316: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1152] Creating TensorFlow 
device (/device:GPU:0) -> (device: 0, name: GRID K520, pci bus id: 
0000:00:03.0, compute capability: 3.0)
2018-01-24 16:37:59.766001: W 
tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran 
out of memory trying to allocate 2.00GiB.  Current allocation summary follows.
2018-01-24 16:37:59.766054: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256):     Total 
Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in 
bin. 40B client-requested in use in bin.
2018-01-24 16:37:59.766070: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766084: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1024):    Total 
Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in 
bin. 1.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766094: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2048):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766108: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4096):    Total 
Chunks: 2, Chunks in use: 2. 12.5KiB allocated for chunks. 12.5KiB in use in 
bin. 12.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766122: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8192):    Total 
Chunks: 2, Chunks in use: 2. 24.5KiB allocated for chunks. 24.5KiB in use in 
bin. 24.5KiB client-requested in use in bin.
2018-01-24 16:37:59.766134: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16384):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766143: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (32768):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766155: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (65536):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766163: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (131072):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766177: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (262144):  Total 
Chunks: 2, Chunks in use: 2. 800.0KiB allocated for chunks. 800.0KiB in use in 
bin. 800.0KiB client-requested in use in bin.
2018-01-24 16:37:59.766196: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (524288):  Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766208: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1048576):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766221: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2097152):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766230: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4194304):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766241: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8388608):     Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766250: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16777216):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766262: I         
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (33554432):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766271: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (67108864):    Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766282: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (134217728):   Total 
Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B 
client-requested in use in bin.
2018-01-24 16:37:59.766292: I 
tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (268435456):   Total 
Chunks: 2, Chunks in use: 1. 3.57GiB allocated for chunks. 2.00GiB in use in 
bin. 2.00GiB client-requested in use in bin.
2018-01-24 16:37:59.766304: I 
tensorflow/core/common_runtime/bfc_allocator.cc:644] Bin for 2.00GiB was 
256.00MiB, Chunk State: 
2018-01-24 16:37:59.766335: I 
tensorflow/core/common_runtime/bfc_allocator.cc:650]   Size: 1.57GiB | 
Requested Size: 0B | in_use: 0, prev:   Size: 2.00GiB | Requested Size: 
2.00GiB | in_use: 1
2018-01-24 16:37:59.766358: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680000 of 
size 1280
2018-01-24 16:37:59.766374: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680500 of 
size 256
2018-01-24 16:37:59.766381: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680600 of 
size 256
2018-01-24 16:37:59.766387: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680700 of 
size 256
2018-01-24 16:37:59.766397: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680800 of 
size 256
2018-01-24 16:37:59.766402: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680900 of 
size 256
2018-01-24 16:37:59.766412: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680a00 of 
size 256
2018-01-24 16:37:59.766422: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680b00 of 
size 256
2018-01-24 16:37:59.766429: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680c00 of 
size 256
2018-01-24 16:37:59.766435: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680d00 of 
size 256
2018-01-24 16:37:59.766459: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680e00 of 
size 256
2018-01-24 16:37:59.766471: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702680f00 of 
size 6400
2018-01-24 16:37:59.766477: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702682800 of 
size 6400
2018-01-24 16:37:59.766482: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702684100 of 
size 409600
2018-01-24 16:37:59.766492: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x7026e8100 of 
size 409600
2018-01-24 16:37:59.766499: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274c100 of 
size 12544
2018-01-24 16:37:59.766509: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x70274f200 of 
size 12544
2018-01-24 16:37:59.766517: I 
tensorflow/core/common_runtime/bfc_allocator.cc:662] Chunk at 0x702752300 of 
size 2147483648
2018-01-24 16:37:59.766523: I 
tensorflow/core/common_runtime/bfc_allocator.cc:671] Free at 0x782752300 of 
size 1684724992
2018-01-24 16:37:59.766530: I 
tensorflow/core/common_runtime/bfc_allocator.cc:677]      Summary of in-use 
Chunks by size: 
2018-01-24 16:37:59.766543: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 10 Chunks of size 256 
totalling 2.5KiB
2018-01-24 16:37:59.766557: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 1280 
totalling 1.2KiB
2018-01-24 16:37:59.766569: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 6400 
totalling 12.5KiB
2018-01-24 16:37:59.766577: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 12544 
totalling 24.5KiB
2018-01-24 16:37:59.766585: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 409600 
totalling 800.0KiB
2018-01-24 16:37:59.766596: I 
tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 
2147483648 totalling 2.00GiB
2018-01-24 16:37:59.766606: I 
tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use 
chunks: 2.00GiB
2018-01-24 16:37:59.766620: I 
tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: 
Limit:                  3833069568
InUse:                  2148344576
MaxInUse:               2148344576
NumAllocs:                      18
MaxAllocSize:           2147483648

2018-01-24 16:37:59.766635: W 
tensorflow/core/common_runtime/bfc_allocator.cc:277] 

2018-01-24 16:37:59.766660: W tensorflow/core/framework/op_kernel.cc:1188] 
Resource exhausted: OOM when allocating tensor of shape [32768,16384] and type 
float
2018-01-24 16:38:00.828932: E tensorflow/core/common_runtime/executor.cc:651] 
Executor failed to create kernel. Resource exhausted: OOM when allocating 
tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 240, in model
sess.run(init)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 889, in run
run_metadata_ptr)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1317, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", 
line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when 
allocating tensor of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'fc1/weights/RMSProp_1/Initializer/zeros', defined at:
File "myAutomap.py", line 278, in <module>
print_cost=True)
File "myAutomap.py", line 228, in model
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 365, in minimize
name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 516, in 
apply_gradients
self._create_slots([_get_variable_for(v) for v in var_list])
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/rmsprop.py", 
line 113, in _create_slots
self._zeros_slot(v, "momentum", self._name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/optimizer.py", line 882, in _zeros_slot
named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 174, in 
create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 148, in 
create_slot_with_initializer
dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/training/slot_creator.py", line 67, in 
_create_slot_var
validate_shape=validate_shape)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1256, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 1097, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 435, in get_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 404, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 806, in 
_get_single_variable
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 229, in __init__
constraint=constraint)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", 
line 323, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/ops/variable_scope.py", line 780, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", 
line 93, in __call__
return array_ops.zeros(shape, dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", 
line 1509, in zeros
output = constant(zero, shape=shape, dtype=dtype, name=name)
File "/usr/lib/python2.7/dist-
packages/tensorflow/python/framework/constant_op.py", line 218, in constant
name=name).outputs[0]
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 3069, in create_op
op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", 
line 1579, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-
access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor 
of shape [32768,16384] and type float
[[Node: fc1/weights/RMSProp_1/Initializer/zeros = Const[_class=
["loc:@fc1/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: 
[32768,16384] values: [0 0 0]...>, 
_device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Errore di segmentazione
0
source share
1 answer

, (int32, int64, float16, float32, float64). , 1: 32768 x 16384 x memory_size_of_your_datatype (, float_64 64 , , 8 , 4.3e9 4,3 ) , - float64 float32 float16 (1/2 1/4 ), . , , AWS, , GPU RAM , , .

, https://www.tensorflow.org/api_docs/python/tf/profiler/Profiler

Edit: tf.ConfigProto() tf.Session(config =...), .

, allow_growth, allow_soft_placement, per_process_gpu_memory_fraction ( )

+2

Source: https://habr.com/ru/post/1693475/


All Articles