Cost and Activation Functions for Multiple Independent Labels

Question

Cost and Activation Functions for Multiple Independent Labels

After completing the mnist / cifar tutorials, I thought that I was experimenting with tensor flow, creating my own “big” dataset, and for simplicity, I settled on black and white oval shape, which changes its height and width regardless of scale 0.0-1.0 in the form images 28x28 pixels in size (of which I have 5000 training images, 1000 test images).

My code uses MNIST expert as the basis (in short for speed), but I switched to a square error based on and, based on the tips here , changed to a sigmoid function for the final level of activation, given that this is not a classification, but rather the "best fitting '' between two tensors, y_ and y_conv.

However, over> 100 thousand iterations, the yield of losses quickly settles between 400 and 900 (or, therefore, 0.2-0.3 from any given mark, averaged over 2 marks in a batch of 50), so I assume that I "Just making noise. Maybe I'm wrong, but I was hoping to use Tensorflow to convolve images to output maybe 10 or more independent labeled variables. Did I miss something fundamental here?

def train(images, labels): # Import data oval = blender_input_data.read_data_sets(images, labels) sess = tf.InteractiveSession() # Establish placeholders x = tf.placeholder("float", shape=[None, 28, 28, 1]) tf.image_summary('images', x) y_ = tf.placeholder("float", shape=[None, 2]) # Functions for Weight Initialization. def weight_variable(shape): initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial) def bias_variable(shape): initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) # Functions for convolution and pooling def conv2d(x, W): return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') def max_pool_2x2(x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # First Variables W_conv1 = weight_variable([5, 5, 1, 16]) b_conv1 = bias_variable([16]) # First Convolutional Layer. h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1) _ = tf.histogram_summary('weights 1', W_conv1) _ = tf.histogram_summary('biases 1', b_conv1) # Second Variables W_conv2 = weight_variable([5, 5, 16, 32]) b_conv2 = bias_variable([32]) # Second Convolutional Layer h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = max_pool_2x2(h_conv2) _ = tf.histogram_summary('weights 2', W_conv2) _ = tf.histogram_summary('biases 2', b_conv2) # Fully connected Variables W_fc1 = weight_variable([7 * 7 * 32, 512]) b_fc1 = bias_variable([512]) # Fully connected Layer h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*32]) h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1)+b_fc1) _ = tf.histogram_summary('weights 3', W_fc1) _ = tf.histogram_summary('biases 3', b_fc1) # Drop out to reduce overfitting keep_prob = tf.placeholder("float") h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob) # Readout layer with sigmoid activation function. W_fc2 = weight_variable([512, 2]) b_fc2 = bias_variable([2]) with tf.name_scope('Wx_b'): y_conv=tf.sigmoid(tf.matmul(h_fc1_drop, W_fc2)+b_fc2) _ = tf.histogram_summary('weights 4', W_fc2) _ = tf.histogram_summary('biases 4', b_fc2) _ = tf.histogram_summary('y', y_conv) # Loss with squared errors with tf.name_scope('diff'): error = tf.reduce_sum(tf.abs(tf.sub(y_,y_conv))) diff = (error*error) _ = tf.scalar_summary('diff', diff) # Train with tf.name_scope('train'): train_step = tf.train.AdamOptimizer(1e-4).minimize(diff) # Merge summaries and write them out. merged = tf.merge_all_summaries() writer = tf.train.SummaryWriter('/home/user/TBlogs/oval_logs', sess.graph_def) # Add ops to save and restore all the variables. saver = tf.train.Saver() # Launch the session. sess.run(tf.initialize_all_variables()) # Restore variables from disk. saver.restore(sess, "/home/user/TBlogs/model.ckpt") for i in range(100000): batch = oval.train.next_batch(50) t_batch = oval.test.next_batch(50) if i%10 == 0: feed = {x:t_batch[0], y_: t_batch[1], keep_prob: 1.0} result = sess.run([merged, diff], feed_dict=feed) summary_str = result[0] df = result[1] writer.add_summary(summary_str, i) print('Difference:%s' % (df) else: feed = {x:batch[0], y_: batch[1], keep_prob: 0.5} sess.run(train_step, feed_dict=feed) if i%1000 == 0: save_path = saver.save(sess, "/home/user/TBlogs/model.ckpt") # Completion print("Session Done")

What bothers me the most is how the tensor board seems to show that the weights are practically unchanged, even after several hours and hours of training and a fading learning speed (although this is not shown in the code). My understanding of machine learning is that when you collapse images, the layers effectively make up the edge detection layers ... so I'm confused why they can hardly change.

My theories currently are:
1. I missed / misunderstood something regarding the loss function.
2. I misunderstood how weight is initialized / updated
3. I greatly underestimated how long the process should take ... although, again, the loss just fluctuates.

Any help would be greatly appreciated, thanks!

+5

machine-learning computer-vision tensorflow tensorboard

Robert Armstrong Jan 05 '16 at 13:48

source share

1 answer

Thomas moreau · Answer 1 · 2016-03-11T00:18:22+0000

From what I see, your cost function is not an ordinary standard error.
You optimize the square tf.reduce_sum(tf.abs(tf.sub(y_,y_conv))) . This function is not differentiable at 0 (this is the square of the norm l1). This can cause some stability problems (especially at the stages of back propagation, I don’t know which sub-gradient they use in this case).

A typical standard error can be written as

 residual = tf.sub(y_, y_conv) error = tf.reduce_mean(tf.reduce_sum(residual*residual, reduction_indices=[1]))

(using an average value and a sum should avoid a value depending on the size of the lot). It is differentiable and should give you better behavior.

Cost and Activation Functions for Multiple Independent Labels

More articles: