The num_output value of the last fully connected layer will not be 1
for pixel prediction. It will be equal to w*h
input image.
What made you feel that the value would be 1?
Change 1 :
Below are the sizes of each layer mentioned in link 1, page 3:
LAYER OUTPUT DIM [c*h*w] course1 96*h1*w1 conv layer course2 256*h2*w2 conv layer course3 384*h3*w3 conv layer course4 384*h4*w4 conv layer course5 256*h5*w5 conv layer course6 4096*1*1 fc layer course7 X*1*1 fc layer where 'X' could be interpreted as w*h
To understand this further, suppose we have a network to predict image pixels. Images have a size of 10 * 10. Thus, the final output of the fc layer will also have a dimension of 100 * 1 * 1 (as in course 7). This can be interpreted as 10 * 10.
Now the question will be how a 1d array can correctly predict a 2d image. For this, you should note that losses are calculated for this output using labels that may correspond to pixel data. Thus, during training, the scales will learn to predict pixel data.
EDIT 2:
Trying to make a grid using draw_net.py
in caffe gives you the following: 
The relu
level associated with conv6
and fc6
has the same name, which leads to complex connectivity in the inverse image. I'm not sure if this will lead to some problems during training, but I would suggest that you rename one of the relu layers to a unique name to avoid some unforeseen problems.
Returning to your question, it seems that after fully connected layers there is no spike. As you can see from the magazine:
I1108 19:34:57.881680 4277 net.cpp:150] Setting up fc7 I1108 19:34:57.881718 4277 net.cpp:157] Top shape: 1 4070 (4070) I1108 19:34:57.881826 4277 net.cpp:150] Setting up reshape I1108 19:34:57.881846 4277 net.cpp:157] Top shape: 1 1 55 74 (4070) I1108 19:34:57.884768 4277 net.cpp:150] Setting up conv6 I1108 19:34:57.885309 4277 net.cpp:150] Setting up pool6 I1108 19:34:57.885327 4277 net.cpp:157] Top shape: 1 63 55 74 (256410)
fc7
has an output size of 4070 * 1 * 1. It converts to 1 * 55 * 74 for transmission as a conv6
layer.
The output of the entire network is made in conv9
, which has an output size of 1*55*74
, which is exactly similar to the size of the labels (depth data).
Please indicate exactly where you feel what upsample is going on if my answer is still not clear.