Update Oct 9 slows down because the calculation is too fast for Python to pre-flush the computation flow and schedule pre-fetching threads. The calculation in the main stream takes 2 ms and is apparently too small for prefetching to capture the GIL. Prefetching a stream has a large delay and therefore can always be prevented by a stream of calculations. Thus, the computational flow goes through all the examples, and then spends most of the time on the GIL, as some prefetching thread gets scheduled and completes one example. The solution is to increase the number of Python threads, increase the size of the queue to fit the entire dataset, start the queue of queues, and then pause the main thread for a couple of seconds to allow the queues to pre-fill the queue.
Old stuff
This is surprisingly slow.
This looks like a special case, making the last 3 examples unnecessarily slow (most of the effort was to optimize large models such as ImageNet, so MNIST ignored it).
You can diagnose problems by receiving timelines as described here.
Here are from these examples with an included chronology set.
Here's the timeline of the feed_dict implementation

It is important to note that matmul takes a good chunk of time, so the reading overhead is negligible
Now here is the runtime reader 
You can see that this operation is a bottleneck in QueueDequeueMany, which takes a whopping 45 ms.
If you zoom in, you will see a bunch of tiny MEMCPY and Cast operations, which is a sign that some operating processor is only a processor ( parse_single_example ), and dequeue should schedule several independent transfers CPU-> GPU
In the var example below with the GPU disabled, I don't see tiny operations, but QueueDequeueMany still takes more than 10 ms. Time seems to scale linearly with batch size, so there is some fundamental slowness. Filed # 4740 