(Edit: sorry, my original argument was the reason why this makes sense, but I realized that it is not, it is a bit OT.)
I did not find the arguments of the TF groups behind this, but does not fulfill the computational meaning, since the ops are written in C ++.
Intuitively, we want to stretch (multiply / add, etc.) different functions from the same sequence to the same time interval. Different timestamps cannot be executed in parallel, while burst / sequences can include the function> periodicity / sequence> time.
By default, Numpy and C ++ use a lowercase memory card (primary) (C-like), so
[[ 0. 1. 2.] [ 3. 4. 5.] [ 6. 7. 8.]]
It is laid as [0,1,2,3,4,5,6,7,8] in memory. This means that if we have
x = np.zeros([time,batch,feature])
( time_major=True in the tensor stream)
In the Row-major memory, we get a layout of the type x[0,0,0],x[0,0,1],x[0,0,2],…,x[0,1,0],... , therefore ex. the point product of weights and vectors from the same sequence and timestep ( w*x[t,b,:] ) is the most continuous operation, followed by the following sequence w*x[t,b+1,:] , etc. d. This is what we want during training.
With time_major=False , which by default has [batch, time, feature], so ex-functions are from the same sequence, but different timestamps are more adjacent, i.e. w*x[batch,t,:] , followed by w*x[batch,t+1,:] , etc. It may be faster to predict a single sequence if the RNN is deployed, but this is an assumption.
If you came to this question for the same reason as I did, I learned to be careful with the slightly unintuitive Numpy indexing, which should be pythonic, not necessarily Row Major. Look at it. As expected:
x = np.zeros([3,3]) x[0:9].flat = np.arange(10) print x > [[ 0. 1. 2.] > [ 3. 4. 5.] > [ 6. 7. 8.]]
We also expect x[1] == x[0,1] , but
print x[1] > [ 3. 4. 5.] print x[np.arange(10)<=4] > IndexError: index 3 is out of bounds for axis 0 with size 3