I think the answer is as follows:
In both cases, you compute A[1:] + A[:-1]
, and in both cases you actually create an intermediate matrix.
However, in the second case, what happens is that you explicitly copy the entire large new allocated array into the reserved memory. Copying such an array occurs at about the same time as the original operation, so you actually double the time.
To summarize, in the first case you will do:
compute A[1:] + A[:-1] (~10ms)
In the second case, you do
compute A[1:] + A[:-1] (~10ms) copy the result into out (~10ms)
source share