Another easy way is to use the carry flag:
Repeat 8x:
lsl r20 ; shift one bit into the carry flag ror r0 ; rotate carry flag into result
(Input to r20 , output to r0 , contents of r20 destroyed, registers can be changed freely.)
This uses 16 instructions @ 2 bytes, 1 cycle each = 32 bytes of program memory and 16 cycles to undo one byte when it is fully "expanded". The code size wrapped in a loop can be reduced, but the execution time will increase.
source share