This is faster because you can use the <operator, not the * oprator. Ie Faster to perform "left shift by 1" (multiply by two) than to perform "muliply by 43". You can get around this limitation by adding padding bytes to the end of each line of the image (as MS did for bitmaps in memory), but essentially this is a consequence of the difference in speed between the two instructions.
In the old days, 8 bits of 320x200 (13h mode) you can index a pixel using a simple formula:
pixOffset = xPos + yPos * 320;
But it was a cloak. A much better alternative was to use
WITH
pixOffset = xPos + (yPos * 256) + (yPos * 64)
Asm
mov ax, xPos ; ax = xPos mov bx, yPos ; bx = yPos shl bx, 6 ; bx = yPos * 64 add ax, bx ; ax = xPos + (yPos * 64) shl bx, 2 ; bx = yPos * 256 add ax, bx ; ax = xPos + yPos * 320
This may seem contradictory, but when it is well written, it uses only one-time clock instructions. I can calculate the offset of 6 measures. Of course, pipelining and cache problems complicate the scenario.
In addition, it is much cheaper to implement shift registers in equipment than a complete multiplication block, both in $$ and in transistors. Therefore, the same number of transistors can be used to provide better performance, or a smaller number can be used for the same performance with less power dissipation.
AFAIK, the mul (and div) commands of modern processors are implemented using look-up tables. For the most part this mitigates the problem, but it is also not without problems. For further reading, look at the Pentium fdiv error (the error table was mistakenly filled inside the chips).
http://en.wikipedia.org/wiki/Pentium_FDIV_bug
So, in conclusion, it is essentially an artifact of the hardware / software used to implement the functions.