A good asm style is pretty universal for all ISAs (and different asm dialects for the same processor). Compiler output (e.g. gcc / clang) usually does all the things that I mention below, so this is a good guide. (And the output of the C compiler is often a good starting point for optimizing a small function.)
As a rule, indentation is one level deeper than labels and assembly directives.
Indentation of operands for a consistent column (therefore, different mnemonics do not leave your code uneven, and it is easy to scan in a block and see the destination register of each instruction as the first operand) 1 .
Indent the line comment for the sequential column on the right, far beyond the operands, to avoid visual noise.
Group the blocks of related instructions together with an empty string to separate them. (Or, if you optimize the CPUs in order by scheduling instructions, you cannot do this and should use comments to keep track of which part of the problem each instruction is working on. Using different levels of indentation for comments can be useful then)
Footnote 1:
Except for the MIPS repository instructions, such as sw $t0, 1234($t1) where the first operand is actually the source; they decided to make the asm source use the same operand order for both downloads and storages, possibly because they are both I-type instructions in machine code. This is typical of asm for RISC boot / storage architectures, so something you need to get used to comes from CISC, where mov eax, [rdi] is load and mov [rdi], eax is storage. And add [rdi], eax is both.
Example: atoi function for unsigned integers, for real MIPS with branch delay intervals. But not MIPS I, nor slots with delayed downloads. Although I still tried to avoid stalls with a load. ( Godbolt for version C )
# unsigned decimal ASCII string to integer
This is probably not optimal for any particular MIPS implementation; a superscalar in order would probably benefit from placing more shifts / additions between the load and the branch, even if it means that the last iteration is doing more redundant work. This is probably good for OoO exec like r10k. Modern MIPS32r6 will use lsa to accumulate a left-shift, as gcc does with -march=mips32r6 , and will use versions of branch instructions without delaying the branch.
This can be pretty good on early scalar MIPS, though. The increment of the pointer fills the slot after loading, avoiding stopping inside the loop. (Immediate bias 1 is due to the fact that we avoided increasing the purified first iteration).
Filling the delay interval for the launch branch before .Lloop_entry would be possible if we wanted to calculate more material for the next iteration after addu $v0, $v0, $t0 inside the main loop. But that would require a dependency on $v0 , which would hurt ILP for superscalar processors in order. (Currently, top to addu can run in parallel, then addu can work in parallel with lbu to create a new total.)
This would be good for scalar order (e.g. MIPS I / MIPS II) or for idle processors.