Is my MIPS compiler crazy or am I crazy for choosing MIPS?

I use the MIPS processor (PIC32) in an embedded project, but I'm starting to doubt my choice. I understand that a RISC processor, such as MIPS, will generate more instructions than you would expect, but I did not think it would be so. Here is a snippet from the dismantling list:

225: LATDSET = 0x0040; sw s1,24808(s2) sw s4,24808(s2) sw s4,24808(s2) sw s1,24808(s2) sw s4,24808(s3) sw s4,24808(s3) sw s1,24808(s3) 226: { 227: porte = PORTE; lw t1,24848(s4) andi v0,t1,0xffff lw v1,24848(s6) andi ra,v1,0xffff lw v1,24848(s6) andi ra,v1,0xffff lw v0,24848(s6) andi t2,v0,0xffff lw a2,24848(s5) andi v1,a2,0xffff lw t2,24848(s5) andi v1,t2,0xffff lw v0,24848(s5) andi t2,v0,0xffff 228: if (porte & 0x0004) andi t2,v0,0x4 andi s8,ra,0x4 andi s8,ra,0x4 andi ra,t2,0x4 andi a1,v1,0x4 andi a2,v1,0x4 andi a2,t2,0x4 229: pst_bytes_somi[0] |= sliding_bit; or t3,t4,s0 xori a3,t2,0x0 movz t3,s0,a3 addu s0,t3,zero or t3,t4,s1 xori a3,s8,0x0 movz t3,s1,a3 addu s1,t3,zero or t3,t4,s1 xori a3,s8,0x0 movz t3,s1,a3 addu s1,t3,zero or v1,t4,s0 xori a3,ra,0x0 movz v1,s0,a3 addu s0,v1,zero or a0,t4,s2 xori a3,a1,0x0 movz a0,s2,a3 addu s2,a0,zero or t3,t4,s2 xori a3,a2,0x0 movz t3,s2,a3 addu s2,t3,zero or v1,t4,s0 xori a3,a2,0x0 movz v1,s0,a3 

This is like a crazy amount of instructions for simply reading / writing and testing variables at fixed addresses. On another processor, I could probably get every C statement up to about 1..3 instructions without resorting to handwritten asm. Obviously, the clock speed is quite high, but it is not 10 times higher than that of another processor (for example, dsPIC).

My optimization is set to the maximum. Is my C compiler awful (this is gcc 3.4.4)? Or is this typical of MIPS?

+6
source share
6 answers
Finally, I found out the answer. Collapsible listing is completely misleading. The compiler does a loop unwrap, and what we see under each C statement is actually an 8x instruction count, since it unwraps an 8x loop. Instructions are not consecutive addresses! Disabling the loop cycle in the compiler options causes the following:
 225: LATDSET = 0x0040; sw s3,24808(s2) 226: { 227: porte = PORTE; lw t1,24848(s5) andi v0,t1,0xffff 228: if (porte & 0x0004) andi t2,v0,0x4 229: pst_bytes_somi[0] |= sliding_bit; or t3,t4,s0 xori a3,t2,0x0 movz t3,s0,a3 addu s0,t3,zero 230: 

Panic over everyone.

+6
source

I think your compiler is behaving badly ... Check, for example, this statement:

 228: if (porte & 0x0004) andi t2,v0,0x4 (1) andi s8,ra,0x4 (2) andi s8,ra,0x4 (3) andi ra,t2,0x4 (4) andi a1,v1,0x4 (5) andi a2,v1,0x4 (6) andi a2,t2,0x4 (7) 

Obviously, there are instructions that basically do nothing. Instruction (3) does nothing new, like storing in s8 the same result calculated according to instruction (2). Instruction (6) also has no effect, since it is overridden by the following instruction (7), I believe that any compiler that performs some static analysis phase will at least delete instructions (3) and (6).

A similar analysis will apply to other parts of your code. For example, in the first expression, you can see that some registers (v0 and v0) are loaded with the same value twice.

I think your compiler is not doing a good job of optimizing compiled code.

+3
source

MIPS is basically the embodiment of everything that was stupid in RISC design. These days, x86 (and x86_64) completely absorbed all the ideas that came from RISC, and ARM evolved much more efficiently than traditional RISC, staying true to the RISC concept to maintain a small, systematic instruction set.

To answer the question, I would say that you are crazy about choosing MIPS or, more importantly, choosing it without studying a bit about MIPS ISA and why it is so bad and how much inefficiency you need to put in if you want to use it. I would choose ARM for low-power / embedded systems in most situations, or, even better, Intel Atom, if you can afford a little more power.

Edit: Actually, the second reason you can go crazy ... From the comments, it seems that you are using 16-bit integers. You should never use types smaller than int in C, except for arrays or in a structure that will be allocated by large numbers (either in an array, or in some other way, for example, a linked list / tree, etc.). Using small types will never do any good except save space (which does not matter until you have a large number of values ​​of this type) and will almost certainly be less efficient than using "normal" types. In the case of MIPS, the difference is extreme. Go to int and see if your problem goes away.

+2
source

The only thing I can think of, it is possible, perhaps, that the compiler can introduce additional meaningless instructions to match the processor speed with a much lower data bus speed. Even this explanation is not enough, since storage / loading instructions likewise have redundancy.

Since the compiler is suspected, remember that focusing on the compiler can blind you to the vision of a tunnel. Errors may be hidden in other parts of the tool chain.

Where did you get the compiler? I find that some of the "light" sources often provide some pretty awful tools. Built-in friends of developers tend to compile their own toolchain with sometimes much better results.

0
source

I tried compiling the following code with CodeSourcery MIPS GCC 4.4-303 with -O4. I tried this with uint32_t and uint16_t:

 #include <stdint.h> void foo(uint32_t PORTE, uint32_t pst_bytes_somi[], uint32_t sliding_bit) { uint32_t LATDSET = 0x0040; { uint32_t porte = PORTE; if (porte & 0x0004) pst_bytes_somi[0] |= sliding_bit; if (porte & LATDSET) pst_bytes_somi[1] |= sliding_bit; } } 

Here is a breakdown with uint32_t integers:

  uint32_t porte = PORTE; if (porte & 0x0004) 0: 30820004 andi v0,a0,0x4 4: 10400004 beqz v0,18 <foo+0x18> 8: 00000000 nop ./foo32.c:7 pst_bytes_somi[0] |= sliding_bit; c: 8ca20000 lw v0,0(a1) 10: 00461025 or v0,v0,a2 14: aca20000 sw v0,0(a1) ./foo32.c:8 if (porte & LATDSET) 18: 30840040 andi a0,a0,0x40 1c: 10800004 beqz a0,30 <foo+0x30> 20: 00000000 nop ./foo32.c:9 pst_bytes_somi[1] |= sliding_bit; 24: 8ca20004 lw v0,4(a1) 28: 00463025 or a2,v0,a2 2c: aca60004 sw a2,4(a1) 30: 03e00008 jr ra 34: 00000000 nop 

Here is a breakdown with uint16_t integers:

  if (porte & 0x0004) 4: 30820004 andi v0,a0,0x4 8: 10400004 beqz v0,1c <foo+0x1c> c: 30c6ffff andi a2,a2,0xffff ./foo16.c:7 pst_bytes_somi[0] |= sliding_bit; 10: 94a20000 lhu v0,0(a1) 14: 00c21025 or v0,a2,v0 18: a4a20000 sh v0,0(a1) ./foo16.c:8 if (porte & LATDSET) 1c: 30840040 andi a0,a0,0x40 20: 10800004 beqz a0,34 <foo+0x34> 24: 00000000 nop ./foo16.c:9 pst_bytes_somi[1] |= sliding_bit; 28: 94a20002 lhu v0,2(a1) 2c: 00c23025 or a2,a2,v0 30: a4a60002 sh a2,2(a1) 34: 03e00008 jr ra 38: 00000000 nop 

As you can see, each statement of the C operator consists of two to three instructions. Using 16-bit integers makes the function of only one instruction longer.

0
source

Have you turned on compiler optimization? Unverified code has a lot of redundancy.

-1
source

Source: https://habr.com/ru/post/898767/


All Articles