ARM assembly: cannot find register in class 'GENERAL_REGS on reboot' asm

I am trying to implement a function that multiplies a 32-bit operand with a 256-bit operand in an ARM assembly by an ARM Cortex-a8. The problem is that my registers are running out, and I have no idea how to reduce the number of registers used. Here is my function:

typedef struct UN_256fe{ uint32_t uint32[8]; }UN_256fe; typedef struct UN_288bite{ uint32_t uint32[9]; }UN_288bite; void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){ asm ( "umull r3, r4, %9, %10;\n\t" "mov %0, r3; \n\t"/*res->uint32[0] = r3*/ "umull r3, r5, %9, %11;\n\t" "adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/ "mov %1, r6; \n\t" "umull r3, r4, %9, %12;\n\t" "adcs r6, r5, r3; \n\t" "mov %2, r6; \n\t"/*res->uint32[2] = r6*/ "umull r3, r5, %9, %13;\n\t" "adcs r6, r3, r4; \n\t" "mov %3, r6; \n\t"/*res->uint32[3] = r6*/ "umull r3, r4, %9, %14;\n\t" "adcs r6, r3, r5; \n\t" "mov %4, r6; \n\t"/*res->uint32[4] = r6*/ "umull r3, r5, %9, %15;\n\t" "adcs r6, r3, r4; \n\t" "mov %5, r6; \n\t"/*res->uint32[5] = r6*/ "umull r3, r4, %9, %16;\n\t" "adcs r6, r3, r5; \n\t" "mov %6, r6; \n\t"/*res->uint32[6] = r6*/ "umull r3, r5, %9, %17;\n\t" "adcs r6, r3, r4; \n\t" "mov %7, r6; \n\t"/*res->uint32[7] = r6*/ "adc r6, r5, #0 ; \n\t" "mov %8, r6; \n\t"/*res->uint32[8] = r6*/ : "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]), "=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0]) : "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]), "r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp) : "r3", "r4", "r5", "r6", "cc", "memory"); } 

EDIT-1: I updated the clobber list based on the first comment, but I still get the same error

+5
source share
1 answer

A simple solution is to break it and not use "clobber". Declare variables as "tmp1" etc. Try not to use any mov statements; let the compiler do this if necessary. The compiler will use the algorithm to determine the best "flow" of information. If you use "clobber", it cannot reuse registers. As now, you force it to load all memory first before executing the assembler. This is bad because you want the memory / CPU ALU to be pipelined.

 void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res) { uint32_t mulhi1, mullo1; uint32_t mulhi2, mullo2; uint32_t tmp; asm("umull %0, %1, %2, %3;\n\t" : "=r" (mullo1), "=r" (mulhi1) : "r"(A), "r"(B->uint32[7]) ); res->uint32[8] = mullo1; /* was 'mov %0, r3; */ volatile asm("umull %0, %1, %3, %4;\n\t" "adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/ : "=r" (mullo2), "=r" (mulhi2), "=r" (tmp) : "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1) : "cc" ); res->uint32[7] = tmp; /* was 'mov %1, r6; */ /* ... etc */ } 

The whole purpose of the "gcc inline assembler" is not to assemble the code directly in the "C" file. Use compiler register allocation logic AND do what isn’t easy to do in 'C'. Using transfer logic in your case.

Without creating one huge asm clause, the compiler can plan loads from memory since it needs new registers. It will also broadcast your "UMULL" ALU activity using the load / store block.

You should use clobber only if the instruction implicitly captures a specific case. You can also use something like

 register int *p1 asm ("r0"); 

and use this as a way out. However, I don’t know of any ARM instructions like this, except for those that can change the stack, and your code does not use them and carry the course.

GCC knows that memory changes if it is specified as input / output, so you don't need a memory failure. This is actually harmful, since the memory clobber is the protection of the compiler’s memory , and this will cause a memory record when the compiler can schedule that for the latter.


The moral is to use gcc inline assembler to work with the compiler. If you type in assembler code and you have huge routines, using the registry can be complicated and confusing. Typical assembler encoders will only store one thing in a register per routine, but this is not always the best use of registers. The compiler will shuffle the data around in a rather smart way that is hard to beat (and not very satisfactory to pass the IMO code) when the code size gets larger.

You might want to check out the GMP library , which has many ways to efficiently solve some of the same problems that look like your code has it.

0
source

Source: https://habr.com/ru/post/1238048/


All Articles