Mixing a NEON Assembly with Non-Vector Functions

I think I found the answer to my question. There is a fmacs command for VFP that can perform a trick that performs scalar computation on NEON / VFP registers.


I am very new to NEON or ARM programming ...

I want to load the upper triangular matrix into the NEON registers and integrate (accumulate) the external vector product using single precision. Key ideas: A + = x '* x, where A is the upper triangular matrix. Some of the operations can be performed by vectorizing the operations using the NEON instruction "vmla.f32" in quad or double registers. However, sometimes I have to work on only one precision register 1 at a time, i.e. Not on 2 or 4 registers with the same precision. In the example below (doesn't work) I'm interested in the line

// A[8-14] += A[1]*x[1-7] "mla s16, s16, d0[1]\n\t" 

I want to use NEON registers to perform a single precision operation.

Code snippet:

  __asm__ volatile ( //load x into registers "vldmia %0, {d0-d3}\n\t" // load A into registers "vldmia %1, {d4-d12}\n\t" "vldmia %1, {d13-d21}\n\t" // A[0-7] += x[0]*x[0-7] "vmla.f32 q2, q2, d0[0]\n\t" "vmla.f32 q3, q3, d0[0]\n\t" // A[8-14] += A[1]*x[1-7] "mla s16, s16, d0[1]\n\t" // output : // input : "r"(A), "r"(x) // registers : "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10" ); 
+6
source share
1 answer

So, I think you are asking about the multiplication of the vector with a scalar?

I would use "vdup" to load the scalar into all stripes of the NEON register and then multiply.

If you can post a simple C version of what you are trying to do, I could try and help more ...

+1
source

Source: https://habr.com/ru/post/888251/


All Articles