Effective copy on ARM, two 16-bit samples or 1 32-bit?

Question

Effective copy on ARM, two 16-bit samples or 1 32-bit?

I am working on an embedded system using an ARM7TDMI processor.

In a time-critical ISR, I need to take a snapshot (copy) of 24 16-bit values from hardware registers in SRAM. Values are consecutive and can be considered as an array.

The data bus (for SRAM and hardware registers) is 16 bits, and we work in ARM mode (8/32).

In the store, we will discuss the best method for copying data: as 16-bit quanitities or as 32-bit values.

My argument is that ARM is in 32-bit mode, so it will make 2 16-bit samples with one instruction faster than with two 16-bit instructions making samples. In addition, there are half the instructions for extraction, which should reduce the time by 1/2.

Does anyone have data to support any method? (My O'scopes are all highlighted, so I cannot take measurements in the embedded system. I also cannot run a huge number of times due to interruption of the ISR every milliseconds). * (Profiling is difficult because our JTAG Jet probes do not provide the means for accurate profiling). *

Sample code - 16 copies of it:

#define MAX_16_BIT_VALUES 24U uint16_t volatile * p_hardware; uint16_t data_from_hardware[MAX_16_BIT_VALUES]; data_from_hardware[0] = p_hardware[0]; data_from_hardware[1] = p_hardware[1]; data_from_hardware[2] = p_hardware[2]; data_from_hardware[3] = p_hardware[3]; //... data_from_hardware[20] = p_hardware[20]; data_from_hardware[21] = p_hardware[21]; data_from_hardware[22] = p_hardware[22]; data_from_hardware[23] = p_hardware[23];

Sample code, 32-bit copy:

 uint32_t * p_data_from_hardware = (uint32_t *)&data_from_hardware[0]; uint32_t volatile * p_hardware_32_ptr = (uint32_t volatile *) p_hardware; p_data_from_hardware[0] = p_hardware_32_ptr[0]; p_data_from_hardware[1] = p_hardware_32_ptr[1]; p_data_from_hardware[2] = p_hardware_32_ptr[2]; p_data_from_hardware[3] = p_hardware_32_ptr[3]; //... p_data_from_hardware[ 8] = p_hardware_32_ptr[ 8]; p_data_from_hardware[ 9] = p_hardware_32_ptr[ 9]; p_data_from_hardware[10] = p_hardware_32_ptr[10]; p_data_from_hardware[11] = p_hardware_32_ptr[11];

Details: ARM7TDMI processor, working in 8/32-bit mode, IAR EW compiler.

Note. Code not deployed to prevent cache reload. Note. The assembly language listing shows that available memory using constant indexes is more efficient than using an incrementable pointer.

Edit 1: Testing

According to Chris Stratton's comment, we are having problems creating 32-bit samples on our 16-bit FPGAs, so 32-bit optimization is not possible.

However, I have profiled using DMA. The performance increase using the DMA controller was 30 ms (microseconds). In our project, we hope to get more significant time savings, so this optimization is not worth it. This experiment showed that DMA would be very useful if we had more data to transmit, or the transmission could be parallel.

It is interesting to note that DMA requires 17 instructions for configuration.

+4

performance c assembly arm embedded

Thomas Matthews Sep 17 '13 at 19:57

source share

1 answer

supercat · Accepted Answer · 2013-09-17T20:12:00+0000

If speed is paramount, it is best if the hardware can support it, there will be an assembler procedure like:

 ; Assume R0 holds source base and R1 holds destination base PUSH {R4-R7} LDMIA R0,{R2-R7} STMIA R1,{R2-R7} LDMIA R0,{R2-R7} STMIA R1,{R2-R7} POP {R4-R7}

I believe that on an ARM7TDMI using a 32-bit bus, the LDR takes three cycles and the STR takes two; loading or saving n words using LDRMIA / STRMIA requires 3 + n cycles. Thus, 12 LDRs and 12 STRs will require 60 cycles, but the above sequence requires 50 (including saving / restoring registers). I would expect that using a 16-bit bus would add an additional loop penalty to each 32-bit load or storage, but if the LDM * and STM * commands split each 32-bit operation into two 16-bit operations, they should still come out much faster than discrete loads and storages, especially if code needs to be extracted from 16-bit memory.

Effective copy on ARM, two 16-bit samples or 1 32-bit?

Sample code - 16 copies of it:

Sample code, 32-bit copy:

Edit 1: Testing

More articles: