I am working on an embedded system using an ARM7TDMI processor.
In a time-critical ISR, I need to take a snapshot (copy) of 24 16-bit values ββfrom hardware registers in SRAM. Values ββare consecutive and can be considered as an array.
The data bus (for SRAM and hardware registers) is 16 bits, and we work in ARM mode (8/32).
In the store, we will discuss the best method for copying data: as 16-bit quanitities or as 32-bit values.
My argument is that ARM is in 32-bit mode, so it will make 2 16-bit samples with one instruction faster than with two 16-bit instructions making samples. In addition, there are half the instructions for extraction, which should reduce the time by 1/2.
Does anyone have data to support any method? (My O'scopes are all highlighted, so I cannot take measurements in the embedded system. I also cannot run a huge number of times due to interruption of the ISR every milliseconds). * (Profiling is difficult because our JTAG Jet probes do not provide the means for accurate profiling). *
Sample code - 16 copies of it:
#define MAX_16_BIT_VALUES 24U uint16_t volatile * p_hardware; uint16_t data_from_hardware[MAX_16_BIT_VALUES]; data_from_hardware[0] = p_hardware[0]; data_from_hardware[1] = p_hardware[1]; data_from_hardware[2] = p_hardware[2]; data_from_hardware[3] = p_hardware[3];
Sample code, 32-bit copy:
uint32_t * p_data_from_hardware = (uint32_t *)&data_from_hardware[0]; uint32_t volatile * p_hardware_32_ptr = (uint32_t volatile *) p_hardware; p_data_from_hardware[0] = p_hardware_32_ptr[0]; p_data_from_hardware[1] = p_hardware_32_ptr[1]; p_data_from_hardware[2] = p_hardware_32_ptr[2]; p_data_from_hardware[3] = p_hardware_32_ptr[3];
Details: ARM7TDMI processor, working in 8/32-bit mode, IAR EW compiler.
Note. Code not deployed to prevent cache reload. Note. The assembly language listing shows that available memory using constant indexes is more efficient than using an incrementable pointer.
Edit 1: Testing
According to Chris Stratton's comment, we are having problems creating 32-bit samples on our 16-bit FPGAs, so 32-bit optimization is not possible.
However, I have profiled using DMA. The performance increase using the DMA controller was 30 ms (microseconds). In our project, we hope to get more significant time savings, so this optimization is not worth it. This experiment showed that DMA would be very useful if we had more data to transmit, or the transmission could be parallel.
It is interesting to note that DMA requires 17 instructions for configuration.