I am trying to program an Arduino. Thanks to PWM - LED matrix. I need to prepare the data before drawing each row, but the innermost loop in this process is too slow. The screen is currently flickering. The cycle should end below 500us. Arduino has an 84MHz Cortex-M3 ARM processor.
This is the concept of how I need to collect a bit for output:
5-bit color data:
R1=12, G1=4, B1=7, R2=0, G2=2, B2=27
The next step is to create a 32-bit stream of consecutive 1s. The number 1s is given by the color value:
r1 = 0b00000000000000000000111111111111 g1 = 0b00000000000000000000000000001111 b1 = 0b00000000000000000000000001111111 r2 = 0b00000000000000000000000000000000 g2 = 0b00000000000000000000000000000011 b2 = 0b00000111111111111111111111111111
The final step is to collect every nth bit of 10 pixels (30 color values ββin total) into a 32-bit integer:
pack1 = 0b00 ... 111011 pack2 = 0b00 ... 111011 pack3 = 0b00 ... 111001 pack4 = 0b00 ... 111001 pack5 = 0b00 ... 101001 ...
This is the code:
// In my case scanwidth is 64*2 (64 is the width of the LED matrix and two lines are scanned at once) for ( i=0; i<scanwidth/5; i++) { // each run uses 5 upper and 5 lower pixels data = *lineptr++; // each int in the line buffer contains 2*15-bit inverted color data (red = 31-red etc.) p1uR = 0x7FFFFFFF >> (data >> 26); // pixel 1 of upper line red channel p1uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p1uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p1lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p1lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p1lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p2uR = 0x7FFFFFFF >> (data >> 26); p2uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p2uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p2lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p2lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p2lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p3uR = 0x7FFFFFFF >> (data >> 26); p3uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p3uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p3lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p3lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p3lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p4uR = 0x7FFFFFFF >> (data >> 26); p4uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p4uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p4lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p4lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p4lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p5uR = 0x7FFFFFFF >> (data >> 26); p5uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p5uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p5lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p5lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p5lB = 0x7FFFFFFF >> (data & 0b11111); index = i; for (j=0; j<31; j++){ // loop over all 30 bits index += (scanwidth/5+1); scanbuff[index] = (p5uR>>j&1)<<29 | (p5uG>>j&1)<<28 | (p5uB>>j&1)<<27 | (p5lR>>j&1)<<26 | (p5lG>>j&1)<<25 | (p5lB>>j&1)<<24 | (p4uR>>j&1)<<23 | (p4uG>>j&1)<<22 | (p4uB>>j&1)<<21 | (p4lR>>j&1)<<20 | (p4lG>>j&1)<<19 | (p4lB>>j&1)<<18 | (p3uR>>j&1)<<17 | (p3uG>>j&1)<<16 | (p3uB>>j&1)<<15 | (p3lR>>j&1)<<14 | (p3lG>>j&1)<<13 | (p3lB>>j&1)<<12 | (p2uR>>j&1)<<11 | (p2uG>>j&1)<<10 | (p2uB>>j&1)<<9 | (p2lR>>j&1)<<8 | (p2lG>>j&1)<<7 | (p2lB>>j&1)<<6 | (p1uR>>j&1)<<5 | (p1uG>>j&1)<<4 | (p1uB>>j&1)<<3 | (p1lR>>j&1)<<2 | (p1lG>>j&1)<<1 | (p1lB>>j&1); } }
I do not consider it necessary to improve the external cycle. I tried to expand the inner loop, but it did not noticeably improve.
Cortex-M3 can perform shifts and logic in a single clock cycle. I evaluate the outer and inner loops to take about 51,000 measures (600us).
Is there anything that I can improve with standard C ++ code? Are there any improvements that can be made to the inline assembly?
source share