How to speed up the planet to packed / striped graphics in C ++?

Question

How to speed up the planet to packed / striped graphics in C ++?

I am trying to program an Arduino. Thanks to PWM - LED matrix. I need to prepare the data before drawing each row, but the innermost loop in this process is too slow. The screen is currently flickering. The cycle should end below 500us. Arduino has an 84MHz Cortex-M3 ARM processor.

This is the concept of how I need to collect a bit for output:

5-bit color data:

R1=12, G1=4, B1=7, R2=0, G2=2, B2=27

The next step is to create a 32-bit stream of consecutive 1s. The number 1s is given by the color value:

 r1 = 0b00000000000000000000111111111111 g1 = 0b00000000000000000000000000001111 b1 = 0b00000000000000000000000001111111 r2 = 0b00000000000000000000000000000000 g2 = 0b00000000000000000000000000000011 b2 = 0b00000111111111111111111111111111

The final step is to collect every nth bit of 10 pixels (30 color values in total) into a 32-bit integer:

 pack1 = 0b00 ... 111011 pack2 = 0b00 ... 111011 pack3 = 0b00 ... 111001 pack4 = 0b00 ... 111001 pack5 = 0b00 ... 101001 ...

This is the code:

  // In my case scanwidth is 64*2 (64 is the width of the LED matrix and two lines are scanned at once) for ( i=0; i<scanwidth/5; i++) { // each run uses 5 upper and 5 lower pixels data = *lineptr++; // each int in the line buffer contains 2*15-bit inverted color data (red = 31-red etc.) p1uR = 0x7FFFFFFF >> (data >> 26); // pixel 1 of upper line red channel p1uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p1uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p1lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p1lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p1lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p2uR = 0x7FFFFFFF >> (data >> 26); p2uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p2uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p2lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p2lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p2lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p3uR = 0x7FFFFFFF >> (data >> 26); p3uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p3uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p3lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p3lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p3lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p4uR = 0x7FFFFFFF >> (data >> 26); p4uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p4uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p4lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p4lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p4lB = 0x7FFFFFFF >> (data & 0b11111); data = *lineptr++; p5uR = 0x7FFFFFFF >> (data >> 26); p5uG = 0x7FFFFFFF >> (data >> 21 & 0b11111); p5uB = 0x7FFFFFFF >> (data >> 16 & 0b11111); p5lR = 0x7FFFFFFF >> (data >> 10 & 0b11111); p5lG = 0x7FFFFFFF >> (data >> 5 & 0b11111); p5lB = 0x7FFFFFFF >> (data & 0b11111); index = i; for (j=0; j<31; j++){ // loop over all 30 bits index += (scanwidth/5+1); scanbuff[index] = (p5uR>>j&1)<<29 | (p5uG>>j&1)<<28 | (p5uB>>j&1)<<27 | (p5lR>>j&1)<<26 | (p5lG>>j&1)<<25 | (p5lB>>j&1)<<24 | (p4uR>>j&1)<<23 | (p4uG>>j&1)<<22 | (p4uB>>j&1)<<21 | (p4lR>>j&1)<<20 | (p4lG>>j&1)<<19 | (p4lB>>j&1)<<18 | (p3uR>>j&1)<<17 | (p3uG>>j&1)<<16 | (p3uB>>j&1)<<15 | (p3lR>>j&1)<<14 | (p3lG>>j&1)<<13 | (p3lB>>j&1)<<12 | (p2uR>>j&1)<<11 | (p2uG>>j&1)<<10 | (p2uB>>j&1)<<9 | (p2lR>>j&1)<<8 | (p2lG>>j&1)<<7 | (p2lB>>j&1)<<6 | (p1uR>>j&1)<<5 | (p1uG>>j&1)<<4 | (p1uB>>j&1)<<3 | (p1lR>>j&1)<<2 | (p1lG>>j&1)<<1 | (p1lB>>j&1); } }

I do not consider it necessary to improve the external cycle. I tried to expand the inner loop, but it did not noticeably improve.

Cortex-M3 can perform shifts and logic in a single clock cycle. I evaluate the outer and inner loops to take about 51,000 measures (600us).

Is there anything that I can improve with standard C ++ code? Are there any improvements that can be made to the inline assembly?

+5

c ++ arm inline-assembly graphics

uzumaki Jul 25 '17 at 18:53

source share

1 answer

Ext3h · Answer 1 · 2017-07-25T19:48:33+0000

Time for some black magic Cortex-M 3.

 #include <cstdint> #include <memory> #include <cstring> volatile char *const bitband_packed = (volatile char*)0x20000000; volatile uint32_t *const bitband_exploded = (volatile uint32_t*)0x22000000; static inline void transform_32_32(uint32_t buff[32]) { const std::size_t size = sizeof(buff[0])*32; volatile char *const tmp = bitband_packed; std::memcpy(const_cast<char*>(tmp), buff, size); for(std::size_t i = 0; i < 32; i++) { for(std::size_t j = i + 1; j < 32; j++) { std::swap(bitband_exploded[(32 * i + j)], bitband_exploded[(32 * j + i)]); } } std::memcpy(buff, const_cast<char*>(tmp), size); } void transform_pwm_32channel_5bit(const uint8_t input[32], uint32_t output[32]) { for(std::size_t i = 0; i < 32; i++) { output[i] = 0xffffffff >> input[i]; } transform_32_32(output); }

The Cortex-M series has a nice feature called Bit-Banding . This allows you to get a fairly efficient bitwise matrix transformation, which coincidentally is exactly what is needed for the effective use of bitban.

The conversion should be done in about 3 cycles per bit ( compiled in GCC 6.3 with -funroll-loops ), so it should be only about 12 thousand cycles or about 150%.

The only catch? This assumes that your particular Cortex-M 3 actually supports the Bit-Band feature. I did not have the opportunity to test this on Arduino.

How to speed up the planet to packed / striped graphics in C ++?

More articles: