Fast threshold and bit packing algorithm (possible improvements?)

Question

Fast threshold and bit packing algorithm (possible improvements?)

I am working on an algorithm that performs a global threshold value of an 8-bit grayscale image into a single-bit (bit packed, for example, 1 byte contains 8 pixels) monochrome image. Each pixel in a Grayscale image can have a brightness value from 0 to 255.

My environment is Win32 in Microsoft Visual Studio C ++.

I'm interested in optimizing the algorithm as much as possible out of curiosity, a 1-bit image will be converted to TIFF. I am currently setting FillOrder as MSB2LSB (the most significant bit for the least significant bit) just because the TIFF specification offers this (it doesn't have to be MSB2LSB)

Just some kind of background for those who don’t know:

The MSB2LSB arranges the pixels from left to right in the byte in the same way that the pixels orient in the image as the X coordinate increases. If you look at the Grayscale image from left to right along the X axis, this obviously requires you to think back when you pack the bit into your current byte. Having said that, let me show you what I have (this is in C, I did not try to use ASM or Compiler Intrinsics, but only because I have very little experience with it, but that would be possible).

Since the monochrome image will have 8 pixels per byte, the width of the monochrome image will be

  (grayscaleWidth + 7) / 8;

FYI, I assume that my largest image should be 6000 pixels wide:

The first thing I do (before processing any image) is

1) calculate the search table for the amounts that I need to translate into a specific byte, if you specify the X coordinate from my image in shades of gray:

int _shift_lut[6000]; for( int x = 0 ; x < 6000; x++) { _shift_lut[x] = 7-(x%8); }

Using this lookup table, I can pack the monochrome bit value into the current byte I'm working on, for example:

 monochrome_pixel |= 1 << _shift_lut[ grayX ];

which ends with a huge increase in speed than

 monochrome_pixel |= 1 << _shift_lut[ 7-(x%8)];

The second lookup table that I am calculating is a lookup table that tells me the X index in my monochrome pixel, taking into account the X pixel per grayscale pixels. This very simple LUT is calculated like this:

 int xOffsetLut[6000]; int element_size=8; //8 bits for( int x = 0; x < 6000; x++) { xOffsetLut[x]=x/element_size; }

This LUT allows me to do something like

 monochrome_image[ xOffsetLut[ GrayX ] ] = packed_byte; //packed byte contains 8 pixels

My grayscale image is a simple unsigned char *, as well as my monochrome image;

This is how I initialize a monochrome image:

 int bitPackedScanlineStride = (grayscaleWidth+7)/8; int bitpackedLength=bitPackedScanlineStride * grayscaleHeight; unsigned char * bitpack_image = new unsigned char[bitpackedLength]; memset(bitpack_image,0,bitpackedLength);

Then I call my binarization function as follows:

 binarize( gray_image.DataPtr(), bitpack_image, globalFormThreshold, grayscaleWidth, grayscaleHeight, bitPackedScanlineStride, bitpackedLength, _shift_lut, xOffsetLut);

And here is my Binarize function (as you can see, I did a few expanding cycles, which may or may not help).

 void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] ) { int yoff; int byoff; unsigned char bitpackPel=0; unsigned char pel1=0; unsigned char pel2=0; unsigned char pel3=0; unsigned char pel4=0; unsigned char pel5=0; unsigned char pel6=0; unsigned char pel7=0; unsigned char pel8=0; int checkX=grayscaleWidth; int checkY=grayscaleHeight; for ( int by = 0 ; by < checkY; by++) { yoff=by*grayscaleWidth; byoff=by*bitPackedScanlineStride; for( int bx = 0; bx < checkX; bx+=32) { bitpackPel = 0; //pixel 1 in bitpack image pel1=grayImage[yoff+bx]; pel2=grayImage[yoff+bx+1]; pel3=grayImage[yoff+bx+2]; pel4=grayImage[yoff+bx+3]; pel5=grayImage[yoff+bx+4]; pel6=grayImage[yoff+bx+5]; pel7=grayImage[yoff+bx+6]; pel8=grayImage[yoff+bx+7]; bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] ); bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; //pixel 2 in bitpack image pel1=grayImage[yoff+bx+8]; pel2=grayImage[yoff+bx+9]; pel3=grayImage[yoff+bx+10]; pel4=grayImage[yoff+bx+11]; pel5=grayImage[yoff+bx+12]; pel6=grayImage[yoff+bx+13]; pel7=grayImage[yoff+bx+14]; pel8=grayImage[yoff+bx+15]; bitpackPel = 0; bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] ); bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; //pixel 3 in bitpack image pel1=grayImage[yoff+bx+16]; pel2=grayImage[yoff+bx+17]; pel3=grayImage[yoff+bx+18]; pel4=grayImage[yoff+bx+19]; pel5=grayImage[yoff+bx+20]; pel6=grayImage[yoff+bx+21]; pel7=grayImage[yoff+bx+22]; pel8=grayImage[yoff+bx+23]; bitpackPel = 0; bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] ); bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; //pixel 4 in bitpack image pel1=grayImage[yoff+bx+24]; pel2=grayImage[yoff+bx+25]; pel3=grayImage[yoff+bx+26]; pel4=grayImage[yoff+bx+27]; pel5=grayImage[yoff+bx+28]; pel6=grayImage[yoff+bx+29]; pel7=grayImage[yoff+bx+30]; pel8=grayImage[yoff+bx+31]; bitpackPel = 0; bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] ); bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; } } }

I know that this algorithm could potentially skip some trailing pixels in each row, but don't worry about that.

As you can see for each monochrome byte, I process 8 gradations of gray.

Where you see pel8 <= threshold is a neat little trick that resolves to 0 or 1 and is much faster than if {} else {}

For each increment X, I pack the bits into bits of a higher order than the previous X

therefore, for the first set of 8 pixels in the image in grayscale

 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Here's what the bits in a byte look like (obviously, each numbered bit is only a threshold result of processing the corresponding numbered pixel, but you get the idea)

 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

PHEW , which should be this. Feel free to have fun with some graceful tricks that squeeze more juice from this algorithm.

With compiler optimization turned on, this function takes an average of 16 milliseconds on an image of approximately 5000 x 2200 pixels in size on a dual-core kernel processor.

EDIT:

Sentence

R .. was to remove the LUT shift and just use constants that are actually perfectly logical ... I changed the OR'ing of each pixel as such:

 void binarize( unsigned char grayImage[], unsigned char bitPackImage[], int threshold, int grayscaleWidth, int grayscaleHeight, int bitPackedScanlineStride, int bitpackedLength, int shiftLUT[], int xOffsetLUT[] ) { int yoff; int byoff; unsigned char bitpackPel=0; unsigned char pel1=0; unsigned char pel2=0; unsigned char pel3=0; unsigned char pel4=0; unsigned char pel5=0; unsigned char pel6=0; unsigned char pel7=0; unsigned char pel8=0; int checkX=grayscaleWidth-32; int checkY=grayscaleHeight; for ( int by = 0 ; by < checkY; by++) { yoff=by*grayscaleWidth; byoff=by*bitPackedScanlineStride; for( int bx = 0; bx < checkX; bx+=32) { bitpackPel = 0; //pixel 1 in bitpack image pel1=grayImage[yoff+bx]; pel2=grayImage[yoff+bx+1]; pel3=grayImage[yoff+bx+2]; pel4=grayImage[yoff+bx+3]; pel5=grayImage[yoff+bx+4]; pel6=grayImage[yoff+bx+5]; pel7=grayImage[yoff+bx+6]; pel8=grayImage[yoff+bx+7]; /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx]); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+1] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+2] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+3] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+4] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+5] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+6] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+7] );*/ bitpackPel |= ( (pel1<=threshold) << 7); bitpackPel |= ( (pel2<=threshold) << 6 ); bitpackPel |= ( (pel3<=threshold) << 5 ); bitpackPel |= ( (pel4<=threshold) << 4 ); bitpackPel |= ( (pel5<=threshold) << 3 ); bitpackPel |= ( (pel6<=threshold) << 2 ); bitpackPel |= ( (pel7<=threshold) << 1 ); bitpackPel |= ( (pel8<=threshold) ); bitPackImage[byoff+(xOffsetLUT[bx])] = bitpackPel; //pixel 2 in bitpack image pel1=grayImage[yoff+bx+8]; pel2=grayImage[yoff+bx+9]; pel3=grayImage[yoff+bx+10]; pel4=grayImage[yoff+bx+11]; pel5=grayImage[yoff+bx+12]; pel6=grayImage[yoff+bx+13]; pel7=grayImage[yoff+bx+14]; pel8=grayImage[yoff+bx+15]; bitpackPel = 0; /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+8] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+9] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+10] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+11] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+12] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+13] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+14] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+15] );*/ bitpackPel |= ( (pel1<=threshold) << 7); bitpackPel |= ( (pel2<=threshold) << 6 ); bitpackPel |= ( (pel3<=threshold) << 5 ); bitpackPel |= ( (pel4<=threshold) << 4 ); bitpackPel |= ( (pel5<=threshold) << 3 ); bitpackPel |= ( (pel6<=threshold) << 2 ); bitpackPel |= ( (pel7<=threshold) << 1 ); bitpackPel |= ( (pel8<=threshold) ); bitPackImage[byoff+(xOffsetLUT[bx+8])] = bitpackPel; //pixel 3 in bitpack image pel1=grayImage[yoff+bx+16]; pel2=grayImage[yoff+bx+17]; pel3=grayImage[yoff+bx+18]; pel4=grayImage[yoff+bx+19]; pel5=grayImage[yoff+bx+20]; pel6=grayImage[yoff+bx+21]; pel7=grayImage[yoff+bx+22]; pel8=grayImage[yoff+bx+23]; bitpackPel = 0; /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+16] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+17] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+18] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+19] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+20] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+21] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+22] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+23] );*/ bitpackPel |= ( (pel1<=threshold) << 7); bitpackPel |= ( (pel2<=threshold) << 6 ); bitpackPel |= ( (pel3<=threshold) << 5 ); bitpackPel |= ( (pel4<=threshold) << 4 ); bitpackPel |= ( (pel5<=threshold) << 3 ); bitpackPel |= ( (pel6<=threshold) << 2 ); bitpackPel |= ( (pel7<=threshold) << 1 ); bitpackPel |= ( (pel8<=threshold) ); bitPackImage[byoff+(xOffsetLUT[bx+16])] = bitpackPel; //pixel 4 in bitpack image pel1=grayImage[yoff+bx+24]; pel2=grayImage[yoff+bx+25]; pel3=grayImage[yoff+bx+26]; pel4=grayImage[yoff+bx+27]; pel5=grayImage[yoff+bx+28]; pel6=grayImage[yoff+bx+29]; pel7=grayImage[yoff+bx+30]; pel8=grayImage[yoff+bx+31]; bitpackPel = 0; /*bitpackPel |= ( (pel1<=threshold) << shiftLUT[bx+24] ); bitpackPel |= ( (pel2<=threshold) << shiftLUT[bx+25] ); bitpackPel |= ( (pel3<=threshold) << shiftLUT[bx+26] ); bitpackPel |= ( (pel4<=threshold) << shiftLUT[bx+27] ); bitpackPel |= ( (pel5<=threshold) << shiftLUT[bx+28] ); bitpackPel |= ( (pel6<=threshold) << shiftLUT[bx+29] ); bitpackPel |= ( (pel7<=threshold) << shiftLUT[bx+30] ); bitpackPel |= ( (pel8<=threshold) << shiftLUT[bx+31] );*/ bitpackPel |= ( (pel1<=threshold) << 7); bitpackPel |= ( (pel2<=threshold) << 6 ); bitpackPel |= ( (pel3<=threshold) << 5 ); bitpackPel |= ( (pel4<=threshold) << 4 ); bitpackPel |= ( (pel5<=threshold) << 3 ); bitpackPel |= ( (pel6<=threshold) << 2 ); bitpackPel |= ( (pel7<=threshold) << 1 ); bitpackPel |= ( (pel8<=threshold) ); bitPackImage[byoff+(xOffsetLUT[bx+24])] = bitpackPel; } } }

Now I am testing Intel Xeon 5670 using (GCC) 4.1.2. According to these specifications, a hard-coded bitshift is 4 ms slower than using my original LUT algorithm. In Xeon and GCC, the LUT algorithm takes an average of 8.61 ms, and hard-coded bit-bits take an average of 12.285 ms.

+1

optimization c image-processing

alessandro ferrucci Sep 14 '10 at 0:15

source share

2 answers

You can do this with SSE quite easily, processing 16 pixels at a time, for example.

load vector (16 x 8 bit unsigned)
add (255 - threshold) to each element
use PMOVMSKB to extract character bits into a 16-bit word
store 16-bit word

Sample code using the built-in SSE functions (warning: unverified!):

 void threshold_and_pack( const uint8_t * in_image, // input image, 16 byte aligned, height rows x width cols, width = multiple of 16 uint8_t * out_image, // output image, 2 byte aligned, height rows x width/8 cols, width = multiple of 2 const uint8_t threshold, // threshold const int width, const int height) { const __m128i vThreshold = _mm_set1_epi8(255 - threshold); int i, j; for (i = 0; i < height; ++i) { const __m128i * p_in = (__m128i *)&in_image[i * width]; uint16_t * p_out = (uint16_t *)&out_image[i * width / CHAR_BIT]; for (j = 0; j < width; j += 16) { __m128i v = _mm_load_si128(p_in); uint16_t b; v = _mm_add_epi8(v, vThreshold); b = _mm_movemask_epi8(v); // use PMOVMSKB to pack sign bits into 16 bit word *p_out = b; p_in++; p_out++; } } }

+1

Paul r Sep 14 '10 at 10:55

source share

R .. · Accepted Answer · 2010-09-14T01:49:08+0000

Try something like:

 unsigned i, w8=w>>3, x; for (i=0; i<w8; i++) { x = thres-src[0]>>1&0x80; x |= thres-src[1]>>2&0x40; x |= thres-src[2]>>3&0x20; x |= thres-src[3]>>4&0x10; x |= thres-src[4]>>5&0x08; x |= thres-src[5]>>6&0x04; x |= thres-src[6]>>7&0x02; x |= thres-src[7]>>8&0x01; out[i] = x; src += 8; }

You can find additional code for the remainder at the end of the line of width not a multiple of 8, or you can simply overlay / align the source so that it is a multiple of 8.

Fast threshold and bit packing algorithm (possible improvements?)

More articles: