I'm not sure which would be the most efficient way, but this is a little shorter:
#include <stdio.h> int main() { unsigned x = 0x1234; x = (x << 8) | x; x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f); x = (x << 4) | x; printf("0x1234 -> 0x%08x\n",x); return 0; }
If you need to do this repeatedly and very quickly, as suggested in your edit, you might consider creating a lookup table and using it. The following function dynamically allocates and initializes such a table:
unsigned *makeLookupTable(void) { unsigned *tbl = malloc(sizeof(unsigned) * 65536); if (!tbl) return NULL; int i; for (i = 0; i < 65536; i++) { unsigned x = i; x |= (x << 8); x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f); x |= (x << 4); tbl[i] = x; } return tbl; }
After that, each conversion is something like:
result = lookuptable[input];
.. or maybe:
result = lookuptable[input & 0xffff];
Or a smaller, more cache-friendly lookup table (or pair) can be used with one lookup each for high and low bytes (as noted in the comments @ LưuVĩnhPhúc). In this case, the table generation code can be:
unsigned *makeLookupTableLow(void) { unsigned *tbl = malloc(sizeof(unsigned) * 256); if (!tbl) return NULL; int i; for (i = 0; i < 256; i++) { unsigned x = i; x = ((x & 0xf0) << 4) | (x & 0x0f); x |= (x << 4); tbl[i] = x; } return tbl; }
... and an optional second table:
unsigned *makeLookupTableHigh(void) { unsigned *tbl = malloc(sizeof(unsigned) * 256); if (!tbl) return NULL; int i; for (i = 0; i < 256; i++) { unsigned x = i; x = ((x & 0xf0) << 20) | ((x & 0x0f) << 16); x |= (x << 4); tbl[i] = x; } return tbl; }
... and convert the value to two tables:
result = hightable[input >> 8] | lowtable[input & 0xff];
... or with one (only the bottom table above):
result = (lowtable[input >> 8] << 16) | lowtable[input & 0xff]; result ^= 0xff000000; /* to invert high byte */
If the upper part of the value (alpha?) Does not change much, even one large table may work well, as sequential searches will be closer to each other in the table.
I sent the @Apriori performance test code, making some adjustments, and added tests for other answers that he didn’t include initially ... then compiled three versions with different settings. One of them is 64-bit code with SSE4.1 enabled, where the compiler can use SSE for optimization ... and then two 32-bit versions, one with SSE and one without. Although all three were running on the same fairly recent processor, the results show how the optimal solution can vary depending on the features of the processor:
64b SSE4.1 32b SSE4.1 32b no SSE -------------------------- ---------- ---------- ---------- ExpandOrig time: 3.502 s 3.501 s 6.260 s ExpandLookupSmall time: 3.530 s 3.997 s 3.996 s ExpandLookupLarge time: 3.434 s 3.419 s 3.427 s ExpandIsalamon time: 3.654 s 3.673 s 8.870 s ExpandIsalamonOpt time: 3.784 s 3.720 s 8.719 s ExpandChronoKitsune time: 3.658 s 3.463 s 6.546 s ExpandEvgenyKluev time: 6.790 s 7.697 s 13.383 s ExpandIammilind time: 3.485 s 3.498 s 6.436 s ExpandDmitri time: 3.457 s 3.477 s 5.461 s ExpandNitish712 time: 3.574 s 3.800 s 6.789 s ExpandAdamLiss time: 3.673 s 5.680 s 6.969 s ExpandAShelly time: 3.524 s 4.295 s 5.867 s ExpandAShellyMulOp time: 3.527 s 4.295 s 5.852 s ExpandSSE4 time: 3.428 s ExpandSSE4Unroll time: 3.333 s ExpandSSE2 time: 3.392 s ExpandSSE2Unroll time: 3.318 s ExpandAShellySSE4 time: 3.392 s
The executables were compiled on 64-bit Linux using gcc 4.8.1 using -m64 -O3 -march=core2 -msse4.1 , -m32 -O3 -march=core2 -msse4.1 and -m32 -O3 -march=core2 -mno-sse respectively. @Apriori SSE tests were omitted for 32-bit assemblies (it crashed into 32-bit with SSE enabled and, obviously, will not work with disabled SSE).
Among the adjustments made, the actual image data was used instead of random values (photographs of objects with a transparent background), which significantly improved the performance of the large search table, but little changed for the rest.
Essentially, lookup tables benefit from a landslide when SSE is unavailable (or not in use) ... and SSE manual coding solutions benefit differently. However, it should also be noted that when the compiler could use SSE for optimization, most bit processing solutions were almost as fast as manual-encoded SSEs, still slower, but only slightly.