What is the fastest way to exchange values ​​in C?

I want to exchange two integers, and I want to know which of these two implementations will be faster: An obvious way with the temp variable:

void swap(int* a, int* b) { int temp = *a; *a = *b; *b = temp; } 

Or the xor version that I'm sure most people saw:

 void swap(int* a, int* b) { *a ^= *b; *b ^= *a; *a ^= *b; } 

It appears that the first uses an extra register, but the second performs three downloads and saves, while the first only performs two of them. Can someone tell me which is faster and why? Why is more important.

+46
performance c
Aug 31 '08 at 15:12
source share
21 answers

The XOR method does not work if a and b point to the same address. The first XOR will clear all bits in the memory address pointed to by both variables, therefore, as soon as the function returns (* a == * b == 0), regardless of the initial value.

Additional Information on the Wiki Page: XOR Exchange Algorithm

Although this problem is unlikely to occur, I always prefer to use a method that is guaranteed to work, rather than a smart method that does not work at unexpected times.

+75
Aug 31 '08 at 16:17
source share
— -

Number 2 is often cited as a “smart” way to do this. In fact, it is most likely slower, since it hides the programmer's explicit goal of swapping two variables. This means that the compiler cannot optimize it to use the actual assembler operations for exchange. It also suggests the ability to perform bitwise xor on objects.

Stick to number 1, this is the most common and most understandable swap and can be easily templated / generalized.

This wikipedia section explains the problems fairly well: http://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice

+91
Aug 31 '08 at 15:19
source share

On a modern processor, you can use the following when sorting large arrays and do not see the difference in speed:

 void swap (int *a, int *b) { for (int i = 1 ; i ; i <<= 1) { if ((*a & i) != (*b & i)) { *a ^= i; *b ^= i; } } } 

In fact, the important part of your question is “why?”. part. Now, returning 20 years to 8086 days, the above would be a real killer, but on the last Pentium, that would be the right speed for the two you sent.

The reason comes down to memory and has nothing to do with the processor.

CPU speed compared to memory speed has increased astronomically. Access to memory has become a major bottleneck in application performance. All swap algorithms will spend most of their time waiting for data to be retrieved from memory. A modern OS can have up to 5 memory levels:

  • Cache Level 1 - runs at the same speed as the processor, has a small access time, but not enough.
  • Cache Level 2 - it works a little slower than L1, but it has more overhead for access (usually data must first be transferred to L1).
  • Cache Level 3 - (not always present) Often an external processor is slower and larger than L2
  • RAM is the main system memory that usually implements the pipeline, so there is latency in read requests (CPU request data, message sent to RAM, RAM receives data, RAM sends data to the CPU).
  • Hard disk - when there is not enough RAM, data is sent to HD, which is very slow, and not under the control of the CPU as such.

Sorting algorithms degrade memory access because they usually access memory in a very disordered manner, which leads to inefficient costs of retrieving data from L2, RAM, or HD.

Thus, optimization of the swap method is really pointless - if it is called only several times, then any inefficiency is hidden due to the small number of calls, if it caused a lot, then any inefficiency is hidden due to the number of misses in the cache (where the CPU should receive data from L2 (1 of cycles), L3 (10 cycles), RAM (100 cycles), HD (!)).

What you really need to do is look at the algorithm that calls the swap method. This is not a trivial exercise. Although Big-O notation is useful, O (n) can be significantly faster than O (log n) for small n. (I am sure there is a CodingHorror article about this.) In addition, many algorithms have degenerate cases where the code does more than necessary (using qsort on almost ordered data can be slower than sorting bubbles with early validation). So, you need to analyze your algorithm and the data that it uses.

Which leads to code analysis. Profilers are useful, but you need to know how to interpret the results. Never use a single run to collect results, always average results in many executions - because your test application could be transferred halfway to the OS hard drive. Always releasing a profile, optimized builds, profiling debugging code are pointless.

As for the original question - which is faster? - He likes to try to figure out Ferrari faster than Lambourgini, looking at the size and shape of the wing mirror.

+39
Sep 05 '08 at 10:30
source share

The first is faster because bitwise operations like xor are usually very difficult to visualize for the reader.

Faster to understand that this is the most important part;)

+11
Aug 31 '08 at 15:39
source share

@Harry: Go to the corner and think about what you suggested. Come back when you realize the error of your paths.

Never execute functions as macros for the following reasons:

  • Enter security. There's nothing here. The following only generates a warning at compilation, but is not executed at runtime:

     float a=1.5f,b=4.2f; swap (a,b); 

    A template function will always be of the correct type (and why don't you treat warnings as errors?).

    EDIT: since there are no templates in C, you need to write a separate swap for each type or use access to hacker memory.

  • This is text substitution. The following errors are not executed at run time (this time without compiler warnings):

     int a=1,temp=3; swap (a,temp); 
  • This is not a function. Thus, it cannot be used as an argument for something like qsort.

  • Compilers are smart. I mean really smart. Made by really smart people. They can perform nesting functions. Even during the connection (which is even more clever). Remember that embedding increases code size. Larger code means more chance to skip the cache when retrieving instructions, which means slower code.
  • Side effects. Macros have side effects! Consider:

     int &f1 (); int &f2 (); void func () { swap (f1 (), f2 ()); } 

    Here f1 and f2 will be called twice.

    EDIT: version C with unpleasant side effects:

     int a[10], b[10], i=0, j=0; swap (a[i++], b[j++]); 

Macros: Just say no!

EDIT: That's why I prefer to define macro names in UPPERCASE so that they stand out in the code as a warning for use with caution.

EDIT2: To respond to Leahn Novash's comment:

Suppose we have an unconfigured function f, which is converted by the compiler to a sequence of bytes, then we can determine the number of bytes in this way:

 bytes = C(p) + C(f) 

where C () specifies the number of bytes created, C (f) is the bytes for the function, and C (p) is the bytes for the "housekeeping" code, the preamble, and the post-emblem that the compiler adds to (creating and destroying the frame of the function stack etc.). Now, to call the function f, C (c) bytes are required. If the function is called n times, then the total code size:

 size = C(p) + C(f) + nC(c) 

Now let me embed the function. C (p) the household function becomes zero, since the function can use the frame of the caller's stack. C (c) is also equal to zero since there is currently no call invocation code. But, f is replicated wherever the call was. So, the total code size is now:

 size = nC(f) 

Now, if C (f) is less than C (c), then the total size of the executable will be reduced. But, if C (f) is greater than C (c), then the code size will increase. If C (f) and C (c) are the same, you also need to consider C (p).

So how many bytes does C (f) and C (c) produce. Well, the simplest C ++ function would be a getter:

 void GetValue () { return m_value; } 

which would probably generate a four-byte instruction:

 mov eax,[ecx + offsetof (m_value)] 

which is four bytes. The call request is five bytes. Thus, overall size savings. If the function is more complex, say, a pointer ("return m_value [index];") or a calculation ("return m_value_a + m_value_b;"), then the code will be larger.

+9
Sep 05 '08 at 15:58
source share

For those who stumble upon this question and decide to use the XOR method. You should consider embedding your function or using a macro to avoid the overhead of calling a function:

 #define swap(a, b) \ do { \ int temp = a; \ a = b; \ b = temp; \ } while(0) 
+8
05 Sep '08 at 11:13
source share

You are optimizing the wrong thing, both of them must be so fast that you have to run them billions of times in order to get any measurable difference.

And almost everything will have a much greater effect on your performance, for example, if the values ​​that you change are hidden in memory until the last value that you touch is a lily located in the processor cache, otherwise you will have to access the memory - and this is several orders of magnitude slower than any operation that you perform inside the processor.

Anyway, your bottleneck is much more likely to be an inefficient algorithm or an improper data structure (or communication overhead) than how you change numbers.

+6
Aug 31 '08 at 20:34
source share

I never understood hate for macros. When used correctly, they can make the code more compact and readable. I believe that most programmers know that macros should be used with caution, it is important that a particular call is a macro, not a function call (all caps). If SWAP(a++, b++); is a constant source of problems, maybe programming is not for you.

Admittedly, the choir trick is neat the first 5,000 times you see it, but all it really does is preserve one temporary at the cost of reliability. Looking at the assembly generated above, it preserves the case, but creates dependencies. Also, I would not recommend xchg, as it has an implied lock prefix.

In the end, we all come to the same place after countless hours spent on unproductive optimization and debugging caused by our smartest code. Keep it simple.

 #define SWAP(type, a, b) \ do { type t=(a);(a)=(b);(b)=t; } while (0) void swap(size_t esize, void* a, void* b) { char* x = (char*) a; char* y = (char*) b; char* z = x + esize; for ( ; x < z; x++, y++ ) SWAP(char, *x, *y); } 
+6
Feb 21 '13 at 15:53
source share

The only way to find out is to check it, and the answer may even change depending on which compiler and platform you are on. Modern compilers are really good at optimizing code these days, and you should never try to outsmart the compiler unless you can prove that your path is really faster.

With that said, you'd better have a damn good reason to choose # 2 for # 1. The code in # 1 is much more readable and should therefore always be chosen first. Switch only to # 2, if you can prove that you need to make this change, and if you do, comment on this to explain what is happening and why you did it in an unobvious way.

As a joke, I work with several people who like to optimize prematurely, and this makes for really disgusting, unsaved code. I also bet that they most often shoot in the foot because they exacerbate the compiler’s ability to optimize the code by writing it in a difficult way.

+4
Aug. 31 '08 at 15:58
source share

I would not do this with pointers unless you need to. The compiler cannot optimize them very well due to the possibility of smoothing pointers (although if you can GUARANTEE that pointers point to non-overlapping locations, GCC at least has extensions to optimize this).

And I would not do this with functions at all, since this is a very simple operation and the overhead of the function is significant.

The best way to do this is with macros, if raw speed and the ability to optimize is what you need. In GCC, you can use the built-in typeof() to create a flexible version that works with any built-in type.

Something like that:

 #define swap(a,b) \ do { \ typeof(a) temp; \ temp = a; \ a = b; \ b = temp; \ } while (0) ... { int a, b; swap(a, b); unsigned char x, y; swap(x, y); /* works with any type */ } 

With other compilers, or if you require strict compliance with the C89 / 99 standard, you will need to create a separate macro for each type.

A good compiler optimizes this as aggressively as possible, given the context if it is called with local / global variables as arguments.

+4
01 Oct '08 at 1:44
source share

All the top rated answers are not really the ultimate "facts" ... these are people who speculate!

You can finally find out which code executes less build commands, because you can look at the assembly generated by the compiler and see what is done in the smaller build instructions!

Here is the c code that I compiled with the flags "gcc -std = c99 -S -O3 lookAtAsmOutput.c":

 #include <stdio.h> #include <stdlib.h> void swap_traditional(int * restrict a, int * restrict b) { int temp = *a; *a = *b; *b = temp; } void swap_xor(int * restrict a, int * restrict b) { *a ^= *b; *b ^= *a; *a ^= *b; } int main() { int a = 5; int b = 6; swap_traditional(&a,&b); swap_xor(&a,&b); } 

ASM output for swap_traditional () accepts →> 11 <instructions (not including "leave", "ret", "size"):

 .globl swap_traditional .type swap_traditional, @function swap_traditional: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 12(%ebp), %ecx pushl %ebx movl (%edx), %ebx movl (%ecx), %eax movl %ebx, (%ecx) movl %eax, (%edx) popl %ebx popl %ebp ret .size swap_traditional, .-swap_traditional .p2align 4,,15 

ASM output for swap_xor () accepts →> 11 <instructions that do not include leave and ret:

 .globl swap_xor .type swap_xor, @function swap_xor: pushl %ebp movl %esp, %ebp movl 8(%ebp), %ecx movl 12(%ebp), %edx movl (%ecx), %eax xorl (%edx), %eax movl %eax, (%ecx) xorl (%edx), %eax xorl %eax, (%ecx) movl %eax, (%edx) popl %ebp ret .size swap_xor, .-swap_xor .p2align 4,,15 

Build Summary:
swap_traditional () accepts 11 instructions
swap_xor () accepts 11 commands

Output:
Both methods use the same number of instructions to execute and, therefore, approximately the same speed on this hardware platform.

Lesson learned:
When you have small snippets of code, viewing the asm output is useful for quickly iterating over your code and getting the fastest code (i.e. the least instructions). And you can save time, even if you do not need to run the program for every code change. You only need to start the code change at the end with the profiler to show that your code changes are faster.

I use this method for heavy DSP code that requires speed.

+4
Mar 05 '09 at 18:32
source share

To answer your question, as indicated, it would be necessary to dig into the time codes of the instructions of a particular processor that this code will work, and therefore I need to make a bunch of assumptions around the state of the caches in the system and the assembly code emitted by the compiler. It would be an interesting and useful exercise in terms of understanding how your processor of choice really works, but in the real world the difference will be insignificant.

+3
Sep 02 '08 at 19:15
source share

For modern processor architectures, method 1 will be faster and also with greater readability than method 2.

In modern processor architectures, XOR technology is significantly slower than using a temporary variable for sharing. One reason is that modern processors tend to execute instructions in parallel through instruction pipelines. In the XOR method, the inputs for each operation depend on the results of the previous operation, so they must be performed in a strictly ordered order. If efficiency is of great concern, it is recommended that you test the speed of both the XOR method and the temporary switching of variables in the target architecture. See here for more details.




Edit: Method 2 is an in-place replacement method (i.e., without using additional variables). To finish this question, I will add another in-place replacement with +/- .

 void swap(int* a, int* b) { if (a != b) // important to handle a/b share the same reference { *a = *a+*b; *b = *a-*b; *a = *a-*b; } } 
+2
Jan 15 '14 at 7:23
source share

In my opinion, local optimizations like this should only be seen as tightly coupled to the platform. This makes a huge difference if you compile it on a 16-bit uC compiler or on gcc with x64 as the target.

If you have a specific goal, just try both of them and look at the generated asm code or profile of your application using both methods and see what is actually faster on your platform.

+1
Oct 10 '08 at 12:11
source share

x = x + y- (y = x);

 float x; cout << "X:"; cin >> x; float y; cout << "Y:" ; cin >> y; cout << "---------------------" << endl; cout << "X=" << x << ", Y=" << y << endl; x=x+y-(y=x); cout << "X=" << x << ", Y=" << y << endl; 
+1
Aug 23 '17 at 15:54 on
source share

If you can use some built-in assembler and do the following (psuedo assembler):

 PUSH A A=B POP B 

You will save a lot of passed parameters and install a patch code, etc.

0
Aug 31 '08 at 16:34
source share

I just put both swaps (like macros) in the handwritten quicksort I was playing with. The XOR version was much faster (0.1 sec), and then with a temporary variable (0.6 sec). However, XOR messed up the data in the array (the same Ant problem was probably mentioned).

, XOR, , , . , , , .

 acopy=a; bcopy=b; a=bcopy; b=acopy; 

[ if , , XOR , (0,6 )]

-one
04 . '08 22:41
source share

, - 32- x86, XCHG, , ... .

, MSV++:

 #include <stdio.h> #define exchange(a,b) __asm mov eax, a \ __asm xchg eax, b \ __asm mov a, eax int main(int arg, char** argv) { int a = 1, b = 2; printf("%d %d --> ", a, b); exchange(a,b) printf("%d %d\r\n", a, b); return 0; } 
-one
22 . '09 17:03
source share

. , .

  x = x ^ y; y = x ^ y; x = x ^ y; 
-one
09 . '15 at 4:26
source share
 void swap(int* a, int* b) { *a = (*b - *a) + (*b = *a); } 

// C , , :)

-2
18 . '09 14:52
source share

.

 #define Swap( a, b ) (a)^=(b)^=(a)^=(b) 

.

:

, . .

-four
07 . '09 at 17:57
source share



All Articles