Why is this Java code 6 times faster than identical C # code?

I have several different solutions to Project Euler problem 5 , but the time difference between the two languages โ€‹โ€‹/ platforms in this particular intrigue is implementing me. I did not do any optimization with compiler flags, just javac (via command line) and csc (via Visual Studio).

Here is the Java code. It ends in 55 ms.

  public class Problem005b
 {
     public static void main (String [] args)
     {
         long begin = System.currentTimeMillis ();
         int i = 20;
         while (true)
         {
             if (
                     (i% 19 == 0) &&
                     (i% 18 == 0) &&
                     (i% 17 == 0) &&
                     (i% 16 == 0) &&
                     (i% 15 == 0) &&
                     (i% 14 == 0) &&
                     (i% 13 == 0) &&
                     (i% 12 == 0) &&
                     (i% 11 == 0)
                 )
             {
                 break;
             }
             i + = 20;
         }
         long end = System.currentTimeMillis ();
         System.out.println (i);
         System.out.println (end-begin + "ms");
     } 
}

Here is the identical C # code. It ends in 320 ms

  using System;

 namespace ProjectEuler05
 {
     class Problem005
     {
         static void Main (String [] args)
         {
             DateTime begin = DateTime.Now;
             int i = 20;
             while (true)
             {
                 if (
                         (i% 19 == 0) &&
                         (i% 18 == 0) &&
                         (i% 17 == 0) &&
                         (i% 16 == 0) &&
                         (i% 15 == 0) &&
                         (i% 14 == 0) &&
                         (i% 13 == 0) &&
                         (i% 12 == 0) &&
                         (i% 11 == 0)
                     )
                     {
                         break;
                     }
                 i + = 20;
             }
             DateTime end = DateTime.Now;
             TimeSpan elapsed = end - begin;
             Console.WriteLine (i);
             Console.WriteLine (elapsed.TotalMilliseconds + "ms");
         }
     }
 } 
+42
java c # execution-time
May 10 '11 at 15:11
source share
8 answers

The key to getting the two closer together is to ensure that the comparison is fair.

First of all, make sure that the costs associated with running the Debug build load the pdb characters as you did.

Then you need to make sure that the initialization counts are not counted. Obviously, these are real costs and may be significant for some people, but in this case we are interested in the cycle itself.

Next, you need to deal with specific platform behavior. If you are on a 64-bit Windows machine, you can work in 32 bit or 64 bit mode. In 64-bit mode, JIT is very different, often changing the resulting result code. In particular, and I would guess appropriately, you get access to twice as many general-purpose registers.

In this case, the inner part of the loop, naively translated into machine code, will have to load into the registers the constants used in unit tests. If there is not enough of everything necessary in the cycle, then he must eject them from memory. Even from a level 1 cache, this will be a significant hit compared to keeping everyone in the register.

In VS 2010 MS , the default target has changed from anycpu to x86 . I donโ€™t have anything like resources or a client that knows MSFT, so I wonโ€™t try to guess about it. However, anyone looking at something like the performance analysis you are doing should definitely try both.

Once these differences are eliminated, the numbers look much more reasonable. Any further differences are likely to require more than educated guesses; instead, they will need to investigate the actual differences in the generated machine code.

There are a few things that I think would be interesting for the optimizing compiler.

  • Those already mentioned:
    • The lcm option is interesting, but I don't see the compiler writer worried.
    • reduction of division by multiplication and masking.
      • I donโ€™t know enough about this, but other people have tried , noted that they significantly improve the divider on more recent Intel chips.
      • Perhaps you could organize something complicated using SSE2.
      • Of course, the operation of modulo 16 is ripe for conversion to a mask or shift.
    • The compiler may notice that none of the tests have side effects.
      • he could speculatively try to evaluate several of them at once; on a superscar processor, this could pump things up pretty quickly, but it will largely depend on how well the linker interacts with the OO execution mechanism.
    • If the pressure in the register was hard, you could implement the constants as one variable, set at the beginning of each cycle, and then increase as you move.

These are all guesses, and they should be considered as a cold. If you want to find out, take it apart.

+12
May 11 '11 at 20:55
source share
  • You must use the StopWatch class to execute the time code.
  • In addition, you should consider JIT, runtime, etc., so let the test run run a sufficient number of times (for example, 10,000, 100,000 times) and get some average value. It is important to run the code several times, not . So write a method and loop the main method to get your measurements.
  • remove all debugging information from the assemblies and let the code work autonomously in the release build
+39
May 10 '11 at 15:17
source share

Several optimizations are possible. Perhaps the Java JIT executes them, but the CLR does not.

Optimization # 1:

 (x % a == 0) && (x % b == 0) && ... && (x % z == 0) 

equivalently

 (x % lcm(a, b, ... z) == 0) 

So, in your example, the comparison chain can be replaced by

 if (i % 232792560 == 0) break; 

(but, of course, if you have already calculated LCM, it will not be difficult for you to run the program first!)

Optimization # 2 :

This is also equivalent to:

 if (i % (14549535 * 16)) == 0 break; 

or

 if ((i % 16 == 0) && (i % 14549535 == 0)) break; 

The first division can be replaced by a mask and compared with zero:

 if (((i & 15) == 0) && (i % 14549535 == 0)) break; 

The second division can be replaced by multiplication by the modular inverse:

 final long LCM = 14549535; final long INV_LCM = 8384559098224769503L; // == 14549535**-1 mod 2**64 final long MAX_QUOTIENT = Long.MAX_VALUE / LCM; // ... if (((i & 15) == 0) && (0 <= (i>>4) * INV_LCM) && ((i>>4) * INV_LCM < MAX_QUOTIENT)) { break; } 

It is somewhat unlikely that JIT will use this, but it is not as far-fetched as you might think - some C compilers implement pointer subtraction this way.

+23
May 10 '11 at 16:32
source share

DateTime is not so accurate for benchmarking (as indicated here (read notes)), look at the stopwatch class

+10
May 10 '11 at 15:20
source share

Perhaps because building DateTime objects is much more expensive than System.currentTimeMillis .

+2
May 10 '11 at 15:17
source share

This is too short a task for the right timing. You need to run at least 1000 times and see what happens. It looks like you are using them from the command line, in which case you might be comparing JIT compilers for both. Try putting both buttons in a simple graphical interface and skip this button a few hundred times before returning the elapsed time. Even ignoring JIT compilation, time can be dropped by the granularity of the OS scheduler.

Oh, and because of JIT ... just read the SECOND button click result. :)

+2
May 10 '11 at 15:29
source share

In Java, I would use System.nanoTime (). Any test that takes less than 2 seconds should take longer. It is worth noting that Java is pretty good at optimizing inefficient code or code that does nothing. A more interesting test would be if you optimized the code.

You are trying to get a solution that you can determine without using a loop. that is, a problem that will be done better in a different way.

You want to get a product from coefficients from 11 to 20, which are 2,2,2,2,3,3,5,5,7,11,13,17,19. Multiply them together and you have an answer.

+1
May 10 '11 at 15:28
source share

(Moved from OP)

Changing the target from x86 to anycpu reduced the average execution time to 84 ms per turn, starting from 282 ms. Maybe I should split this into a second thread?

UPDATE:
Thanks to Femaref below, who pointed out some testing problems , and indeed, after he completed his suggestions, the times were lower, which indicates that the VM setup time was significant in Java, but probably not in C #. In C #, these were debugging symbols that were significant.

I updated my code to run each loop 10,000 times, and only displays the average ms at the end. The only significant change I made was the C # version, where I switched to the [StopWatch class] [3] for more resolution. I am stuck with milliseconds because it is good enough.

Results:
Changes in testing do not explain why Java (still) is much faster than C #. C # performance was better, but this can be fully explained by removing debugging symbols. If you read [Mike Two] [4] and I exchange the comments attached to this OP, you will see that I got ~ 280 ms on average for five runs of C # code, just switching from Debug to Release.

Numbers:

  • 10,000 cycles of counting unmodified Java code gave me an average of 45 ms (from 55 ms).
  • 10,000 cycles of counting a C # code using the StopWatch class gave me an average of 282 ms (compared to 320 ms).

All this leaves the difference inexplicable. In fact, the differential has worsened. Java has moved from ~ 5.8x faster to ~ 6.2x faster.

+1
Nov 02
source share



All Articles