Java can recognize the benefits of a SIMD processor; or is there just the optimizing effect of a cycle reversal

This piece of code relates to the dotproduct method for my vector class. The method calculates the internal products for the target array of vectors (1000 vectors).

When the length of the vector is an odd number (262145), the calculation time is 4.37 seconds. When the length of the vector (N) is 262144 (a multiple of 8), the calculation time is 1.93 seconds.

time1=System.nanotime(); int count=0; for(int j=0;j<1000;i++) { b=vektors[i]; // selects next vector(b) to multiply as inner product. // each vector has an array of float elements. if(((N/2)*2)!=N) { for(int i=0;i<N;i++) { t1+=elements[i]*b.elements[i]; } } else if(((N/8)*8)==N) { float []vek=new float[8]; for(int i=0;i<(N/8);i++) { vek[0]=elements[i]*b.elements[i]; vek[1]=elements[i+1]*b.elements[i+1]; vek[2]=elements[i+2]*b.elements[i+2]; vek[3]=elements[i+3]*b.elements[i+3]; vek[4]=elements[i+4]*b.elements[i+4]; vek[5]=elements[i+5]*b.elements[i+5]; vek[6]=elements[i+6]*b.elements[i+6]; vek[7]=elements[i+7]*b.elements[i+7]; t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7]; //t1 is total sum of all dot products. } } } time2=System.nanotime(); time3=(time2-time1)/1000000000.0; //seconds 

Question: Can reducing the time from 4.37 to 1.93s (2 times faster) be a JIT-wise decision to use SIMD instructions or just my cyclical positive effect?

If the JIT cannot automatically perform SIMD optimization, then this example also does not automatically optimize the JIT reversal, is that true ?.

For 1M iterations (vectors) and for vector size 64, does the multiplication multiplier reach 3.5X (cache advantage?).

Thanks.

+4
source share
2 answers

There are many problems in your code. Are you sure you are measuring what you think you are measuring?

Your first loop does this indentation more conditionally:

  for(int j=0;j<1000;i++) { b=vektors[i]; // selects next vector(b) to multiply as inner product. // each vector has an array of float elements. } 

Your folded loop includes a really long chain of dependent loads and storages. Your detailed cycle includes 8 separate chains of dependent loads and storages. The JVM cannot turn one into the other if you use floating point arithmetic, because these are fundamentally different calculations. Violation of dependent chains of loading and downloading can lead to greater acceleration on modern processors.

Your folded loop iterates over the entire vector. Your expanded loop is repeated only on the first (approximately) eighth. Thus, the expanded cycle again computes something fundamentally different.

I have not seen the JVM generate vector code for something like your second loop, but maybe I’ve been out of date for a few years from what the JVM does. Try using -XX:+PrintAssembly when running the code and checking the code that opto generates.

+8
source

I have worked a bit on this (and I'm extracting knowledge from a similar project that I did in C with matrix multiplication), but take my answer with a piece of salt, as I am by no means an expert on this topic.

As for your first question, I think acceleration comes from unrolling a loop; you do about 87% less conditional checks in terms of a for loop. As far as I know, the JVM supports SSE from version 1.4, but to actually control whether your code uses vectorization (and know exactly), you need to use JNI.

See the JNI example here: Do any JIT compilers JVM using vectorized floating point instructions?

When you reduce the size of your vector to 64 from 262144, cache is definitely a factor. When I was doing this project in C, we had to implement cache locking for large matrices in order to use the cache. One thing you might want to do is check the size of your cache.

As a side note. It is best to measure performance on the flop rather than seconds, just because the execution time (in seconds) of your program can vary depending on many different factors, such as CPU usage at that time.

+5
source

Source: https://habr.com/ru/post/1489593/


All Articles