Multithreaded matrix multiplication

Question

Multithreaded matrix multiplication

I recently started using multithreading in java. Since I am writing a program for numerical calculations at my university, I decided to make my first attempts by performing multi-threaded matrix multiplication.

This is my code. Please keep in mind that this was only done as a first attempt and not very clean.

public class MultithreadingTest{ public static void main(String[] args) { // TODO Auto-generated method stub double[][] matrix1 = randomSquareMatrix(2000); double[][] matrix2 = randomSquareMatrix(2000); matrixMultiplication(matrix1,matrix2,true); matrixMultiplicationSingleThread(matrix1, matrix2); try { matrixMultiplicationParallel(matrix1,matrix2, true); } catch (InterruptedException | ExecutionException e) { // TODO Auto-generated catch block e.printStackTrace(); } try { matrixMultiplicationParallel2(matrix1,matrix2, true); } catch (InterruptedException | ExecutionException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public static double[][] randomSquareMatrix(int n){ double[][] mat = new double[n][n]; Random rand = new Random(); for(int i=0; i<n; i++) for(int j=0; j<n; j++) mat[i][j]=rand.nextInt(10); return mat; } public static void printSquareMat(double[][] mat){ int n=mat.length; for(int i=0; i<n; i++){ for(int j=0; j<n; j++) System.out.print(mat[i][j]+" "); System.out.print("\n");} System.out.print("\n"); } public static void average(double[][] matrix) { int n=matrix.length; double sum=0; for(int i=0; i<n; i++) for(int j=0; j<n; j++) sum+=matrix[i][j]; System.out.println("Average of all Elements of Matrix : "+(sum/(n*n))); } public static void matrixMultiplication(double[][] matrix1, double[][] matrix2, boolean printMatrix){ int n=matrix1.length; double[][] resultMatrix = new double[n][n]; double startTime = System.currentTimeMillis(); for(int i=0; i<n; i++)for(int j=0; j<n; j++)for(int k=0; k<n; k++) resultMatrix[i][j]+=matrix1[i][k]*matrix2[k][j]; if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); } System.out.print("\n"); System.out.println(((System.currentTimeMillis()-startTime)/1000)+ " seconds for matrix of size "+n+" in main thread."); average(resultMatrix); } public static void matrixMultiplicationSingleThread(double[][] m1, double[][] m2) { int n=m1.length; double startTime = System.currentTimeMillis(); Thread t = new Thread(new multiSingle(m1,m2)); t.start(); try { t.join(); } catch (InterruptedException e) { // TODO Auto-generated catch block System.out.println("Error"); e.printStackTrace(); } System.out.print("\n"); System.out.println(((System.currentTimeMillis()-startTime)/1000)+ " seconds for matrix of size "+n+" in external Thread."); } public static void matrixMultiplicationParallel(double[][] matrix1, double[][] matrix2, boolean printMatrix) throws InterruptedException, ExecutionException{ int n=matrix1.length; double[][] resultMatrix=new double[n][n]; double tmp; ExecutorService exe = Executors.newFixedThreadPool(2); Future<Double>[][] result = new Future[n][n]; double startTime = System.currentTimeMillis(); for(int i=0; i<n; i++) { for(int j=0; j<=i; j++) { tmp=matrix2[i][j]; matrix2[i][j]=matrix2[j][i]; matrix2[j][i]=tmp; } } for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { result[i][j] = exe.submit(new multi(matrix1[i],matrix2[j])); } } exe.shutdown(); exe.awaitTermination(1, TimeUnit.DAYS); for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { resultMatrix[i][j] = result[i][j].get(); } } for(int i=0; i<n; i++) { for(int j=0; j<=i; j++) { tmp=matrix2[i][j]; matrix2[i][j]=matrix2[j][i]; matrix2[j][i]=tmp; } } if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); } System.out.print("\n"); System.out.println(((System.currentTimeMillis()-startTime)/1000)+ " seconds for matrix of size "+n+" multithreaded with algorithm 1."); average(resultMatrix); } public static void matrixMultiplicationParallel2(double[][] matrix1, double[][] matrix2, boolean printMatrix) throws InterruptedException, ExecutionException{ int n=matrix1.length; double[][] resultMatrix=new double[n][n]; double tmp; ExecutorService exe = Executors.newFixedThreadPool(2); Future<Double>[][] result = new Future[n][n]; double startTime = System.currentTimeMillis(); for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { result[i][j] = exe.submit(new multi2(i,j,matrix1,matrix2)); } } exe.shutdown(); exe.awaitTermination(1, TimeUnit.DAYS); for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { resultMatrix[i][j] = result[i][j].get(); } } if (printMatrix && n<=5)for(int i=0; i<n; i++){for(int j=0; j<n; j++) System.out.print(resultMatrix[i][j]+" ");System.out.print("\n"); } System.out.print("\n"); System.out.println(((System.currentTimeMillis()-startTime)/1000)+ " seconds for matrix of size "+n+" multithreaded with algorithm 2."); average(resultMatrix); } public static class multi implements Callable<Double>{ multi(double[] vec1, double[] vec2){ this.vec1=vec1; this.vec2=vec2; } double result; double[] vec1, vec2; @Override public Double call() { result=0; for(int i=0; i<vec1.length; i++) result+=vec1[i]*vec2[i]; return result; } } public static class multi2 implements Callable<Double>{ multi2(int a, int b, double[][] vec1, double[][] vec2){ this.a=a; this.b=b; this.vec1=vec1; this.vec2=vec2; } int a,b; double result; double[][] vec1, vec2; @Override public Double call() { result=0; for(int i=0; i<vec1.length; i++) result+=vec1[a][i]*vec2[i][b]; return result; } } public static class multiSingle implements Runnable{ double[][] matrix1, matrix2; multiSingle(double[][] m1, double[][] m2){ matrix1=m1; matrix2=m2; } public static void matrixMultiplication(double[][] matrix1, double[][] matrix2, boolean printMatrix){ int n=matrix1.length; double[][] resultMatrix = new double[n][n]; for(int i=0; i<n; i++)for(int j=0; j<n; j++)for(int k=0; k<n; k++) resultMatrix[i][j]+=matrix1[i][k]*matrix2[k][j]; MultithreadingTest.average(resultMatrix); } @Override public void run() { matrixMultiplication(matrix1, matrix2, false); } } }

I have two general questions for multithreading, I hope that he will not open a new topic for this.

Is there a way to write code without additional classes for threads that implement runnable or callable? I looked at the approaches using anonymous inner classes and lambdas, but since I have information about the font, I cannot pass parameters to threads this way, since run () and call () do not accept any, that is, if the parameters are final. But, assuming that I am writing a class for operations with matrices, I would prefer not to write an additional class for each operation that I want to run in the stream.
Assuming that my class does a lot of multithreaded operations, creating a thread pool and closing it in each method will waste a lot of resources, I think. Therefore, I would like to create a thread pool as a member of ob my class, creating it if necessary and using invokeAll. But what happens if my object is deleted? Will I have problems since I never close the thread pool? In C ++, I would use a destructor for this. Or does gc take care of everything in this case?

Now directly with a link to my code:

I implemented matrix multiplication in four different ways, like a method running in my main thread, like a method running in a new thread, but still not multi-threaded (so that there are no background taks in my main thread slowing it down), and two different ways multithreaded matrix multiplication. The first version transfers the second matrix, transfers the multiplication as a vector-vector-multiplication, and transfers the matrix back to its original form. The second version takes matrices directly, and additinaly accepts two indexes to determine the row and column of matrices for vector-vector multiplication.

For all versions, I measured the time required for the multiplication and calculated the average value af in the resulting matrices to see if the results are the same.

I ran this code on two computers, both JVM and Windows 10. My first laptop, the 5th generation i5, is 2.6 GHz dual-core, and the second is my desktop PC, the 4th generation i5, 4.2 GHz quad core.

I expected my desktop PC to be much faster. I also expected that multi-threaded versions would take about half / quarter of the time for a streaming version with an inscription, but more so since there is additional work to create streams, etc. And finally, I was expecting a second multi-threaded version that does not transfer one matrix twice to be faster, since there are fewer operations.

After running the code, I got a little confused in the results, I hope someone can explain this to me:

For one-button methods, my laptop needs about 340 (for a matrix size of 3000). Therefore, I assume that there are no expensive background tasks in my main thread. On the other hand, my desktop PC needs 440s. Now the question is why is my laptop, which runs slower, much faster? Even if the fifth generation is faster than the 4th generation, since my desktop PC is 1.6 times faster than the speed of a laptop, I still expect it to be faster. The difference between these generations is unlikely to be large.

For multithreaded methods, my laptop needs about 34 s. If multithreading is ideal, then it should not occupy less than half. Why is it two streams ten times faster? The same goes for my desktop PC. Using four threads, multiplication is done in 16 s instead of 440. It looks like my desktop PC is running at the same speed as my laptop, only on four, not two threads.

Now for comparison between the two multi-threaded methods, the version that transfers the same matrix twice takes about 34 seconds on my laptop, the version that takes the matrices takes about 200 seconds. This sounds realistic as it is more than half the single threaded method. But why is it so much slower than the first version? I would suggest that matrix transfer will be twice slower than the extra time to get the matrix elements? Is there something that I am missing or working with a matrix that is much slower than working with a vector?

Hope someone can answer these questions. Sorry to write such a long post.

Sincerely yours Torsten

+5

java multithreading

Thorsten schmitz Feb 15 '16 at 12:01

source share

2 answers

You must be careful to minimize the overhead when creating threads. A good estimate is to use the ForkJoin infrastructure to separate problems using a thread pool. This structure

Reuses an existing thread pool.
splits the task until it is enough that the pool is busy, but no more.

There is only one floating point block per core, so your scalability will be based on the number of cores you have.

I suggest you read the Fork Join Matrix Multiplication in Java . I could not find the source code for this code.

http://gee.cs.oswego.edu/dl/papers/fj.pdf

http://gee.cs.oswego.edu/dl/cpjslides/fj.pdf using the ForkJoin infrastructure.

+1

Peter Lawrey Feb 15 '16 at 13:12

source share

Matt timmermans · Accepted Answer · 2016-02-15T14:37:13+0000

The answer to the big secret: this time required to multiply the matrix depends largely on the time taken to move the data from RAM to the processor cache. You can have 4 cores, but you only have 1 RAM bus, so you will not get any benefit using more cores (multithreading) if they all block each other, waiting for access to memory.

The first experiment you should try is the following: Write a single-threaded version using matrix transpose and vector multiplication. You will find that it is much faster - probably about as fast as the multi-threaded version with transpose.

The reason the original single-threaded version is so slow is because it needs to load a cache block for each cell in the column that is being multiplied. If you use the transpose matrix, then all these cells are sequential in memory, and loading a single block gives you a bunch.

So, if you want to optimize matrix multiplication, FIRST optimize memory access for cache efficiency, THEN split the work between several threads - no more than twice as many threads as you have kernels. Anything that even more simply wastes time and resources with context switches, etc.

regarding other issues:

1) It is convenient to use lambdas, which capture variables from the area that creates them, for example:

 for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { final double[] v1 = matrix1[i]; final double[] v2 = matrix2[j]; result[i][j] = exe.submit(() -> vecdot(v1,v2)); } }

2) GC will take care of this. You do not need to explicitly close the thread pool to free up any resources.

Multithreaded matrix multiplication

More articles: