You have chosen a bad example, since Tudor was so kind as to indicate. A device with a rotating disk is subject to physical restrictions on the movement of the plates and heads, and the most effective reading implementation is to read each block in order, which reduces the need to move the head or wait for the disk to align.
However, some operating systems do not always permanently store data on disks, and for those who remember, defragmentation can provide increased disk performance if you did not do this job for the OS / file system.
As you mentioned, wanting a program that will benefit, let me offer a simple addition to the matrix.
Assuming that you have made one thread per core, you can triadously divide any two matrices that will be added to the N rows (one for each thread). Adding a matrix (if you remember) works as such:
A + B = C
or
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ] [ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ] [ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
So, to distribute this among N threads, we just need the number of lines and the module divided by the number of threads in order to get the "thread ID" with which it will be added.
matrix with 20 rows across 3 threads row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18) row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19) row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17) // row 20 doesn't exist, because we number rows from 0
Now each thread “knows” which rows it should process, and the results “for each row” can be calculated trivially, because the results do not intersect in another area of the calculation flow .
All that is needed now is a “result” data structure that keeps track of when the values were calculated and when the last value is set, then the calculation is complete. In this "fake" example of the result of adding a matrix with two streams, it takes about half the time to calculate the answer by two streams.
More complex problems can be solved using multithreading, and various problems can be solved using different methods. I deliberately chose one of the simplest examples.