Go: Unexpected performance when accessing an array through a slice slice (2D slice)

I did some performance experiments in Go with matrix multiplication and came across some unexpected results.

Version 1:

func newMatrix(n int) [][]int { m := make([][]int, n) buf := make([]int, n*n) for i := range m { m[i] = buf[i*n : (i+1)*n] } return m } func mult1(m1, m2, res [][]int) [][]int { for i := range m1 { for k := range m1[0] { for j := range m2[0] { res[i][j] += m1[i][k] * m2[k][j] } } } return res } 

From a linear array, I create several slices that represent the rows of the matrix.

Version 2:

 func mult2(m1, m2, res []int, n int) []int { for i := 0; i < n; i++ { for k := 0; k < n; k++ { for j := 0; j < n; j++ { res[i*n+j] += m1[i*n+k] * m2[k*n+j] } } } return res } 

In this version, I just use a linear array and index into it from multiplication.

Matrix 2 Multiplication 2048x2048 gives the following runtime:

  version 1: 35.550813801s version 2: 19.090223468s 

Version 2 is almost twice as fast.

I used the approach below to measure:

 start := time.Now() mult(m1, m2, m3) stop := time.Now() 

I know that using slices provides another level of indirection that can affect cache performance, but I did not expect this to be such a big difference. Unfortunately, I did not find any good tool that works with a Mac that can analyze the effectiveness of the cache in Go, so I can’t say for sure what exactly causes the difference in performance.

So, I think I'm asking, is this the expected behavior or is there something I am missing?

Software and hardware: Go version 1.4.2 darwin / amd64; OS X 10.10.3; 2 GHz Quad Core i7.

+6
source share
2 answers

The main problem in version 1 code seems indirect. Despite the fact that the layout in memory for matrices is the same in both versions, using indirect addressing can lead to:

  • More generated instructions for the same code. The compiler may have problems determining when to use packaged versions of SIMD commands (for example, SSE, AVX). You can verify this by resetting the assembly code, find the XMM or YMM registers and check if the instructions that work with the registers are packed.
  • You make it difficult for the compiler to add software presets. Because indirect addressing, it is difficult for the compiler to determine how to add software prefetching. You can search for vprefetch instructions in assembly code.
  • The prefetcher will also be less efficient due to indirect addressing. First you need to access the start address of the line, and then access the elements of the line so that it is difficult to notice that the pre-set of equipment should simply retrieve the serial addresses. This can only be measured through profiling, such as perf.

Therefore, in case of version 1, indirect addressing is the main problem. I also recommend running 2 codes in several iterations in order to remove the cache warming penalty, which may be higher for version 1 due to what I explained above.

+6
source

Unfortunately, I don't have enough reputation to put this as a comment, but in addition to VAndrei's points, it's worth noting that the two examples provided use for-loop differently. As the first example is executed after s/i := range m1/i := 0; i < n; i++/ s/i := range m1/i := 0; i < n; i++/ s/i := range m1/i := 0; i < n; i++/ ?

It can also be useful to check how the outputs of "list mult1" and "list mult2" look like in pprof. There is a great tutorial that you can start with Go pprof very quickly: Profile Go Russ Cox programs

-1
source

Source: https://habr.com/ru/post/986931/


All Articles