Implications for a large number of mutexes

Suppose I have an array of 1,000,000 elements and several workflows, each of which manipulates data in this array. Workflows can update already filled elements with new data, but each operation is limited to one element of the array and does not depend on the values ​​of any other element.

Using one mutex to protect the entire array will clearly lead to high competition. On the other hand, I could create an array of mutexes with the same length as the original array, and for each element of array[i] I would lock mutex[i] while working on it. Assuming an even distribution of data, this basically eliminates the lock conflict due to large memory.

I think a more reasonable solution would be to have an array of n mettexes (where 1 <n <1,000,000). Then for each element of array[i] I would block mutex[i % n] while working on it. If n is large enough, I can still minimize the conflict.

So my question is, is there a performance limitation when using a large number of mutexes (e.g.> = 1,000,000) in this way, in addition to increasing memory usage? If so, how many mutexes can you reasonably use before you begin to see deterioration?

I am sure that the answer to this is somewhat platform specific; I use pthreads on Linux. I am also working on customizing my own tests, but the scale of the data I'm working on does this a lot of time, so some initial recommendations will be appreciated.


That was the initial question. For those who ask for more detailed information about the problem, I have four binary files with several GB, describing somewhere around half a billion events that are being analyzed. The array in question is actually an array of pointers supporting a very large hash table chain. We read four data files into a hash table, possibly combining them together if they have certain characteristics. In the existing implementation, there are 4 streams, each of which reads one file and inserts entries from this file into the hash table. The hash table has 997 locks and 997 * 9973 = ~ 10,000,000 pointers. When inserting an element with a hash h I first lock mutex[h % 997] before inserting or modifying an element in bucket[h % 9943081] . This works well, and as far as I can tell, we did not have too many problems with rivalry, but there is a performance bottleneck because we use only 4 cores of a 16-core machine. (And even smaller as we go, because the files usually do not have the same size.) After all the data has been read in memory, we analyze it, which uses new threads and a new blocking strategy configured for different workloads.

I am trying to improve the performance of the data loading phase by switching to a thread pool. In the new model, I still have one thread for each file, which simply reads the file in ~ 1 MB chunks and transfers each chunk to the workflow in the pool for parsing and pasting. So far, the increase in performance has been minimal, and the profiling I did seemed to indicate that the time taken to lock and unlock the array was a likely culprit. A lock is built into the hash table implementation that we use, but it allows you to specify the number of locks to use regardless of the size of the table. I hope to speed things up without changing the hash table implementation itself.

+2
source share
3 answers

(A very partial and possibly indirect answer to your question.)

Once I scored a huge performance hit, trying to do it (on CentOS), increasing the number of locks from a prime number of ~ 1K to a prime number of ~ 1M. Although I never understood its reasons, I eventually realized (or just convinced myself) that this was the wrong question.

Suppose you have an array of length M, with n workers. In addition, you use a hash function to protect the elements of M with m <M (for example, using some random grouping). Then, using the Quadratic approach to the Birthday Paradox , the probability of a collision between two workers - p - is given:

p ~ n 2 / (2m)


It follows that the number of mutexes you need, m, does not depend on M at all - this is a function of only p and n.

+3
source

On Linux, there is no expense other than memory associated with a large number of mutexes.

However, remember that the memory used by your mutexes must be included in your working set - and if the size of your working set exceeds the corresponding cache size, you will see a significant decrease in performance. This means that you do not need a large mutex array.

As Ami Tavory points out that competition depends on the number of mutexes and the number of threads, and not on the number of protected data elements - therefore there is no reason to associate the number of mutexes with the number of data elements (with the obvious caveat that it never makes sense to have more mutexes than elements) .

+2
source

In general, I would advise

  • Just lock the entire array (simple, very often "good enough" if your application basically does "other things" besides accessing the array)

    ... or...

  • Implementing read / write locks for the entire array (assuming the read is equal to or greater than the write)

Apparently your scenario does not match any of the cases.

Q: Have you considered implementing some kind of β€œwrite queue”?

In the worst case, you only need one mutex. At best, you can even use a locking mechanism to manage your queue. Check here for some ideas that might apply: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx

0
source

Source: https://habr.com/ru/post/1241349/


All Articles