The interviewer's initial question is: "... multithreading is allowed." The wording of this question may be a little ambiguous, but the spirit of the question is obvious: the interviewer asks the candidate to write a program to solve the problem, and also analyze / justify the use (or not) of multithreading as part of the proposed solution. This is a simple question to test a candidate’s ability to think around a large-scale problem and explain the algorithmic choices they make, making sure that the candidate doesn’t just rip something off the website without understanding it.
Given this, this particular interview question can be effectively resolved in O (n log n) (asymptotically) whether multithreading is used or not, and multithreading can be additionally used to logarithmic acceleration the actual execution time.
Solution Overview
If you were asked the OP question with the help of a top pilot company, the following approach would show that you really understand the problem and the problems associated with it. Here we offer a two-step approach:
As an example, consider a file with 32 names, each of which contains one letter and each of which has an initial frequency. The above strategy can be visualized as follows:
1. File: ARBIKJLOSNUITDBSCPBNJDTLGMGHQMRH 32 Names 2. A|R|B|I|K|J|L|O|S|N|U|I|T|D|B|S|C|P|B|N|J|D|T|L|G|M|G|H|Q|M|R|H 32 Partitions 1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1 with counts 3. AR BI JK LO NS IU DT BS CP BN DJ LT GM GH MQ HR Merge
So, if we read the final list in memory from beginning to end, it gives a sorted list:
A|B|C|D|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- 1|3|2|2|1|1|1|3|1|2|1|3|2|1|1|3|1|2|1 = 32 Name instances (== original file).
Why is the solution effective?
Whether a hash table is used (as the original poster was suggested), and whether multithreading is used or not, any solution to this question cannot be solved more efficiently than O (n log n), because sorting must be done. Given this limitation, you can use two strategies:
Reading data from disk, using a hash table to manage the sums of names / frequencies, then sorting the contents of the hash table (original poster suggested)
Reading data from the disk, initializing each name with its total amount from the file, and then combining the sorting of names summing up all the totals for each name (this is a solution).
Solution (1) requires the hash table to be sorted after all the data has been read. Solution (2) performs its frequency reference as it is sorted, so the hash table overhead is deleted. Without considering multithreading at all, we can already see that even when using the most efficient hash table implementation to solve (1), solution (2) is already more efficient, since it does not have a hash table overhead at all.
Multithreading restrictions
In both solutions (1) and solution (2), assuming that solution (1) uses the most efficient implementation of the hash table, both algorithms perform the same asymptotically in O (n log n); it's just that the order of their operations is slightly different. However, although multi-threaded solution (1) actually slows down its execution, multi-threaded solution (2) will significantly improve speed . How is this possible?
If the multi-threaded solution (1), either after reading from disk, or after sorting, we are faced with the problem of competition in the hash table, since all threads try to access the hash table at the same time. Especially for writing to the table, this statement can damage the execution time of solution (1) so much that running it without multithreading will actually provide faster execution time.
For multithreading, to make it possible to speed up execution time, you need to make sure that every block of work performed by each thread is independent of each other thread. This will allow all threads to work at maximum speed without competition on shared resources and faster work. Solution (2) does just that removing the hash table as a whole and uses the Merge Sort method, the Divide and Conquer algorithm, which allows you to break the problem down into subqueries that are independent of each other.
Multithreading and partitioning to further improve runtime
For multi-threaded merge sorting, the file can be divided into sections and create a new stream to combine each consecutive pair of sections. Since the names in the file are variable in length, the file must be scanned sequentially from beginning to end in order to be able to split; random access to the file cannot be used. However, since any solution should check the contents of the file at least once, allowing only sequential access to the file, it still gives the optimal solution.
What runtime acceleration can be expected from multithreading solutions (2)? The analysis of this algorithm is quite complicated, given its simplicity and as the subject of various white papers. However, dividing the file into n partitions will allow the program to execute (n / log (n)) times faster than on one CPU without splitting the file. Simply put, if one processor takes 1 hour to process a file of 640 GB in size, then dividing the file into 64 fragments of 10 GB and executing on a machine with 32 processors will allow the program to complete it in about 6 minutes, increase 10 times (ignoring disk overhead) .