Process large file through multithreading

There is a rather large file on the disk (> 10G), each line inside fie consists of a line number and a person’s name, for example:

1 Jane 2 Perk 3 Sime 4 Perk .. .. 

I need to read this large file and find the frequency of each name, finally, output the results in descending order of each frequency of names, for example:

 Perk 2 Jane 1 Sime 1 

As the interviewer requested, the above task should be performed as efficiently as possible, and multithreading is allowed. And my solution looks something like this:

  • Since the file is too large, I split the file into several small files, each small file is about 100M , through lseek I can find the beginning and end of each small file (beg, end) ;

  • For these small files, there is a common hash map that uses the person’s name as a keyword and how many times it is displayed as a value;

  • For each small file, one stream passes through it, each time the stream encounters a username, it will increase its corresponding value on a shared hash map;

  • When all the threads end, I think it’s time to sort the hash map according to the value field.

But since there may be too many names in this file, sorting will be slow. I did not have a good idea on how to display names in descending order.

Hope someone can help me with the above problem, give me a better solution on how to do the work using multithreading and sorting.

+6
source share
5 answers

Using the map-reduce approach might be a good idea for your problem. This approach will consist of two steps:

  • Map : read data fragments from a file and create a stream to process this data
  • Reduce : the main thread waits for all other threads to complete, and then combines the results of each individual thread.

The advantage of this solution is that you do not need to lock between threads, since each of them will work with a different piece of data. Using a common data structure, as you suggest, may also be a solution, but you may have some overhead due to blocking contention.

You need to perform sorting during the reduction phase when data from all streams is available. But you can do some work during the step of the map to make it easier (faster) to complete the full sort at the reduction stage.

If you prefer to avoid sequential sorting at the end, you can use some custom data structure. I would use a map (something like a red-black tree or a hash table ) to quickly search for a name. Moreover, I would use heap to maintain order of frequencies between names. Of course, you will have to have parallel versions of these data structures. Depending on how rough the parallelization is, you may have problems with the conflict or not.

+7
source

If I asked that as an interview question, using the word "effectively", I would expect an answer to something like "cut -f 2 -d" <file | sort | uniq -c, because efficiency often spends time solving a problem that has already been solved. This is actually a good idea, I will add something similar to our interview questions.

Your bottleneck will be disk, so all kinds of multithreading override the solution (which also takes into account "efficiency"). Sharing your readings like this will either slow down if there are spinning disks, or at least make the buffer cache more confusing and less likely to use the drop-behind algorithm. Bad idea, don’t do it.

+6
source

I don't think multithreading is a good idea. The "slow" part of the program is read from disk, and multithreading from disk is not accelerated. This will make it much more difficult (for each fragment you need, for example, to find the first "full" line, and you need to coordinate the various streams, and you need to block the shared hash map every time you access it), you can work with " local "hash map, and then merge them at the end (when all threads end (at the end of 10gb) partial hash cards are combined). Now you do not need to synchronize access to the shared map.

I think sorting the resulting hash map will be the easiest part if the full hash memory can be stored in memory :-) You just copy it to the malloc (ed) block and qsort it with its counter.

+3
source

Your (2) and (4) steps in the solution make it essentially consistent (the second one locks to save a consistent hash map and the last one when you try to sort all the data).

One-step sorting of the hash map at the end is a bit strange, you should use an incremental sorting procedure such as heapsort (blocking the required data structure) or mergesort (sort the parts of the histogram file, but avoid merging everything “in one main stream at the end” - try to create a sorting network and mix the contents of the output file at each stage of sorting).

Multithreaded reads can be a problem, but with modern SSDs and aggressive read caching, multithreading is not a major slowdown factor. All about synchronizing the process of sorting results.

Here's a merge sample: http://dzmitryhuba.blogspot.com/2010/10/parallel-merge-sort.html

Once again, as I said, some sorting network can help provide efficient parallel sorting, but not just “waiting for all-subtitles and sorting them-results”. Maybe bitonic sort if you have many processors.

+3
source

The interviewer's initial question is: "... multithreading is allowed." The wording of this question may be a little ambiguous, but the spirit of the question is obvious: the interviewer asks the candidate to write a program to solve the problem, and also analyze / justify the use (or not) of multithreading as part of the proposed solution. This is a simple question to test a candidate’s ability to think around a large-scale problem and explain the algorithmic choices they make, making sure that the candidate doesn’t just rip something off the website without understanding it.

Given this, this particular interview question can be effectively resolved in O (n log n) (asymptotically) whether multithreading is used or not, and multithreading can be additionally used to logarithmic acceleration the actual execution time.

Solution Overview

If you were asked the OP question with the help of a top pilot company, the following approach would show that you really understand the problem and the problems associated with it. Here we offer a two-step approach:

  • The file is first partitioned and read into memory.

  • A special version of Merge Sort is used on sections that simultaneously count the frequency of each name when sorting a file.

As an example, consider a file with 32 names, each of which contains one letter and each of which has an initial frequency. The above strategy can be visualized as follows:

 1. File: ARBIKJLOSNUITDBSCPBNJDTLGMGHQMRH 32 Names 2. A|R|B|I|K|J|L|O|S|N|U|I|T|D|B|S|C|P|B|N|J|D|T|L|G|M|G|H|Q|M|R|H 32 Partitions 1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1 with counts 3. AR BI JK LO NS IU DT BS CP BN DJ LT GM GH MQ HR Merge #1 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 and tally 4. ABRI JKLO INSU BDST BCNP DJLT GHM HMQR Merge #2 1111 1111 1111 1111 1111 1111 211 1111 and tally 5. ABIJKLOR BDINSTU BCDJLNPT GHMQR Merge #3 11111111 1111211 11111111 22211 and tally 6. ABDIJKLNORSTU BCDGHJLMNPQRT Merge #4 1212111111211 1112211211111 and tally 7. ABCDGHIJKLMNOPQRSTU Merge #5 1322111312132113121 and tally 

So, if we read the final list in memory from beginning to end, it gives a sorted list:

 A|B|C|D|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- 1|3|2|2|1|1|1|3|1|2|1|3|2|1|1|3|1|2|1 = 32 Name instances (== original file). 



Why is the solution effective?

Whether a hash table is used (as the original poster was suggested), and whether multithreading is used or not, any solution to this question cannot be solved more efficiently than O (n log n), because sorting must be done. Given this limitation, you can use two strategies:

  • Reading data from disk, using a hash table to manage the sums of names / frequencies, then sorting the contents of the hash table (original poster suggested)

  • Reading data from the disk, initializing each name with its total amount from the file, and then combining the sorting of names summing up all the totals for each name (this is a solution).

Solution (1) requires the hash table to be sorted after all the data has been read. Solution (2) performs its frequency reference as it is sorted, so the hash table overhead is deleted. Without considering multithreading at all, we can already see that even when using the most efficient hash table implementation to solve (1), solution (2) is already more efficient, since it does not have a hash table overhead at all.

Multithreading restrictions

In both solutions (1) and solution (2), assuming that solution (1) uses the most efficient implementation of the hash table, both algorithms perform the same asymptotically in O (n log n); it's just that the order of their operations is slightly different. However, although multi-threaded solution (1) actually slows down its execution, multi-threaded solution (2) will significantly improve speed . How is this possible?

If the multi-threaded solution (1), either after reading from disk, or after sorting, we are faced with the problem of competition in the hash table, since all threads try to access the hash table at the same time. Especially for writing to the table, this statement can damage the execution time of solution (1) so much that running it without multithreading will actually provide faster execution time.

For multithreading, to make it possible to speed up execution time, you need to make sure that every block of work performed by each thread is independent of each other thread. This will allow all threads to work at maximum speed without competition on shared resources and faster work. Solution (2) does just that removing the hash table as a whole and uses the Merge Sort method, the Divide and Conquer algorithm, which allows you to break the problem down into subqueries that are independent of each other.

Multithreading and partitioning to further improve runtime

For multi-threaded merge sorting, the file can be divided into sections and create a new stream to combine each consecutive pair of sections. Since the names in the file are variable in length, the file must be scanned sequentially from beginning to end in order to be able to split; random access to the file cannot be used. However, since any solution should check the contents of the file at least once, allowing only sequential access to the file, it still gives the optimal solution.

What runtime acceleration can be expected from multithreading solutions (2)? The analysis of this algorithm is quite complicated, given its simplicity and as the subject of various white papers. However, dividing the file into n partitions will allow the program to execute (n / log (n)) times faster than on one CPU without splitting the file. Simply put, if one processor takes 1 hour to process a file of 640 GB in size, then dividing the file into 64 fragments of 10 GB and executing on a machine with 32 processors will allow the program to complete it in about 6 minutes, increase 10 times (ignoring disk overhead) .

+2
source

Source: https://habr.com/ru/post/920877/


All Articles