Processing huge text files

Question

Processing huge text files

Problem: I have a huge text file (suppose 3gig), I need to go through every word in the file and find out how many times the word appears in the file.

My suggestion: Divide the huge file into several files, and each broken file will have words in a sorted way. For example, all words starting with "a" will be stored in the file "_a.dic". Thus, at any time we will not use more than 26 files.

The problem is this approach,

I can use streams to read a file, but wanted to use streams to read specific parts of a file. For example, read 0-1024 bytes with a separate thread (at least there are 4-8 threads based on the number of processors in the field). Is this possible or is I dreaming?

Any better approach?

Note. This should be a clean C ++ or c based solution. Databases are prohibited, etc.

+5

c ++ multithreading text-files

asyncwait Oct 26 '09 at 14:54

source share

10 answers

I do not think that using multiple threads that read parts of a file in parallel will help a lot. I would expect this app to be related to the bandwidth and latency of your hard drive, and not to the actual word count. Such a multi-threaded version can really get worse, because quasi-random file access is usually slower than access to a linear file.

If the processor is really busy in the single-threaded version, there may be potential speed. One stream could read data in large chunks and put it in a queue of limited capacity. A bunch of other workflows can each work on their own fragment and count words. After the counter streams have finished, you must merge the word counts.

+6

sellibitze Oct 26 '09 at 15:09

source share

First, decide the data structure for saving words.

The obvious choice is mapping. But perhaps Trie will serve you better. In each node, you save a counter for the word. 0 means that this is only part of the word. You can embed in a trie with a stream and read your character-based file.

Secondly - multithreading yes or no? This is not easy to answer. Depending on the size of the growing data structure and how you parallelize the response, they may differ.

Singlethreaded - Simple and easy to use.
Multithreading with multiple read streams and one datastructur. Then you need to synchronize access to the data structure. In Trie, you only need to lock the node you are in, so many readers can access the data structure without much interference. A self-balancing tree may be different, especially when rebalancing.
Multithreading with multiple reader streams, each with its own data structure. Each stream creates its own data structure when reading part of the file. After completing each result, the results should be combined (which should be easy).

What you should think about - you need to find the word boundary for each thread, but this should not be a big problem (for example, each thread passes it to the first word boundary and starts there, at the end of each thread ends the word that it works on).

+3

Tobias Langner Oct 26 '09 at 16:10

source share

While you can use the second stream to analyze the data after reading it, you probably won't get a huge amount by doing this. Trying to use more than one stream to read data will almost certainly hurt speed, not improve it. Using multiple threads for data processing is pointless - processing will be many times faster than reading, so even with one additional thread, the limit will be on disk speed.

One (possible) way to get significant speed is to bypass regular iostreams - while some of them are almost as fast as using C FILE *, I don’t know anything that is really faster, and some of them are much slower, If you use this on a system (like Windows) that has an input / output model that is noticeably different from C, you can get much more with a little caution.

The problem is quite simple: the file you are reading is (potentially) larger than the available cache space, but you will not get anything from caching, because you are not going to re-read the fragments of the file again (at least if you do something reasonably) . Thus, you want to tell the system to bypass any caching and simply transfer the data as quickly as possible from the disk to your memory, where you can process it. On a Unix-like system, probably open() and read() (and you won’t get much). On Windows, these are CreateFile and ReadFile , passing the FILE_FLAG_NO_BUFFERING flag to CreateFile - and will probably double your speed if you do it right.

You also received some answers that protect the execution of processing using various parallel constructs. I think they are fundamentally wrong. If you are not doing something terribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to just read the file.

The structure that I would use would be to have two buffers, say, megabytes apiece. Reading data into one buffer. Turn this buffer into the counting stream to count the words in this buffer. While this happens, read the data in the second buffer. When this is done, basically clipboards and continue. There is a bit of extra processing that you will need to do when exchanging buffers to process a word that can cross the border from one buffer to another, but this is pretty trivial (basically, if the buffer doesn't end with white space, you're still in one word, when start working with the next data buffer).

As long as you are sure that it will only be used on a multiprocessor (multi-core) machine, using real threads will be great. If it is likely that this could be done on a single-core computer, you might be better off using a single thread with overlapping I / O.

+1

Jerry Coffin Oct 26 '09 at 15:34

source share

As others have indicated, disk I / O will be the bottleneck. Therefore, I suggest you use overlapping I / O. This basically inverts program logic. Instead of entering an input code to determine when to do I / O, you simply tell the operating system to call your code whenever it finishes an I / O bit. If you use I / O completion ports , you can even tell the OS to use multiple threads to process file fragments.

+1

MSalters Oct 27 '09 at 10:07

source share

c based on the solution?

I think perl was born just for this purpose.

0

Oren Mazor Oct 26 '09 at 14:58

source share

a stream has only one cursor. If you access a stream with multiple streams at a time, you will not be sure to read where you want. Reading is done from the cursor position.

What I would do is to have only one stream (possibly the main one) that reads the stream and sends read bytes to other streams.

Example:

Topic # 1 is ready and will ask the main thread to give her the next part,
The main topic will read the next 1Mb and provide them with stream 1,
Topic #i reads 1Mb and counts words as you wish,
Theme #i terminates and requests the next 1Mb again.

That way, you can split the reading of a stream into a stream analysis.

0

Patrice Bernassola Oct 26 '09 at 15:10

source share

What you are looking for is RegEx. This Stackoverflow thread in C ++ relational systems should help:

C ++: which regular expression library should I use?

0

ryber Oct 26 '09 at 15:11

source share

First off, I'm sure C / C ++ is not the best way to handle this. Ideally, you also use a map / shortcut for parallelism.

But assuming your limitations, this is what I will do.

1) Divide the text file into smaller pieces. You do not need to do this by the first letter of the word. Just break them into, say, 5,000 words. In pseudo code, you would do something like this:

index = 0

numwords = 0

mysplitfile = openfile (index-split.txt)

while (bigfile → word)

 mysplitfile << word numwords ++ if (numwords > 5000) mysplitfile.close() index++ mysplitfile = openfile(index-split.txt)

2) Use the general map data structure and pthreads to create new threads for reading each of the subfiles. Again, pseudo code:

maplock = create_pthread_lock ()

sharedmap = std :: map ()

for each index-split.txt file:

 spawn-new-thread(myfunction, filename, sharedmap, lock)

dump_map (sharedmap)

void myfunction (file name, sharedmap) {

 localmap = std::map<string, size_t>(); file = openfile(filename) while (file >> word) if !localmap.contains(word) localmap[word] = 0 localmap[word]++ acquire(lock) for key,value in localmap if !sharedmap.contains(key) sharedmap[key] = 0 sharedmap[key] += value release(lock)

}

Sorry for the syntax. I have been writing a lot of python lately.

0

spitzanator Oct 26 '09 at 15:16

source share

Not C, and a bit UGLY, but it took only 2 minutes to break out:

perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

Move through each line with -n
Split each line into @F words with -a
Each word $_ increases the hash %h
Once END of file reached,
sort hash at $h{$b}<=>$h{$a}
If the two frequencies are identical, sort alphabetically $a cmp $b
Print the frequency $h{$w} and the word $w
Redirect results to 'freq' file

I ran this code in a 3.3 GB text file with 580,000,000 words.
Perl 5.22 completed in 173 seconds.

My input file already had punctuation, and the uppercase was converted to lowercase using this bit of code:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/AZ/az/" file_raw > file
(run time 144 seconds)

A word reading script can be written in awk:
awk '{for (i=1; i<=NF; i++){h[$i]++}} END{for (w in h){printf("%s\t%s\n", h[w], w)}}' file | sort -rn > freq

0

Chris Koknat Oct 12 '15 at 22:53

source share

Jonathan Leffler · Accepted Answer · 2009-10-26 15:08

You need to look at Kernighan and Pike's “Programming Practices,” and in particular Chapter 3.

In C ++, use a line and counter based map ( std::map<string,size_t> , IIRC). Read the file (once - it is too large to read more than once), breaking it into words when you go (for some definition of "word"), and increase the score in the map entry for each word found.

In C, you will need to create a map yourself. (Or find David Hanson's “ C Interfaces and Implementations .”)

Or you can use Perl, or Python, or Awk (all of which have associative arrays equivalent to a map).

Processing huge text files

More articles: