Is there a better way to calculate the frequency of all characters in a file?

Well, therefore, let's say I have a text file (not necessarily containing all possible characters), and I would like to calculate the frequency of each character and, after calculating the frequency, then I need to access each character and its frequency from the most common to least frequent. Characters are not necessarily ASCII characters, they can be arbitrary sequences of bytes, albeit of the same length.

I was considering doing something like this (in pseudocode):

function add_to_heap (symbol) freq = heap.find(symbol).frequency if (freq.exists? == true) freq++ else symbol.freq = 1 heap.insert(symbol) MaxBinaryHeap heap while somefile != EOF symbol = read_byte(somefile) heap.add_to_heap(symbol) heap.sort_by_frequency() while heap.root != empty root = heap.extract_root() do_stuff(root) 

I was wondering: is there a better, easier way to calculate and save how many times each character appears in the file?

+6
source share
2 answers

You can always use the HashMap isntead from the Heap. Like you, you will perform operations that are in O (1) for each character found instead of O (log n) wheres n - the number of elements currently on the heap.

However, if the number of different characters is limited by a reasonable number (1 byte is ideal, 2 bytes should still be exact), you can simply use an array of this size and again have O (1), but with significantly lower fixed costs.

+3
source

If you are looking for the β€œbest” solution based on runtime, here's what I suggest:

When you read a file, you should sort (or hash) your characters by the meaning of the characters themselves, and not by their frequencies. This will allow you to quickly find the current character in the list of already seen characters, instead of looking for the entire list. You should also have this initial structure to be able to perform quick inserts - I would recommend a hash binary tree.

After you read all of your characters, you should switch your ordering based on the frequency count. I read everything into an array and then sorted in place, but there are several equivalent ways to do this.

Hope this helps!

+2
source

Source: https://habr.com/ru/post/898619/


All Articles