The most efficient character counting algorithm?

Let's say you want to count the oncoming characters in some text.

The fastest way I could think of is to use an array of type unsigned char charcounts[256] , initialize it with zeros, and then look at each char in the text input and make charcounts[c]++ . then a linear search of charcounts[] using two vars to track the lowest (so far) char and count it, replacing it with a new char / count when we find the bottom one until we get to the end.

So, the "text" will be t = 2, e = 1, x = 1.

Is there a faster way to do this?

+4
source share
4 answers

The first part is the frequency of counting letters. Two questions to indicate, assuming that the language here is C or C ++:

  • Your code will not process letters occurring> 255 times (or 127 if char is signed.) Creating "charcounts" for an int array will probably not have a big impact on performance.
  • Your code will not work for Unicode characters / international characters

The second part is the definition of the least frequent letter

  • If you are dealing with short lines ("text", "fred"), then scanning all 256 entries in your table is a step in determining speed. You better not keep track of the lowest frequency letter in the first scan cycle.
  • But if you want to scan all 256 entries, you can exit the loop as soon as you press the β€œone” value (or zero if that’s how your algorithm is designed to work).
+4
source

The first part of your algorithm - counting characters - is simply generating keys for sorting.

If you know that you use only the alphabetic characters [A-Za-z] *, then you can optimize your algorithm by reducing the number of buckets used, but this is only a minor setting.

The second part - it's just a stable view - there are many ways to do this: the wikipedia page gives a good summary when sorting . If you are only interested in the character that matters the least, then the method ("Phase 2") that you describe is probably as effective as you can get.

The only other way I can come up with is to improve if you can divide your letters into a fixed number of buckets (e.g. 16) uniformly across a range of characters, and then recurs on each bucket. Any buckets without symbols can be thrown away, which will reduce the time during the scanning / sorting phase. Similarly, if the bucket has one character, this is done. You also want to make sure that you divide the bucket by 16 more when you know that it has several different characters.

Using a test word as an example (assuming 4 buckets and only lowercase letters:

  • generate 4 buckets (AG, HM, NT, UZ)
  • split test words:
    • AG: e,
    • NM:
    • NT: tst
    • UZ:
  • recursion to other buckets - (AG has one character - this should be the smallest so that we can stop
  • If this is not the case (as for the word "testes"), we can see that HM and UZ are empty, so we only need to check NT (which will contain tsts).
    • We create 4 buckets (NO, PQ, RS and T).
    • Separate the letters
    • and etc.

The advantage of this method is that we did not have to scan each letter. If the range of characters has the same size, then both of these methods are O (n) at best, where n is the length of the string (this is inevitable, since we should always look at each character), although building lists of characters in my example can do the algorithm is as bad as O (n ^ 2). However, as the range of characters increases, especially for short strings, using auxiliary buckets will significantly increase performance. For a Unicode string, you can use a hybrid approach - for example, separating all non-ascii characters in the first phase and using your simpler method for the ascii part.

+4
source

Here you described two tasks. The first is to count the number of times each ASCII character is encountered in the stream, and the second is trying to find the smallest frequency symbol.

The first algorithm seems pretty efficient. From head to toe, I can't think of a faster way.

I am less sure about your second algorithm. You do not explicitly say why you want to find the smallest frequency symbol or what input, but I can imagine that it is easy to have more than one symbol with a frequency equal to zero, so how do you want to distinguish between them?

+1
source

This sounds like one of the most effective ways to do what you describe. I'm not sure what you want to do with the second part, it looks like you want to find a character with a minimum number of entries in the sort data?

0
source

Source: https://habr.com/ru/post/1277643/


All Articles