The first part of your algorithm - counting characters - is simply generating keys for sorting.
If you know that you use only the alphabetic characters [A-Za-z] *, then you can optimize your algorithm by reducing the number of buckets used, but this is only a minor setting.
The second part - it's just a stable view - there are many ways to do this: the wikipedia page gives a good summary when sorting . If you are only interested in the character that matters the least, then the method ("Phase 2") that you describe is probably as effective as you can get.
The only other way I can come up with is to improve if you can divide your letters into a fixed number of buckets (e.g. 16) uniformly across a range of characters, and then recurs on each bucket. Any buckets without symbols can be thrown away, which will reduce the time during the scanning / sorting phase. Similarly, if the bucket has one character, this is done. You also want to make sure that you divide the bucket by 16 more when you know that it has several different characters.
Using a test word as an example (assuming 4 buckets and only lowercase letters:
- generate 4 buckets (AG, HM, NT, UZ)
- split test words:
- recursion to other buckets - (AG has one character - this should be the smallest so that we can stop
- If this is not the case (as for the word "testes"), we can see that HM and UZ are empty, so we only need to check NT (which will contain tsts).
- We create 4 buckets (NO, PQ, RS and T).
- Separate the letters
- and etc.
The advantage of this method is that we did not have to scan each letter. If the range of characters has the same size, then both of these methods are O (n) at best, where n is the length of the string (this is inevitable, since we should always look at each character), although building lists of characters in my example can do the algorithm is as bad as O (n ^ 2). However, as the range of characters increases, especially for short strings, using auxiliary buckets will significantly increase performance. For a Unicode string, you can use a hybrid approach - for example, separating all non-ascii characters in the first phase and using your simpler method for the ascii part.
source share