It is not possible to understand the string search method as described. What is uFFFF?

I am reading something about finding a string (range) in a sorted array of strings.

It says:

If you want to find all lines starting with "h", you can run a binary search for the lines "h" and "h \ uFFFF". This gives range indices for all keys starting with "h". Note that a binary search can return an index where the string would be even if it is not actually in the array.

I do not understand anything from this paragraph.

What is h\uFFFF , how does it help / use in binary search, and the last value also means that even this search is faulty?

Any help to understand what is being said here, please?

+6
source share
4 answers

\ uFFFF is the character that is sorted last in the 16-bit "alphabet", that is, after any valid character of a letter, character, or special character.

When you do a binary search for a string in a sorted array, you will find a place where this string can be inserted. When you have several identical lines, you get a place in front of the first. When you add the "last letter of the alphabet" after your row, the insertion point will be after the last of the identical rows, so it gives you a series of identical rows in a sorted array.

Imagine: suppose you are not allowed to use the letter Z in your words. You now have a sorted array of strings:

 0 1 2 3 4 5 6 aab abb abc abc abd bcx bdy 

If you are looking for abc , binary search tells you the first place you can insert it, that is 2. If you are looking for abcZ , then you will get binary search 4, because abcZ comes in alphabetical order immediately after abc . This lets you know that the range between 2, inclusive and 4, exclusive, is occupied by the string abc . If both queries return the same number, you know that the string is not in the array.

In the paragraph that you indicated, \uFFFF plays the role of the "forbidden letter Z" in my example.

+9
source

\uFFFF is the maximum possible character in Java. Since the strings are sorted, a search h will find the beginning of the range, while h\uFFFF will find the end (assuming Unicode strings here), since the second character cannot be greater than \uFFFF . Even if it cannot exactly match the string, the search will return an index where the target will be, even if it is actually missing.

update: \uFFFF is the largest possible unicode sortable character in a 16-bit block, if you are working with 32-bit blocks, use U+10FFFF (whatever in Java). I personally have never worked with 32-bit unicode blocks in Java. See Section 16.7 specification 5.2.0 .

U + FFFF and U + 10FFFF. These two uncharacteristic code points have a communication attribute associated with the largest code block values ​​for specific Unicode encoding forms. In UTF-16, U + FFFF is associated with the highest value of the 16-bit code, FFFF . U + 10FFFF - associated with the largest legal value of the 32-bit UTF-32 code, 10FFFF. This attribute displays these two uncharacteristic code points useful for internal purposes as sentries. For example, they can be used to indicate the end of a list, to represent a value in an index is guaranteed to be higher than any valid character value, etc.

+3
source

The sequence \uFFFF in Java denotes a character with the Unicode code number U + FFFF. However, the code does not encode the character at all:

U + FFFF is used to represent a numeric value that is not guaranteed to be a symbol for applications such as the final value at the end of the index.

See these links: Unicode Technical Report No. 16 , this Unicode character diagram and this symbol definition .

+1
source

As other answers pointed out, a search of h will find the beginning of the range of rows starting with h , and h\uFFFF will find the end (exception) of the range of rows starting with h in your dataset.

The last sentence means that the search h\uFFFF will show you where you insert such a line if it does not exist in your data, so it gives you an exclusive end to your range.

+1
source

Source: https://habr.com/ru/post/907043/


All Articles