Ignoring Hebrew Vowels When Comparing Strings

Good evening, I hope you can help me with this problem, as I struggle to find solutions.

I have a provider of words that give me, for example, words from Hebrew, -

Vocals - ื‘ ึทึผ ื™ึดืช not vowel - ื‘ื™ืช

Vowel - ื” ึท ื‘ ึทึผ ื™ึฐืชึธื” Non-Vowel - ื”ื‘ื™ืชื”

Unlike my provider, my user cannot normally enter Hebrew vowels (and I do not want him to do this). A user story is a user searching for a word in the provided words. The problem is the comparison between vowels and unglazed words. Since each of them is represented by a different byte array in memory, the equals method returns false.

I tried to understand how UTF-8 handles Hebrew vowels, and it seems like these are just normal characters.

I want to present vowels to the user, so I want to keep the string as it is in memory, but when comparing, I want to ignore them. Is there an easy way to solve this problem?

+4
source share
2 answers

You can use Collator . I can't tell you exactly how it works, since it is new to me, but this seems to do the trick:

public static void main( String[] args ) { String withVowels = "ื‘ึทึผื™ึดืช"; String withoutVowels = "ื‘ื™ืช"; String withVowelsTwo = "ื”ึทื‘ึทึผื™ึฐืชึธื”"; String withoutVowelsTwo = "ื”ื‘ื™ืชื”"; System.out.println( "These two strings are " + (withVowels.equals( withoutVowels ) ? "" : "not ") + "equal" ); System.out.println( "The second two strings are " + (withVowelsTwo.equals( withoutVowelsTwo ) ? "" : "not ") + "equal" ); Collator collator = Collator.getInstance( new Locale( "he" ) ); collator.setStrength( Collator.PRIMARY ); System.out.println( collator.equals( withVowels, withoutVowels ) ); System.out.println( collator.equals( withVowelsTwo, withoutVowelsTwo ) ); } 

From this, I get the following output:

 These two strings are not equal The second two strings are not equal true true 
+5
source

AFAIK no. Vowel characters. Even some combinations of letters and dots are symbols. See the wikipedia page.

http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

You can save the search key for your words as characters only in the range 05dx-05ex. You can add another field for a word with vowels.

Of course, you should expect the following:

  • You will need to consider words that have different meanings according to nikkud.
  • You should consider the โ€œwrong namesโ€ ื™ and ื•, which are common.
0
source

Source: https://habr.com/ru/post/1438191/


All Articles