I am writing an assistant that performs a series of conversions in the input string to create a search-friendly representation of this string.
Think of the following scenario:
- Full-text search in German or French texts
- The entries in your data warehouse contain
MüllerGroßmannÇingletònBjørkÆreogramme
- Search should be fuzzy, in this
Üll , Üll , etc. comply with MüllerGros , groß , etc. match Großmanncin etc. match Çingletònbjö , bjo etc. match Bjørkaereo etc. match Æreogramme
So far, I have been successful in cases (1), (3) and (4).
What I cannot understand is how to handle (2) and (5).
So far, I have tried the following methods to no avail:
CFStringNormalize() // with all documented normalization forms CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations
I looked at the ICU User Guide for Transforms , but did not put too much effort into it ... since I think these are obvious reasons.
I know that I could catch case (2) by converting it to upper case and then back to lower case, which will work within this particular application. However, I am interested in solving this problem at a more fundamental level, I hope that it also takes into account case-sensitive applications.
Any hints would be greatly appreciated!
source share