Folding / normalizing ligatures (e.g. Æ to ae) Using (Core) Foundation

I am writing an assistant that performs a series of conversions in the input string to create a search-friendly representation of this string.

Think of the following scenario:

  • Full-text search in German or French texts
  • The entries in your data warehouse contain
    • Müller
    • Großmann
    • Çingletòn
    • Bjørk
    • Æreogramme
  • Search should be fuzzy, in this
    • Üll , Üll , etc. comply with Müller
    • Gros , groß , etc. match Großmann
    • cin etc. match Çingletòn
    • bjö , bjo etc. match Bjørk
    • aereo etc. match Æreogramme

So far, I have been successful in cases (1), (3) and (4).

What I cannot understand is how to handle (2) and (5).

So far, I have tried the following methods to no avail:

 CFStringNormalize() // with all documented normalization forms CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations -- aside: how on earth do I normalize simply _composing_ already decomposed strings??? as soon as I pack that in, my formerly passing tests fail, as well... 

I looked at the ICU User Guide for Transforms , but did not put too much effort into it ... since I think these are obvious reasons.

I know that I could catch case (2) by converting it to upper case and then back to lower case, which will work within this particular application. However, I am interested in solving this problem at a more fundamental level, I hope that it also takes into account case-sensitive applications.

Any hints would be greatly appreciated!

+4
source share
2 answers

Congratulations, you have found one of the most painful pieces of word processing!

First of all, NamesList.txt and CaseFolding.txt are indispensable resources for such things, if you have not seen them.

Part of the problem is that you are trying to do something almost right that works in all the languages ​​/ locales that you care about, while Unicode is more concerned with doing the right thing when displaying strings in one language.

For (2), ß has a canonical addition form to ss , since the earliest CaseFolding.txt file can be found ( 3.0-Update1 / CaseFolding-2.txt ). CFStringFold() and -[NSString stringByFoldingWithOptions:] should do the right thing, but if not then the "locale-independent" s.upper().lower() seems to give a reasonable answer for all inputs (and also handles the infamous Turkish I ").

For (5), you were a little unlucky: Unicode 6.2 does not contain a normative mapping from Æ to AE and changed from a “letter” to a “ligature” and vice versa (U + 00C6 LATIN CAPITAL LETTER AE in 1.0, LATIN CAPITAL LIGATURE AE in 1.1 and LATIN CAPITAL LETTER AE in 2.0). You can find NamesList.txt for "ligature" and add a bunch of special cases.

Notes:

  • CFStringNormalize() does not do what you want. You want to normalize rows before adding them to the index; I suggest NFKC at the beginning and end of another processing.
  • CFStringTransform() not quite what you want; all scripts are "latin"
  • CFStringFold() depends on the order: the union of ypogegrammeni and prosgegrammeni is split into kCFCompareDiacriticInsensitive , but converted to iota lowercase by kCFCompareCaseInsensitive . The “right” thing is to make the phrase first and then the others, although removing it may have a more linguistic meaning.
  • You almost certainly don't want to use kCFCompareLocalized unless you want to rebuild the search index every time the locale changes.

Notes of readers from other languages: Make sure that the function you use does not depend on the current locale of the user! Java users should use something like s.toUpperCase(Locale.ENGLISH) , .NET users should use s.ToUpperInvariant() . If you really want to use the current user locale, specify it explicitly.

+6
source

I used the following extension for String, which seems to work beautifully.

 /// normalized version of string for comparisons and database lookups. If normalization fails or results in an empty string, original string is returned. var normalized: String? { // expand ligatures and other joined characters and flatten to simple ascii (æ => ae, etc.) by converting to ascii data and back guard let data = self.data(using: String.Encoding.ascii, allowLossyConversion: true) else { print("WARNING: Unable to convert string to ASCII Data: \(self)") return self } guard let processed = String(data: data, encoding: String.Encoding.ascii) else { print("WARNING: Unable to decode ASCII Data normalizing stirng: \(self)") return self } var normalized = processed // // remove non alpha-numeric characters normalized = normalized.replacingOccurrences(of: "?", with: "") // educated quotes and the like will be destroyed by above data conversion // strip appostrophes normalized = normalized.replacingCharacters(in: "'", with: "") // replace non-alpha-numeric characters with spaces normalized = normalized.replacingCharacters(in: CharacterSet.alphanumerics.inverted, with: " ") // lowercase string normalized = normalized.lowercased() // remove multiple spaces and line breaks and tabs and trim normalized = normalized.whitespaceCollapsed // may return an empty string if no alphanumeric characters! In this case, use the raw string as the "normalized" form if normalized == "" { return self } else { return normalized } } 
0
source

Source: https://habr.com/ru/post/1397591/


All Articles