L10N: reliable test data for specific locale sorting

I am working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in applications built on top of the database, the database theoretically sorts the data using a mapping that matches the language associated with the data that the user is viewing.

I am trying to find sorted lists of words that meet two criteria:

  • sorted order follows sorting rules for locale
  • the listed words will allow me to fulfill most / all specific sorting rules for a locale

I am having trouble finding such reliable test data. Are such data sets currently available for sorting, and if so, where are they?

"words.en.txt" is an example text file containing American text in English:

Andrew Brian Chris Zachary 

I plan to load the word list into my database in a randomized order and check if the sorting of the list matches the original input.

Since I don’t speak any language other than English, I don’t know how to create data samples, for example, the following sample in French (call it “words.fr.txt”):

 cote côte coté côté 

The French prefer diacritics, which must be ordered from right to left. If you sorted this code using a code-code, it most likely will come out like this (this is a wrong sort):

 cote coté côte côté 

Thanks for the help, Chris.

+4
source share
1 answer

Here is what I found.

The Unicode Common Locale Data Repository (CLDR) is pretty much the authority for comparisons for international text. I managed to find several word lists matching the rules found in the CLDR in the ICA Project "ICU Demonstration - Locale Explorer". It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. This is a great library; check it out.

In some cases, it was useful to create some meaningless terms by referring directly to the CLDR rules. Search engines available in the United States were not suitable for searching foreign terms with the arguments / diacritical / other nuances that interested me for this testing (in retrospect, I would be interested if they were better suited for search engines for this task).

+4
source

Source: https://habr.com/ru/post/1335450/


All Articles