Comparing strings with characters from different alphabets

I want to compare two strings containing characters from different alphabets (e.g. Russian and English). I want characters that look the same to be considered equal to each other.

eg. in the word "mother" the letter "o" is written from the English alphabet (code 043 in Unicode), and in the world "MOM" the letter "o" is written from the Russian alphabet (code 006F in Unicode). Therefore ("Mom" = "Mm") => false, but I want this to be true. Is there some kind of standard SAS function, or should I use a macro for this.

Thanks!

+5
source share
2 answers

I would do this:

First I would make a map. I mean, which letter in Russian corresponds to the letter in English. Example:
b = b
b = v
...

I would save this map in a separate table or as macroVars. Then I create a loop macro with the tranwrd function that will loop into the map that was created.

An example here could be.

 data _null_; stringBefore = ""; stringAfter = tranwrd(stringBefore,"","a"); stringAfter = tranwrd(stringAfter,"","b"); stringAfter = tranwrd(stringAfter,"","v"); ... run; 

After this conversion, I think you can compare your strings.

+1
source

I also encoded some functions to handle typos in the keyboard layout. Here is the code:

 /***************************************************************************/ /* FUNCTION count_rus_letters RETURNS NUMBER OF CYRILLIC LETTERS IN STRING */ /***************************************************************************/ proc fcmp outlib=sasuser.userfuncs.mystring; FUNCTION count_rus_letters(string $); length letter $2; rus_count=0; len=klength(string); do i=1 to len; letter=ksubstr(string,i,1); if letter in ("","","","","","","","","","","","","","","","" "","","","","","","","","","","","","","","","","","","","", "","","","","","","","","","","","","","","","","","","","" "","","","","","","","","","") then rus_count+1; end; return(rus_count); endsub; run; /**************************************************************************/ /* FUNCTION count_eng_letters RETURNS NUMBER OF ENGLISH LETTERS IN STRING */ /**************************************************************************/ proc fcmp outlib=sasuser.userfuncs.mystring; FUNCTION count_eng_letters(string $); length letter $2; eng_count=0; len=klength(string); do i=1 to len; letter=ksubstr(string,i,1); if rank('A') <= rank(letter) <=rank('z') then eng_count+1; end; return(eng_count); endsub; run; /**************************************************************************/ /* FUNCTION is_string_russian RETURNS 1 IF NUMBER OF RUSSIAN SYMBOLS IN */ /* STRING >= NUMBER OF ENGLISH SYMBOLS */ /**************************************************************************/ proc fcmp outlib=sasuser.userfuncs.mystring; FUNCTION is_string_russian(string $); length letter $2 result 8; eng_count=0; rus_count=0; len=klength(string); do i=1 to len; letter=ksubstr(string,i,1); if letter in ("","","","","","","","","","","","","","","","" "","","","","","","","","","","","","","","","","","","","", "","","","","","","","","","","","","","","","","","","","" "","","","","","","","","","") then rus_count+1; if rank('A') <= rank(letter) <=rank('z') then eng_count+1; end; if rus_count>=eng_count then result=1; else result=0; return(result); endsub; run; /**************************************************************************/ /* FUNCTION fix_layout_misprints REPLACES MISPRINTED SYMBOLS BY ANALYSING */ /* LANGUAGE OF THE STRING (FOR ENGLISH STRING RUSSIAN SYMBOLS ARE */ /* REPLACED BY ENGLISH COPIES AND FOR RUSSIAN STRING SYMBOLS ARE */ /* REPLACED BY RUSSIAN COPIES) */ /**************************************************************************/ proc fcmp outlib=sasuser.userfuncs.mystring; FUNCTION fix_layout_misprints(string $) $ 1000; length letter $2 result $1000; eng_count=0; rus_count=0; len=klength(string); do i=1 to len; letter=ksubstr(string,i,1); if letter in ("","","","","","","","","","","","","","","","" "","","","","","","","","","","","","","","","","","","","", "","","","","","","","","","","","","","","","","","","","" "","","","","","","","","","") then rus_count+1; if rank('A') <= rank(letter) <=rank('z') then eng_count+1; end; if rus_count>=eng_count then result=ktranslate(string,"","AaBEeKkMOoPpCcTXx"); else result=ktranslate(string,"AaBEeKkMOoPpCcTXx",""); return(result); endsub; run; /***********/ /* EXAMPLE */ /***********/ options cmplib=sasuser.userfuncs; data _null_; good_str=""; err_str="a"; fixed_str=fix_layout_misprints(err_str); put "Good string=" good_str; put "Error string=" err_str; put "Fixed string=" fixed_str; rus_count_in_err=count_rus_letters(err_str); put "Count or Cyrillic symbols in error string=" rus_count_in_err; eng_count_in_err=count_eng_letters(err_str); put "Count or English symbols in error string=" eng_count_in_err; is_error_str_russian=is_string_russian(err_str); put "Is error string language Russian=" is_error_str_russian; if (good_str ne err_str) then put "Before clearing - strings are not equal to each other"; if (good_str = fixed_str) then put "After clearing - strings are equal to each other"; run; 
0
source

Source: https://habr.com/ru/post/1242703/


All Articles