BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DmitryErshov
Obsidian | Level 7

I want to compare two strings which contains symbols from different alphabets (e.g. Russian and English). I want that symbols which looks similarly is considered as equal to each other.

 

E.g. in the word "Mom" letter "o" is from English alphabet (code 043E in Unicode), and in the world "Mоm" letter "о" is from Russian alphabet (code 006F in Unicode). So ("Mom" = "Mоm") => false, but I want it would be true. Is there some standard SAS function or I should wright a macro to do it.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
TomKari
Onyx | Level 15

Can you use the KTranslate function to convert the non-first-7-bit characters to the 7-bit versions? So all of the "o" versions above 7F would first be converted, then do your compare (e.g. convert 043E to 006F).

 

Tom

View solution in original post

5 REPLIES 5
TomKari
Onyx | Level 15

Can you use the KTranslate function to convert the non-first-7-bit characters to the 7-bit versions? So all of the "o" versions above 7F would first be converted, then do your compare (e.g. convert 043E to 006F).

 

Tom

DmitryErshov
Obsidian | Level 7

Thanks a lot Tom!

 

One more question. I want to check that some letter belongs to Russian alphabet. I can do it with direct comparizon with Cyrillic letters:

 

letter in ('А', 'Б', 'В', 'Г', 'Д', 'Ж', ...)

 

Is there some simpler approach? E.g. for English alphabet I could use rank() function:

 

rank('A') <= rank(letter) <= rank('z')

But this function doesn't work for UTF-8 encoding. How can I get position of the letter in UTF-8 table?

 

Regards,

Dmitry 

TomKari
Onyx | Level 15

Hi, Dmitry

 

I'm definitely getting onto thinner ice here, but since it looks like the Cyrillic characters are between x'0400' and x'0513', what if you just coded

 

'0400'x <= letter <= '0513'x

 

I'm not set up to try internationalized stuff...give it a try and post back what happens.

 

Tom

DmitryErshov
Obsidian | Level 7

Thanks Tom! I also coded some functions to deal with keybord layout misprints. Here is code:

 

/***************************************************************************/
/* FUNCTION count_rus_letters RETURNS NUMBER OF CYRILLIC LETTERS IN STRING */
/***************************************************************************/ 
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION count_rus_letters(string $);
length letter $2;

rus_count=0;

len=klength(string);

do i=1 to len;
  letter=ksubstr(string,i,1);
  if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
      "З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
      "С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
      "Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я") 
  then rus_count+1;
end;

return(rus_count);
endsub;
run;

/**************************************************************************/
/* FUNCTION count_eng_letters RETURNS NUMBER OF ENGLISH LETTERS IN STRING */
/**************************************************************************/ 
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION count_eng_letters(string $);
length letter $2;

eng_count=0;

len=klength(string);

do i=1 to len;
  letter=ksubstr(string,i,1);
  if rank('A') <= rank(letter) <=rank('z') 
  then eng_count+1;
end;

return(eng_count);
endsub;
run;

/**************************************************************************/
/* FUNCTION is_string_russian RETURNS 1 IF NUMBER OF RUSSIAN SYMBOLS IN   */
/* STRING >= NUMBER OF ENGLISH SYMBOLS                                    */
/**************************************************************************/ 
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION is_string_russian(string $);
length letter $2 result 8;

eng_count=0;
rus_count=0;

len=klength(string);

do i=1 to len;
  letter=ksubstr(string,i,1);
  if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
      "З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
      "С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
      "Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я") 
  then rus_count+1;
  if rank('A') <= rank(letter) <=rank('z') 
  then eng_count+1;
end;

if rus_count>=eng_count
then result=1;
else result=0;

return(result);
endsub;
run;

/**************************************************************************/
/* FUNCTION fix_layout_misprints REPLACES MISPRINTED SYMBOLS BY ANALYSING */
/* LANGUAGE OF THE STRING (FOR ENGLISH STRING RUSSIAN SYMBOLS ARE         */
/* REPLACED BY ENGLISH COPIES AND FOR RUSSIAN STRING SYMBOLS ARE          */
/* REPLACED BY RUSSIAN COPIES)                                            */
/**************************************************************************/ 
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION fix_layout_misprints(string $) $ 1000;
length letter $2 result $1000;

eng_count=0;
rus_count=0;

len=klength(string);

do i=1 to len;
  letter=ksubstr(string,i,1);
  if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
      "З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
      "С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
      "Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я") 
  then rus_count+1;
  if rank('A') <= rank(letter) <=rank('z') 
  then eng_count+1;
end;

if rus_count>=eng_count
then result=ktranslate(string,"АаВЕеКкМОоРрСсТХх","AaBEeKkMOoPpCcTXx");
else result=ktranslate(string,"AaBEeKkMOoPpCcTXx","АаВЕеКкМОоРрСсТХх");

return(result);
endsub;
run;

/***********/
/* EXAMPLE */
/***********/
options cmplib=sasuser.userfuncs;
data _null_;
good_str="Иванов";
err_str="Ивaнов";
fixed_str=fix_layout_misprints(err_str);

put "Good string=" good_str;
put "Error string=" err_str;
put "Fixed string=" fixed_str;

rus_count_in_err=count_rus_letters(err_str);
put "Count or Cyrillic symbols in error string=" rus_count_in_err;

eng_count_in_err=count_eng_letters(err_str);
put "Count or English symbols in error string=" eng_count_in_err;

is_error_str_russian=is_string_russian(err_str);
put "Is error string language Russian=" is_error_str_russian;

if (good_str ne err_str) 
then put "Before clearing - strings are not equal to each other";

if (good_str = fixed_str) 
then put "After clearing - strings are equal to each other";
run; 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1633 views
  • 3 likes
  • 2 in conversation