BookmarkSubscribeRSS Feed
OrangePeel
Fluorite | Level 6

Hi, everyone, for example, how to match the following two datasets?  

Of course, the actual situation is more complex. Dataset a or b have many observations, and the number and order of observations are not always equal.

 

This function is not work.

OrangePeel_0-1710983968074.png

 

data a;
length text $200;
text1="既往用药(ITT分析集)";output;
text1="合并用药(ITT分析集)";output;
text1="基于IRC评估的无进展生存期(PFS)汇总分析 (ITT分析集)";output;
text1="按系统器官分类和首选术语总结的任意一组发生率≥5%的TEAE(安全性分析集)";output;
run;


data b;
length text $200;
text2="既往用药(全分析集/意向性治疗分析集)";output;
text2="既往用药(全分析集/意向性治疗分析集)";output;
text2="基于独立中心评估的无进展生存期(PFS)汇总分析-基于删失规则I(全分析集/意向性治疗分析集)";output;
text2="按系统器官分类和首选术语总结的任意一组发生率≥{xx}%的TEAE(安全性分析集)";output;
run;

 

9 REPLIES 9
Patrick
Opal | Level 21

If you've got the SAS Data Quality products licensed then you could look into the DQMATCH() function.

You would create match codes and then join your data via these match codes.

OrangePeel
Fluorite | Level 6
unfortunately,not have
Patrick
Opal | Level 21

@OrangePeel wrote:
unfortunately,not have

Do you have access to a SAS Viya environment?

 

With SAS9.4 you could use Proc FCMP and implement the Levenshtein distance function yourself that works for multibyte characters.

Assuming you don't want to do this another option could be to look into something like Python to create such matchcodes or similarity scores that you then can use on the SAS side for a join.


@Ksharp, @Reeza Can you think of anything else?

OrangePeel
Fluorite | Level 6
Thank you reply. This is really a challenge for me. I'll give it a try。
Ksharp
Super User

@Patrick 

I wrote such kind of function before, if you like it , here is :

UPDATED: change FIND() into KFIND() for multi byte data.

data a;
infile cards truncover;
input a $200.;
cards;
二巯丙磺钠
ACYW群脑膜炎球菌多糖疫苗
A型肉毒毒素
DL-蛋氨酸,甘氨酸,甘草酸铵
L-焦谷氨酸艾托格列净
一枝黄花草本,穿心莲草本
丁溴东莨菪碱
七叶皂苷,水杨酸二乙胺
田七花叶
七叶神安
三七通舒
侧柏种子
保泰松,利多卡因,氨基比林
倍他米松
倍他米松,氯霉素
倍他米松,盐酸苯海拉明,硝酸咪康唑,硫酸庆大霉素
倍他米松磷酸钠
;
data b;
infile cards truncover;
input b $200.;
cards;
2, 3-二巯基丙磺酸钠盐
acyw135群脑膜炎球菌多糖疫苗
A型肉毒毒素
复方甘草酸苷
艾托格列净
穿黄清热
丁溴东莨菪碱
复方七叶皂苷
三七叶总皂苷提取物
柏子仁配方
复方氨基比林
复方保泰松鸡血藤
倍他米松
倍松
联邦倍松
得肤宝
倍他米松磷酸钠
;

proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;
new_a=kcompress(a,,'kad');
new_b=kcompress(b,,'kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search one string ,two string , three string .......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;

options cmplib=work.math;
proc sql;
create table want as
select a,b,fuzz_match(a,b) as weight
 from  a, b
  order by a,weight desc;
quit;

 

Reeza
Super User
There are the K level functions but unfortunately not for DBCS, KCOMPARE is a start but probably not close enough.
OrangePeel
Fluorite | Level 6
Although the complex funtion describes SBCS, I tried it and it works for DBCS as well.The returned result is Levenshtein Distance。
Patrick
Opal | Level 21

@OrangePeel wrote:
Although the complex funtion describes SBCS, I tried it and it works for DBCS as well.The returned result is Levenshtein Distance。

If the function hasn't been qualified for use with multibyte then you shouldn't use it for multibyte. Only because the function doesn't return an error doesn't mean it's doing the right thing. 

What could happen for example: The function treats each byte of a multibyte string as separate "character" and though will return a wrong result.

OrangePeel
Fluorite | Level 6
You're right, just because the program doesn't report errors doesn't mean the results are accurate.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1203 views
  • 7 likes
  • 4 in conversation