Hello
Lets say I have list of column names that I am looking for .
I want to search these col names in SASHELP.CARS data set.
The problem that sometimes the var name I search doesnt exist in sashelp.cars .
Then, my task is - Find the best candidate var name that has highest matching to var name I am looking for.
For example:
For field CAR_Model best candidate is CAR
For field MSRP best candidate is MSRP (Here have full matching)
For field CarEngineSize best candidate is EngineSize
and so on
what is the way to do it please
The want data set will have 2 columns- var_search, var_best_candidate
proc contents data=sashelp.cars;
ods output variables=variables_List;
run;
Data Search_Fields;
input field $
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;
Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".
Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic.
If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.
An example:
proc contents data=sashelp.cars; ods output variables=variables_List; run; data search; set variables_list; Tofind = 'mpg'; distance_score = compged(upcase(tofind),upcase(variable)); keep tofind variable distance_score label; run; proc sort data=search; by distance_score; run;
The sort would get the typical "closest" matches at the start of the data set.
I use the UPCASE versions because the functions used will return a different score based on case of the letters.
Since variable names may be stored in mixed case but functionally there would be no difference for most other code between names of "mpg" "Mpg" "mPG" etc then the singe case for any comparison makes more sense.
Note that I also picked a variable request that might have 2 "correct" choices. To demonstrate that you may still require another manual step.
Note that this will return 0 for the distance for a perfect match but the other values may be quite a bit different.
Experiment with the related functions and read the documentation.
You might try adding a variable such a Really_long_mpg_variable to see how this scoring approach works.
Having dealt with a number of searches matching peoples' names without access the fancier SAS packages, I used an approach of look for an EXACT match first , Upcase(tofind)=Upcase(variable) and only bothered with the score if there was not a match.
You may want to investigate a search for is one a substring of the other as well. See the result for the Really_long_mpg_variable suggested. The INDEX function might be useful for that.
Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".
Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic.
If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.
An example:
proc contents data=sashelp.cars; ods output variables=variables_List; run; data search; set variables_list; Tofind = 'mpg'; distance_score = compged(upcase(tofind),upcase(variable)); keep tofind variable distance_score label; run; proc sort data=search; by distance_score; run;
The sort would get the typical "closest" matches at the start of the data set.
I use the UPCASE versions because the functions used will return a different score based on case of the letters.
Since variable names may be stored in mixed case but functionally there would be no difference for most other code between names of "mpg" "Mpg" "mPG" etc then the singe case for any comparison makes more sense.
Note that I also picked a variable request that might have 2 "correct" choices. To demonstrate that you may still require another manual step.
Note that this will return 0 for the distance for a perfect match but the other values may be quite a bit different.
Experiment with the related functions and read the documentation.
You might try adding a variable such a Really_long_mpg_variable to see how this scoring approach works.
Having dealt with a number of searches matching peoples' names without access the fancier SAS packages, I used an approach of look for an EXACT match first , Upcase(tofind)=Upcase(variable) and only bothered with the score if there was not a match.
You may want to investigate a search for is one a substring of the other as well. See the result for the Really_long_mpg_variable suggested. The INDEX function might be useful for that.
That's great,
May you show a program that find for each var the best matching var?
Do it all for all vars in one step?
proc contents data=sashelp.cars;
ods output variables=variables_List;
run;
Data Search_Fields;
input field $20.;
cards;
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;
%macro Find_Best_Fit_Var(Field);
data search;
set variables_list;
Tofind ="&Field.";
distance_score = compged(upcase(tofind),upcase(variable));
keep tofind variable distance_score label;
run;
proc sql;
create table Want as
select *
from search
order by distance_score
;
quit;
data Want_b;
set Want;
if _N_ = 1 then output;
run;
%mend Find_Best_Fit_Var;
%Find_Best_Fit_Var (Field=CarEngineSize)
%Find_Best_Fit_Var (Field=Cylin)
The data is small, so you could use PROC SQL to get a cartesian product and calculate the distance for each combination, something like:
proc sql ;
create table want as
select variable
,field
,compged(upcase(field),upcase(variable)) as distance_score
from search_fields,variables_list
group by field
having distance_score=min(distance_score)
;
quit ;
For field CAR_Model best candidate is CAR
For field MSRP best candidate is MSRP (Here have full matching)
For field CarEngineSize best candidate is EngineSize
I am skeptical about the usage of these functions in this case. I have never tried using them in your case of variable names, but it seems to me these tools are designed to be used on words or names, in which case it can mitigate the effect of spelling errors, or mitigate the effect of there being 2 different ways to spell a word such as "color" and "colour". But neither of those applies to variable names, which are not names or words; where variable names can be spelled differently because different words or abbreviations or ordering of the words can be used in the variable name. So maybe I'm wrong, but this doesn't seem like the right problem to use these tools on.
What if you want to find the closest match to X145, from a list of variables named X14,X15,X125,X147.X345,X545?
As @PaigeMiller has said, would the use of spelling distance have any utility at all in this case?
As ballarw showed you , there are several function in SAS to do such FUZZ matched. Like : compged() spelldis() complev() ........ But I would like to show my way/function :
proc contents data=sashelp.cars;
ods output variables=a;
run;
data a;
set a(keep=Variable rename=(Variable=a));
run;
Data b;
input b $20.;
cards;
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $) ;
length new_a new_b str_short str_long $ 200;
new_a=kcompress(a,,'kad');
new_b=kcompress(b,,'kad');
if klength(a)<klength(b) then do;
str_short=new_a;str_long=new_b;
end;
else do;
str_short=new_b;str_long=new_a;
end;
array count{10} _temporary_ (0 0 0 0 0 0 0 0 0 0); /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024); /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b)); /*Search the max length of str is 10*/
do len=1 to max_str_len; /*Search one string ,two string , three string .......*/
n=0;
do start=1 to klength(str_short)+1-len; /*Search sub-string in long string*/
if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
end;
count{len}=n;
end;
sum=0;w=0;mean=0;
do k=1 to max_str_len;
if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;
options cmplib=work.math;
proc sql;
create table want as
select a,b,fuzz_match(a,b) as weight
from a, b
group by a
having weight=max(weight);
quit;
The solution provided worked 100% for me.
My task was to find the best candidate using the code and then look with my eyes too verify it found logical candidate and if not then i do it manually using my eyes .
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.