BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Onyx | Level 15

Hello

Lets say I have list of column names that I am looking for .

I want to search these col names in SASHELP.CARS data set.

The problem that sometimes the var name I search doesnt exist in sashelp.cars .

Then, my task is - Find the best candidate var name that  has highest matching to var name I am looking for.

For example:

For field CAR_Model best candidate is CAR

For field MSRP  best candidate is MSRP (Here have  full matching)

For field CarEngineSize  best candidate is EngineSize

and so on

what is the way to do it please 

The want data set will have 2 columns- var_search, var_best_candidate

 

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;

Data Search_Fields;
input field $
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".

Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic. 

 

If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.

 

An example:

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;


data search; 
   set variables_list;
   Tofind = 'mpg';
   distance_score = compged(upcase(tofind),upcase(variable));
   keep tofind variable distance_score label;
run;

proc sort data=search;
   by distance_score;
run;

The sort would get the typical "closest" matches at the start of the data set.

I use the UPCASE versions because the functions used will return a different score based on case of the letters. 

Since variable names may be stored in mixed case but functionally there would be no difference for most other code between names of "mpg" "Mpg" "mPG" etc then the singe case for any comparison makes more sense. 

Note that I also picked a variable request that might have 2 "correct" choices. To demonstrate that you may still require another manual step.

Note that this will return 0 for the distance for a perfect match but the other values may be quite a bit different.

 

Experiment with the related functions and read the documentation.

 

You might try adding a variable such a Really_long_mpg_variable to see how this scoring approach works.

 

Having dealt with a number of searches matching peoples' names without access the fancier SAS packages, I used an approach of look for an EXACT match first , Upcase(tofind)=Upcase(variable) and only bothered with the score if there was not a match.

You may want to investigate a search for is one a substring of the other as well. See the result for the Really_long_mpg_variable suggested. The INDEX function might be useful for that.

View solution in original post

7 REPLIES 7
ballardw
Super User

Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".

Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic. 

 

If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.

 

An example:

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;


data search; 
   set variables_list;
   Tofind = 'mpg';
   distance_score = compged(upcase(tofind),upcase(variable));
   keep tofind variable distance_score label;
run;

proc sort data=search;
   by distance_score;
run;

The sort would get the typical "closest" matches at the start of the data set.

I use the UPCASE versions because the functions used will return a different score based on case of the letters. 

Since variable names may be stored in mixed case but functionally there would be no difference for most other code between names of "mpg" "Mpg" "mPG" etc then the singe case for any comparison makes more sense. 

Note that I also picked a variable request that might have 2 "correct" choices. To demonstrate that you may still require another manual step.

Note that this will return 0 for the distance for a perfect match but the other values may be quite a bit different.

 

Experiment with the related functions and read the documentation.

 

You might try adding a variable such a Really_long_mpg_variable to see how this scoring approach works.

 

Having dealt with a number of searches matching peoples' names without access the fancier SAS packages, I used an approach of look for an EXACT match first , Upcase(tofind)=Upcase(variable) and only bothered with the score if there was not a match.

You may want to investigate a search for is one a substring of the other as well. See the result for the Really_long_mpg_variable suggested. The INDEX function might be useful for that.

Ronein
Onyx | Level 15

That's great,

May you show a program that find for each var the best matching var?

Do it all for all vars in one step?

 

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;

Data Search_Fields;
input field $20.;
cards;
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;

%macro Find_Best_Fit_Var(Field);
data search; 
set variables_list;
Tofind ="&Field.";
distance_score = compged(upcase(tofind),upcase(variable));
keep tofind variable distance_score label;
run;
proc sql;
create table Want as
select  *
from search
order by distance_score
;
quit;
data Want_b;
set Want;
 if _N_ = 1 then output;
run;
%mend Find_Best_Fit_Var;
%Find_Best_Fit_Var (Field=CarEngineSize)
%Find_Best_Fit_Var (Field=Cylin)
Quentin
Super User

The data is small, so you could use PROC SQL to get a cartesian product and calculate the distance for each combination, something like:

 

proc sql ;
  create table want as
  select variable
        ,field
        ,compged(upcase(field),upcase(variable)) as distance_score
  from search_fields,variables_list
  group by field
  having distance_score=min(distance_score)
  ;
quit ;
PaigeMiller
Diamond | Level 26

For field CAR_Model best candidate is CAR

For field MSRP  best candidate is MSRP (Here have  full matching)

For field CarEngineSize  best candidate is EngineSize

 

I am skeptical about the usage of these functions in this case. I have never tried using them in your case of variable names, but it seems to me these tools are designed to be used on words or names, in which case it can mitigate the effect of spelling errors, or mitigate the effect of there being 2 different ways to spell a word such as "color" and "colour". But neither of those applies to variable names, which are not names or words; where variable names can be spelled differently because different words or abbreviations or ordering of the words can be used in the variable name. So maybe I'm wrong, but this doesn't seem like the right problem to use these tools on.

--
Paige Miller
mkeintz
PROC Star

What if you want to find the closest match to X145, from a list of variables named X14,X15,X125,X147.X345,X545?

 

As @PaigeMiller has said, would the use of spelling distance have any utility at all in this case?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

As ballarw showed you , there are several function in SAS to do such FUZZ matched. Like : compged() spelldis() complev() ........ But I would like to show my way/function :

 

proc contents data=sashelp.cars;
ods output variables=a;
run;
data a;
 set a(keep=Variable rename=(Variable=a));
run;

Data b;
input b $20.;
cards;
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;


proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;
new_a=kcompress(a,,'kad');
new_b=kcompress(b,,'kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search one string ,two string , three string .......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;

options cmplib=work.math;
proc sql;
create table want as
select a,b,fuzz_match(a,b) as weight
 from  a, b
  group by a
   having weight=max(weight);
quit;


Ksharp_0-1752225169514.png

 

Ronein
Onyx | Level 15

The solution provided worked 100% for me.

My task was to find the best candidate using the code and then look with my eyes too verify it found logical candidate  and if not then i do it manually using my eyes .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1096 views
  • 4 likes
  • 6 in conversation