Solved: Re: Search var with best match name

Ronein · Posted 07-10-2025 03:35 AM

Hello

Lets say I have list of column names that I am looking for .

I want to search these col names in SASHELP.CARS data set.

The problem that sometimes the var name I search doesnt exist in sashelp.cars .

Then, my task is - Find the best candidate var name that has highest matching to var name I am looking for.

For example:

For field CAR_Model best candidate is CAR

For field MSRP best candidate is MSRP (Here have full matching)

For field CarEngineSize best candidate is EngineSize

and so on

what is the way to do it please

The want data set will have 2 columns- var_search, var_best_candidate

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;

Data Search_Fields;
input field $
Cylin
DriveTrain
CarEngineSize
Horsepower
InvoiceAmnt
Length
MPG_IN_City
MPG_IN_Highway
MSRP
Make
CAR_Model
Origin
CAR_Type
Weight
Wheelbase
;
Run;

ballardw · Posted 07-10-2025 03:57 AM

Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".

Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic.

If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.

An example:

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;


data search; 
   set variables_list;
   Tofind = 'mpg';
   distance_score = compged(upcase(tofind),upcase(variable));
   keep tofind variable distance_score label;
run;

proc sort data=search;
   by distance_score;
run;

The sort would get the typical "closest" matches at the start of the data set.

I use the UPCASE versions because the functions used will return a different score based on case of the letters.

Since variable names may be stored in mixed case but functionally there would be no difference for most other code between names of "mpg" "Mpg" "mPG" etc then the singe case for any comparison makes more sense.

Note that I also picked a variable request that might have 2 "correct" choices. To demonstrate that you may still require another manual step.

Note that this will return 0 for the distance for a perfect match but the other values may be quite a bit different.

Experiment with the related functions and read the documentation.

You might try adding a variable such a Really_long_mpg_variable to see how this scoring approach works.

Having dealt with a number of searches matching peoples' names without access the fancier SAS packages, I used an approach of look for an EXACT match first , Upcase(tofind)=Upcase(variable) and only bothered with the score if there was not a match.

You may want to investigate a search for is one a substring of the other as well. See the result for the Really_long_mpg_variable suggested. The INDEX function might be useful for that.

View solution in original post

ballardw · Posted 07-10-2025 03:57 AM

Any time someone asks for "best" solutions I tend to ask "what is the measure used to determine best".

Typically programming code looks at minimizing time of execution, time to program or resources used such as memory, storage space or network traffic.

If you have a data set with the names of the variables such as shown then you might consider use of one of the spelling distance functions such as COMPGED or SPEDIS to get a score for similarity to your search name.

An example:

proc contents data=sashelp.cars;
ods output variables=variables_List;
run;


data search; 
   set variables_list;
   Tofind = 'mpg';
   distance_score = compged(upcase(tofind),upcase(variable));
   keep tofind variable distance_score label;
run;

proc sort data=search;
   by distance_score;
run;