hi there, I'm experienced in creating macros and manipulating arrays etc but not in using %do loops outwith a data step / proc sql so am getting stuck on this.
I have a data set (called masterlist) with 2 variables - varnum is sequential and variable1 has no duplicates - each entry is unique:
varnum variable1
1 CompanyName1
2 CompanyName2
3 CompanyName3
etc
N CompanyNameN
I want to search another table (newtable) to flag all instances of each of the values in variable1 i.e. CompanyName1, CompanyName2 --- CompanyNameN to find how many of each record in variable1 are in newtable (which has a variable companyname). The companyname field in newtable is free text so can be entered with capitals, shortened form (LTD instead of Limited), extra spaces, spelling mistakes etc) etc so it is not always an exact match to what is held in the table masterlist. I am going to do some "fuzzy" matches based on substring and index functions on key parts of the value in variable1. I can do what I need for just one value of variable1, but how do I make the program loop through all values in variable1? In simple terms I want something like this:
do i=1 to (N)
data newtable;
set newtable;
if substr(companyname, 1,3) = variable1(i) or index(companyname,"variable1(i)") > 0 then flag = (i);
run;
end;
Each row in column "flag" would end up with lots of blanks and numbers between 1 and N (the max being the number of entries table masterlist)
and then I ultimately want to do a count of each " fuzzy match" on companyname in newtable
proc summary data=newtable nway missing;
class flag;
output out= etc;
run;
I feel I need to create global variables for varnum(i) and variable1(i) and then feed those into a do loop for the data step?
Any suggestions on how to code this would be great as I have not used %do before or generated multiple global macro variables via a do loop either.
SQL join.
proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where substr(l.companyname, 1,3) = m.variable1
or index(l.companyName, m.variable1) >0
order by 1, 2;
quit;
Or use COMPGED in your join - you'll need to play around with the filter and join conditions.
proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where compged(m.varnum, l.companyName) < 100
order by 1, 2;
quit;
It's a bit hard to follow this description. I suggest you add some actual sample data to your question (examples of both datasets, as CARDS data steps), and also add the actual code you would use to implement the logic you want for two values of CompanyName. Don't write a %DO loop and don't write a macro. Just post the SAS code that people could run against the sample data to create the output data you would want for CompanyName1 and CompanyName2. That will make it easier for people to understand what you are doing, and perhaps suggest alternatives (like maybe a SQL approach). And if a macro would be useful, then having working non-macro code is the best place to start when designing a macro.
I would be tempted to use the data set with the sequence to create a format so that the value of 1 would be the CompanyName1 value. Then iterated a counter in a data set from 1 to n looking for whatever .
Which of values is shorter? From your description This is likely to almost never be true:
substr(companyname, 1,3) = variable1(i)
Assuming by "variable1(i)" you mean the ith value of variable1 in masterlist. The only time it would be true is when variable1 is exactly 3 letters. Use =: if you mean "begins with"
Also you stated you aren't sure about the quality of the data values. You likely want to make sure that letter case is not contributing to the problem. and use upper or lowercase values only.
Here's my take on this. Obviously completed untested.
data need; set masterlist; fmtname='Seq_to_company'; type='N'; start=varnum; label= upcase(variable1); /* just one of several ways to get the count of values in masterlist not efficient*/ call symputx('numcompanies',put(_n_,best5. -L)); run; proc format cntlin=need; run; data want; set newtable; do i=1 to &numcompanies. ; if Upcase(substr(companyname, 1,3)) =: put(i,seq_to_company. -L) or index(upcase(companyname),strip(put(i,seq_to_company. -L)) ) > 0 then do; flag = (i); output; end;
end; run;
Make a format to do a look up. Loop over the values of the master list.
The above uses upper case every where so you get matches if the original values are "PDQ" in one place and "Pdq" elsewhere.
I also would not be so sure that you only get one match so this can create multiple matches. The above code only writes out the matches. If you want all the "no matches at all" then add an "Else output;" after the end of the Do i= /end; loop.
Strip is used in the Index because otherwise trailing spaces of the formatted value would prevent finding matches.
Or an SQL cartesian join between your masterlist and "newtable" with your "if" as a WHERE.
Note: Use of the same data on Set and Data statements like this means that logic error can corrupt newtable and you completely replace it. Meaning a minor syntax mistake could require rebuilding newtable from scratch.
data newtable; set newtable;
So you have two independent problems here.
The hard one is how to find examples of CompanyName1 (or any of the names) in a free text field.
So decide how you want to do that first.
The second is how to do it for multiple companies. You don't really need either a DO loop or a %DO loop for that since you already have the list in a dataset. So just use a data step to read in the list of company names and generate the code to based on the solution you come up with to your main problem.
SQL join.
proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where substr(l.companyname, 1,3) = m.variable1
or index(l.companyName, m.variable1) >0
order by 1, 2;
quit;
Or use COMPGED in your join - you'll need to play around with the filter and join conditions.
proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where compged(m.varnum, l.companyName) < 100
order by 1, 2;
quit;
Here's two solutions that I usually use as a starting point. One is SQL based, @FriedEgg solution from a long time ago.
https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780
And a temporary array solution (looks for exact matches but you can modify it for your list).
https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf
I'm not sure how global macro variables would come into play here and it's not really indicated in your code. And fairly certain it's unnecessary.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.