BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ScoobieR
Obsidian | Level 7

hi there, I'm experienced in creating macros and manipulating arrays etc but not in using %do loops outwith a data step / proc sql so am getting stuck on this.

 

I have a data set (called masterlist) with 2 variables - varnum is sequential and variable1 has no duplicates - each entry is unique:

varnum  variable1

1              CompanyName1

2              CompanyName2

3              CompanyName3

 

etc

N             CompanyNameN

 

I want to search another table (newtable) to flag all instances of each of the values in variable1 i.e. CompanyName1, CompanyName2 --- CompanyNameN  to find how many of each record in variable1 are in newtable (which has a variable companyname).  The companyname field in newtable is free text so can be entered with capitals, shortened form (LTD instead of Limited), extra spaces, spelling mistakes etc) etc so  it is not always an exact match to what is held in the table masterlist.  I am going to do some "fuzzy" matches based on substring and index functions on key parts of the value in variable1.  I can do what I need for just one value of variable1, but how do I make the program loop through all values in variable1? In simple terms I want something like this:

 

do i=1 to (N)

data newtable;

set newtable;

if substr(companyname, 1,3) = variable1(i) or index(companyname,"variable1(i)") > 0 then flag = (i);

run;

end;

 

Each row in column "flag" would end up with lots of blanks and numbers between 1 and N (the max being the number of entries table masterlist)

and then I ultimately want to do a count of each " fuzzy match" on companyname in newtable

 

proc summary data=newtable nway missing;

class flag;

output out= etc;

run;

 

I feel I need to create global variables for varnum(i) and variable1(i) and then feed those into a do loop for the data step?

 

Any suggestions on how to code this would be great as I have not used %do before or generated multiple global macro variables via a do loop either.

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

SQL join. 

 

proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where substr(l.companyname, 1,3) = m.variable1
or index(l.companyName, m.variable1) >0
order by 1, 2;
quit;

Or use COMPGED in your join - you'll need to play around with the filter and join conditions. 

 

proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where compged(m.varnum, l.companyName) < 100
order by 1, 2;
quit;

View solution in original post

8 REPLIES 8
Quentin
Super User

It's a bit hard to follow this description. I suggest you add some actual sample data to your question (examples of both datasets, as CARDS data steps), and also add the actual code you would use to implement the logic you want for two values of CompanyName.  Don't write a %DO loop and don't write a macro.  Just post the SAS code that people could run against the sample data to create the output data you would want for CompanyName1 and CompanyName2.  That will make it easier for people to understand what you are doing, and perhaps suggest alternatives (like maybe a SQL approach).  And if a macro would be useful, then having working non-macro code is the best place to start when designing a macro.

 

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
ballardw
Super User

I would be tempted to use the data set with the sequence to create a format so that the value of 1 would be the CompanyName1 value. Then iterated a counter in a data set from 1 to n looking for whatever .

 

Which of values is shorter? From your description This is likely to almost never be true:

 substr(companyname, 1,3) = variable1(i)

Assuming by "variable1(i)" you mean the ith value of variable1 in masterlist. The only time it would be true is when variable1 is exactly 3 letters. Use =: if you mean "begins with"

Also you stated you aren't sure about the quality of the data values. You likely want to make sure that letter case is not contributing to the problem. and use upper or lowercase values only.

 

Here's my take on this. Obviously completed untested.

data need;
   set masterlist;
   fmtname='Seq_to_company';
   type='N';
   start=varnum;
   label= upcase(variable1);
   /* just one of several ways to get the count of values in masterlist not efficient*/
   call symputx('numcompanies',put(_n_,best5. -L));
run;

proc format cntlin=need;
run;

data want;
   set newtable;
   do i=1 to &numcompanies. ;
      if Upcase(substr(companyname, 1,3)) =: put(i,seq_to_company. -L) or index(upcase(companyname),strip(put(i,seq_to_company. -L)) ) > 0 then do;
         flag = (i);
         output;
      end;
end; run;

Make a format to do a look up. Loop over the values of the master list.

The above uses upper case every where so you get matches if the original values are "PDQ" in one place and "Pdq" elsewhere.

I also would not be so sure that you only get one match so this can create multiple matches. The above code only writes out the matches.  If you want all the "no matches at all" then add an "Else output;" after the end of the Do i= /end; loop.

 

Strip is used in the Index because otherwise trailing spaces of the formatted value would prevent finding matches.

 

Or an SQL cartesian join between your masterlist and "newtable" with your "if" as a WHERE.

 

Note: Use of the same data on Set and Data statements like this means that logic error can corrupt newtable and you completely replace it. Meaning a minor syntax mistake could require rebuilding newtable from scratch.

data newtable;
   set newtable;
Tom
Super User Tom
Super User

So you have two independent problems here.

The hard one is how to find examples of CompanyName1 (or any of the names) in a free text field.

So decide how you want to do that first.

 

The second is how to do it for multiple companies.  You don't really need either a DO loop or a %DO loop for that since you already have the list in a dataset.  So just use a data step to read in the list of company names and generate the code to based on the solution you come up with to your main problem.

Reeza
Super User

SQL join. 

 

proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where substr(l.companyname, 1,3) = m.variable1
or index(l.companyName, m.variable1) >0
order by 1, 2;
quit;

Or use COMPGED in your join - you'll need to play around with the filter and join conditions. 

 

proc sql;
create table fuzzymatch as
select m.varnum, m.variable1, l.companyName as matched_company
from masterlist as m
cross join newtable as l
where compged(m.varnum, l.companyName) < 100
order by 1, 2;
quit;
ScoobieR
Obsidian | Level 7
hi Reeza -

Just wanted to say thank you for your solution - this worked very well. I think I was over complicating this in my head.

Sorry for slow reply- my Mum took ill 4 weeks ago and only getting back on now to say thank you.
Reeza
Super User

Here's two solutions that I usually use as a starting point. One is SQL based, @FriedEgg solution from a long time ago.

 

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

 

And a temporary array solution (looks for exact matches but you can modify it for your list).

 

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

 

I'm not sure how global macro variables would come into play here and it's not really indicated in your code. And fairly certain it's unnecessary.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1721 views
  • 9 likes
  • 5 in conversation