BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
kyle234
Obsidian | Level 7

I want to compare the variables name A to Name B. Name a is a concatenation separated by commas of multiple names from an A file and Name B is a concatenation separated by commas of multiple name from a B file. I want to compare them so that if there are any matches between any of the non missing names within NAMEA to any of the non missing names in NAMEB it will output as a match indicated by 1. I'm wondering if an array with a do loop would somehow work. Thank you!

 

Original File

 

 

ID    Name_A          Name B                 Match_Flag

1      MILLIE,MIL     MIL                         1

2     BILL,BOB,,       BOB,,                     1

3     KYLE,,              BILL,BROWN      0

4     ,,                         ,,                             0

 

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

An array is not necessary. If your data looks like this

data have;
   infile datalines dlm='|' dsd;
   input id Name_a:$25. Name_b:$25.;  
datalines;
1|MILLIE,MIL|MIL   
2|BILL,BOB,|BOB,,
3|KYLE|BILL,BROWN
4|,,|,,
;run;

this code seems to do the trick:

data want;
  set have;
  match_flag=0;
  do _N_=1 to countw(Name_a,',') while(match_flag=0);
    word=scan(Name_a,_N_,',');
    if word=' ' then continue;
    if indexw(trim(Name_b),trim(word),',') then
      match_flag=1;
    end;
  drop word;
run;

View solution in original post

6 REPLIES 6
ballardw
Super User

First step: fix stupid data layout so you only have one "name" per observation.

 

What is the output actually supposed to look like. Your side-side-side of Name_a and Name_b doesn't make any sense.

 

Here is my take, starting with making one Name per observation for use in comparisons:

data work.Name_a;
   infile datalines dlm='|' ;
   input id Name_a:$25. ;  
datalines;
1|MILLIE,MIL    
2|BILL,BOB,
3|KYLE
;
data work.Name_b;
   infile datalines dlm='|' ;
   input Name_b:$25. ;  
datalines;
MIL    
BOB
BILL,BROWN 
;
 
data work.useableName_a;
   set work.name_a;
   length name $ 20. ;
   do i= 1 to countw(name_a,',');
      name=scan(name_a,i,',');
      output;
   end;
run;
data work.useableName_b;
   set work.name_b;
   length name $ 20. ;
   do i= 1 to countw(name_b,',');
      name=scan(name_b,i,',');
      output;
   end;
run;
proc sort data=work.useableName_a;
   by name;
run;
proc sort data=work.useableName_b;
   by name;
run;


data work.combined;
   merge work.useableName_a (in=inA)
         work.useableName_b (in=inB)
   ;
   by name ;
   match = (ina and inb);
run;

proc summary data=work.combined nway;
   class id name_a;
   var match;
   output out=work.want(drop=_:) max=;
run;

If there is any difference of case between any of the values I would suggest UPCASE the Name variable in both of the useablename data sets.

 

Two minor tricks involved:

The data set option IN= creates a variable that is 1/0 valued indicating that the current observation is from the set. So my code creates Match as 1 when observations from both sets have the same value of Name.

The proc summary step is to remove any of the duplicates by selecting the largest value of the match variable.

The output data set from summary by default would include two variables, _type_ and _freq_ that indicate the combinations of the class variables, _type_ , and the number of records for the specific combination of class variables, _freq_. These aren't useful for this project so drop them. The NWAY option means that only the largest value of _type_ combinations of class variables would appear in the output data set.

 

IF your data sets are "large" enough the CLASS statement in summary may use too much memory. If that happens then sort the Combined data by Id and Name_A and use BY instead of Class.

kyle234
Obsidian | Level 7

Thank you so much. So the issue is that I have multiple similar comparisons I need to do. For example FileA has 11 last names and I need to compare to fileB which has 5 last names. I want code that is dynamic and can do all the combinations between them. I was hoping there was a way to put them in one variable so i can just compare the variables also the number or last names in one file seems to always be changing. I was wondering if there is some way to get it to loop through all the combinations between these different variables. 

ballardw
Super User

@kyle234 wrote:

Thank you so much. So the issue is that I have multiple similar comparisons I need to do. For example FileA has 11 last names and I need to compare to fileB which has 5 last names. I want code that is dynamic and can do all the combinations between them. I was hoping there was a way to put them in one variable so i can just compare the variables also the number or last names in one file seems to always be changing. I was wondering if there is some way to get it to loop through all the combinations between these different variables. 


Comparison of multiple values in one variable is way harder and more prone to error. The code I show is dynamic and does loop search through all the values IF the very limited example looks like your data values. If not, then provide an example of each file.

 

Note that when you have 11 items in one variable and 5 in the other that means you will have to do 55 tests. And if you mean by "all combinations" compare all the observations in one set with all the observations of the other set then 100 observations in one and 100 in the other means 10,000 observation combinations that need to be tested.

 

Without better example data can't come up with much else, though perhaps the sort on the FileB could use the NODUPEKEY so duplicates of the names are even checked. (Lots of last names SMITH or GARCIA so no need to check more than once)

Ksharp
Super User
data have;
infile cards truncover expandtabs dlm=' ';
input ID    Name_A  :$40.        Name_B  :$40. ;
cards;
1      MILLIE,MIL     MIL                        
2     BILL,BOB,,       BOB,,                    
3     KYLE,,              BILL,BROWN  
4     ,,                         ,,   
;

data want;
 set have;
 Match_Flag=0;
 do i=1 to countw(name_a,','); 
   if findw(name_b,scan(name_a,i,','),',','it') then do;Match_Flag=1;leave;end;
 end;
 drop i;
run;
s_lassen
Meteorite | Level 14

An array is not necessary. If your data looks like this

data have;
   infile datalines dlm='|' dsd;
   input id Name_a:$25. Name_b:$25.;  
datalines;
1|MILLIE,MIL|MIL   
2|BILL,BOB,|BOB,,
3|KYLE|BILL,BROWN
4|,,|,,
;run;

this code seems to do the trick:

data want;
  set have;
  match_flag=0;
  do _N_=1 to countw(Name_a,',') while(match_flag=0);
    word=scan(Name_a,_N_,',');
    if word=' ' then continue;
    if indexw(trim(Name_b),trim(word),',') then
      match_flag=1;
    end;
  drop word;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 419 views
  • 4 likes
  • 4 in conversation