BookmarkSubscribeRSS Feed
aljones1816
Fluorite | Level 6

I have a dataset containing a field of codes. Sometimes, the field contains multiple codes in a single row separated by a comma. the variable looks something like this:

 

id_field:

a23g22,hh998

bg884

g9932

gh994,f99g,4jgkf,fgldf

 

I have a second dataset containing a list of IDs without any multiples on the same row, like this:

id:

bg884

g9932

gh994

f99g

4jgkf

fgldf

 

What I need to do is to check for each row in the first dataset whether the variable contains the string on any row of the second dataset's ID variable, then perform some action if there is a match. So for instance, row 4 in the first dataset contains strings matching rows 3-6 of the second. I could just merge the two lists together, except for the fact that dataset one has many cells with multiple values, so they wouldn't join.

 

My intuition was that I could pull the ids from the second dataset into a list:

 

proc sql;
select distinct id into :id_list separated by ','
from all_ids;
quit;

 

And then perhaps perform a do loop to iterate through each value of the list using the "find" function:

 

data want;

set have;

    for each value in id_list do;

        if find(id_field,'id_list value') then do;

            code here that does stuff;

        end;

    end;

run;

 

But I do not know how to structure the syntax to achieve this, or if this is even the ideal solution (both datasets are small so I am not worried about processing time).

 

11 REPLIES 11
novinosrin
Tourmaline | Level 20
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
/*Only matches*/
proc sql;
create table want as
select a.*,b.*
from one a inner join two b
on findw(id_field,strip(id))>0;
quit;
/*Both matches and Non matches*/
proc sql;
create table want as
select a.*,b.*
from one a left join two b
on findw(id_field,strip(id))>0;
quit;
novinosrin
Tourmaline | Level 20
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set one;
do while(hi.next()=0);
  if findw(id_field,strip(id))>0 then output;
end;
run;
aljones1816
Fluorite | Level 6

@novinosrin Thanks so much for your help! One issue I have with the left join solution is that my output "want" dataset contains more rows than the input "have" dataset. I think this is because in the rows in the "have" with multiple IDs, there are multiple matches with the second id list. Can you tell me how I can match only once?

novinosrin
Tourmaline | Level 20

Hi @aljones1816   Can you plz post your expected solution or in other words the expected output for the input sample, so we can avoid assumptions

aljones1816
Fluorite | Level 6

@novinosrin I will try. I am new at this so please excuse my poor formatting. 

 

In the "have" dataset, I only want to know if a given row matches one of the ids from dataset two. So I would want the putput to be something like:

 

id_field:                                             id

a23g22,hh998               

bg884                                           bg884

g9932                                           g9932

gh994,f99g,4jgkf,fgldf                  gh994

 

I do not want, for example, the last row to duplicate for every match, like this:

 

id_field:                                             id

a23g22,hh998               

bg884                                           bg884

g9932                                           g9932

gh994,f99g,4jgkf,fgldf                  gh994

gh994,f99g,4jgkf,fgldf                  f99g

gh994,f99g,4jgkf,fgldf                4jgkf

gh994,f99g,4jgkf,fgldf                fgldf

 

 

 

novinosrin
Tourmaline | Level 20

Hi @aljones1816 

 

/*Keep a Match flag*/
data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set one;
match=0;
do while(hi.next()=0);
 if findw(id_field,strip(id))>0 then match=1;
end;
if not match then call missing(id);
run;
novinosrin
Tourmaline | Level 20

Hello @aljones1816   An improvement to the Proc SQL to meet your requirement

 

proc sql;
create table want(drop=m) as
select a.*,b.*,monotonic() as m
from one a left join two b
on findw(id_field,strip(id))>0
group by id_field
having max(m)=m;
quit;
novinosrin
Tourmaline | Level 20

Okay @aljones1816   One linear simple and safe method, 

 

data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
 end;
set one;
call missing(id);
do _n_=1 to countw(id_field,',');
 if h.find(key:scan(id_field,_n_,','))=0 then leave;
end;
run;
r_behata
Barite | Level 11
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
run;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
run;

data want;
if _n_=0 then set one;
	if _n_ =1 then
		do;
			declare hash h(dataset:'one');
			declare hiter iter('h');
			h.definekey('id_field');
			h.definedata(all:'y');
			h.definedone();
		end;
	set two;
	rc=iter.first();



	do while(rc eq 0);
		if find(id_field,id,'t') > 0  then output;
		
		rc=iter.next();
	end;

	keep id;
run;

aljones1816
Fluorite | Level 6

@r_behata Thanks so much for this solution! One thing I need in the output though is that a raw in the "have" dataset only matches once with a row in the id dataset, so that the output "want" doesn't have more rows than the input. Something like this:

 

id_field:                                             id

a23g22,hh998               

bg884                                           bg884

g9932                                           g9932

gh994,f99g,4jgkf,fgldf                  gh994

 

Instead of this:

 

id_field:                                             id

a23g22,hh998               

bg884                                           bg884

g9932                                           g9932

gh994,f99g,4jgkf,fgldf                  gh994

gh994,f99g,4jgkf,fgldf                  f99g

gh994,f99g,4jgkf,fgldf                  4jgkf

gh994,f99g,4jgkf,fgldf                  fgldf

 

Can you help me understand how to achieve that outcome?

hashman
Ammonite | Level 13

@aljones1816;

  1. For each record in ONE, loop through the list of the comma-delimited IDs.
  2. If any ID is in TWO, break the loop, and the ID on which the loop has stopped is your guy.
  3. Otherwise the ID value stopping the loop will be missing, which is what you need. 

In other (SAS) words:

data one ;                                                                                                                                                                                                                                                      
  input id_field $35. ;                                                                                                                                                                                                                                         
  cards ;                                                                                                                                                                                                                                                       
a23g22,hh998                                                                                                                                                                                                                                                    
bg884                                                                                                                                                                                                                                                           
not,in,two                                                                                                                                                                                                                                                      
g9932                                                                                                                                                                                                                                                           
gh994,f99g,4jgkf,fgldf                                                                                                                                                                                                                                          
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data two ;                                                                                                                                                                                                                                                      
  input id $ ;                                                                                                                                                                                                                                                  
  cards ;                                                                                                                                                                                                                                                       
bg884                                                                                                                                                                                                                                                           
g9932                                                                                                                                                                                                                                                           
gh994                                                                                                                                                                                                                                                           
f99g                                                                                                                                                                                                                                                            
4jgkf                                                                                                                                                                                                                                                           
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want ;                                                                                                                                                                                                                                                     
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h (dataset:"two") ;                                                                                                                                                                                                                                
    h.definekey ("id") ;                                                                                                                                                                                                                                        
    h.definedone () ;                                                                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  set one two (obs=0) ;                                                                                                                                                                                                                                         
  do _n_ = 1 by 1 until (h.check() = 0 or cmiss (id)) ;                                                                                                                                                                                                         
    id = scan (id_field, _n_) ;                                                                                                                                                                                                                                 
  end ;                                                                                                                                                                                                                                                         
run ;                     

By the nature of the algorithm, the number of output rows will equal that in the input.

 

Kind regards,

Paul D.

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 10384 views
  • 3 likes
  • 4 in conversation