I have a dataset containing a field of codes. Sometimes, the field contains multiple codes in a single row separated by a comma. the variable looks something like this:
id_field:
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
I have a second dataset containing a list of IDs without any multiples on the same row, like this:
id:
bg884
g9932
gh994
f99g
4jgkf
fgldf
What I need to do is to check for each row in the first dataset whether the variable contains the string on any row of the second dataset's ID variable, then perform some action if there is a match. So for instance, row 4 in the first dataset contains strings matching rows 3-6 of the second. I could just merge the two lists together, except for the fact that dataset one has many cells with multiple values, so they wouldn't join.
My intuition was that I could pull the ids from the second dataset into a list:
proc sql;
select distinct id into :id_list separated by ','
from all_ids;
quit;
And then perhaps perform a do loop to iterate through each value of the list using the "find" function:
data want;
set have;
for each value in id_list do;
if find(id_field,'id_list value') then do;
code here that does stuff;
end;
end;
run;
But I do not know how to structure the syntax to achieve this, or if this is even the ideal solution (both datasets are small so I am not worried about processing time).
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
/*Only matches*/
proc sql;
create table want as
select a.*,b.*
from one a inner join two b
on findw(id_field,strip(id))>0;
quit;
/*Both matches and Non matches*/
proc sql;
create table want as
select a.*,b.*
from one a left join two b
on findw(id_field,strip(id))>0;
quit;
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
data want;
if _n_=1 then do;
if 0 then set two;
dcl hash H (dataset:'two') ;
h.definekey ("id") ;
h.definedata ("id") ;
h.definedone () ;
dcl hiter hi('h');
end;
set one;
do while(hi.next()=0);
if findw(id_field,strip(id))>0 then output;
end;
run;
@novinosrin Thanks so much for your help! One issue I have with the left join solution is that my output "want" dataset contains more rows than the input "have" dataset. I think this is because in the rows in the "have" with multiple IDs, there are multiple matches with the second id list. Can you tell me how I can match only once?
Hi @aljones1816 Can you plz post your expected solution or in other words the expected output for the input sample, so we can avoid assumptions
@novinosrin I will try. I am new at this so please excuse my poor formatting.
In the "have" dataset, I only want to know if a given row matches one of the ids from dataset two. So I would want the putput to be something like:
id_field: id
a23g22,hh998
bg884 bg884
g9932 g9932
gh994,f99g,4jgkf,fgldf gh994
I do not want, for example, the last row to duplicate for every match, like this:
id_field: id
a23g22,hh998
bg884 bg884
g9932 g9932
gh994,f99g,4jgkf,fgldf gh994
gh994,f99g,4jgkf,fgldf f99g
gh994,f99g,4jgkf,fgldf 4jgkf
gh994,f99g,4jgkf,fgldf fgldf
Hi @aljones1816
/*Keep a Match flag*/
data want;
if _n_=1 then do;
if 0 then set two;
dcl hash H (dataset:'two') ;
h.definekey ("id") ;
h.definedata ("id") ;
h.definedone () ;
dcl hiter hi('h');
end;
set one;
match=0;
do while(hi.next()=0);
if findw(id_field,strip(id))>0 then match=1;
end;
if not match then call missing(id);
run;
Hello @aljones1816 An improvement to the Proc SQL to meet your requirement
proc sql;
create table want(drop=m) as
select a.*,b.*,monotonic() as m
from one a left join two b
on findw(id_field,strip(id))>0
group by id_field
having max(m)=m;
quit;
Okay @aljones1816 One linear simple and safe method,
data want;
if _n_=1 then do;
if 0 then set two;
dcl hash H (dataset:'two') ;
h.definekey ("id") ;
h.definedata ("id") ;
h.definedone () ;
end;
set one;
call missing(id);
do _n_=1 to countw(id_field,',');
if h.find(key:scan(id_field,_n_,','))=0 then leave;
end;
run;
data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
run;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
run;
data want;
if _n_=0 then set one;
if _n_ =1 then
do;
declare hash h(dataset:'one');
declare hiter iter('h');
h.definekey('id_field');
h.definedata(all:'y');
h.definedone();
end;
set two;
rc=iter.first();
do while(rc eq 0);
if find(id_field,id,'t') > 0 then output;
rc=iter.next();
end;
keep id;
run;
@r_behata Thanks so much for this solution! One thing I need in the output though is that a raw in the "have" dataset only matches once with a row in the id dataset, so that the output "want" doesn't have more rows than the input. Something like this:
id_field: id
a23g22,hh998
bg884 bg884
g9932 g9932
gh994,f99g,4jgkf,fgldf gh994
Instead of this:
id_field: id
a23g22,hh998
bg884 bg884
g9932 g9932
gh994,f99g,4jgkf,fgldf gh994
gh994,f99g,4jgkf,fgldf f99g
gh994,f99g,4jgkf,fgldf 4jgkf
gh994,f99g,4jgkf,fgldf fgldf
Can you help me understand how to achieve that outcome?
In other (SAS) words:
data one ; input id_field $35. ; cards ; a23g22,hh998 bg884 not,in,two g9932 gh994,f99g,4jgkf,fgldf ; run ; data two ; input id $ ; cards ; bg884 g9932 gh994 f99g 4jgkf ; run ; data want ; if _n_ = 1 then do ; dcl hash h (dataset:"two") ; h.definekey ("id") ; h.definedone () ; end ; set one two (obs=0) ; do _n_ = 1 by 1 until (h.check() = 0 or cmiss (id)) ; id = scan (id_field, _n_) ; end ; run ;
By the nature of the algorithm, the number of output rows will equal that in the input.
Kind regards,
Paul D.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.