Check whether a dataset variable value contains any string from a list

aljones1816 · Posted 11-08-2019 03:52 PM

I have a dataset containing a field of codes. Sometimes, the field contains multiple codes in a single row separated by a comma. the variable looks something like this:

id_field:

a23g22,hh998

bg884

g9932

gh994,f99g,4jgkf,fgldf

I have a second dataset containing a list of IDs without any multiples on the same row, like this:

id:

bg884

g9932

gh994

f99g

4jgkf

fgldf

What I need to do is to check for each row in the first dataset whether the variable contains the string on any row of the second dataset's ID variable, then perform some action if there is a match. So for instance, row 4 in the first dataset contains strings matching rows 3-6 of the second. I could just merge the two lists together, except for the fact that dataset one has many cells with multiple values, so they wouldn't join.

My intuition was that I could pull the ids from the second dataset into a list:

proc sql;
select distinct id into :id_list separated by ','
from all_ids;
quit;

And then perhaps perform a do loop to iterate through each value of the list using the "find" function:

data want;

set have;

for each value in id_list do;

if find(id_field,'id_list value') then do;

code here that does stuff;

end;

run;

But I do not know how to structure the syntax to achieve this, or if this is even the ideal solution (both datasets are small so I am not worried about processing time).

novinosrin · Posted 11-08-2019 04:01 PM

data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
/*Only matches*/
proc sql;
create table want as
select a.*,b.*
from one a inner join two b
on findw(id_field,strip(id))>0;
quit;
/*Both matches and Non matches*/
proc sql;
create table want as
select a.*,b.*
from one a left join two b
on findw(id_field,strip(id))>0;
quit;

novinosrin · Posted 11-08-2019 04:07 PM

data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set one;
do while(hi.next()=0);
  if findw(id_field,strip(id))>0 then output;
end;
run;

aljones1816 · Posted 11-08-2019 04:24 PM

@novinosrin Thanks so much for your help! One issue I have with the left join solution is that my output "want" dataset contains more rows than the input "have" dataset. I think this is because in the rows in the "have" with multiple IDs, there are multiple matches with the second id list. Can you tell me how I can match only once?

novinosrin · Posted 11-08-2019 04:26 PM

Hi @aljones1816 Can you plz post your expected solution or in other words the expected output for the input sample, so we can avoid assumptions

aljones1816 · Posted 11-08-2019 04:32 PM

@novinosrin I will try. I am new at this so please excuse my poor formatting.

In the "have" dataset, I only want to know if a given row matches one of the ids from dataset two. So I would want the putput to be something like:

id_field: id

a23g22,hh998

bg884 bg884

g9932 g9932

gh994,f99g,4jgkf,fgldf gh994

I do not want, for example, the last row to duplicate for every match, like this:

id_field: id

a23g22,hh998

bg884 bg884

g9932 g9932

gh994,f99g,4jgkf,fgldf gh994

gh994,f99g,4jgkf,fgldf f99g

gh994,f99g,4jgkf,fgldf 4jgkf

gh994,f99g,4jgkf,fgldf fgldf

novinosrin · Posted 11-08-2019 05:01 PM

Hi @aljones1816

/*Keep a Match flag*/
data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set one;
match=0;
do while(hi.next()=0);
 if findw(id_field,strip(id))>0 then match=1;
end;
if not match then call missing(id);
run;

novinosrin · Posted 11-08-2019 05:13 PM

Hello @aljones1816 An improvement to the Proc SQL to meet your requirement

proc sql;
create table want(drop=m) as
select a.*,b.*,monotonic() as m
from one a left join two b
on findw(id_field,strip(id))>0
group by id_field
having max(m)=m;
quit;

novinosrin · Posted 11-08-2019 05:19 PM

Okay @aljones1816 One linear simple and safe method,

data want;
if _n_=1 then do;
 if 0 then set two;
   dcl hash H (dataset:'two') ;
   h.definekey  ("id") ;
   h.definedata ("id") ;
   h.definedone () ;
 end;
set one;
call missing(id);
do _n_=1 to countw(id_field,',');
 if h.find(key:scan(id_field,_n_,','))=0 then leave;
end;
run;

r_behata · Posted 11-08-2019 04:38 PM

data one;
input id_field $35.;
cards;
a23g22,hh998
bg884
g9932
gh994,f99g,4jgkf,fgldf
;
run;
data two;
input id $;
cards;
bg884
g9932
gh994
f99g
4jgkf
;
run;

data want;
if _n_=0 then set one;
	if _n_ =1 then
		do;
			declare hash h(dataset:'one');
			declare hiter iter('h');
			h.definekey('id_field');
			h.definedata(all:'y');
			h.definedone();
		end;
	set two;
	rc=iter.first();



	do while(rc eq 0);
		if find(id_field,id,'t') > 0  then output;
		
		rc=iter.next();
	end;

	keep id;
run;

aljones1816 · Posted 11-08-2019 04:48 PM

@r_behata Thanks so much for this solution! One thing I need in the output though is that a raw in the "have" dataset only matches once with a row in the id dataset, so that the output "want" doesn't have more rows than the input. Something like this:

id_field: id

a23g22,hh998

bg884 bg884

g9932 g9932

gh994,f99g,4jgkf,fgldf gh994

Instead of this:

id_field: id

a23g22,hh998

bg884 bg884

g9932 g9932

gh994,f99g,4jgkf,fgldf gh994

gh994,f99g,4jgkf,fgldf f99g

gh994,f99g,4jgkf,fgldf 4jgkf

gh994,f99g,4jgkf,fgldf fgldf

Can you help me understand how to achieve that outcome?

hashman · Posted 11-08-2019 10:59 PM

@aljones1816;

For each record in ONE, loop through the list of the comma-delimited IDs.
If any ID is in TWO, break the loop, and the ID on which the loop has stopped is your guy.
Otherwise the ID value stopping the loop will be missing, which is what you need.

In other (SAS) words:

data one ;                                                                                                                                                                                                                                                      
  input id_field $35. ;                                                                                                                                                                                                                                         
  cards ;                                                                                                                                                                                                                                                       
a23g22,hh998                                                                                                                                                                                                                                                    
bg884                                                                                                                                                                                                                                                           
not,in,two                                                                                                                                                                                                                                                      
g9932                                                                                                                                                                                                                                                           
gh994,f99g,4jgkf,fgldf                                                                                                                                                                                                                                          
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data two ;                                                                                                                                                                                                                                                      
  input id $ ;                                                                                                                                                                                                                                                  
  cards ;                                                                                                                                                                                                                                                       
bg884                                                                                                                                                                                                                                                           
g9932                                                                                                                                                                                                                                                           
gh994                                                                                                                                                                                                                                                           
f99g                                                                                                                                                                                                                                                            
4jgkf                                                                                                                                                                                                                                                           
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want ;                                                                                                                                                                                                                                                     
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h (dataset:"two") ;                                                                                                                                                                                                                                
    h.definekey ("id") ;                                                                                                                                                                                                                                        
    h.definedone () ;                                                                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  set one two (obs=0) ;                                                                                                                                                                                                                                         
  do _n_ = 1 by 1 until (h.check() = 0 or cmiss (id)) ;                                                                                                                                                                                                         
    id = scan (id_field, _n_) ;                                                                                                                                                                                                                                 
  end ;                                                                                                                                                                                                                                                         
run ;

By the nature of the algorithm, the number of output rows will equal that in the input.

Kind regards,

Paul D.

Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Re: Check whether a dataset variable value contains any string from a list

Registration is open

Registration is open

SAS Training: Just a Click Away