Esteemed Advisors:
I am trying to interleave two datasets with a condition that the resulting dataset contains only observations that can be found in both of the two datasets.
Below is exemplar code to illustrate the problem. If you run this code and inspect dataset interleave2 you will see that for a group of 3 observations where target=1, two came from Random_A and one came from Random_B. Likewise, for a group of three observations where target=2, two came from Random_B and one came from Random_A. All of these observations need to be retained in the desired dataset.
For the group of 3 observations where target=3, all observations came from Random_B only. These are ones that need to be omitted. All observations for a given target that come from a single source dataset are not to be retained in the desired dataset.
The challenge for me (and now for you) is to come up with the code that will interleave Random_A and Random_B such that the resultant dataset that only contains the groups of targets that are present in both datasets.
Hope this makes sense and thanks for taking a look,
Gene
data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
Proc sort data=Random_A;
by ST;
run;
Proc sort data=Random_B;
by ST;
run;
data interleave1;
set random_A random_B;
by ST;
run;
proc sort data=interleave1 out=interleave2 nounikey;
by target;
run;
If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.
If so, then try this:
Proc sort data=Random_A equals;
by Target;
run;
Proc sort data=Random_B equals;
by Target;
run;
data interleave1;
do _N_=1 by 1 until(last.Target);
set random_A(in=ina) random_B(in=inb);
by Target;
N_A+ina;
N_B+inb;
end;
do _N_=1 to _N_;
set random_A random_B curobs=curobs1;
by Target;
if N_A and N_B then output;
end;
call missing(N_A,N_B);
run;
Bart
Below one way how this could work.
Interleaving data is slower than concatenating. If you need the result sorted then I'd be doing this after combining the tables. Depending on how many source rows get dropped this also leads to less rows in total that need sorting.
data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
proc sql;
create view work.common_vals as
select l.target
from Random_A l inner join Random_B r
on l.target=r.target
;
quit;
data inter;
set random_A random_B;
if _n_=1 then
do;
dcl hash h1(dataset:"work.common_vals");
h1.defineKey('Target');
h1.defineDone();
end;
if h1.check()=0;
run;
proc sort data=inter out=want;
by st;
run;
If you have a bigger data volume with repeated values for target in both tables then you could use below SQL alternative to avoid a many:many join that creates a lot of rows.
proc sql;
create view work.common_vals as
select l.target
from (select distinct target from Random_A) l inner join Random_B r
on l.target=r.target
;
quit;
If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.
If so, then try this:
Proc sort data=Random_A equals;
by Target;
run;
Proc sort data=Random_B equals;
by Target;
run;
data interleave1;
do _N_=1 by 1 until(last.Target);
set random_A(in=ina) random_B(in=inb);
by Target;
N_A+ina;
N_B+inb;
end;
do _N_=1 to _N_;
set random_A random_B curobs=curobs1;
by Target;
if N_A and N_B then output;
end;
call missing(N_A,N_B);
run;
Bart
A small tweak for the sake of efficiency:
set random_A(in=ina keep=target) random_B(in=inb keep=target);
I think it won't help much since the other(second) DoW-loop takes all data in the end, and SAS will likely caches the data in memory to save some I/Os after the first loop. So even if the first DoW-loop "narrows" data, the second will have to do the "missing" I/O.
A minor simplification of @yabwon solution:
proc sort data=Random_A equals;
by target;
run;
proc sort data=Random_B equals;
by target;
run;
data want;
merge random_a (in=ina) random_b (in=inb) ;
by target;
if last.target=1 then do until (last.target);
set random_a random_b;
by target;
if ina=1 and inb=1 then output;
end;
run;
data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;
data temp;
set Random_A Random_B indsname=indsname;
dsn=indsname;
run;
proc sql;
create table want as
select * from temp group by Target having count(distinct dsn)=2;
quit;
If original data order is important, then you can avoid sorting:
data want;
set random_a (in=ina) random_b (in=inb);
if _n_=1 then do;
declare hash found_in_b (dataset:'random_b (keep=target)');
found_in_b.definekey('target');
found_in_b.definedone();
declare hash found_in_a ();
found_in_a.definekey('target');
found_in_a.definedone();
end;
if ina=1 and found_in_a.check()^=0 then found_in_a.add();
if (ina=1 and found_in_b.check()=0)
or
(inb=1 and found_in_a.check()=0);
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.