topic Re: Conditional interleaving of two datasets? in SAS Programming

Conditional interleaving of two datasets?

genemroz — Thu, 16 May 2024 04:01:25 GMT

Esteemed Advisors:

I am trying to interleave two datasets with a condition that the resulting dataset contains only observations that can be found in both of the two datasets.

Below is exemplar code to illustrate the problem. If you run this code and inspect dataset interleave2 you will see that for a group of 3 observations where target=1, two came from Random_A and one came from Random_B. Likewise, for a group of three observations where target=2, two came from Random_B and one came from Random_A. All of these observations need to be retained in the desired dataset.

For the group of 3 observations where target=3, all observations came from Random_B only. These are ones that need to be omitted. All observations for a given target that come from a single source dataset are not to be retained in the desired dataset.

The challenge for me (and now for you) is to come up with the code that will interleave Random_A and Random_B such that the resultant dataset that only contains the groups of targets that are present in both datasets.

Hope this makes sense and thanks for taking a look,

Gene

data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

Proc sort data=Random_A;
by ST;
run;

Proc sort data=Random_B;
by ST;
run;

data interleave1;
set random_A random_B;
by ST;
run;

proc sort data=interleave1 out=interleave2 nounikey;
by target;
run;

Re: Conditional interleaving of two datasets?

Patrick — Thu, 16 May 2024 05:51:27 GMT

Below one way how this could work.

Interleaving data is slower than concatenating. If you need the result sorted then I'd be doing this after combining the tables. Depending on how many source rows get dropped this also leads to less rows in total that need sorting.

data Random_A (drop=i);
  call streaminit(4786);
  do i=1 to 100;
    Source="A";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

data Random_B (drop=i);
  call streaminit(6874);
  do i=1 to 150;
    Source="B";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

proc sql;
  create view work.common_vals as
  select l.target
  from Random_A l inner join Random_B r
    on l.target=r.target
  ;
quit;

data inter;
  set random_A random_B;
  if _n_=1 then
    do;
      dcl hash h1(dataset:"work.common_vals");
      h1.defineKey('Target');
      h1.defineDone();
    end;
  if h1.check()=0;
run;

proc sort data=inter out=want;
  by st;
run;

If you have a bigger data volume with repeated values for target in both tables then you could use below SQL alternative to avoid a many:many join that creates a lot of rows.

proc sql;
  create view work.common_vals as
  select l.target
  from (select distinct target from Random_A) l inner join Random_B r
    on l.target=r.target
  ;
quit;

Re: Conditional interleaving of two datasets?

yabwon — Thu, 16 May 2024 06:30:54 GMT

If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.

If so, then try this:

Proc sort data=Random_A equals;
by Target;
run;

Proc sort data=Random_B equals;
by Target;
run;


data interleave1;
  do _N_=1 by 1 until(last.Target);
    set random_A(in=ina) random_B(in=inb);
    by Target;
    N_A+ina;
    N_B+inb;
  end;

  do _N_=1 to _N_;
    set random_A random_B curobs=curobs1;
    by Target;
    if N_A and N_B then output;
  end; 

  call missing(N_A,N_B);
run;

Bart

Re: Conditional interleaving of two datasets?

Astounding — Thu, 16 May 2024 11:16:12 GMT

A small tweak for the sake of efficiency:

    set random_A(in=ina keep=target) random_B(in=inb keep=target);

Re: Conditional interleaving of two datasets?

yabwon — Thu, 16 May 2024 12:00:23 GMT

I think it won't help much since the other(second) DoW-loop takes all data in the end, and SAS will likely caches the data in memory to save some I/Os after the first loop. So even if the first DoW-loop "narrows" data, the second will have to do the "missing" I/O.

Re: Conditional interleaving of two datasets?

genemroz — Thu, 16 May 2024 19:07:45 GMT

Both solutions proposed by Yabwon and Patrick were successful. I marked Yabwon's as accepted because I'm not familiar with hash code objects

Re: Conditional interleaving of two datasets?

Ksharp — Fri, 17 May 2024 01:05:37 GMT

data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;

data temp;
 set Random_A Random_B indsname=indsname;
 dsn=indsname;
run;
proc sql;
create table want as
select * from temp group by Target having count(distinct dsn)=2;
quit;

Re: Conditional interleaving of two datasets?

mkeintz — Fri, 17 May 2024 01:23:48 GMT

If original data order is important, then you can avoid sorting:

data want;
  set random_a (in=ina) random_b (in=inb);

  if _n_=1 then do;
    declare hash found_in_b (dataset:'random_b (keep=target)');
      found_in_b.definekey('target');
      found_in_b.definedone();
    declare hash found_in_a ();
      found_in_a.definekey('target');
      found_in_a.definedone();
  end;

  if ina=1 and found_in_a.check()^=0 then found_in_a.add();

  if (ina=1 and found_in_b.check()=0)
     or 
     (inb=1 and found_in_a.check()=0);
run;

Re: Conditional interleaving of two datasets?

mkeintz — Fri, 17 May 2024 01:46:32 GMT

A minor simplification of @yabwon solution:

proc sort data=Random_A equals;
  by target;
run;

proc sort data=Random_B equals;
  by target;
run;

data want;
  merge random_a (in=ina) random_b (in=inb) ;
  by target;

  if last.target=1 then do until (last.target);
    set random_a  random_b;
    by target;
    if ina=1 and inb=1 then output;
  end;
run;