Solved: Conditional interleaving of two datasets?

genemroz

Esteemed Advisors:

I am trying to interleave two datasets with a condition that the resulting dataset contains only observations that can be found in both of the two datasets.

Below is exemplar code to illustrate the problem. If you run this code and inspect dataset interleave2 you will see that for a group of 3 observations where target=1, two came from Random_A and one came from Random_B. Likewise, for a group of three observations where target=2, two came from Random_B and one came from Random_A. All of these observations need to be retained in the desired dataset.

For the group of 3 observations where target=3, all observations came from Random_B only. These are ones that need to be omitted. All observations for a given target that come from a single source dataset are not to be retained in the desired dataset.

The challenge for me (and now for you) is to come up with the code that will interleave Random_A and Random_B such that the resultant dataset that only contains the groups of targets that are present in both datasets.

Hope this makes sense and thanks for taking a look,

Gene

data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

Proc sort data=Random_A;
by ST;
run;

Proc sort data=Random_B;
by ST;
run;

data interleave1;
set random_A random_B;
by ST;
run;

proc sort data=interleave1 out=interleave2 nounikey;
by target;
run;

yabwon

If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.

If so, then try this:

Proc sort data=Random_A equals;
by Target;
run;

Proc sort data=Random_B equals;
by Target;
run;


data interleave1;
  do _N_=1 by 1 until(last.Target);
    set random_A(in=ina) random_B(in=inb);
    by Target;
    N_A+ina;
    N_B+inb;
  end;

  do _N_=1 to _N_;
    set random_A random_B curobs=curobs1;
    by Target;
    if N_A and N_B then output;
  end; 

  call missing(N_A,N_B);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

Patrick

Below one way how this could work.

Interleaving data is slower than concatenating. If you need the result sorted then I'd be doing this after combining the tables. Depending on how many source rows get dropped this also leads to less rows in total that need sorting.

data Random_A (drop=i);
  call streaminit(4786);
  do i=1 to 100;
    Source="A";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

data Random_B (drop=i);
  call streaminit(6874);
  do i=1 to 150;
    Source="B";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

proc sql;
  create view work.common_vals as
  select l.target
  from Random_A l inner join Random_B r
    on l.target=r.target
  ;
quit;

data inter;
  set random_A random_B;
  if _n_=1 then
    do;
      dcl hash h1(dataset:"work.common_vals");
      h1.defineKey('Target');
      h1.defineDone();
    end;
  if h1.check()=0;
run;

proc sort data=inter out=want;
  by st;
run;

If you have a bigger data volume with repeated values for target in both tables then you could use below SQL alternative to avoid a many:many join that creates a lot of rows.

proc sql;
  create view work.common_vals as
  select l.target
  from (select distinct target from Random_A) l inner join Random_B r
    on l.target=r.target
  ;
quit;

yabwon

If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.

If so, then try this:

Proc sort data=Random_A equals;
by Target;
run;

Proc sort data=Random_B equals;
by Target;
run;


data interleave1;
  do _N_=1 by 1 until(last.Target);
    set random_A(in=ina) random_B(in=inb);
    by Target;
    N_A+ina;
    N_B+inb;
  end;

  do _N_=1 to _N_;
    set random_A random_B curobs=curobs1;
    by Target;
    if N_A and N_B then output;
  end; 

  call missing(N_A,N_B);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Astounding

A small tweak for the sake of efficiency:

    set random_A(in=ina keep=target) random_B(in=inb keep=target);

yabwon

I think it won't help much since the other(second) DoW-loop takes all data in the end, and SAS will likely caches the data in memory to save some I/Os after the first loop. So even if the first DoW-loop "narrows" data, the second will have to do the "missing" I/O.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

genemroz

Both solutions proposed by Yabwon and Patrick were successful. I marked Yabwon's as accepted because I'm not familiar with hash code objects

mkeintz

A minor simplification of @yabwon solution:

proc sort data=Random_A equals;
  by target;
run;

proc sort data=Random_B equals;
  by target;
run;

data want;
  merge random_a (in=ina) random_b (in=inb) ;
  by target;

  if last.target=1 then do until (last.target);
    set random_a  random_b;
    by target;
    if ina=1 and inb=1 then output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp

data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;

data temp;
 set Random_A Random_B indsname=indsname;
 dsn=indsname;
run;
proc sql;
create table want as
select * from temp group by Target having count(distinct dsn)=2;
quit;

mkeintz

If original data order is important, then you can avoid sorting:

data want;
  set random_a (in=ina) random_b (in=inb);

  if _n_=1 then do;
    declare hash found_in_b (dataset:'random_b (keep=target)');
      found_in_b.definekey('target');
      found_in_b.definedone();
    declare hash found_in_a ();
      found_in_a.definekey('target');
      found_in_a.definedone();
  end;

  if ina=1 and found_in_a.check()^=0 then found_in_a.add();

  if (ina=1 and found_in_b.check()=0)
     or 
     (inb=1 and found_in_a.check()=0);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Re: Conditional interleaving of two datasets?

Classroom Training Available!