BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
genemroz
Quartz | Level 8

Esteemed Advisors:

 

I am trying to interleave two datasets with a condition that the resulting dataset contains only observations that can be found in both of the two datasets.  

 

Below is exemplar code to illustrate the problem.  If you run this code and inspect dataset interleave2 you will see that for a group of  3 observations where target=1, two came from Random_A and one came from Random_B.  Likewise, for a group of three observations where target=2, two came from Random_B and one came from Random_A.  All of these observations need to be retained in the desired dataset.

 

For the group of 3 observations where target=3, all observations came from Random_B only.  These are ones that need to be omitted.  All observations for a given target that come from a single source dataset are not to be retained in the desired dataset.

 

The challenge for me (and now for you) is to come up with the code that will interleave Random_A and Random_B such that the resultant dataset that only contains the groups of targets that are present in both datasets.

 

Hope this makes sense and thanks for taking a look,

 

Gene

 

data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;

Proc sort data=Random_A;
by ST;
run;

Proc sort data=Random_B;
by ST;
run;

data interleave1;
set random_A random_B;
by ST;
run;

proc sort data=interleave1 out=interleave2 nounikey;
by target;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.

If so, then try this:

Proc sort data=Random_A equals;
by Target;
run;

Proc sort data=Random_B equals;
by Target;
run;


data interleave1;
  do _N_=1 by 1 until(last.Target);
    set random_A(in=ina) random_B(in=inb);
    by Target;
    N_A+ina;
    N_B+inb;
  end;

  do _N_=1 to _N_;
    set random_A random_B curobs=curobs1;
    by Target;
    if N_A and N_B then output;
  end; 

  call missing(N_A,N_B);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

Below one way how this could work. 

Interleaving data is slower than concatenating. If you need the result sorted then I'd be doing this after combining the tables. Depending on how many source rows get dropped this also leads to less rows in total that need sorting.

data Random_A (drop=i);
  call streaminit(4786);
  do i=1 to 100;
    Source="A";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

data Random_B (drop=i);
  call streaminit(6874);
  do i=1 to 150;
    Source="B";
    Target=rand("Integer",1,100);
    ST=catx('/',Source,Target);
    output;
  end;

proc sql;
  create view work.common_vals as
  select l.target
  from Random_A l inner join Random_B r
    on l.target=r.target
  ;
quit;

data inter;
  set random_A random_B;
  if _n_=1 then
    do;
      dcl hash h1(dataset:"work.common_vals");
      h1.defineKey('Target');
      h1.defineDone();
    end;
  if h1.check()=0;
run;

proc sort data=inter out=want;
  by st;
run;

 

If you have a bigger data volume with repeated values for target in both tables then you could use below SQL alternative to avoid a many:many join that creates a lot of rows.

proc sql;
  create view work.common_vals as
  select l.target
  from (select distinct target from Random_A) l inner join Random_B r
    on l.target=r.target
  ;
quit;

 

yabwon
Onyx | Level 15

If I understood you correctly, for each value of Target you want observations from A and B only if that Target value exists in both.

If so, then try this:

Proc sort data=Random_A equals;
by Target;
run;

Proc sort data=Random_B equals;
by Target;
run;


data interleave1;
  do _N_=1 by 1 until(last.Target);
    set random_A(in=ina) random_B(in=inb);
    by Target;
    N_A+ina;
    N_B+inb;
  end;

  do _N_=1 to _N_;
    set random_A random_B curobs=curobs1;
    by Target;
    if N_A and N_B then output;
  end; 

  call missing(N_A,N_B);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Astounding
PROC Star

A small tweak for the sake of efficiency:

    set random_A(in=ina keep=target) random_B(in=inb keep=target);
yabwon
Onyx | Level 15

I think it won't help much since the other(second) DoW-loop takes all data in the end, and SAS will likely caches the data in memory to save some I/Os after the first loop. So even if the first DoW-loop "narrows" data, the second will have to do the "missing" I/O.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



genemroz
Quartz | Level 8
Both solutions proposed by Yabwon and Patrick were successful. I marked Yabwon's as accepted because I'm not familiar with hash code objects
mkeintz
PROC Star

A minor simplification of @yabwon solution:

 

proc sort data=Random_A equals;
  by target;
run;

proc sort data=Random_B equals;
  by target;
run;

data want;
  merge random_a (in=ina) random_b (in=inb) ;
  by target;

  if last.target=1 then do until (last.target);
    set random_a  random_b;
    by target;
    if ina=1 and inb=1 then output;
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data Random_A (drop=i);
call streaminit(4786);
do i=1 to 100;
Source="A";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;
data Random_B (drop=i);
call streaminit(6874);
do i=1 to 150;
Source="B";
Target=rand("Integer",1,100);
ST=catx('/',Source,Target);
output;
end;
run;

data temp;
 set Random_A Random_B indsname=indsname;
 dsn=indsname;
run;
proc sql;
create table want as
select * from temp group by Target having count(distinct dsn)=2;
quit;
mkeintz
PROC Star

If original data order is important, then you can avoid sorting:

 

data want;
  set random_a (in=ina) random_b (in=inb);

  if _n_=1 then do;
    declare hash found_in_b (dataset:'random_b (keep=target)');
      found_in_b.definekey('target');
      found_in_b.definedone();
    declare hash found_in_a ();
      found_in_a.definekey('target');
      found_in_a.definedone();
  end;

  if ina=1 and found_in_a.check()^=0 then found_in_a.add();

  if (ina=1 and found_in_b.check()=0)
     or 
     (inb=1 and found_in_a.check()=0);
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 518 views
  • 1 like
  • 6 in conversation