Left join using data step

Sheeba · Posted 09-10-2016 08:07 PM

Hi,

I have a situation where i have to do a left join on two datasets. I cannot use proc sql step as i need to do some processing while doing left join. These processing involves creation of new variables on the basis of values in both the datasets.

My left join is little different as I am trying to merge one record from the first dataset to all the matching records in second. The key on which join is made is not a primary key in both the datasets. I have added a sample data from both datasets below.

Actual size of datasets is one million and right dataset is 20k.

I got a sample program from internet but I need some help in optimising the program. Now it loops for 20k times for each record in the first dataset. Is there any way i can subset the second dataset

Is there any alternative way to do this join more efficiently?

Thanks in advance,

First dataset.

data AE;

input SUBJECT 3. @5EVENT $;

datalines;

263 HEADACX

263 HEADACA

263 HEADAC2

263 HEADAC3

263 HEADAC4

263 HEADAC5

263 HEADAC6

264 xxxxxxx

264 xxxxxx1

265 FRACTU0

265 FRACTU1

265 FRACTU2

265 FRACTU3

265 FRACTU4

270 NOTPRES

;

run;

Second dataset

data PATDATA;

input SUBJECT 3. @5TRT_CODE $1.;

datalines;

263 A

263 D

263 X

264 A

265 B

265 Y

275 N

;

run;

program

%macro temp;

DATA alldata0;

SET ae;

DROP _: match; * Drop temporary variables;

match=0;

/*Below I am looping the second dataset completely. is there any way to subset it?*/

DO i=1 TO xnobs;

SET patdata (rename=(subject=_subject)) NOBS=xnobs POINT=i;

if subject=_subject THEN

DO;

match=1;

OUTPUT;

END;

IF match=0 THEN DO; * Output AE record if no match in CM;

CALL MISSING(trt_code);

OUTPUT; END;

RUN;

%mend temp;

%temp;

Regards,

Sheeba Swaminathan

Reeza · Posted 09-10-2016 09:18 PM

Why do think you can't use SQL?

Reeza · Posted 09-10-2016 09:18 PM

Why do think you can't use SQL?

Sheeba · Posted 09-10-2016 09:33 PM

Hi Reeza,

I cannot use sql as I need to create few variables during the join .. Some of the new variables are created by comparing the variables in both the datasets

Also the join conditions ( here if subject =_subject ) are generated dynamically from a dataset ..

Also I need to get two Data sets o

1. Which satisfies these conditions and
2 which does not satisfy the matching condition

In order to get all these flexibility I am planning to move forward with data step .

Regards,
Sheeba

Ksharp · Posted 09-10-2016 10:19 PM

You could use Hash Table instead of Left Join .




data AE;
input SUBJECT 3. @5EVENT $;
datalines;
263 HEADACX
263 HEADACA
263 HEADAC2
263 HEADAC3
263 HEADAC4
263 HEADAC5
263 HEADAC6
264 xxxxxxx
264 xxxxxx1
265 FRACTU0
265 FRACTU1
265 FRACTU2
265 FRACTU3
265 FRACTU4
270 NOTPRES
;
run;
 
data PATDATA;
input SUBJECT 3. @5TRT_CODE $1.;
datalines;
263 A
263 D
263 X
264 A
265 B
265 Y
275 N
;
run;
data want;
 if _n_=1 then do;
  if 0 then set patdata;
  declare hash h(dataset:'patdata',multidata:'y');
  h.definekey('subject');
  h.definedata('trt_code');
  h.definedone();
 end;
set ae;
call missing(trt_code);
rc=h.find();
if rc ne 0 then output;
do while(rc=0);
 output;
 rc=h.find_next();
end;
drop rc;
run;

PGStats · Posted 09-10-2016 11:13 PM

Use an index to read only matching records:

data AE;
input SUBJECT 3. @5EVENT $;
datalines;
263 HEADACX
263 HEADACA
263 HEADAC2
263 HEADAC3
263 HEADAC4
263 HEADAC5
263 HEADAC6
264 xxxxxxx
264 xxxxxx1
265 FRACTU0
265 FRACTU1
265 FRACTU2
265 FRACTU3
265 FRACTU4
270 NOTPRES
;

data PATDATA /*(index=(subject=(subject)))*/;
input SUBJECT 3. @5TRT_CODE $1.;
datalines;
263 A
263 D
263 X
264 A
265 B
265 Y
275 N
;

proc sql;
create index subject on PATDATA(subject);
quit;

data join;
set AE;
reset = 1;
do i = 1 by 1 while (_iorc_ = 0);
    set PATDATA key=subject keyreset=reset;
    if _error_ then call missing(TRT_CODE);
    if i = 1 or not _error_ then output;
    end;
_iorc_ = 0;
_error_ = 0;
drop i;
run;

PG

Sheeba · Posted 09-11-2016 08:00 PM

Hi ksharp,

Thanks a lot for suggestion.

Hash tables are new to me . I will do a research on that .

Regards,
Sheeba

Sheeba · Posted 09-11-2016 08:02 PM

Hi PGstats,

Thanks for suggesting this method .

I will try out this .

Thanks again,
Regards,
Sheeba Swaminathan

KachiM · Posted 09-12-2016 09:47 AM

This solution is using Array. The subject is used as index of the array. This helps to decide
in one probe whether the subject of AE is present in the PATDATA and if so TRT_CODE is harvested.
This method of using array is called KEY INDEXing. Because it decides in one hit, this method
is faster than HASH or other solutions. I wish you run all the methods for joy and compare
run-times.

To be frugal in using the memory, we size the array with say 263:275 from your PATDATA as one argument.
We also need how many times a SUBJECT repeats. You know that 263 repeats 3 times but others are less.
So the ecnomical use of array will be:
array s[263:275, 3] _temporary_;
You may say that 266 to 274 are not present in PATDATA. Array can't be declared in pieces and so we live
with that limitation !

We call 263 as minsub, 275 as maxsub and 3 as maxGroup. For general purpose, these 3 values can be
obtained by scanning the PATDATA data set as:

data _null_;
   retain max minsub maxsub;
   do count = 1 by 1 until(last.subject);
      set patdata end = eof;
      by subject;
      if last.subject then do;
         max = max(max, count);
         minsub = min(minsub, subject);
         maxsub = max(maxsub, subject);
      end;
   end;
if eof then do;
   call symputx('maxGroup', max);
   call symputx('minsub', minsub);
   call symputx('maxsub', maxsub);
end;
run;
%put &minsub;
%put &maxsub;
%put &maxGroup;

The assumption is that the PATDATA, at least be GROUPED by SUBJECT and the AE data set need
not be sorted by SUBJECT.

The solution consists of two do-loops. When _N_ = 1, the first do-loop loads the TRT_CODE
from PATDATA into the array. Remember that if a SUBJECT repeats, COUNT takes care of the
correct cell of the array.

data want;
   array s[&minsub:&maxsub,&maxGroup] $1 _temporary_;
   array c[&minsub:&maxsub] _temporary_;
   if _n_ = 1 then do;
   do until(eof);
      do count = 1 by 1 until(last.subject);
         set patdata end = eof;
         by subject ;
         s[subject, count] = trt_code;
      end;
      c[subject] = count;
   end;
   end;

The second do-loop first ckecks whether the SUBJECT from AE falls between the RANGE(&minsub and &maxsub).
If it doen't fall, the default missing value for TRT_CODE is written to output data set. If it falls
then a check is made on the first cell of the SUBJECT whether it holds a non-missing TRT_CODE. Remember
situation like missing SUBJECTS with 266 to 274, which by default, will hold missing TRT_CODE in the array.

   do until(last);
      set ae end = last;
      trt_code = ' ';
      if  ^(&minsub <= subject <= &maxsub) then output; 
      else if not missing(s[subject,1]) then 
      do i = 1 to c[subject];
         trt_code = s[subject,i];
         /** Your computations go here ..... **/
         output;
      end;
      else output;
   end;
drop i count;
run;

Hope this is an acceptable solution to you. Don't forget feed us your comparison of solutions.

Sheeba · Posted 09-12-2016 11:18 AM

Hi datasp,

Thanks a lot for the suggestion.

Sure .. I will run all these methods ,compare it and will let you know.

Regards,

Sheeba Swaminathan

Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Re: Left join using data step

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away