09-10-2016 08:07 PM
I have a situation where i have to do a left join on two datasets. I cannot use proc sql step as i need to do some processing while doing left join. These processing involves creation of new variables on the basis of values in both the datasets.
My left join is little different as I am trying to merge one record from the first dataset to all the matching records in second. The key on which join is made is not a primary key in both the datasets. I have added a sample data from both datasets below.
Actual size of datasets is one million and right dataset is 20k.
I got a sample program from internet but I need some help in optimising the program. Now it loops for 20k times for each record in the first dataset. Is there any way i can subset the second dataset
Is there any alternative way to do this join more efficiently?
Thanks in advance,
input SUBJECT 3. @5EVENT $;
input SUBJECT 3. @5TRT_CODE $1.;
DROP _: match; * Drop temporary variables;
/*Below I am looping the second dataset completely. is there any way to subset it?*/
DO i=1 TO xnobs;
SET patdata (rename=(subject=_subject)) NOBS=xnobs POINT=i;
if subject=_subject THEN
IF match=0 THEN DO; * Output AE record if no match in CM;
09-10-2016 09:33 PM
09-10-2016 10:19 PM
You could use Hash Table instead of Left Join . data AE; input SUBJECT 3. @5EVENT $; datalines; 263 HEADACX 263 HEADACA 263 HEADAC2 263 HEADAC3 263 HEADAC4 263 HEADAC5 263 HEADAC6 264 xxxxxxx 264 xxxxxx1 265 FRACTU0 265 FRACTU1 265 FRACTU2 265 FRACTU3 265 FRACTU4 270 NOTPRES ; run; data PATDATA; input SUBJECT 3. @5TRT_CODE $1.; datalines; 263 A 263 D 263 X 264 A 265 B 265 Y 275 N ; run; data want; if _n_=1 then do; if 0 then set patdata; declare hash h(dataset:'patdata',multidata:'y'); h.definekey('subject'); h.definedata('trt_code'); h.definedone(); end; set ae; call missing(trt_code); rc=h.find(); if rc ne 0 then output; do while(rc=0); output; rc=h.find_next(); end; drop rc; run;
09-10-2016 11:13 PM
Use an index to read only matching records:
data AE; input SUBJECT 3. @5EVENT $; datalines; 263 HEADACX 263 HEADACA 263 HEADAC2 263 HEADAC3 263 HEADAC4 263 HEADAC5 263 HEADAC6 264 xxxxxxx 264 xxxxxx1 265 FRACTU0 265 FRACTU1 265 FRACTU2 265 FRACTU3 265 FRACTU4 270 NOTPRES ; data PATDATA /*(index=(subject=(subject)))*/; input SUBJECT 3. @5TRT_CODE $1.; datalines; 263 A 263 D 263 X 264 A 265 B 265 Y 275 N ; proc sql; create index subject on PATDATA(subject); quit; data join; set AE; reset = 1; do i = 1 by 1 while (_iorc_ = 0); set PATDATA key=subject keyreset=reset; if _error_ then call missing(TRT_CODE); if i = 1 or not _error_ then output; end; _iorc_ = 0; _error_ = 0; drop i; run;
09-12-2016 09:47 AM
This solution is using Array. The subject is used as index of the array. This helps to decide
in one probe whether the subject of AE is present in the PATDATA and if so TRT_CODE is harvested.
This method of using array is called KEY INDEXing. Because it decides in one hit, this method
is faster than HASH or other solutions. I wish you run all the methods for joy and compare
To be frugal in using the memory, we size the array with say 263:275 from your PATDATA as one argument.
We also need how many times a SUBJECT repeats. You know that 263 repeats 3 times but others are less.
So the ecnomical use of array will be:
array s[263:275, 3] _temporary_;
You may say that 266 to 274 are not present in PATDATA. Array can't be declared in pieces and so we live
with that limitation !
We call 263 as minsub, 275 as maxsub and 3 as maxGroup. For general purpose, these 3 values can be
obtained by scanning the PATDATA data set as:
data _null_; retain max minsub maxsub; do count = 1 by 1 until(last.subject); set patdata end = eof; by subject; if last.subject then do; max = max(max, count); minsub = min(minsub, subject); maxsub = max(maxsub, subject); end; end; if eof then do; call symputx('maxGroup', max); call symputx('minsub', minsub); call symputx('maxsub', maxsub); end; run; %put &minsub; %put &maxsub; %put &maxGroup;
The assumption is that the PATDATA, at least be GROUPED by SUBJECT and the AE data set need
not be sorted by SUBJECT.
The solution consists of two do-loops. When _N_ = 1, the first do-loop loads the TRT_CODE
from PATDATA into the array. Remember that if a SUBJECT repeats, COUNT takes care of the
correct cell of the array.
data want; array s[&minsub:&maxsub,&maxGroup] $1 _temporary_; array c[&minsub:&maxsub] _temporary_; if _n_ = 1 then do; do until(eof); do count = 1 by 1 until(last.subject); set patdata end = eof; by subject ; s[subject, count] = trt_code; end; c[subject] = count; end; end;
The second do-loop first ckecks whether the SUBJECT from AE falls between the RANGE(&minsub and &maxsub).
If it doen't fall, the default missing value for TRT_CODE is written to output data set. If it falls
then a check is made on the first cell of the SUBJECT whether it holds a non-missing TRT_CODE. Remember
situation like missing SUBJECTS with 266 to 274, which by default, will hold missing TRT_CODE in the array.
do until(last); set ae end = last; trt_code = ' '; if ^(&minsub <= subject <= &maxsub) then output; else if not missing(s[subject,1]) then do i = 1 to c[subject]; trt_code = s[subject,i]; /** Your computations go here ..... **/ output; end; else output; end; drop i count; run;
Hope this is an acceptable solution to you. Don't forget feed us your comparison of solutions.