BookmarkSubscribeRSS Feed
Sheeba
Lapis Lazuli | Level 10

Hi,

 

I have a situation where i have to do a left join on two datasets. I cannot use proc sql step as i need to do some processing while doing left join. These processing involves creation of new variables on the basis of values in both the datasets.

 

My left join is little different as I am trying to merge one record from the first dataset to all the matching records in second.  The key on which join is made is not a primary key in both the datasets. I have added a sample data from both datasets below.

 

Actual size of datasets is one million and right dataset is 20k.

I got a sample program from internet but I need some help in optimising the program. Now it loops for 20k times for each record in the first dataset. Is there any way i can subset the second dataset

 

Is there any alternative way to do this join more efficiently?

 

Thanks in advance,

 

 

First dataset.

data AE;

input SUBJECT 3. @5EVENT $;

datalines;

263 HEADACX

263 HEADACA

263 HEADAC2

263 HEADAC3

263 HEADAC4

263 HEADAC5

263 HEADAC6

264 xxxxxxx

264 xxxxxx1

265 FRACTU0

265 FRACTU1

265 FRACTU2

265 FRACTU3

265 FRACTU4

270 NOTPRES

;

run;

 

Second dataset

data PATDATA;

input SUBJECT 3. @5TRT_CODE $1.;

datalines;

263 A

263 D

263 X

264 A

265 B

265 Y

275 N

;

run;

 

program

%macro temp;

DATA alldata0;

SET ae;

DROP _: match; * Drop temporary variables;

match=0;

/*Below I am looping the second dataset completely. is there any way to subset it?*/

DO i=1 TO xnobs;

SET patdata (rename=(subject=_subject)) NOBS=xnobs POINT=i;

if subject=_subject THEN

DO;

match=1;

OUTPUT;

END;

END;

IF match=0 THEN DO; * Output AE record if no match in CM;

CALL MISSING(trt_code);

OUTPUT; END;

RUN;

%mend temp;

%temp;

 

 

 

Regards,

Sheeba Swaminathan

9 REPLIES 9
Reeza
Super User

Why do think you can't use SQL? 

Reeza
Super User

Why do think you can't use SQL? 

Sheeba
Lapis Lazuli | Level 10
Hi Reeza,

I cannot use sql as I need to create few variables during the join .. Some of the new variables are created by comparing the variables in both the datasets

Also the join conditions ( here if subject =_subject ) are generated dynamically from a dataset ..

Also I need to get two Data sets o

1. Which satisfies these conditions and
2 which does not satisfy the matching condition


In order to get all these flexibility I am planning to move forward with data step .

Regards,
Sheeba
Ksharp
Super User
You could use Hash Table instead of Left Join .




data AE;
input SUBJECT 3. @5EVENT $;
datalines;
263 HEADACX
263 HEADACA
263 HEADAC2
263 HEADAC3
263 HEADAC4
263 HEADAC5
263 HEADAC6
264 xxxxxxx
264 xxxxxx1
265 FRACTU0
265 FRACTU1
265 FRACTU2
265 FRACTU3
265 FRACTU4
270 NOTPRES
;
run;
 
data PATDATA;
input SUBJECT 3. @5TRT_CODE $1.;
datalines;
263 A
263 D
263 X
264 A
265 B
265 Y
275 N
;
run;
data want;
 if _n_=1 then do;
  if 0 then set patdata;
  declare hash h(dataset:'patdata',multidata:'y');
  h.definekey('subject');
  h.definedata('trt_code');
  h.definedone();
 end;
set ae;
call missing(trt_code);
rc=h.find();
if rc ne 0 then output;
do while(rc=0);
 output;
 rc=h.find_next();
end;
drop rc;
run;

PGStats
Opal | Level 21

Use an index to read only matching records:

 

data AE;
input SUBJECT 3. @5EVENT $;
datalines;
263 HEADACX
263 HEADACA
263 HEADAC2
263 HEADAC3
263 HEADAC4
263 HEADAC5
263 HEADAC6
264 xxxxxxx
264 xxxxxx1
265 FRACTU0
265 FRACTU1
265 FRACTU2
265 FRACTU3
265 FRACTU4
270 NOTPRES
;

data PATDATA /*(index=(subject=(subject)))*/;
input SUBJECT 3. @5TRT_CODE $1.;
datalines;
263 A
263 D
263 X
264 A
265 B
265 Y
275 N
;

proc sql;
create index subject on PATDATA(subject);
quit;

data join;
set AE;
reset = 1;
do i = 1 by 1 while (_iorc_ = 0);
    set PATDATA key=subject keyreset=reset;
    if _error_ then call missing(TRT_CODE);
    if i = 1 or not _error_ then output;
    end;
_iorc_ = 0;
_error_ = 0;
drop i;
run;
PG
Sheeba
Lapis Lazuli | Level 10
Hi ksharp,

Thanks a lot for suggestion.

Hash tables are new to me . I will do a research on that .

Regards,
Sheeba
Sheeba
Lapis Lazuli | Level 10
Hi PGstats,

Thanks for suggesting this method .

I will try out this .

Thanks again,
Regards,
Sheeba Swaminathan
KachiM
Rhodochrosite | Level 12

This solution is using Array. The subject is used as index of the array. This helps to decide
in one probe whether the subject of AE is present in the PATDATA and if so TRT_CODE is harvested.
This method of using array is called KEY INDEXing. Because it decides in one hit, this method
is faster than HASH or other solutions. I wish you run all the methods for joy and compare
run-times.

To be frugal in using the memory, we size the array with say 263:275 from your PATDATA as one argument.
We also need how many times a SUBJECT repeats. You know that 263 repeats 3 times but others are less.
So the ecnomical use of array will be:
array s[263:275, 3] _temporary_;
You may say that 266 to 274 are not present in PATDATA. Array can't be declared in pieces and so we live
with that limitation !

We call 263 as minsub, 275 as maxsub and 3 as maxGroup. For general purpose, these 3 values can be
obtained by scanning the PATDATA data set as:

data _null_;
   retain max minsub maxsub;
   do count = 1 by 1 until(last.subject);
      set patdata end = eof;
      by subject;
      if last.subject then do;
         max = max(max, count);
         minsub = min(minsub, subject);
         maxsub = max(maxsub, subject);
      end;
   end;
if eof then do;
   call symputx('maxGroup', max);
   call symputx('minsub', minsub);
   call symputx('maxsub', maxsub);
end;
run;
%put &minsub;
%put &maxsub;
%put &maxGroup;

The assumption is that the PATDATA, at least be GROUPED by SUBJECT and the AE data set need
not be sorted by SUBJECT.

The solution consists of two do-loops. When _N_ = 1, the first do-loop loads the TRT_CODE
from PATDATA into the array. Remember that if a SUBJECT repeats, COUNT takes care of the
correct cell of the array.

data want;
   array s[&minsub:&maxsub,&maxGroup] $1 _temporary_;
   array c[&minsub:&maxsub] _temporary_;
   if _n_ = 1 then do;
   do until(eof);
      do count = 1 by 1 until(last.subject);
         set patdata end = eof;
         by subject ;
         s[subject, count] = trt_code;
      end;
      c[subject] = count;
   end;
   end;

The second do-loop first ckecks whether the SUBJECT from AE falls between the RANGE(&minsub and &maxsub).
If it doen't fall, the default missing value for TRT_CODE is written to output data set. If it falls
then a check is made on the first cell of the SUBJECT whether it holds a non-missing TRT_CODE. Remember
situation like missing SUBJECTS with 266 to 274, which by default, will hold missing TRT_CODE in the array.

 

   do until(last);
      set ae end = last;
      trt_code = ' ';
      if  ^(&minsub <= subject <= &maxsub) then output; 
      else if not missing(s[subject,1]) then 
      do i = 1 to c[subject];
         trt_code = s[subject,i];
         /** Your computations go here ..... **/
         output;
      end;
      else output;
   end;
drop i count;
run;

Hope this is an acceptable solution to you. Don't forget feed us your comparison of solutions.

Sheeba
Lapis Lazuli | Level 10

Hi datasp,

 

Thanks a lot for the suggestion.

 

Sure .. I will run all these methods ,compare it  and will let you know.

 

 

Regards,

Sheeba Swaminathan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3704 views
  • 8 likes
  • 5 in conversation