BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Schtroumpfette
Obsidian | Level 7

Hello,

 

I have the following code which runs well on a small sample and I get the cases and controls dataset (cacoset).  However, I am now running it using two very large datasets (cases = 65000 cases) and (controls > 60 million controls) and it is taking forever to run.  For example for the past 20 hours it has only read 500 observations from the created "Controls" dataset as part of the matching process.  

Any advice?

 

Thanks,

Shtroumpfette

 

Code:


options ls=78 ps=60 obs=max;


%Let results = %STR(M:\Case Control\Results);


/*IMPORT REQUIRED DATABASE*/

proc import datafile = 'C:\Users\Shtroump\Final Analysis\controls.csv'
out = pat_sk2;
run;


*copy dataset to keep intact copy;
data pat_sk3;
set pat_sk2;
run;


data Pat_sk3_plusminus;
set pat_sk3;
*figure out start/end of each interval;
index_start = index_date - 30;
index_end = index_date + 30;
run;

 

/*keep variables of interest for now-*/

data pat_sk4;
set Pat_sk3_plusminus;
keep encounter_id hospital_id age patient_sk sex diabetes_event index_date index_start index_end;
run;


/*SPLIT pat_sk4 elligible cases and & possible controls*/

data case_diab control_diab;
set pat_sk4;
if diabetes_event = 1 then output case_diab;
else output control_diab;
run;


data out1.case_diab (compress=yes);
set case_diab;
run;

data out1.contol_diab (compress=yes);
set control_diab;
run;


data caseset (compress=yes);
set case_diab;
run;


data cerner.controloriginal (compress=yes);
set control_diab;
run; /*save this dataset for future use*/

/*create a copy of controls dataset to use n iterative steps below*/
data contset (compress=yes);
set control_diab;
run;

/*Choose 2 matched controls for each case. Matches subject on gender and age hospital_id and index date.
Unique controls (sampling without replacement) are chosen for each case.
Cases cannot be their own controls*/

proc contents data=caseset out = contents noprint; run;


/*Creates macro variable for total # of cases in case data set*/

%global num_obs;

data _null_;
set contents (obs=1);
call symput('num_obs', trim(left(put(nobs,7.))));
run;
%put _global_;

/*assign case observation number to SETID that will be a unique number defiining each case-control risk set. CACO =1 for all cases*/

data caseset_temp;
set caseset;
setid=_N_;
caco=1;
run;

/*options mprint symbolgen mtrace*/

/*Begin macro MATCHED*/

options mprint;
%macro MATCHED;

%do i=1 %to &num_obs;
%let obsno=&i;


/*Loops interatively through case dataset to select one case at a time, then
evaluates all observations in control datset to select potential controls*/


data contset1;
if _N_=1 then set caseset_temp(firstobs=&obsno obs=&obsno
keep = patient_sk encounter_id diabetes_event age sex hospital_id index_start index_end setid
rename = (patient_sk = case_patient_sk
encounter_id = case_encounter_id
diabetes_event = case_diabetes_event
index_start=case_index_start
index_end = case_index_end
age=case_age
sex=case_sex
hospital_id = case_hospital_id
));

set contset;

if abs((<=case_index_end<=index_date <=case_index_end)
and hospital_id=case_hospital_id
and (case_age-2<=age<=case_age+2)
and sex=case_sex
and patient_sk ne case_patient_sk
)
then do;

choose = ranuni(0);
caco=0; /*CACO = 0 for all controls*/
output contset1; /*output controls meeting matching criteria*/
end;

label case_patient_sk = 'Case: Person ID (DUID+PID)';
run;

proc sort data = contset1; by choose; run;

data contset2(drop=choose);
set contset1(obs=2);
run;

%if &obsno=1 %then %do;
data controls;
set contset2;
run;
%end;
%else %do;
data controls;
set controls contset2;
run;
%end;
%end;
%mend matched;
%matched;

/*appends all cases and selected control datasets together*/

data matchset;
set caseset_temp controls;
by setid;
run;

proc sort data = matchset;
by setid descending caco;
run;

/*assigns unique IDNUM to each ase and control in all risk sets*/

data cacoset;
set matchset;
by setid descending caco;
if first.setid then count = 0;
if caco=1 then idnum =setid*10000;
else if caco ne 1 then do;
count+1;
idnum = (setid*10000)+count;
end;
run;

data less2(keep = setid count);
set cacoset;
by setid;
if last.setid and count<2;
run;

proc freq data =less2;
tables count;
title2 'cases with less than 2 controls after matching';
title3 'Frequency of # of controls';
run;

data less2cont;
merge less2(in=in1 drop=count) cacoset;
by setid;
if ini;
run;

proc print data = less2cont noobs n double;
var case_patient_sk diabetes_event patient_sk caco setid idnum sex age;
title2 'Cases with less than 2 controls after matching';
title3 '';
run;

proc sort data = cacoset out = byidentifier nodupkey;
by patient_sk;
run;

/*creating a final permanent dataset with all case-control risk sets*/
data out1.cacoset;
set cacoset(drop=count);
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello @Schtroumpfette ,

 

No idea if you have found out meanwhile on how to proceed but here's an example on how to proceed with Hash Object Table Look-up in case of being confronted with a Cartesian product!

 

Suppose you have a look-up table 'LOOKUP_TABLE' of 20 000 records.

For every observation in the dataset 'HAVE', you want to scan each and every observation in the lookup table for a possible match. 

You have no key-variable (no by-variable) to merge on.

 

Here's the classical (non-SQL) way of doing this kind of Cartesian product. It takes time!

data work.wanted;
 set work.have;
 do pointer = 1 to 20000;
  set work.lookup_table point=pointer;
   if whatever_condition_is_TRUE then output;
 end;
run;

Here's how to do the same with a hash table look-up. Much faster! Especially for BIG datasets.

/* See: https://support.sas.com/resources/papers/proceedings/proceedings/forum2007/271-2007.pdf */
/* See: https://support.sas.com/resources/papers/proceedings16/10200-2016.pdf                   */

data work.wanted(drop=rc);
  if _N_=1 then do;
   declare hash h(dataset: "work.lookup_table", ordered: "A", multidata: "Y");
   h.definekey   ("key");
   h.definedata  ('var_1','var_2');
   h.definedone();
   call missing(var_1, var_2);
  end; 
  set work.have;
do rc = h.find() by 0 while (rc = 0) ;
 if whatever_condition_is_TRUE then output;
 rc = h.find_next() ;
end ;
run;

You don't have a key variable, but the hash table requires a key-variable. Therefore, make a key variable in both of your datasets and give the key a constant value for every observation, sthg. like:

key='k';

Joining on such a key is the ultimate nxn merge (Cartesian product).

Make sure your datasets 'have' and 'lookup_table' have no variables with the same name. Do a (rename=()) if needed.

 

I hope you can translate this to your own concrete situation.

 

Cheers,

Koen

 

View solution in original post

8 REPLIES 8
Astounding
PROC Star

Looking at the first part of the code, before the macro begins, it's obvious you have never paid much attention to efficiency.  That code reads the same data set multiple times when it could be read once.  So I won't pretend I can fix the macro in a way that would be meaningful based on your current understanding of what the code should look like.  I will point out one section that should be fixed, that contributes to the slowdown.  This  inefficiency may not be obvious:

 

%if &obsno=1 %then %do;
data controls;
set contset2;
run;
%end;
%else %do;
data controls;
set controls contset2;
run;
%end;

The second DATA step looks like it should run quickly, but you have to remember that you are performing this task 65,000 times, with an average of 65,000 observations each time.  That will add up, particularly as the number of observations increases.  So this change won't fix everything by any means, but it will help:

 

%if &obsno=1 %then %do;
   data controls;
      set contset2;
   run;
%end;
%else %do;
   proc append data=contset2 base=controls;
   run;
%end;

One more consideration related to the logic of the code ... is it possible that the same patient appears twice in the data (once in the treatment group and once in the control group)?  That would make a difference in the coding requirements within the macro.  Also, is it acceptable if the same patient gets accepted as a control for more than one treatment patient?

 

Schtroumpfette
Obsidian | Level 7

Thanks Astounding,

 

I am fairly new to macros.  Thank you for your patience 🙂 

 

I will try your trick and see if this helps.

 

The code is based on this: 152-30: SAS® Programs to Select Controls for Matched Case-Control Studies (program 2)

To answer your question, the controls dataset does not have the cases- they were excluded.

Unique controls - without replacement- are chosen for each case.  Cases cannot be their own controls.

 

Thanks so much,

Shtroumpfette

sbxkoenk
SAS Super FREQ

Hello,

 

Use the 'Insert SAS Code' icon when copy/pasting code in your post. 😉 Your code won't lose structure and formatting.

 

Further on, ...

I haven't interpreted your code fully (it's too long!) but it seems to me you need a kind of Cartesian product (for every obs in dataset 1, scan all obs in dataset 2 for a match). 

There exist multiple ways to do this but by far the fastest solution is to do a Hash Object Table Lookup. You can probably cut your processing time by 80 or 90% by using a hash table (in-memory lookup). Hash tables are a great and easy to use tool introduced in SAS 9.

 

Cheers,

Koen

 

Schtroumpfette
Obsidian | Level 7

Thanks sbxkoenk,

 
II will use the insert a SAS code icon in the future.  Thanks for the tip. 
 
Correct: For every obs in dataset 1, scan all obs in dataset 2 for a match. 

 

What is a Hash Object Table Lookup?  It would be great if I can cut the processing time by 80-90%, but how would I do that?  Is this part of the code?  I have never done that, so any help you can provide, it would be great.

 

I am relatively new to SAS  and macros altogether. 

 

Thanks,

Schtroumpfette

sbxkoenk
SAS Super FREQ

Hello @Schtroumpfette ,

 

I would like to provide you with a quick code-example, but don't have time right now. I will check the thread of this discussion again in the coming days and if you haven't sorted it out yet (or nobody else has provided you with an example), I will surely post some hash table lookup code for your specific use case where there is NO key-variable to merge on.

 

Meanwhile you can try to sort it out yourself.

 

Here's an introduction on 'Using the Hash Object'.

Using the Hash Object

https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lrcon&docsetTarget=n1b...

Look at

Example 2: Loading a Data Set and Using the FIND Method to Retrieve Data.

It's not entirely what you are looking for as there is a key variable (by-variable) involved, but you will get ideas.

 

You can also search all boards within these communities.sas.com with keywords like: hash object table lookup SAS. You will probably find numerous examples and good papers on the subject of hash table lookup.

 

Cheers,

Koen

 

Schtroumpfette
Obsidian | Level 7

Thanks so much Koen,

 

I will look into it for sure- I already read a bit about it. I will keep you posted and I would appreciate a code in the coming days if nothing works and no one else responds.

 

True, there is NO key variable to merge on.  

 

Thanks,
Schtroumpfette

sbxkoenk
SAS Super FREQ

Hello @Schtroumpfette ,

 

No idea if you have found out meanwhile on how to proceed but here's an example on how to proceed with Hash Object Table Look-up in case of being confronted with a Cartesian product!

 

Suppose you have a look-up table 'LOOKUP_TABLE' of 20 000 records.

For every observation in the dataset 'HAVE', you want to scan each and every observation in the lookup table for a possible match. 

You have no key-variable (no by-variable) to merge on.

 

Here's the classical (non-SQL) way of doing this kind of Cartesian product. It takes time!

data work.wanted;
 set work.have;
 do pointer = 1 to 20000;
  set work.lookup_table point=pointer;
   if whatever_condition_is_TRUE then output;
 end;
run;

Here's how to do the same with a hash table look-up. Much faster! Especially for BIG datasets.

/* See: https://support.sas.com/resources/papers/proceedings/proceedings/forum2007/271-2007.pdf */
/* See: https://support.sas.com/resources/papers/proceedings16/10200-2016.pdf                   */

data work.wanted(drop=rc);
  if _N_=1 then do;
   declare hash h(dataset: "work.lookup_table", ordered: "A", multidata: "Y");
   h.definekey   ("key");
   h.definedata  ('var_1','var_2');
   h.definedone();
   call missing(var_1, var_2);
  end; 
  set work.have;
do rc = h.find() by 0 while (rc = 0) ;
 if whatever_condition_is_TRUE then output;
 rc = h.find_next() ;
end ;
run;

You don't have a key variable, but the hash table requires a key-variable. Therefore, make a key variable in both of your datasets and give the key a constant value for every observation, sthg. like:

key='k';

Joining on such a key is the ultimate nxn merge (Cartesian product).

Make sure your datasets 'have' and 'lookup_table' have no variables with the same name. Do a (rename=()) if needed.

 

I hope you can translate this to your own concrete situation.

 

Cheers,

Koen

 

Schtroumpfette
Obsidian | Level 7

Thanks Koen,

 

This is very helpful.  I read a lot about hash tables and tried to implement some solutions.  

 

However, I will try what  you proposed.  Thanks for taking the time, a great help!

 

Schtroumpfette

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1778 views
  • 3 likes
  • 3 in conversation