DATA Step, Macro, Functions and more

Hash Table for Case-Control Matching

Reply
New Contributor
Posts: 3

Hash Table for Case-Control Matching

Hi all, I recently started trying to use hash tables for case control matching.  My goal is to do the following.

1) On the first run match cases to two controls very closely.  Ideally I would find all my matched controls in this step but there are two possibilities

     a) I only find 1 control that meets this criteria

     b) I don't find any controls to meet criteria.

2)  For all those cases that fall under a, and b I would like to run the matching again with  a wider tolerance on the matching.

I found some code from SAS Edmonton user group by George Zhu that does step 1.  I tried to modify the code so that it would go through and do step 1 and then for any cases left over go and do step 2. 

However, I have failed in my attempts to do this.

Part of the problem is that the second run is overwriting the results of the first so closer matches are getting over written with looser  matches.  This is because of  the  cases.replace() in the second loop.  Changing the replace to an .add() does not do what I want either because it adds to those that already have 2 matches.

Below is the code.  I would greatly appreciate any help.

Joe

****************************************************************************************

   *****DATA SETUP;

   ***************************************************************************************;

%let ratio=2;

  %let nCase=100;

  %let nControl=100000;

  %let ratio=2;

  * Generate the Cases data set;

  data cases(drop=i);

  retain id age gender;

  length gender $1;

  do i=1 to &nCase.;

  id=i;

  age=floor(ranuni(0)*100);

  gender=ifc(ranuni(0)>0.5,"F","M");

  output;

  end;

  run;

  * Generate the Controls data set;

  data controls(drop=i);

  retain id age gender;

  length gender $1;

  do i=1 to &nControl.;

  id=i;

  age=floor(ranuni(0)*73);

  gender=ifc(ranuni(5)>0.7,"F","M");

  output;

  end;

  run;

  data Controls_H;

  set Controls;

  control_rand=ranuni(0);

  rename age=control_age gender=control_gender id=control_id;

  run;

  proc sort data=controls_H; *scramble the control list for randomness;

  by control_rand;

  run;

  data Cases_H;

  set Cases;

  case_rand=ranuni(0); *for scramble the order of cases;

  rename age=case_age gender=case_gender id=case_id;

  count=0; *for recording number of controls matched;

  run;

   ****************************************************************************************

   *********  (2)   MATCHES   WITH EVEN MORE FUZZY AGE

   ***************************************************************************************;

   *********  NOTE: if does not find age within 5 years I want it to use the 10 year interval;

  * the main data step;

  data _null_;

  if _n_=1 then do;

  set cases_h(obs=1); *make the variables in Cases data set available in the PDV;

  *put the cases in a hash table;

  declare hash cases(dataset:'cases_H',hashexp:15,ordered:'y');

  cases.definekey("case_rand","case_id");

  cases.definedata("case_rand","case_id","count","case_age","case_gender");

  cases.definedone();

  declare hiter hi_cases('cases'); * declare a hash table iterator object;

  declare hash matches(); *declare a hash table for matched cases and controls;

  matches.definekey("case_id","control_id");

  matches.definedata("case_id","control_id");

  matches.definedone();

  *declare a hash table for recording matched controls;

  control_id_hash=case_id;

  declare hash m_control();

  m_control.definekey("control_id_hash");

  m_control.definedone();

  m_control.clear();

  end;

  set controls_h end=eof;

  control_id_hash=control_id; *get current control_id for searching;

  if (m_control.find() ne 0) then do; *not matched to a case yet;

  rc=hi_cases.first(); *search cases table using hash iterator object;

  do while(rc=0);

  if (count<&ratio.  and

  case_gender=control_gender and abs(case_age-control_age)<=5) then do;

  count+1;

  cases.replace();

  matches.add();

  m_control.add();

  leave;

  end;

       if (matches.find() ne 0) then do; ***if case is not found in matched output???**;

  if (count<&ratio.    and

  case_gender=control_gender and abs(case_age-control_age)<=10) then do;

  count+1;

  cases.replace();

  matches.add();

  m_control.add();

  leave;

  end;

  end;

  rc=hi_cases.next();

  end;

  end;

  *check if all the cases have matches (ie, count=&ratio.);

  done=1;

  rc=hi_cases.first();

  do while(rc=0);

  if count<&ratio. then do;

  done=0;

  leave;

  end;

  rc=hi_cases.next();

  end;

  *if all the cases are matched or run out of controls, output the resulting data sets;

  if (done or eof) then do;

  matches.output(dataset:"matches");

  cases.output(dataset:"matched_cases");

  m_control.output(dataset:"matched_controls");

  stop;

  end;

  run;

Respected Advisor
Posts: 4,651

Re: Hash Table for Case-Control Matching

There is a simpler way to do this random case matching, probably not as fast but more flexible than hashing (you could easily set up many selection stages). Look at the following:

%let nCase=100;
%let nControl=100000;
%let ratio=2;

* Generate the Cases data set;
data cases(drop=i);
length gender $1;
do i=1 to &nCase.;
     id=i;
     age=floor(ranuni(0)*100);
     gender=ifc(ranuni(0)>0.5,"F","M");
     output;
     end;
run;

* Generate the Controls data set;
data controls(drop=i);
length gender $1;
do i=1 to &nControl.;
     id=i;
     age=floor(ranuni(0)*73);
     gender=ifc(ranuni(5)>0.7,"F","M");
     output;
     end;
run;

/* Combine cases and controls, calculate selection stage distance (0-5)=0, (6-10)=1, add a random number to provide

random sampling within each selection stage */

proc sql;
create table matches as
select
     c.id as caseId,
     c.gender,
     c.age as caseAge,
     r.id as controlId,
     r.age as controlAge,
     ranuni(4323) as rand,
     (abs(c.age-r.age) > 5) + (abs(c.age-r.age) > 10) as dist
from cases as c left join controls as r
      on c.gender=r.gender and c.age between r.age - 10 and r.age + 10
order by c.id, dist, rand;
quit;

/* Extract first two matches for each case ID */

data sample;
do until (last.caseId);
     set matches; by caseId;
     order + 1;
     if order <= &ratio then output;
     end;
order = 0;
drop rand dist;
run;

PG


PG
New Contributor
Posts: 3

Re: Hash Table for Case-Control Matching

PG,

Thank you for your reply.  One of the reasons I was considering the hash approach is because of the size of the data I will be working with.  I am currently working with approx 45k cases and 800k possible controls.  I will run a few examples with the sql approach and see how it works.

Joe

Respected Advisor
Posts: 4,651

Re: Hash Table for Case-Control Matching

You are likely to require another approach with that many cases, or tighter matching criteria.  Run it with a 100 cases and the full set of controls and check the number of matches per case. That will give you an estimate of the matches table size. - PG

PG
Respected Advisor
Posts: 4,651

Re: Hash Table for Case-Control Matching

There is a simple way to reduce the amount of matching to be done: find all the matches for each combination of gender and age represented in the cases. I.e. two females cases of 48 will necessarily be matched to the same control list, so generate that list only once, put it in a random order, and match the first two controls to the first case and the two next ones to the second female case. In other words, the cases list will shrink to less than 200, making a SQL based solution feasable (I think Smiley Happy)

%let nCase=100;
%let nControl=100000;
%let ratio=2;
%let bestAgeDiff=5;
%let maxAgeDiff=10;

* Generate the Cases data set;
data cases(drop=i);
length gender $1;
do i=1 to &nCase.;
id=i;
age=floor(ranuni(0)*100);
gender=ifc(ranuni(0)>0.5,"F","M");
output;
end;
run;

* Generate the Controls data set;
data controls(drop=i);
length gender $1;
do i=1 to &nControl.;
id=i;
age=floor(ranuni(0)*73);
gender=ifc(ranuni(5)>0.7,"F","M");
output;
end;
run;

proc sort data=cases; by gender age; run;

/* Make a list of all gender-age unique combinations in the cases */

proc sql;
create table uniqueCases as
select unique gender, age
from cases
order by gender, age;
quit;

/* Find all possible matches of the controls to the unique cases. Generate two kinds of matches:

those within &bestAgeDiff years of each other and those within &maxAgeDiff years */

proc sql;
create table matches as
select
     U.gender,
     U.age,
     C.id as controlId,
     (abs(U.age-C.age) > &bestAgeDiff.) + (abs(U.age-C.age) > &maxAgeDiff.) as dist,
     ranuni(4323) as rand
from uniqueCases as U left join controls as C
     on U.gender=C.gender and U.age between C.age - &maxAgeDiff. and C.age + &maxAgeDiff.
order by U.gender, U.age, dist, rand;
quit;

/* Create one case observation for each expected match */

data nCases;
set cases;
do match = 1 to &ratio;
     output;
     end;
run;

/* Pick the first randomly ordered controls for all expected matches */

data sample;
match = 0; /* detects the end of expected matches for a given gender-age group */
merge nCases matches;
by gender age;
if match>0;
drop rand;
run;

PG

PG
New Contributor
Posts: 3

Re: Hash Table for Case-Control Matching

PG,

I used your above sql logic to my actual data.  It ran very fast.  The only problem is that the controls are being matched to more then one case.  My actual data is more complex than the simulated, there are more fields. When I run the code on the simulated data I don't get any duplicate controls.

Respected Advisor
Posts: 4,651

Re: Hash Table for Case-Control Matching

Yes, this is likely to happen if your cases population is not distributed the same way as your control population. Groups that are abundant in the cases but rare in the controls will generate multiple matches. Notice that duplicate matches occur across age groups, not within the same age group. I don't know how to prevent this completely. One way of reducing the likelyhood of duplicates would be to favour exact age matches over approximate ones. To do this, you could try changing variable dist expression to:

(U.age ne C.age) + (abs(U.age-C.age) > &bestAgeDiff.) + (abs(U.age-C.age) > &maxAgeDiff.) as dist,

Best of luck.

PG

PG
Ask a Question
Discussion stats
  • 6 replies
  • 339 views
  • 3 likes
  • 2 in conversation