@stellapersis7 wrote:
Hi all,
I have datasets called cases and controls. I need to match 1:1 from cases and controls using the following variables:
age and gender should be exactly matched
for duration, duration of controls should be more than duration of cases.
Hi @stellapersis7,
Are the above two bullet points the only requirements? Consider this simple example with only three cases and three controls (all with the same age and gender):
Obviously, there are several solutions satisfying your requirements: One is the set {(1, 3), (3, 5)} of (case, control) pairs, highlighted in green in the graph (where the subjects are represented by their "duration" values for simplicity). Other solutions are {(1, 2), (3, 5)}, {(1, 2), (4, 5)} and [(1, 3), (4, 5)} -- but also {(1, 5)}. The latter set contains only one (case, control) pair, as there is no eligible control left for the cases with durations 3 and 4, once the control with the large duration 5 has been "wastefully" assigned to the case with duration 1.
Mathematically, your goal is to find a matching in a bipartite graph. If you want to obtain a set with as many eligible (case, control) pairs as possible, this would be called a maximum (cardinality) matching. The maximum possible cardinality in the above example is 2, so the singleton matching {(1, 5)} is not a maximum matching.
I think (but haven't proved mathematically; I don't know much about graph theory) that the DATA step suggested below (creating dataset WANT) finds a maximum matching. It uses case and control datasets sorted by age, gender and descending duration. Starting with the maximum duration in each age-gender BY-group of the CASES dataset, it randomly selects one of the eligible controls in the CONTROLS dataset (if any). Technically, it temporarily stores the ENROLIDs and durations of one BY-group of the controls in a hash object (using a sequential number _c as the key), which is convenient because a control that has been assigned to a case can be easily deleted in order to avoid duplicate assignments.
Output dataset WANT contains all observations from dataset CASES plus the ENROLID of the assigned control, named ENROLID_CONTROL, and the corresponding DURATION_CONTROL. The latter two variables have missing values if no matching control was found (anymore).
Let me first create sample datasets CASES and CONTROLS with about 1000 cases and 3000 controls. (The purpose of the exclusions via WHERE= dataset options is to include non-matching cases and controls.)
/* Create sample data for demonstration */
data cases(rename=(d=duration_case) where=(age ne 21))
controls(rename=(d=duration_control) where=(age ne 42));
call streaminit(27182818);
do enrolid=1 to 4000;
age=rand('integer',18,80);
gender=char('MF',rand('integer',1,2));
d=rand('integer',1,2000);
if enrolid<1000 then output cases;
else output controls;
end;
run;
proc sort data=cases;
by age gender descending duration_case;
run;
proc sort data=controls;
by age gender descending duration_control;
run;
/* Match controls to cases */
data want(drop=_:);
call streaminit(3141592);
if _n_=1 then do;
if 0 then set cases;
dcl hash h(ordered:'a');
h.definekey('_c');
h.definedata('_c','enrolid_control','duration_control');
h.definedone();
dcl hiter hi('h');
end;
set controls(in=ctrl rename=(enrolid=enrolid_control)) cases(in=case);
by age gender;
if ctrl then do;
if first.gender then _c=1;
else _c+1;
h.add();
end;
if case then do;
_i=0;
_rc=hi.first();
do while(_rc=0 & duration_control>duration_case);
_i+1;
_rc=hi.next();
end;
if _i then do;
_r=rand('integer',_i);
do _j=1 to _r;
hi.prev();
end;
end;
else call missing(enrolid_control, duration_control);
output;
if _i then do;
_d=_c;
_rc=hi.prev();
_rc=h.remove(key:_d);
end;
end;
if last.gender then do;
_rc=hi.first();
_rc=hi.prev();
h.clear();
end;
run;
I have also written a "reverse" variant of the above program (not shown here), i.e., assigning cases to controls, using input datasets sorted by age, gender and ascending duration, starting the assignments with the smallest DURATION_CONTROL in each age-gender BY-group. With all input datasets I tested, it obtained the exact same number of matches as the above program -- indicating that those numbers might be the maximum possible "cardinalities".
Please note, however, that results of both versions of the program are somewhat "biased" in a sense: The above version "favors" large case durations (within each age-gender BY-group). Cases with smaller durations may be left unassigned because eligible controls have already been assigned earlier. Similarly, the reverse program version "favors" small control durations. You would have to decide if such "biases" are acceptable for whatever statistical analysis you are planning to perform with the matched case-control pairs.
In the small example above, the program would always assign control "5" to case "4" and hence leave case "3" unassigned. The "reverse" version of the program would always assign case "1" to control "2" and hence leave control "3" unassigned. Therefore, neither of the two program versions could ever obtain the "green" solution {(1, 3), (3, 5)}. If that is a problem and you want to avoid the "biases" mentioned above and your SAS license (unlike mine) includes SAS/OR or similar modules for optimization, I think you should post your question in the Mathematical Optimization, Discrete-Event Simulation, and OR forum. SAS/OR contains advanced procedures that are suitable for such "graph theoretic" problems.
EDIT: Unlike my test datasets, your sample data contain a duplicate case ENROLID (27264303). Depending on the rules to be applied to duplicates, the code above may need to be modified a bit in order to handle those cases correctly.
... View more