Solved: matching cases and controls

stellapersis7 · Posted 06-04-2024 01:17 PM

Hi all,

I have datasets called cases and controls. I need to match 1:1 from cases and controls using the following variables:

age and gender should be exactly matched
for duration, duration of controls should be more than duration of cases.

data WORK.CASES;
infile datalines dsd truncover;
input ENROLID:32. duration_case:32. AGE:32. GENDER:$4.;
label ENROLID="Enrollee ID" AGE="Age of Patient";
datalines;
1096503 141 37 MALE
15101702 107 61 MALE
27264303 36 24 FEMA
27264303 274 24 FEMA
27642301 4 60 MALE
;;;;

data WORK.CONTROLS;
infile datalines dsd truncover;
input ENROLID:32. duration_control:32. AGE:32. GENDER:$4.;
label ENROLID="Enrollee ID" AGE="Age of Patient";
datalines;
68202 700 55 FEMA
125603 714 52 MALE
126003 725 59 FEMA
350802 723 58 FEMA
371401 729 57 MALE
;;;;

I used proc survey select but its allowing for 1:1 of all exactly matched variable. However, I need the duration of controls to be more than cases. Please help.

Output desired:

match	enrolid	age	gender	duration
case	XXXX	5	female	30
control	YYY	5	female	45
case	aaa	23	male	91
control	bbb	23	male	243

Thank you.

FreelanceReinh · Posted 06-10-2024 11:29 AM

@stellapersis7 wrote:

Hi all,

I have datasets called cases and controls. I need to match 1:1 from cases and controls using the following variables:

age and gender should be exactly matched

for duration, duration of controls should be more than duration of cases.

Hi @stellapersis7,

Are the above two bullet points the only requirements? Consider this simple example with only three cases and three controls (all with the same age and gender):

Obviously, there are several solutions satisfying your requirements: One is the set {(1, 3), (3, 5)} of (case, control) pairs, highlighted in green in the graph (where the subjects are represented by their "duration" values for simplicity). Other solutions are {(1, 2), (3, 5)}, {(1, 2), (4, 5)} and [(1, 3), (4, 5)} -- but also {(1, 5)}. The latter set contains only one (case, control) pair, as there is no eligible control left for the cases with durations 3 and 4, once the control with the large duration 5 has been "wastefully" assigned to the case with duration 1.

Mathematically, your goal is to find a matching in a bipartite graph. If you want to obtain a set with as many eligible (case, control) pairs as possible, this would be called a maximum (cardinality) matching. The maximum possible cardinality in the above example is 2, so the singleton matching {(1, 5)} is not a maximum matching.

I think (but haven't proved mathematically; I don't know much about graph theory) that the DATA step suggested below (creating dataset WANT) finds a maximum matching. It uses case and control datasets sorted by age, gender and descending duration. Starting with the maximum duration in each age-gender BY-group of the CASES dataset, it randomly selects one of the eligible controls in the CONTROLS dataset (if any). Technically, it temporarily stores the ENROLIDs and durations of one BY-group of the controls in a hash object (using a sequential number _c as the key), which is convenient because a control that has been assigned to a case can be easily deleted in order to avoid duplicate assignments.

Output dataset WANT contains all observations from dataset CASES plus the ENROLID of the assigned control, named ENROLID_CONTROL, and the corresponding DURATION_CONTROL. The latter two variables have missing values if no matching control was found (anymore).

Let me first create sample datasets CASES and CONTROLS with about 1000 cases and 3000 controls. (The purpose of the exclusions via WHERE= dataset options is to include non-matching cases and controls.)

/* Create sample data for demonstration */

data cases(rename=(d=duration_case) where=(age ne 21))
     controls(rename=(d=duration_control) where=(age ne 42));
call streaminit(27182818);
do enrolid=1 to 4000;
  age=rand('integer',18,80);
  gender=char('MF',rand('integer',1,2));
  d=rand('integer',1,2000);
  if enrolid<1000 then output cases;
  else output controls;
end;
run;

proc sort data=cases;
by age gender descending duration_case;
run;

proc sort data=controls;
by age gender descending duration_control;
run;

/* Match controls to cases */

data want(drop=_:);
call streaminit(3141592);
if _n_=1 then do;
  if 0 then set cases;
  dcl hash h(ordered:'a');
  h.definekey('_c');
  h.definedata('_c','enrolid_control','duration_control');
  h.definedone();
  dcl hiter hi('h');
end;
set controls(in=ctrl rename=(enrolid=enrolid_control)) cases(in=case);
by age gender;
if ctrl then do;
  if first.gender then _c=1;
  else _c+1;
  h.add();
end;
if case then do;
  _i=0;
  _rc=hi.first();
  do while(_rc=0 & duration_control>duration_case);
    _i+1;
    _rc=hi.next();
  end;
  if _i then do;
    _r=rand('integer',_i);
    do _j=1 to _r;
      hi.prev();
    end;
  end;
  else call missing(enrolid_control, duration_control);
  output;
  if _i then do;
    _d=_c;
    _rc=hi.prev();
    _rc=h.remove(key:_d);
  end;
end;
if last.gender then do;
  _rc=hi.first();
  _rc=hi.prev();
  h.clear();
end;
run;

I have also written a "reverse" variant of the above program (not shown here), i.e., assigning cases to controls, using input datasets sorted by age, gender and ascending duration, starting the assignments with the smallest DURATION_CONTROL in each age-gender BY-group. With all input datasets I tested, it obtained the exact same number of matches as the above program -- indicating that those numbers might be the maximum possible "cardinalities".

Please note, however, that results of both versions of the program are somewhat "biased" in a sense: The above version "favors" large case durations (within each age-gender BY-group). Cases with smaller durations may be left unassigned because eligible controls have already been assigned earlier. Similarly, the reverse program version "favors" small control durations. You would have to decide if such "biases" are acceptable for whatever statistical analysis you are planning to perform with the matched case-control pairs.

In the small example above, the program would always assign control "5" to case "4" and hence leave case "3" unassigned. The "reverse" version of the program would always assign case "1" to control "2" and hence leave control "3" unassigned. Therefore, neither of the two program versions could ever obtain the "green" solution {(1, 3), (3, 5)}. If that is a problem and you want to avoid the "biases" mentioned above and your SAS license (unlike mine) includes SAS/OR or similar modules for optimization, I think you should post your question in the Mathematical Optimization, Discrete-Event Simulation, and OR forum. SAS/OR contains advanced procedures that are suitable for such "graph theoretic" problems.

EDIT: Unlike my test datasets, your sample data contain a duplicate case ENROLID (27264303). Depending on the rules to be applied to duplicates, the code above may need to be modified a bit in order to handle those cases correctly.

View solution in original post

ballardw · Posted 06-04-2024 03:42 PM

If you are going to show an example of desired output data then the shown example input data should be possible to create the output. Your shown output has ages of 5 and 23 and neither of your starting data sets has either of those ages. In fact you show zero, none, no matches of age between your control and case examples. So we can't test any code that would be expected to match on age and gender with that data.

Provide data that allows a valid selection. Which should mean one of the sets, likely the Control set, has multiple values that would be valid matches.

Without seeing what you did to select the cases with Surveyselect we can't discuss that approach either.

stellapersis7 · Posted 06-05-2024 12:42 PM

Hi Ballard,

thanks for the reply. The output I have given is to show how the durations should be(i.e duration of control more than cases). I cannot show the exact input as the first few observations I have in both datasets do not have the age/gender/duration variables I can match upon.

I definitely can give you the code for proc survey select.

duration of cases is the pre_overlap_duration and for controls it is day_sga

/* cases */

data case_met ;

set AGE_MET1;

index= pre_overlap_duration||AGE||GENDER;

run;

proc sort data = case_met;

by index;

run;

PROC FREQ DATA= case_met NOPRINT;

TABLES INDEX/LIST MISSING OUT=CASECNT_MET (KEEP=INDEX COUNT

RENAME=(COUNT=CASECNT));

run;

/* controls*/

data ctrl_glp;

set ctrl_1;

index= day_sga||age||GENDER;

run;

proc sort data = ctrl_glp;

by index;

run;

PROC FREQ DATA= ctrl_glp NOPRINT;

TABLES INDEX/LIST MISSING OUT=CTRLCNT (KEEP=INDEX COUNT

RENAME=(COUNT=CTRLCNT));

run;

DATA ALLCOUNT;

MERGE CASECNT (IN=A) CTRLCNT (IN=B);

BY INDEX;

IF CASECNT > 0;

IF A AND NOT B THEN CTRLCNT = 0;

_NSIZE_ = MIN(CASECNT,CTRLCNT);

IF _NSIZE_ GT 0;

RUN;

PROC SQL;

CREATE TABLE ELIGIBLE_CONTROLS AS

SELECT *

FROM ctrl_glp

WHERE INDEX IN (SELECT INDEX FROM ALLCOUNT);

RUN;

PROC SORT DATA = ELIGIBLE_CONTROLS;

BY INDEX;

RUN;

PROC SQL;

CREATE TABLE ELIGIBLE_CASES AS

SELECT *

FROM case_glp

WHERE INDEX IN (SELECT INDEX FROM ALLCOUNT);

RUN;

PROC SORT DATA = ELIGIBLE_CASES;

BY INDEX;

RUN;

PROC SURVEYSELECT DATA = ELIGIBLE_CONTROLS

SAMPSIZE = ALLCOUNT

METHOD = SRS

SEED=499812

OUT=SELECTED_CONTROLS;

STRATA INDEX;

RUN;

PROC SURVEYSELECT DATA = ELIGIBLE_CASES

SAMPSIZE = ALLCOUNT

METHOD = SRS

SEED=499812

OUT=SELECTED_CASES;

STRATA INDEX;

RUN;

DATA CC (KEEP=ENROLID INDEX CCID);

SET SELECTED_CONTROLS (IN=A KEEP=ENROLID INDEX)

SELECTED_CASES (IN=B KEEP=ENROLID INDEX);

IF A THEN CCID = 1; *CONTROLS;

ELSE IF B THEN CCID = 0; *CASES;

RUN;

PROC SORT DATA= CC;

BY INDEX CCID;

RUN;

DATA CC1 (KEEP=ENROLID INDEX CCID MATCHID);

SET CC;

BY INDEX CCID;

LENGTH CTKTR CAKTR IDXID 8 IDA $6 MATCHX $50 MATCHID 8;

ATTRIB MATCHID FORMAT =20.;

RETAIN CTKTR CAKTR IDXID;

IF CCID = 1 THEN CTKTR +1; * COUNTER FOR CONTROLS;

ELSE IF CCID = 0 THEN CAKTR +1; * COUNTER FOR CASES;

IF FIRST.INDEX THEN IDXID +1; * INCREASE INDEX COUNT;

IDA = COMPRESS(SUBSTR(INDEX,4,6),'*'); * RETAIN PART OF INDEX;

IDX= PUT(IDXID,$4.); * COUNTER (CHARACTER);

IF CCID = 1 THEN MATCHX = IDX||IDA||CTKTR; * MATCHID FOR CONTROLS;

ELSE IF CCID = 0 THEN MATCHX = IDX||IDA||CAKTR;* MATCHID FOR CASES;

MATCHX = COMPRESS(MATCHX,'');

MATCHID = INPUT(MATCHX, 20.); * NUMERIC MATCHID;

RUN;

FreelanceReinh · Posted 06-10-2024 11:29 AM

@stellapersis7 wrote:

Hi all,

I have datasets called cases and controls. I need to match 1:1 from cases and controls using the following variables:

age and gender should be exactly matched

for duration, duration of controls should be more than duration of cases.

Hi @stellapersis7,

Are the above two bullet points the only requirements? Consider this simple example with only three cases and three controls (all with the same age and gender):

Obviously, there are several solutions satisfying your requirements: One is the set {(1, 3), (3, 5)} of (case, control) pairs, highlighted in green in the graph (where the subjects are represented by their "duration" values for simplicity). Other solutions are {(1, 2), (3, 5)}, {(1, 2), (4, 5)} and [(1, 3), (4, 5)} -- but also {(1, 5)}. The latter set contains only one (case, control) pair, as there is no eligible control left for the cases with durations 3 and 4, once the control with the large duration 5 has been "wastefully" assigned to the case with duration 1.

Mathematically, your goal is to find a matching in a bipartite graph. If you want to obtain a set with as many eligible (case, control) pairs as possible, this would be called a maximum (cardinality) matching. The maximum possible cardinality in the above example is 2, so the singleton matching {(1, 5)} is not a maximum matching.

I think (but haven't proved mathematically; I don't know much about graph theory) that the DATA step suggested below (creating dataset WANT) finds a maximum matching. It uses case and control datasets sorted by age, gender and descending duration. Starting with the maximum duration in each age-gender BY-group of the CASES dataset, it randomly selects one of the eligible controls in the CONTROLS dataset (if any). Technically, it temporarily stores the ENROLIDs and durations of one BY-group of the controls in a hash object (using a sequential number _c as the key), which is convenient because a control that has been assigned to a case can be easily deleted in order to avoid duplicate assignments.

Output dataset WANT contains all observations from dataset CASES plus the ENROLID of the assigned control, named ENROLID_CONTROL, and the corresponding DURATION_CONTROL. The latter two variables have missing values if no matching control was found (anymore).

Let me first create sample datasets CASES and CONTROLS with about 1000 cases and 3000 controls. (The purpose of the exclusions via WHERE= dataset options is to include non-matching cases and controls.)

/* Create sample data for demonstration */

data cases(rename=(d=duration_case) where=(age ne 21))
     controls(rename=(d=duration_control) where=(age ne 42));
call streaminit(27182818);
do enrolid=1 to 4000;
  age=rand('integer',18,80);
  gender=char('MF',rand('integer',1,2));
  d=rand('integer',1,2000);
  if enrolid<1000 then output cases;
  else output controls;
end;
run;

proc sort data=cases;
by age gender descending duration_case;
run;

proc sort data=controls;
by age gender descending duration_control;
run;

/* Match controls to cases */

data want(drop=_:);
call streaminit(3141592);
if _n_=1 then do;
  if 0 then set cases;
  dcl hash h(ordered:'a');
  h.definekey('_c');
  h.definedata('_c','enrolid_control','duration_control');
  h.definedone();
  dcl hiter hi('h');
end;
set controls(in=ctrl rename=(enrolid=enrolid_control)) cases(in=case);
by age gender;
if ctrl then do;
  if first.gender then _c=1;
  else _c+1;
  h.add();
end;
if case then do;
  _i=0;
  _rc=hi.first();
  do while(_rc=0 & duration_control>duration_case);
    _i+1;
    _rc=hi.next();
  end;
  if _i then do;
    _r=rand('integer',_i);
    do _j=1 to _r;
      hi.prev();
    end;
  end;
  else call missing(enrolid_control, duration_control);
  output;
  if _i then do;
    _d=_c;
    _rc=hi.prev();
    _rc=h.remove(key:_d);
  end;
end;
if last.gender then do;
  _rc=hi.first();
  _rc=hi.prev();
  h.clear();
end;
run;

I have also written a "reverse" variant of the above program (not shown here), i.e., assigning cases to controls, using input datasets sorted by age, gender and ascending duration, starting the assignments with the smallest DURATION_CONTROL in each age-gender BY-group. With all input datasets I tested, it obtained the exact same number of matches as the above program -- indicating that those numbers might be the maximum possible "cardinalities".

Please note, however, that results of both versions of the program are somewhat "biased" in a sense: The above version "favors" large case durations (within each age-gender BY-group). Cases with smaller durations may be left unassigned because eligible controls have already been assigned earlier. Similarly, the reverse program version "favors" small control durations. You would have to decide if such "biases" are acceptable for whatever statistical analysis you are planning to perform with the matched case-control pairs.

In the small example above, the program would always assign control "5" to case "4" and hence leave case "3" unassigned. The "reverse" version of the program would always assign case "1" to control "2" and hence leave control "3" unassigned. Therefore, neither of the two program versions could ever obtain the "green" solution {(1, 3), (3, 5)}. If that is a problem and you want to avoid the "biases" mentioned above and your SAS license (unlike mine) includes SAS/OR or similar modules for optimization, I think you should post your question in the Mathematical Optimization, Discrete-Event Simulation, and OR forum. SAS/OR contains advanced procedures that are suitable for such "graph theoretic" problems.

EDIT: Unlike my test datasets, your sample data contain a duplicate case ENROLID (27264303). Depending on the rules to be applied to duplicates, the code above may need to be modified a bit in order to handle those cases correctly.

stellapersis7 · Posted 06-12-2024 02:55 PM

Thank you, it works!

Ksharp · Posted 06-12-2024 11:42 PM

Very interesting question.

If your data is not big , try this code:

p.s. using data is from Freelance.


data cases(rename=(d=duration_case) where=(age ne 21))
     controls(rename=(d=duration_control) where=(age ne 42));
call streaminit(27182818);
do enrolid=1 to 4000;
  age=rand('integer',18,80);
  gender=char('MF',rand('integer',1,2));
  d=rand('integer',1,2000);
  if enrolid<1000 then output cases;
  else output controls;
end;
run;

data want;
if _n_=1 then do;
 if 0 then set controls(rename=(enrolid=control_enrolid age=control_age gender=control_gender));
 declare hash h(dataset:'controls(rename=(enrolid=control_enrolid age=control_age gender=control_gender))',ordered:'y');
 declare hiter hi('h');
 h.definekey('control_enrolid','control_age','control_gender','duration_control');
 h.definedata('control_enrolid','control_age','control_gender','duration_control');
 h.definedone();
end;
set cases(rename=(enrolid=case_enrolid age=case_age gender=case_gender));
match='case   ';	enrolid=case_enrolid;	age=case_age; gender=case_gender;	duration=duration_case;
output;
rc=hi.first();
do while(rc=0);
 if case_age=control_age and case_gender=control_gender and duration_case<duration_control then do;
   match='control';	enrolid=control_enrolid;	age=control_age; gender=control_gender;	duration=duration_control;
   output;
   found=1;
   leave;
 end;
 rc=hi.next();
end;
if found then do;
rc=hi.next();
rc=h.remove(key:enrolid,key:age,key:gender,key:duration);
end;
keep  match	enrolid	age 	gender	duration ;
run;

matching cases and controls

Re: matching cases and controls

Re: matching cases and controls

Re: matching cases and controls

Re: matching cases and controls

Re: matching cases and controls

Re: matching cases and controls

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!