Hi,
I'm using SAS to match cases/controls on conditions (in this case sex at birth and 10-year age range, which I already defined) for a matched case/control study on comorbidities and COVID. I have about 300 cases and 2000 controls in my dataset, and ideally I would like to match 1-3 controls to each case without reusing controls.
I use proc sql to match, and end up with a table showing every possible match - every permutation, where there's a separate line for every match. About 20,000 lines. However, a single control may be matched to 100 different cases. So it might look like this:
case_id control_id
1 10
1 125
1 1850
1 928
3 125
3 8
3 1276
3 1011
3 10
4 10
4 1011
I'd like to limit it to only use each control once, but to still allow multiple matches per case - and not to delete any cases unless they have ZERO matches. But, when I try to deduplicate, I end up with the vast majority of my cases getting deleted.
Here's my code for matching:
proc sql;
create table controls_id
as select
one.uniqueid as case_id,
two.uniqueid as control_id,
one.age_range as case_agegroup,
two.age_range as control_agegroup,
one.sex_at_birth as case_sab,
two.sex_at_birth as control_sab
from case one, control two
where (one.age_range=two.age_range and one.sex_at_birth = two.sex_at_birth);
And here's my code for deduplicating by control_ID - when I run this, it deletes all instances of the duplicated controls, but (I think because it's deleting randomly, and every single control matches more than one case) takes most of my cases with it.
proc sort data=controls_id nodupkey;
by control_id;
run;
I would like to delete observations with a duplicate control value, but prioritize deletion so that I lose as few cases as possible. Only one case does not have a matching control and each remaining case matches with more than one control.
Would appreciate any help - thank you!!
@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:
For each control_id
pick among the possible case_id the one with the fewer matches so far
repeat.
data have;
input case_id control_id;
datalines;
1 10
1 125
1 1850
1 928
3 125
3 8
3 1276
3 1011
3 10
4 10
4 1011
;
proc sort data=have; by control_id; run;
data want;
if _n_ = 1 then do;
call missing(case_id);
declare hash h (multidata:"no");
h.definekey("case_id");
h.definedata("count");
h.definedone();
count = 0;
do until(done);
set have(keep=case_id) end=done;
if h.find() ne 0 then h.add();
end;
end;
minCount = 9999;
do until(last.control_id);
set have; by control_id;
h.find();
if count < minCount then do;
minCount = count;
bestCase = case_id;
end;
end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;
proc sort data=want; by case_id control_id; run;
proc print data=want noobs; run;
case_id control_id 1 10 1 125 1 928 1 1850 3 8 3 1276 4 1011
Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.
Might not be the simplest way, but could be done with three linear assignments:
data have;
input case_id control_id;
datalines;
1 10
1 125
1 1850
1 928
3 125
3 8
3 1276
3 1011
3 10
4 10
4 1011
;
data links1;
set have;
rename case_id=from control_id=to;
weight = 1;
run;
proc optnet data_links=links1 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want1;
run;
proc print data=want1; run;
proc sql;
create table links2 as
select * from links1
where to not in (select to from want1);
quit;
proc optnet data_links=links2 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want2;
run;
proc print data=want2; run;
proc sql;
create table links3 as
select * from links2
where to not in (select to from want2);
quit;
proc optnet data_links=links3 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want3;
run;
proc print data=want3; run;
data want;
set want1 want2 want3;
rename from=case_id to=control_id;
drop weight;
run;
proc sort data=want; by case_id control_id; run;
proc print data=want noobs; run;
case_id control_id 1 10 1 928 1 1850 3 8 3 125 3 1276 4 1011
This looks really promising, but unfortunately looks like my version of SAS doesn't have the OPTNET option. Any substitutions you might recommend?
I guess another approach to get this done could be based on the hash object, but it is not obvious. I suggest reposting your question with a title like "Hash object for matching cases and controls?" to get the attention of hash gurus.
@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:
For each control_id
pick among the possible case_id the one with the fewer matches so far
repeat.
data have;
input case_id control_id;
datalines;
1 10
1 125
1 1850
1 928
3 125
3 8
3 1276
3 1011
3 10
4 10
4 1011
;
proc sort data=have; by control_id; run;
data want;
if _n_ = 1 then do;
call missing(case_id);
declare hash h (multidata:"no");
h.definekey("case_id");
h.definedata("count");
h.definedone();
count = 0;
do until(done);
set have(keep=case_id) end=done;
if h.find() ne 0 then h.add();
end;
end;
minCount = 9999;
do until(last.control_id);
set have; by control_id;
h.find();
if count < minCount then do;
minCount = count;
bestCase = case_id;
end;
end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;
proc sort data=want; by case_id control_id; run;
proc print data=want noobs; run;
case_id control_id 1 10 1 125 1 928 1 1850 3 8 3 1276 4 1011
Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.