Solved: De-duplicating controls in matched case/control

mfredrickson42 · Posted 05-18-2020 07:31 PM

Hi,

I'm using SAS to match cases/controls on conditions (in this case sex at birth and 10-year age range, which I already defined) for a matched case/control study on comorbidities and COVID. I have about 300 cases and 2000 controls in my dataset, and ideally I would like to match 1-3 controls to each case without reusing controls.

I use proc sql to match, and end up with a table showing every possible match - every permutation, where there's a separate line for every match. About 20,000 lines. However, a single control may be matched to 100 different cases. So it might look like this:

case_id control_id

1 10

1 125

1 1850

1 928

3 125

3 8

3 1276

3 1011

3 10

4 10

4 1011

I'd like to limit it to only use each control once, but to still allow multiple matches per case - and not to delete any cases unless they have ZERO matches. But, when I try to deduplicate, I end up with the vast majority of my cases getting deleted.

Here's my code for matching:
proc sql;
create table controls_id
as select
one.uniqueid as case_id,
two.uniqueid as control_id,
one.age_range as case_agegroup,
two.age_range as control_agegroup,
one.sex_at_birth as case_sab,
two.sex_at_birth as control_sab

from case one, control two
where (one.age_range=two.age_range and one.sex_at_birth = two.sex_at_birth);

And here's my code for deduplicating by control_ID - when I run this, it deletes all instances of the duplicated controls, but (I think because it's deleting randomly, and every single control matches more than one case) takes most of my cases with it.

proc sort data=controls_id nodupkey;
by control_id;
run;

I would like to delete observations with a duplicate control value, but prioritize deletion so that I lose as few cases as possible. Only one case does not have a matching control and each remaining case matches with more than one control.

Would appreciate any help - thank you!!

PGStats · Posted 05-20-2020 01:37 AM

@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:

For each control_id

pick among the possible case_id the one with the fewer matches so far

repeat.

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

proc sort data=have; by control_id; run;

data want;
if _n_ = 1 then do;
    call missing(case_id);
    declare hash h (multidata:"no");
    h.definekey("case_id");
    h.definedata("count");
    h.definedone();
    count = 0;
    do until(done);
        set have(keep=case_id) end=done;
        if h.find() ne 0 then h.add();
        end;
    end;
minCount = 9999;
do until(last.control_id);
    set have; by control_id;
    h.find(); 
    if count < minCount then do;
        minCount = count;
        bestCase = case_id;
        end;
    end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;

case_id control_id
1 	10
1 	125
1 	928
1 	1850
3 	8
3 	1276
4 	1011

Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.

PG

View solution in original post

PGStats · Posted 05-19-2020 01:20 AM

Might not be the simplest way, but could be done with three linear assignments:

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

data links1;
set have;
rename case_id=from control_id=to;
weight = 1;
run;

proc optnet data_links=links1 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want1;
run;

proc print data=want1; run;

proc sql;
create table links2 as
select * from links1
where to not in (select to from want1);
quit;

proc optnet data_links=links2 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want2;
run;

proc print data=want2; run;

proc sql;
create table links3 as
select * from links2
where to not in (select to from want2);
quit;

proc optnet data_links=links3 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want3;
run;

proc print data=want3; run;

data want;
set want1 want2 want3;
rename from=case_id to=control_id;
drop weight;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;

case_id control_id
1 	10
1 	928
1 	1850
3 	8
3 	125
3 	1276
4 	1011

PG

mfredrickson42 · Posted 05-19-2020 02:16 PM

This looks really promising, but unfortunately looks like my version of SAS doesn't have the OPTNET option. Any substitutions you might recommend?

PGStats · Posted 05-19-2020 02:38 PM

I guess another approach to get this done could be based on the hash object, but it is not obvious. I suggest reposting your question with a title like "Hash object for matching cases and controls?" to get the attention of hash gurus.

PG

PGStats · Posted 05-20-2020 01:37 AM

@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:

For each control_id

pick among the possible case_id the one with the fewer matches so far

repeat.

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

proc sort data=have; by control_id; run;

data want;
if _n_ = 1 then do;
    call missing(case_id);
    declare hash h (multidata:"no");
    h.definekey("case_id");
    h.definedata("count");
    h.definedone();
    count = 0;
    do until(done);
        set have(keep=case_id) end=done;
        if h.find() ne 0 then h.add();
        end;
    end;
minCount = 9999;
do until(last.control_id);
    set have; by control_id;
    h.find(); 
    if count < minCount then do;
        minCount = count;
        bestCase = case_id;
        end;
    end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;

case_id control_id
1 	10
1 	125
1 	928
1 	1850
3 	8
3 	1276
4 	1011

Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.

PG

mfredrickson42 · Posted 05-21-2020 03:22 PM

This worked! THANK YOU!

De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Re: De-duplicating controls in matched case/control

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!