BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mfredrickson42
Calcite | Level 5

Hi,

 

I'm using SAS to match cases/controls on conditions (in this case sex at birth and 10-year age range, which I already defined) for a matched case/control study on comorbidities and COVID. I have about 300 cases and 2000 controls in my dataset, and ideally I would like to match 1-3 controls to each case without reusing controls.

 

I use proc sql to match, and end up with a table showing every possible match - every permutation, where there's a separate line for every match. About 20,000 lines. However, a single control may be matched to 100 different cases. So it might look like this:

 

case_id           control_id

           1                     10

           1                    125

           1                 1850

           1                   928

           3                   125

           3                      8

           3                 1276

           3                  1011

           3                     10

           4                     10

           4                  1011

 

I'd like to limit it to only use each control once, but to still allow multiple matches per case - and not to delete any cases unless they have ZERO matches. But, when I try to deduplicate, I end up with the vast majority of my cases getting deleted.

 

Here's my code for matching:
proc sql;
create table controls_id
as select
one.uniqueid as case_id,
two.uniqueid as control_id,
one.age_range as case_agegroup,
two.age_range as control_agegroup,
one.sex_at_birth as case_sab,
two.sex_at_birth as control_sab

from case one, control two
where (one.age_range=two.age_range and one.sex_at_birth = two.sex_at_birth);

 

And here's my code for deduplicating by control_ID - when I run this, it deletes all instances of the duplicated controls, but (I think because it's deleting randomly, and every single control matches more than one case) takes most of my cases with it.

 

proc sort data=controls_id nodupkey;
by control_id;
run;

 

I would like to delete observations with a duplicate control value, but prioritize deletion so that I lose as few cases as possible. Only one case does not have a matching control and each remaining case matches with more than one control.

 

Would appreciate any help - thank you!!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:

 

 For each control_id

pick among the possible case_id the one with the fewer matches so far

repeat.

 

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

proc sort data=have; by control_id; run;

data want;
if _n_ = 1 then do;
    call missing(case_id);
    declare hash h (multidata:"no");
    h.definekey("case_id");
    h.definedata("count");
    h.definedone();
    count = 0;
    do until(done);
        set have(keep=case_id) end=done;
        if h.find() ne 0 then h.add();
        end;
    end;
minCount = 9999;
do until(last.control_id);
    set have; by control_id;
    h.find(); 
    if count < minCount then do;
        minCount = count;
        bestCase = case_id;
        end;
    end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;
case_id control_id
1 	10
1 	125
1 	928
1 	1850
3 	8
3 	1276
4 	1011

Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.

PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

Might not be the simplest way, but could be done with three linear assignments:

 

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

data links1;
set have;
rename case_id=from control_id=to;
weight = 1;
run;

proc optnet data_links=links1 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want1;
run;

proc print data=want1; run;

proc sql;
create table links2 as
select * from links1
where to not in (select to from want1);
quit;

proc optnet data_links=links2 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want2;
run;

proc print data=want2; run;

proc sql;
create table links3 as
select * from links2
where to not in (select to from want2);
quit;

proc optnet data_links=links3 GRAPH_DIRECTION=DIRECTED;
linear_assignment out=want3;
run;

proc print data=want3; run;

data want;
set want1 want2 want3;
rename from=case_id to=control_id;
drop weight;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;
case_id control_id
1 	10
1 	928
1 	1850
3 	8
3 	125
3 	1276
4 	1011
PG
mfredrickson42
Calcite | Level 5

This looks really promising, but unfortunately looks like my version of SAS doesn't have the OPTNET option. Any substitutions you might recommend?

PGStats
Opal | Level 21

I guess another approach to get this done could be based on the hash object, but it is not obvious. I suggest reposting your question with a title like "Hash object for matching cases and controls?" to get the attention of hash gurus.

PG
PGStats
Opal | Level 21

@mfredrickson42 , I managed to come up with a simple hash-based heuristic to generate some decent matches:

 

 For each control_id

pick among the possible case_id the one with the fewer matches so far

repeat.

 

data have;
input case_id  control_id;
datalines;
  1         10
  1        125
  1     1850
  1       928
  3       125
  3          8
  3     1276
  3      1011
  3         10
  4         10
  4      1011
;

proc sort data=have; by control_id; run;

data want;
if _n_ = 1 then do;
    call missing(case_id);
    declare hash h (multidata:"no");
    h.definekey("case_id");
    h.definedata("count");
    h.definedone();
    count = 0;
    do until(done);
        set have(keep=case_id) end=done;
        if h.find() ne 0 then h.add();
        end;
    end;
minCount = 9999;
do until(last.control_id);
    set have; by control_id;
    h.find(); 
    if count < minCount then do;
        minCount = count;
        bestCase = case_id;
        end;
    end;
case_id = bestCase;
output;
count = minCount + 1;
h.replace();
keep case_id control_id;
run;

proc sort data=want; by case_id control_id; run;

proc print data=want noobs; run;
case_id control_id
1 	10
1 	125
1 	928
1 	1850
3 	8
3 	1276
4 	1011

Note: You might get slightly more balanced results by replacing if count < minCount by if count <= minCount. Do some testing.

PG
mfredrickson42
Calcite | Level 5
This worked! THANK YOU!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 724 views
  • 1 like
  • 2 in conversation