Solved: Case Control Matched Population - Avoiding Duplicates

anissak1 · Posted 05-01-2020 02:06 PM

Hello - I've created a case control matched population with clinical trial data. Even if you are not familiar with clinical trials, I think the SAS coding should be relatively simple - please help if you can. 🙂 I have been able to undertake all steps except one.

I have are 602 unique cases. I have a dataset that includes possible matching controls to these cases (68K options as many controls match each possible case). I would like to select one unique control for each of the possible cases. My current code, below, returns a match for each case, but it sometimes gives me a control that has already been selected for another case. If a control has already been selected for a case, I want SAS to "move on" to the next control that is an option and not include any duplicated controls in my final dataset. So in sum, I want 602 cases and 602 controls.

Partial dataset attached.

Thank you for any guidance!

Anissa

data random;
set controls_usubjid;
call streaminit(12345);
random= rand('uniform');
run;

proc sort data=random ; by cases_usubjid random;
run;

data controls_usubjid2 not_enough; set random;
by cases_usubjid ;
retain num;
if first.cases_usubjid then num=1; if num le 1 then do;
output controls_usubjid2;
num=num+1;
end;
if last.cases_usubjid then do;
if num le 2 then output not_enough; end;
run;

FreelanceReinh · Posted 05-04-2020 03:20 PM

Thanks for the clarification and the new problem description.

Try this:

data have;
input case control $ @@;
cards;
1 A 1 B 1 C 2 A 2 D 2 R 3 B 3 F 3 G
4 D 4 A 4 M 4 X 4 H
;

data want(drop=rc);
if _n_=1 then do;
  dcl hash h();
  h.definekey('control');
  h.definedone();
end;
do until(last.case);
  set have;
  by case;
  if rc=. & h.check() then do;
    rc=h.add();
    output;
  end;
end;
run;

The IF condition means:

rc=.: For the current BY group (i.e. case) no control has been selected yet.
h.check(): The control of the current observation has not been selected before (and hence is not found in the hash table).

If both parts of the condition are met, the control is added (rc=h.add()) to the hash table storing the selected controls and the observation is written to dataset WANT (output).

View solution in original post

ChrisNZ · Posted 05-01-2020 11:55 PM

Don't you want this rather?

data CONTROLS_USUBJID2 NOT_ENOUGH; 
  set RANDOM;
  by CASES_USUBJID ;
  if first.CASES_USUBJID then NUM=1; 
  if NUM eq 1 then output CONTROLS_USUBJID2;
  NUM+1;
  if last.CASES_USUBJID & NUM le 2 then output NOT_ENOUGH; 
run;

High-Performance SAS Coding - Third Edition

anissak1 · Posted 05-02-2020 10:23 AM

Thank you for the note and code. Indeed you fixed my code for the "not enough" dataset - I hadn't notice that error because I am obsessing over how to ensure I don't have duplicate controls selected. 🙂

The code you provided still gives me duplicate controls for some cases. Any ideas on how I can fix this? Seems such a simple concept! Like an "if then" clause. But I can't figure how to say in SAS "if you have already picked a certain control as a match, then go on to the next available control". Maybe there is another way to think about the problem?

ChrisNZ · Posted 05-02-2020 07:39 PM

I don't see CONTROL in the code, only CASE. And only the first CASE record will be saved. There should be no duplicate.

High-Performance SAS Coding - Third Edition

anissak1 · Posted 05-02-2020 09:26 PM

Apologies for the confusion. The dataset I created has 602 cases and all possible matching controls. Cases and controls are in the same observation/row. I ordered the controls randomly. Then I selected one control for each case by selecting the first case. The code you sent works well.

What I want to know is, how I can tell SAS not to "pick" a particular control for a case if that control was already selected.

When I run the code as is, I get duplicate controls, even though the cases are unique.

Hopefully that makes more sense. Thank you!

Anissa

ChrisNZ · Posted 05-03-2020 01:00 AM

You have to store the controls as they are saved, and check the new ones against that store.

For 600 values, an array is enough, though a hash table can be used too.

Something like:

data CONTROLS_USUBJID2 NOT_ENOUGH; 
  set RANDOM;
  by CASES_USUBJID ;
  array CONTROLS [602] $20 _temporary_;
  if first.CASES_USUBJID then do;   
    NUM=1; 
    FOUND=0;
  end; 
  if ^FOUND & ^whichc(CONTROL, of CONTROLS[*]) then do;
    output CONTROLS_USUBJID2;
    CONTROLS[whichc(' ', of CONTROLS[*]) = CONTROL;
    FOUND+1;
  end;
  NUM+1;
  if last.CASES_USUBJID & NUM le 2 then output NOT_ENOUGH; 
run;

Not too sure how you want to treat the NUM variable.

High-Performance SAS Coding - Third Edition

anissak1 · Posted 05-03-2020 12:10 PM

Hi Chris - Good morning! Thanks again! I was able to run the code - had to fix one missing bracket. But the resulting dataset still has multiple duplicate controls assigned to particular cases. 😞 See screenshot. Any other ideas? Really appreciate you working through this with me. More than you know!!

Anissa

FreelanceReinh · Posted 05-03-2020 03:00 PM

Hi @anissak1,

I saw your initial post, but I haven't started looking deeper into your or @ChrisNZ's code because, contrary to your own assessment, I don't think that your task is "relatively simple" in general. (You may be lucky and your specific dataset is amenable to a "relatively simple" heuristic.) Instead I read up a bit on the topic of finding a maximum (cardinality) matching in a bipartite graph -- which is exactly the mathematical description of the problem you want to solve. (Regrettably, I skipped graph theory in my study back in the 1990s ...)

The good news is: There are algorithms for this task (see the first of the two Wikipedia links above). It depends on the data, though, if a solution exists at all. In terms of SAS, I'm sure that suitable algorithms are available in one or more SAS/OR procedures. For example, PROC OPTGRAPH should be able to solve an even more difficult problem: to find a maximum matching with an additional optimality criterion. This means you could assign "weights" to each admissible case-control pair (to indicate that one pair "fits better" than another) and the procedure would strive to find a "best" maximum matching in this sense. See the paragraph about the case "|S|<|T|" in Linear Assignment (Matching).

However, I don't have a SAS/OR license (nor any experience with this SAS module). Do you? If so, the expert(s) in the Mathematical Optimization, Discrete-Event Simulation, and OR subforum can most likely help you with code.

Otherwise, it might be an idea to prevent the situation of having multiple case-control pairs to choose from. How did you create your current dataset? There's a SAS/STAT procedure called PSMATCH which creates an equivalent type of dataset from input data containing cases and controls in separate observations with variables containing the characteristics such as your variables agegrp, sex and baseline. I'm not familiar with this procedure, but I saw in the documentation that the relevant MATCH statement has options to specify how many controls (e.g., 1) are to be assigned to each case (depending on the METHOD used). So, perhaps you can apply PROC PSMATCH to your original "unmatched" data and thus obtain a unique ("optimal") control for each case to begin with.

anissak1 · Posted 05-03-2020 08:56 PM

Hi Reinhard - Two additional comments are that 1) I don't need an "ideal" match from a mathematical perspective. I'm choosing exact matches (gender, age, disease type) and already have a dataset wherein these matches are contained. So no "judgment" needs to be applied in selection. 2) I understand that my random ordering of controls should be sufficient in terms of selection criteria.

If this makes any SAS code aha's come to light for you or anyone else in the community, I'd be delighted. But I can also stick to my more limited dataset. This is a secondary analysis for a paper and I can tell from re-running the possible matches after eliminating duplicates that probably not many more possibilities remain.

Thanks again.

Anissa

anissak1 · Posted 05-04-2020 01:28 PM

Hello Reinhard - I posted this message to the wrong board....maybe the below simplification will help you or others with ideas...?!

The selection and ordering is already done, so I just would like to tell SAS to do the following...?! There are 2 simple columns of data. I want a new dataset containing first observation for each value in column 1 selected based on the value in column B not having been selected for a previous observation.

1 A (Selected)

1 B

1 C

2 A (Not selected because A already "used")

2 D (Selected because D has not been "used")

2 R

3 B (Selected because B has not been "used")

3 F

3 G

4 D (Not selected because D has been "used")

4 A (Not selected because A has been "used")

4 M (Selected because M has not been "used)

4 X

4 H

FreelanceReinh · Posted 05-04-2020 03:20 PM

Thanks for the clarification and the new problem description.

Try this:

data have;
input case control $ @@;
cards;
1 A 1 B 1 C 2 A 2 D 2 R 3 B 3 F 3 G
4 D 4 A 4 M 4 X 4 H
;

data want(drop=rc);
if _n_=1 then do;
  dcl hash h();
  h.definekey('control');
  h.definedone();
end;
do until(last.case);
  set have;
  by case;
  if rc=. & h.check() then do;
    rc=h.add();
    output;
  end;
end;
run;

The IF condition means:

rc=.: For the current BY group (i.e. case) no control has been selected yet.
h.check(): The control of the current observation has not been selected before (and hence is not found in the hash table).

If both parts of the condition are met, the control is added (rc=h.add()) to the hash table storing the selected controls and the observation is written to dataset WANT (output).

anissak1 · Posted 05-05-2020 12:54 AM

Hello Reinhard! Thank you. I will try this tomorrow and keep you posted.

anissak1 · Posted 05-05-2020 03:10 PM

Dear Reinhard - A million thanks. I applied your code to my dataset and it worked perfectly.

Absolutely perfectly. I was able to find a total of 579 matching controls for the 602 cases. And no duplicates!

Wow, that was so much better than my manual approach. 🙂

Much appreciation!

Anissa

ChrisNZ · Posted 05-03-2020 07:06 PM

It works for me.

data CONTROLS_USUBJID;
  do CASES_USUBJID=1 to 602;
    do CONTROL=1 to 1000;
      if ranuni(1) > .95 then output;
    end;
  end;
run;

data RANDOM;
  set CONTROLS_USUBJID;
  call streaminit(12345);
  RANDOM= rand('uniform');
run;
 
proc sort data=RANDOM ; 
  by CASES_USUBJID RANDOM;
run;

data CONTROLS_USUBJID2 
     NOT_ENOUGH; 
  set RANDOM;
  by CASES_USUBJID ;
  array CONTROLS [602] _temporary_ (602*-1) ; 
  if first.CASES_USUBJID then do;   
    NUM=1; 
    FOUND=0;
  end; 
  if ^FOUND & ^whichn(CONTROL, of CONTROLS[*]) then do;
    output CONTROLS_USUBJID2;
    CONTROLS[whichn(-1, of CONTROLS[*])] = CONTROL;
    FOUND+1;
  end;
  NUM+1;
  if last.CASES_USUBJID & NUM le 2 then output NOT_ENOUGH; 
run;

proc sql; 
  select CONTROL, count(*) from CONTROLS_USUBJID2 group by CONTROL having count(*)>1;

NOTE: No rows were selected.

High-Performance SAS Coding - Third Edition

Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Re: Case Control Matched Population - Avoiding Duplicates

Classroom Training Available!