BookmarkSubscribeRSS Feed
Simone98
Calcite | Level 5

Good afternoon, I have a large dataset of 6 million observations (containing both cases and controls), made in this way:

- uniqueid : the unique code of each patient;

- gender

- age

- casecontrol (1 for cases, 0 for controls).

The cases are about 400,000 and i wanted to create a 1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).

At the end i want to obtain a dataset like this:

- id_control

- gender_control

- age_control

- id_case

- gender_case

- age_case

 

I found a macro online, but no way to obtain the id_case... if someone can provide me a solution, it will be extremely helpful.

Thank you.

 

6 REPLIES 6
HB
Barite | Level 11 HB
Barite | Level 11
Not sure what you are really trying to accomplish.

Do you have 400,000 cases and 5,600,000 controls?

Are you wanting to create a dataset of about 1.6 million observations in which 25 percent of the records (400,000) are cases and 75 percent of the records (1.2 million) are controls selected to match the cases?

Simone98
Calcite | Level 5
Yes, i have a dataset with 5,600,000 controls and 400,000 cases, and I want to create a 1:3 matching in order to obtain a dataset (with 1,2 million observations) , where each row is made like this:
- id_control
- age_control
- sex_control
- id_case (the id of the case to which the control is matched)
- age_case
- sex_age

Thank you for your help!
Reeza
Super User
Have you looked at PROC PSMATCH?
mkeintz
PROC Star

1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).

So for each case record, there is a range of 7 ages that can be acceptable matches.  And vice-versa, each of those control matches can be matched to some case record having 7 possible ages (forgetting for the moment the matches near the upper and lower ages).

 

The problem here is, (assuming you are doing matching without replacement) that there can be distributions of ages that would only satisfy random case-control matching some (perhaps most) of the time, but not all of the time.  That is, some random draws of "matching" control ages for a given case age, could leave an insufficient sample of "matching" control ages for some other case age.   Yet some other random draws from the same data might satisfy your objective of 3 controls per case.

 

For instance, you can have

  1. Two case records with ages 21 (n=1) and 22 (n=1)
  2. Eight control records with ages   18 (n=1), 19 (n=1), 20 (n=1), 23 (n=1), 24 (n=1), and 25 (n=1).  I.e. no 21's or 22's.

You might randomly assign control records with ages 19, 20 and 23 for case age 21.  That would leave only two controls records (ages 24 and 25) as matchable against case age 22.  Yet there are a number of other random draws from this data that would provide 1:3 matches.

 

Now perhaps your data is not so pathologically distributed as to make random draws of matches more than infinitesimally likely to generate incomplete case-control matches

 

But it's possible.  If it happens, I guess you could rerun your random assignment with a different random-number seed.

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PRoul
Calcite | Level 5

Hi There,

 

I recently came across the HASH object which made my like much easier. Following is the program using HASH object to match 1:10 cases with controls. We can not use PSMatch, it is completely different than the match we do directly by comparing variables.

 

data controls;

set controls;

matchkey = catx('_', gender_control , age_control);

proc sort data = controls out=controls;

by matchkey;

run;

 

data cases;

set cases;

matchkey = catx('_', gender_case , age_case);

proc sort data = cases out=cases;

by matchkey;

run;

 

data matched_ctrls(keep=id_case gender_case age_case id_control gender_control age_control  );

 

Length id_control  $20 age_control 8 gender_control  $2  matchkey $20;

 

if _N_ = 1 then do;

                declare hash controls(dataset:"controls", multidata: "Y");

                controls.defineKey("matchkey");

                controls.defineData("id_control","age_control","gender_control", "matchkey");

                controls.defineDone();

 

                declare hash used_controls();

                used_controls.defineKey("id_control");

                used_controls.defineDone();

end;

 

do until (last.matchkey);

                set cases;

                by matchkey;

 

                match_count = 0;

                rc = controls.find(key:matchkey);

 

do while(rc = 0 and match_count < 3 );

 

                if used_controls.check(key:id_control) ne 0 then do;

                                output;

                                match_count+1;

                                used_controls.add();

                end;

 

                rc = controls.find_next();

                end;

 

                end;

                run;

 

mkeintz
PROC Star

@PRoul 

I see two issues with your solution:

  1. You are not solving the problem as stated by the OP.  You are doing EXACT matches of AGE between case and control.  But the OP was requesting matches having case_age within a range of case_control.  Assuming the OP would be equally satisfied with a "close" match vs an "exact" match, this program may not fully satisfy the initial request.  Of course, the OP might prefer to exclude close matches.

  2. More importantly, your solution is unlikely to generate anything like a random sample of matches (unlike PSMATCH).  In particular, records near the end of the control dataset will not have the same probability of being selected as records near the start, even though they have the same match values.  That is because.
    1. PROC SORT, by default, does not change the order of records having identical sort keys.
    2. hash objects using the multidata:'Y' option preserves the order of data items having the same key to match the order retrieved from the source dataset.  This means your technique of FIND method followed by a series of FINDNEXT methods will always result in the data items near the start of the of the dataset being the first ones considered.  This is true even though you are sampling without replacement.

Now if your control dataset is in truly random order to begin with (i.e. before the proc sort), you would effectively have a random solution.  But with real datasets, it's pretty heroic to assume they are in random order.

 

You could resolve this problem by randomizing the order of data within each match key as follows:

data controls;
  set controls;
  matchkey = catx('_', gender_control , age_control);
  call streaminit(0598666);
  rnum=rand('uniform');
run;
proc sort data = controls out=controls;
  by matchkey rnum;
run;

followed by your data step with hash code.

 

This would effectively mean that every observation with a given matchkey value would have the same probability of being selected.

 

BTW, for purposes of matching cases and controls, there is no need to sort the CASES file.  It's just wasted resources.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2345 views
  • 4 likes
  • 5 in conversation