Good afternoon, I have a large dataset of 6 million observations (containing both cases and controls), made in this way:
- uniqueid : the unique code of each patient;
- gender
- age
- casecontrol (1 for cases, 0 for controls).
The cases are about 400,000 and i wanted to create a 1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).
At the end i want to obtain a dataset like this:
- id_control
- gender_control
- age_control
- id_case
- gender_case
- age_case
I found a macro online, but no way to obtain the id_case... if someone can provide me a solution, it will be extremely helpful.
Thank you.
1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).
So for each case record, there is a range of 7 ages that can be acceptable matches. And vice-versa, each of those control matches can be matched to some case record having 7 possible ages (forgetting for the moment the matches near the upper and lower ages).
The problem here is, (assuming you are doing matching without replacement) that there can be distributions of ages that would only satisfy random case-control matching some (perhaps most) of the time, but not all of the time. That is, some random draws of "matching" control ages for a given case age, could leave an insufficient sample of "matching" control ages for some other case age. Yet some other random draws from the same data might satisfy your objective of 3 controls per case.
For instance, you can have
You might randomly assign control records with ages 19, 20 and 23 for case age 21. That would leave only two controls records (ages 24 and 25) as matchable against case age 22. Yet there are a number of other random draws from this data that would provide 1:3 matches.
Now perhaps your data is not so pathologically distributed as to make random draws of matches more than infinitesimally likely to generate incomplete case-control matches
But it's possible. If it happens, I guess you could rerun your random assignment with a different random-number seed.
Hi There,
I recently came across the HASH object which made my like much easier. Following is the program using HASH object to match 1:10 cases with controls. We can not use PSMatch, it is completely different than the match we do directly by comparing variables.
data controls;
set controls;
matchkey = catx('_', gender_control , age_control);
proc sort data = controls out=controls;
by matchkey;
run;
data cases;
set cases;
matchkey = catx('_', gender_case , age_case);
proc sort data = cases out=cases;
by matchkey;
run;
data matched_ctrls(keep=id_case gender_case age_case id_control gender_control age_control );
Length id_control $20 age_control 8 gender_control $2 matchkey $20;
if _N_ = 1 then do;
declare hash controls(dataset:"controls", multidata: "Y");
controls.defineKey("matchkey");
controls.defineData("id_control","age_control","gender_control", "matchkey");
controls.defineDone();
declare hash used_controls();
used_controls.defineKey("id_control");
used_controls.defineDone();
end;
do until (last.matchkey);
set cases;
by matchkey;
match_count = 0;
rc = controls.find(key:matchkey);
do while(rc = 0 and match_count < 3 );
if used_controls.check(key:id_control) ne 0 then do;
output;
match_count+1;
used_controls.add();
end;
rc = controls.find_next();
end;
end;
run;
I see two issues with your solution:
Now if your control dataset is in truly random order to begin with (i.e. before the proc sort), you would effectively have a random solution. But with real datasets, it's pretty heroic to assume they are in random order.
You could resolve this problem by randomizing the order of data within each match key as follows:
data controls;
set controls;
matchkey = catx('_', gender_control , age_control);
call streaminit(0598666);
rnum=rand('uniform');
run;
proc sort data = controls out=controls;
by matchkey rnum;
run;
followed by your data step with hash code.
This would effectively mean that every observation with a given matchkey value would have the same probability of being selected.
BTW, for purposes of matching cases and controls, there is no need to sort the CASES file. It's just wasted resources.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.