Matching of cases and controls 1:3

Simone98 · Posted 10-11-2022 09:43 AM

Good afternoon, I have a large dataset of 6 million observations (containing both cases and controls), made in this way:

- uniqueid : the unique code of each patient;

- gender

- age

- casecontrol (1 for cases, 0 for controls).

The cases are about 400,000 and i wanted to create a 1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).

At the end i want to obtain a dataset like this:

- id_control

- gender_control

- age_control

- id_case

- gender_case

- age_case

I found a macro online, but no way to obtain the id_case... if someone can provide me a solution, it will be extremely helpful.

Thank you.

HB · Posted 10-11-2022 05:07 PM

Not sure what you are really trying to accomplish.

Do you have 400,000 cases and 5,600,000 controls?

Are you wanting to create a dataset of about 1.6 million observations in which 25 percent of the records (400,000) are cases and 75 percent of the records (1.2 million) are controls selected to match the cases?

Simone98 · Posted 10-12-2022 01:34 AM

Yes, i have a dataset with 5,600,000 controls and 400,000 cases, and I want to create a 1:3 matching in order to obtain a dataset (with 1,2 million observations) , where each row is made like this:
- id_control
- age_control
- sex_control
- id_case (the id of the case to which the control is matched)
- age_case
- sex_age

Thank you for your help!

Reeza · Posted 10-11-2022 05:17 PM

Have you looked at PROC PSMATCH?

mkeintz · Posted 10-12-2022 02:10 PM

1:3 matching with the controls, based on gender (same gender) and age (range of plus / minus 3 years).

So for each case record, there is a range of 7 ages that can be acceptable matches. And vice-versa, each of those control matches can be matched to some case record having 7 possible ages (forgetting for the moment the matches near the upper and lower ages).

The problem here is, (assuming you are doing matching without replacement) that there can be distributions of ages that would only satisfy random case-control matching some (perhaps most) of the time, but not all of the time. That is, some random draws of "matching" control ages for a given case age, could leave an insufficient sample of "matching" control ages for some other case age. Yet some other random draws from the same data might satisfy your objective of 3 controls per case.

For instance, you can have

Two case records with ages 21 (n=1) and 22 (n=1)
Eight control records with ages 18 (n=1), 19 (n=1), 20 (n=1), 23 (n=1), 24 (n=1), and 25 (n=1). I.e. no 21's or 22's.

You might randomly assign control records with ages 19, 20 and 23 for case age 21. That would leave only two controls records (ages 24 and 25) as matchable against case age 22. Yet there are a number of other random draws from this data that would provide 1:3 matches.

Now perhaps your data is not so pathologically distributed as to make random draws of matches more than infinitesimally likely to generate incomplete case-control matches

But it's possible. If it happens, I guess you could rerun your random assignment with a different random-number seed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

Re: Matching of cases and controls 1:3

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away