1:1 or 1:2 matching using Age or Age and Sex for a large data set

sms1891

Hi experts,

I have a database with 1.2 million cases and 4.9 million controls. I would like do a 1:1 or 1:2 matching based on age alone or age and sex for looking at the risk of cancer as an outcome. I tried using the following code but my program is crashing. How do I handle such a big database and create a 1:1 or 1:2 matching without the program running for long period of time and eventually crashing? Also I need to analyze and adjust for about 8-10 other variables comparing the cases and controls for the risk of cancer (outcome variable). My 7+ million data set has the following variables. Age, Sex, Race, DM (yes/ no 1/0), var1 (1/0), var2 (1/0), var3 (1/0), ...var10 (1/0) and cancer (1/0). From this dataset, cases are defined if DM = 1 and controls if DM = 0. When I analyze the risk of cancer by DM, I think the patient profiles are different, which may bias the result (risk of cancer), therefore I would like to do a matching with at least with age or age and sex both.

proc sql;
create table want as
select a.id as id_case, b.id as id_control , b.age, b.sex,b.year
from case a
,
control b
where a.age = b.age and a.sex=b.sex and a.year=b.year
order by id_case, id_control;

create table freq as
select count(id_case) as count, id_case from want
group by id_case having count >= 3;

create table final as
select a.*
from want a,
freq b
where a.id_case=b.id_case;
quit;

Is it valid to create a small randomized sample of cases and controls and then do a smaller sample of 1:2 matching from these randomized smaller samples? Any advice regarding this would be greatly appreciated.

DrAbhijeetSafai

Is it valid to create a small randomized sample of cases and controls and then do a smaller sample of 1:2 matching from these randomized smaller samples?

In fact that is what I would suggest. Create a dummy dataset like the one which you have with you. Write programme for this small dummy datasets of having some thousand records or so. Run the programme on this dummy dataset. Make modifications with the help of statisticians until you and they feel that the programme is accurate. Once the programme is accurate and is doing what it is intended to do, run the programme on 10% of this big dataset by limiting the number of observations by using obs = option. Only then run the programme for the whole big dataset.

Now when you are running the programme on big dataset and also modifying the programme at the same time, lot of time is getting wasted because to make even a small change and to see if the change has taken effect, you have to wait for long time.

Once the programme if perfect and it is run for the large dataset, relax and grab your coffee and do something else. Mostly the work will be done in due course of time. Even then if the programme is crashing then there is issue in the software.

Thank you.

- Dr. Abhijeet Safai

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real

1:1 or 1:2 matching using Age or Age and Sex for a large data set

Re: 1:1 or 1:2 matching using Age or Age and Sex for a large data set

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!