BookmarkSubscribeRSS Feed
sms1891
Quartz | Level 8

Hi experts,

I have a database with 1.2 million cases and 4.9 million controls. I would like do a 1:1 or 1:2 matching based on age alone or age and sex for looking at the risk of cancer as an outcome. I tried using the following code but my program is crashing. How do I handle such a big database and create a 1:1 or 1:2 matching without the program running for long period of time and eventually crashing?  Also I need to analyze and adjust for about 8-10 other variables comparing the cases and controls for the risk of cancer (outcome variable). My 7+ million data set has the following variables. Age, Sex, Race, DM (yes/ no 1/0), var1 (1/0), var2 (1/0), var3 (1/0), ...var10 (1/0) and cancer (1/0). From this dataset, cases are defined if  DM = 1 and controls if DM = 0.  When I analyze the risk of cancer by DM, I think the patient profiles are different, which may bias the result (risk of cancer), therefore I would like to do a matching with at least with age or age and sex both.

proc sql;
create table want as
select a.id as id_case, b.id as id_control , b.age, b.sex,b.year
from case a
,
control b
where a.age = b.age and a.sex=b.sex and a.year=b.year
order by id_case, id_control;

create table freq as
select count(id_case) as count, id_case from want
group by id_case having count >= 3;

create table final as
select a.*
from want a,
freq b
where a.id_case=b.id_case;
quit;

Is it valid to create a small randomized sample of cases and controls and then do a smaller sample of 1:2 matching from these randomized smaller samples? Any advice regarding this would be greatly appreciated.  

1 REPLY 1
DrAbhijeetSafai
Lapis Lazuli | Level 10

Is it valid to create a small randomized sample of cases and controls and then do a smaller sample of 1:2 matching from these randomized smaller samples? 

In fact that is what I would suggest. Create a dummy dataset like the one which you have with you. Write programme for this small dummy datasets of having some thousand records or so. Run the programme on this dummy dataset. Make modifications with the help of statisticians until you and they feel that the programme is accurate. Once the programme is accurate and is doing what it is intended to do, run the programme on 10% of this big dataset by limiting the number of observations by using obs = option. Only then run the programme for the whole big dataset. 

 

Now when you are running the programme on big dataset and also modifying the programme at the same time, lot of time is getting wasted because to make even a small change and to see if the change has taken effect, you have to wait for long time. 

 

Once the programme if perfect and it is run for the large dataset, relax and grab your coffee and do something else. Mostly the work will be done in due course of time. Even then if the programme is crashing then there is issue in the software.

 

Thank you.

 

- Dr. Abhijeet Safai 

Dr. Abhijeet Safai
Certified Base and Clinical SAS Programmer
Associate Data Analyst
Actu-Real

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 368 views
  • 3 likes
  • 2 in conversation