Match case control large data base

cantika · Posted 09-26-2012 05:56 AM

Dear all,

I need help with programming. I am a new sas user.

I am matching 2 big data base for a case control study, with 4 controls for each cases and the control is only can be use once.

My case table consist of 65,000 observations with 3 variables ID, sex and age. My control table consist of 7,000,000 observation and 4 variables ID, sex,age_hi and age_lo.

I use the procedure below:

Proc sql;

Create table caco as

Select a. ID as case_ID, b. ID as control_ID, a. sex as case_sex, b. sex as control_sex,

a.age as case_age, b. age as control_lopnr

from case_table a, control_table b

where (a.sex=b.sex and (a.age between b.age_lo and b.age_hi))

order by case_ID, control_ID;

quit;

This part took almost 13 hours, and I have to do this part several times since 1 run only give me less than 400 controls. I think the randomization process that took long time, since the match variable is only sex and age. I used the same
program for the same tables before, but with many more match criteria and it took me only max less than 2 hours.

I guess I need to use hash table, but to be honest I don’t know how to do it.

Can somebody help me? Any help will be appreciated.

Doc_Duke · Posted 09-26-2012 07:42 AM

The BETWEEN operator requires multiple passes of the "b" data. You might be able to reduce this time by indexing on the two age fields in b.

There also appears to be some sort of typo in your code

"b. age as control_lopnr"

cantika · Posted 09-26-2012 09:01 AM

Hi Doc@duke,

thanks, yes I have typo when I wrote this message but not in my program.

I'll try to do your suggestion. Hope it'll work

Astounding · Posted 09-26-2012 09:49 AM

cantika,

I don't have full answers, but I do see some issues.

How is it possible that you only got 400 control records for 65,000 case records? Something doesn't add up here, and making several runs through the data will not change that. Have you tried running this program using only 1 record from the case data set?

You should be able to get rid of one of the sex variables from the final table, since they have to match.

Your logic will allow the same control record to be selected twice, matching more than one case record.

The biggest question I have concerns age_hi and age_lo. Do they take on a set range of values, or do they vary with the age on the control record? If there is one set of buckets that applies to all records, it should be easy to speed this up by creating age_bucket in both data sets.

Good luck.

cantika · Posted 09-26-2012 02:02 PM

Hi Astounding,

I actually used the same program for many more matching criterias for the same databases and it worked perfectly fine.And it only took max less than 2 hours. I found almost all controls. For 70,00 cases I found 69,535 case with 4 control each with no duplicate control. So a control is only use once. And the running time is not so long. That is why I think because of the matching criterias is only 2, so that the probability for cases to get the controls is bigger, so in 1 run only around around 400 cases which get controls. In this case, 1 case can get over 1000 controls.

The age_hi and age_lo are variables I created in order to do the matching. The condition is control's age is +/- 1 year case age. so if the case's age is 60, so the control's age would be in the range between 59 and 61.

Astounding · Posted 09-26-2012 02:18 PM

cantika,

I'm not sure this would be faster, but you could try it. It sounds like the control data contains age, and you used it to create age_hi and age_lo. So this would be another way to check age:

abs(a.age - b.age) <= 1

It still defies the imagination that you are getting no duplicates. Are you saying that you checked the selected CONTROL_ID values, and each one was only used once? Sorry, I can't picture that. There must have been some CONTROL_ID values assigned to multiple CASE_IDs (or else my SQL is too weak to figure this one out!).

Good luck.

TomKari · Posted 09-26-2012 04:54 PM

Hi, cantika. Your post definitely shows what the problem is. It's the combinatorials.

You have 7,000,000 control records:

- for the sake of simplifying the math, let's assume they split evenly among age, so you'll have 70,000 records for each year of age (1 to 100);

- two sexes divides it by two, so for every year of age you'll have 35,000 females, and 35,000 males;

- and you want the controls that have the age of case, the age of case plus one, and the age of case minus one;

It looks to me like you should be expecting 35,000 times 3, or 105,000 records, per case; in your example of 65,000 variables, you should be expecting 6.8 billion result records total.

In your other requests, you had more variables, so you had fewer candidate controls per case.

I think that this may explain your terribly slow performance. Now, the question is what do you want to do to reduce the amount of processing?

Tom

cantika · Posted 09-27-2012 02:29 AM

Hi Tom,

That is exactly my question, what I have to do to reduce the amount time of processing?

LinusH · Posted 09-27-2012 03:20 AM

I agree with TomKari, you need to scale down your data in the control table to at least a 10th of it's size, to easier test performance improvements.

What kind of system are you running this on?

If you got lotr's of memory, raising the MEMSIZE and SORTSIZE will reduce swapping to disk by SQL.

I presume that expanding the join criteria is out of the question?

During testing, try to set some options to monitor the behavior of your query:

options FULLSTIMER MSGLEVEL=I;

Proc SQL _method _tree;

Good luck!

Data never sleeps

TomKari · Posted 09-28-2012 05:59 PM

What I was trying to say, is that if you insist on matching every case record with around 105,000 control records, you will continue to generate around 6 billion result records, and your process is taking a reasonable amount of time for that magnitude of work.

But I suspect that your real business requirement is not to do that amount of processing. Leaving performance aside, what do you actually need to match your case records to? Once we have that information, I'm sure someone will figure out a way to speed it up.

Tom

jonam · Posted 09-29-2012 02:31 PM

Hi Cantika,

By going through all the discussions, following are my thoughts to resolve/ reduce the time of processing.

1. Since your matching key is sex ( M, F ,not sure if you've transgender also ) , based on this you can split both your tables into two tables(both case, control). You don't need to create the table physically. Creating view would suffice.

2. The next step is to understand the age. I'm sure you should have an idea of the cardinality of age in case, (age_lo and age_hi) in control.

3. After identifying step 1, 2 you can create equal no of records in each table. For example 65,000 obs into 10 X 6500, 10 x 0.7M records.

4. After doing all this you can do parallel processing. The following sugi paper might help you doing that.

http://www2.sas.com/proceedings/sugi29/124-29.pdf

I don't think there is a bitmap index in sas like in oracle to help low cardinality key performance.

Hope this helps.

cantika · Posted 10-08-2012 10:33 AM

Hi Jonam,

Thanks for the help.

Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Re: Match case control large data base

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!