Re: Match case control large data base

cantika · Posted 09-26-2012 05:48 AM

Dear all,
I need help with programming. I am a new sas user.

I am matching 2 big data base for a case control study, with 4 controls for each cases and the control is only can be use once.

My case table consist of 65,000 observations with 3 variables ID, sex and age. My control table consist of 7,000,000 observation and 4 variables ID, sex,age_hi and age_lo.

I use the procedure below:

Proc sql;

Create table caco as

Select a. ID as case_ID, b. ID as control_ID, a. sex as case_sex, b. sex as control_sex,

a.age as case_age, b. age as control_lopnr

from case_table a, control_table b

where (a.sex=b.sex and (a.age between b.age_lo and b.age_hi))

order by case_ID, control_ID;

quit;

This part took almost 13 hours, and I have to do this part several times since 1 run only give me less than 400 controls. I think the randomization process that took
long time, since the match variable is only sex and age. I used the same program for the same tables before, but with many more match criteria and it took me only max
less than 2 hours.

I guess I need to use hash table, but to be honest I don’t know how to do it.

Can somebody help me? Any help will be appreciated.

/cantika

LinusH · Posted 09-26-2012 11:01 AM

13 seems like an awful lot of time, even given the "weak" join criteria.

Where does your data reside? In separate databases?

Optimizing the SQL can be hard (given the join criteria).

Indexing control_table might help, especially if you have the option to store the table in SPDE.

Hashing can occur in SQL in certain circumstances, setting the PROC SQL option BUFFERSIZE higher than 64K might help.

I think even rewriting the query to do a sub-query (on case_table) will be faster than 13 hrs.

I think writing a data step hash step will probably be very efficient, and I'm there are lot of people out there that can give you hints about that.

Data never sleeps

cantika · Posted 09-26-2012 01:52 PM

Hi LinusH,

My data is 2 separate databases. I case db and 1 control DB, and I already reduce the variables with only matching variables.

Thank you for your suggestion.

TomKari · Posted 09-26-2012 01:25 PM

What kind of data management system are your two source tables stored in? If it's a DBMS, are they both in the same database?

I suggest that you cut your test down to only a hundred or so randomly-selected case_table records, as a testbed while you try to improve the speed. With 100 records:

How long does the SQL Select take?

How many result records are created (how many control_table records match each case?)

Also, I notice you've posted the same question under "SAS Procedures". This is going to cause confusion.

You should also follow "Choosing the Earliest and Latest Dates" under SAS Procedures. It's an identical problem, I'm sure the same solution will solve both.

Tom

Message was edited by: Tom Kari

cantika · Posted 09-26-2012 01:46 PM

Hi TomKari,

I actually used the same program for many more matching criterias for the same databases and it worked perfectly fine.And it only took max less than 2 hours. I found almost all controls. For 70,00 cases I found 69,535 case with 4 control each. And the running time is not so long. That is why I think because of the matching criterias is only 2, so that the probability for cases to get the controls is bigger, so in 1 run only around around 400 cases which get controls. In this case, 1 case can get over 1000 controls.

Both case and control data tables are ini SAS format.

TomKari · Posted 09-26-2012 04:56 PM

Cross posted from SAS Procedures:

The performance problem is caused by the combinatorials.

You have 7,000,000 control records:

- for the sake of simplifying the math, let's assume they split evenly among age, so you'll have 70,000 records for each year of age (1 to 100);
- two sexes divides it by two, so for every year of age you'll have 35,000 females, and 35,000 males;
- and you want the controls that have the age of case, the age of case plus one, and the age of case minus one;

It looks to me like you should be expecting 35,000 times 3, or 105,000 records, per case; in your example of 65,000 variables, you should be expecting 6.8 billion result records total.

In your other requests, you had more variables, so you had fewer candidate controls per case.

I think that this may explain your terribly slow performance. Now, the question is what do you want to do to reduce the amount of processing?

Tom

P.S. I suggest you close one of the threads, and have everybody move to the other.

Match case control large data base