07-17-2013 09:21 AM
I have a two datasets - cases and controls. Assume that the dataset 'cases' is a small one with a few hundred patient IDs, while 'controls' is huge dataset with millions of patient IDs. I want to merge cases with controls on say, age variable, such that when a case is merged with 10 controls I want the merging process to stop for that case and move to the next. Is it possible with proc sql or merge (data step)? What would be the modifications in the following code:
create table matches as
select a.caseID, a.age, b.controlID, b.age
from cases as a, controls as b
where a.age = b.age;
merge cases (in=a) controls (in=b);
if a and b;
07-17-2013 11:16 AM
Ok sorry if you received e-mail spam, I fixed something in the post.
input age 2. something $4.;
input age 2. somethingelse $2.;
merge have have2;
if first.age=1 then byobscounter=1;
if byobscounter LE 2 then output;
You won't see a huge gain in efficiency since all the records in the huge table will still be read in the data vector but at least they won't be output and thus won't need to be written on the disk at the end of the merge.
07-17-2013 09:08 PM
Almost like Vincent.
data controls ; input age 2. something $4.; datalines; 23 aded 23 aaaa 23 add 23 agb 23 aef 23 add 23 agb 23 aef 23 add 23 agb 23 aef 23 add 23 agb 23 aef 24 aa 24 tt 24 ll 25 aa 25 rr 25 hh ; run; data cases ; input age 2. somethingelse $2.; datalines; 23 A 24 B run; data matches; merge cases (in=a) controls (in=b); by age; if age ne lag(age) then n=0; n+1; if a and b and n le 10 then output; run;