I have a dataset with potential case-control pairs. I would like to select for every case four controls, without replacement.
Have:
case | control |
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
1 | 5 |
1 | 6 |
2 | 2 |
2 | 4 |
2 | 8 |
2 | 10 |
2 | 12 |
2 | 14 |
3 | 1 |
3 | 4 |
3 | 9 |
3 | 11 |
3 | 13 |
3 | 14 |
3 | 16 |
3 | 20 |
3 | 22 |
Want:
case | control |
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
2 | 8 |
2 | 10 |
2 | 12 |
2 | 14 |
3 | 9 |
3 | 11 |
3 | 13 |
3 | 16 |
Thanks!
Great to hear - please mark it as a solution, then 😊
Yes, Smantha, I did:
proc surveyselect data=cc out=RandOrderCC noprint
method=SRS /* sample w/o replacement seed=123 */
samprate=1 /* proportion of observations in sample */
OUTRANDOM;
run;
That didn't give four controls for every case. I should have N cases * 4 controls. Depending on the seed, I get a different number of controls.
(Also: if I have for case 1 controls 1 2 3 4 and for case 2 I also have control 2, I would like to avoid that the selection procedure uses control 2 for case 2, since I only have 4 controls for case 1. I would like to have as many controls as possible.)
data given;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;
proc sort nodupkey;
by control;
run;
proc sort;
by case control;
run;
%macro function(dsn,value);
data &dsn;
set given;
if case = &value then output &dsn;
run;
data &dsn;
set &dsn;
if _n_ < 5 then output;
run;
%mend function;
%function(one,1);
%function(two,2);
%function(three,3);
data final;
set one two three;
run;
Dear Venkatamahesh,
your code generates indeed the want table I gave in my original request. However, by sorting on control, you also delete records which are still valid. For example if you use:
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6
The fourth case is completely deleted, but case 4 should be matched to control 5 and 6.
data cc;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6
;
run;
proc sort; by case control;
run;
data _null_;
set cc nobs=n;
call symputx('nrow',n);
stop;
run;
options symbolgen;
data want;
set cc ;
array test{&nrow.};
retain test1-test&nrow. count;
by case;
if first.case then count=1;
if count <5 then do;
put _all_;
if control not in test then do;
count=count+1;
output;
test[_n_]= control;
end;
end;
drop test1-test&nrow. count;
run;
Hi Smanta, the codes works for a small set of data, but I have 400.000 potential case control matches. The code ran for three days and the computer blocked. The log file was about 177GB. I will need something else I'm afraid.
You can do it in a single datastep, quite simple:
data have;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;
data want;
do _N_=1 by 1 until(last.case);
set have;
by case;
end;
out=4;
do _N_=_N_ to 1 by -1;
set have;
if rand('uniform')<out/_N_ then do;
output;
out=out-1;
end;
end;
drop out; /* Yes, that's a valid SAS statement... */
run;
Sorry, that didn't work. As said, once I have four cases, the code doesn't generate the correct output.
Sorry, I misunderstood your requirements. I thought you needed a random selection of cases, but it seems it does not have to be. On the other hand you need CONTROL to be unique. Is that correct?
If I have understood that correctly, here is a way that may work:
data want;
set have;
stop;
run;
proc sql;
create unique index control on want(control);
quit;
data want;
if 0 then modify want;
out=4;
do until(last.case);
set have;
by case;
if out=0 then continue;
_iorc_=0;
output;
if _iorc_ then continue; /* unique key restraint violated */
out=out-1;
end;
_error_=0;
run;
This works, thanks.
Great to hear - please mark it as a solution, then 😊
Hello,
I have the same problem and when I run the first step of your code -
data want;
set have;
stop;
run;
I don;t under stand this step as it just reduces my data to zero records?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.