I have a dataset with potential case-control pairs. I would like to select for every case four controls, without replacement.
Have:
case | control |
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
1 | 5 |
1 | 6 |
2 | 2 |
2 | 4 |
2 | 8 |
2 | 10 |
2 | 12 |
2 | 14 |
3 | 1 |
3 | 4 |
3 | 9 |
3 | 11 |
3 | 13 |
3 | 14 |
3 | 16 |
3 | 20 |
3 | 22 |
Want:
case | control |
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
2 | 8 |
2 | 10 |
2 | 12 |
2 | 14 |
3 | 9 |
3 | 11 |
3 | 13 |
3 | 16 |
Thanks!
Great to hear - please mark it as a solution, then 😊
Yes, Smantha, I did:
proc surveyselect data=cc out=RandOrderCC noprint
method=SRS /* sample w/o replacement seed=123 */
samprate=1 /* proportion of observations in sample */
OUTRANDOM;
run;
That didn't give four controls for every case. I should have N cases * 4 controls. Depending on the seed, I get a different number of controls.
(Also: if I have for case 1 controls 1 2 3 4 and for case 2 I also have control 2, I would like to avoid that the selection procedure uses control 2 for case 2, since I only have 4 controls for case 1. I would like to have as many controls as possible.)
data given;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;
proc sort nodupkey;
by control;
run;
proc sort;
by case control;
run;
%macro function(dsn,value);
data &dsn;
set given;
if case = &value then output &dsn;
run;
data &dsn;
set &dsn;
if _n_ < 5 then output;
run;
%mend function;
%function(one,1);
%function(two,2);
%function(three,3);
data final;
set one two three;
run;
Dear Venkatamahesh,
your code generates indeed the want table I gave in my original request. However, by sorting on control, you also delete records which are still valid. For example if you use:
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6
The fourth case is completely deleted, but case 4 should be matched to control 5 and 6.
data cc;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6
;
run;
proc sort; by case control;
run;
data _null_;
set cc nobs=n;
call symputx('nrow',n);
stop;
run;
options symbolgen;
data want;
set cc ;
array test{&nrow.};
retain test1-test&nrow. count;
by case;
if first.case then count=1;
if count <5 then do;
put _all_;
if control not in test then do;
count=count+1;
output;
test[_n_]= control;
end;
end;
drop test1-test&nrow. count;
run;
Hi Smanta, the codes works for a small set of data, but I have 400.000 potential case control matches. The code ran for three days and the computer blocked. The log file was about 177GB. I will need something else I'm afraid.
You can do it in a single datastep, quite simple:
data have;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;
data want;
do _N_=1 by 1 until(last.case);
set have;
by case;
end;
out=4;
do _N_=_N_ to 1 by -1;
set have;
if rand('uniform')<out/_N_ then do;
output;
out=out-1;
end;
end;
drop out; /* Yes, that's a valid SAS statement... */
run;
Sorry, that didn't work. As said, once I have four cases, the code doesn't generate the correct output.
Sorry, I misunderstood your requirements. I thought you needed a random selection of cases, but it seems it does not have to be. On the other hand you need CONTROL to be unique. Is that correct?
If I have understood that correctly, here is a way that may work:
data want;
set have;
stop;
run;
proc sql;
create unique index control on want(control);
quit;
data want;
if 0 then modify want;
out=4;
do until(last.case);
set have;
by case;
if out=0 then continue;
_iorc_=0;
output;
if _iorc_ then continue; /* unique key restraint violated */
out=out-1;
end;
_error_=0;
run;
This works, thanks.
Great to hear - please mark it as a solution, then 😊
Hello,
I have the same problem and when I run the first step of your code -
data want;
set have;
stop;
run;
I don;t under stand this step as it just reduces my data to zero records?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.