I have a file with cases and controls (combined file is 15m). I would like to select 2 controls without replacement for each case such that the date for cases is less than the dates for controls. Some controls have missing dates but they should be included for control selection. A sample dataset is attached here.
data have; input id case DateInt :yymmdd. @@; format dateInt date11.; datalines; 1 0 . 2 0 20120103 3 1 20120101 4 0 20120103 5 0 20120101 6 1 20120103 7 0 20120101 8 0 . 9 0 20120103 10 1 20120105 11 0 20120103 12 0 20120103 13 0 20120103 14 0 20120106 15 0 20120107 ;
That is the control selection should be as follows-
(1) For case id 3, the 2 selections without replacement should be from ids 1,2,4,8,9,11,12,13,14,15.
(2) For case id 6, the 2 selections without replacement should be from ids 1,8,10,14,15 and excluding selections in (1) above.
(3) For case id 10, the 2 selections without replacement should be from ids 1,8,14,15 and excluding selections in (1) and (2) above.
Thanks for any tips. I am struggling to write a good code.
So you are saying that a missing DATE should be regarded as later than all known dates, correct?
Now, simple random selection could end up with assignment of the latest control dates to early case dates, leaving no control candidates for assignment to the latest case dates. If so, you could restart as many times as needed until you ended up with a completely valid collection of case-control assignments. A possibly very long process depending on the DATE distribution of CASEs and CONTROLs (assuming at least one valid collection of assignments exists).
A simple approach to this issue could be to start with the latest CASE, then randomly select from qualifying CONTROLS. Repeat for each CASE in reverse chronological order. This would guarantee a valid set of assignments, assuming one exists.
This code does that by sorting the data by descending date (assigning missing date to 31dec2099 for sorting purposes. As a result for each CASE, you can randomly select from CONTROLS that occur earlier in the sorted dataset:
data have;
input id case DateInt :yymmdd. @@;
format dateInt date11.;
datalines;
1 0 . 2 0 20120103 3 1 20120101
4 0 20120103 5 0 20120101 6 1 20120103
7 0 20120101 8 0 . 9 0 20120103
10 1 20120105 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120106 15 0 20120107
run;
data vneed/ view=vneed;
set have;
if dateint=. then dateint='31dec2099'd;
run;
proc sort data=vneed out=need;
by descending dateint case; /*For tied dates make CONTROLs precede CASEs */
run;
data want (drop=_: i); /* Two obs per CASE, with a new variable CONTROL_ID */
array cntrl_ids {1164000} _temporary_;
set need;
if case=0 then do; /* Update list of eligible controls */
_n+1; /* Count of eligible controls */
cntrl_ids{_n}=id;
end;
if case=1 then do i=1 to 2 while (_n>0); /* Find up to 2 controls */
_ptr=ceil(_n*ranuni(150195860));
control_id=cntrl_ids{_ptr};
output;
/*Remove the assigned control_id from cntrl_ids array*/
control_id=.;
/* Slide the remaining id's to the left */
do _p=_ptr to _n;
cntrl_ids{_p}=cntrl_ids{_p+1};
end;
_n+(-1); /* Decrement count of eligible controls*/
end;
run;
Just make sure that the array CNTRL_IDS is large enough to hold the rolling count of control observations.
Edited Note: I added the "while (_n>0)" to the DO statement to ensure selection of controls only when there are qualifying controls available.
Second edited note. I have changed the proc sort to force CONTROLs to precede CASEs when there are tied dates, thereby making them available for matching.
Also I have added a new reply to perform this task separately by SEX, in a single use of the program.
So you are saying that a missing DATE should be regarded as later than all known dates, correct?
Now, simple random selection could end up with assignment of the latest control dates to early case dates, leaving no control candidates for assignment to the latest case dates. If so, you could restart as many times as needed until you ended up with a completely valid collection of case-control assignments. A possibly very long process depending on the DATE distribution of CASEs and CONTROLs (assuming at least one valid collection of assignments exists).
A simple approach to this issue could be to start with the latest CASE, then randomly select from qualifying CONTROLS. Repeat for each CASE in reverse chronological order. This would guarantee a valid set of assignments, assuming one exists.
This code does that by sorting the data by descending date (assigning missing date to 31dec2099 for sorting purposes. As a result for each CASE, you can randomly select from CONTROLS that occur earlier in the sorted dataset:
data have;
input id case DateInt :yymmdd. @@;
format dateInt date11.;
datalines;
1 0 . 2 0 20120103 3 1 20120101
4 0 20120103 5 0 20120101 6 1 20120103
7 0 20120101 8 0 . 9 0 20120103
10 1 20120105 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120106 15 0 20120107
run;
data vneed/ view=vneed;
set have;
if dateint=. then dateint='31dec2099'd;
run;
proc sort data=vneed out=need;
by descending dateint case; /*For tied dates make CONTROLs precede CASEs */
run;
data want (drop=_: i); /* Two obs per CASE, with a new variable CONTROL_ID */
array cntrl_ids {1164000} _temporary_;
set need;
if case=0 then do; /* Update list of eligible controls */
_n+1; /* Count of eligible controls */
cntrl_ids{_n}=id;
end;
if case=1 then do i=1 to 2 while (_n>0); /* Find up to 2 controls */
_ptr=ceil(_n*ranuni(150195860));
control_id=cntrl_ids{_ptr};
output;
/*Remove the assigned control_id from cntrl_ids array*/
control_id=.;
/* Slide the remaining id's to the left */
do _p=_ptr to _n;
cntrl_ids{_p}=cntrl_ids{_p+1};
end;
_n+(-1); /* Decrement count of eligible controls*/
end;
run;
Just make sure that the array CNTRL_IDS is large enough to hold the rolling count of control observations.
Edited Note: I added the "while (_n>0)" to the DO statement to ensure selection of controls only when there are qualifying controls available.
Second edited note. I have changed the proc sort to force CONTROLs to precede CASEs when there are tied dates, thereby making them available for matching.
Also I have added a new reply to perform this task separately by SEX, in a single use of the program.
Wow! This is brilliant! Thank you!!!!
I am now told that in addition to the dates condition, the cases have to be selected based on sex as well i.e. controls have to be of the same sex as the cases. How would the above code change? A sample data is given below.
data have; input id case DateInt :yymmdd. sex $ @@; format dateInt date11.; datalines; 1 0 . M 2 0 20120103 M 3 1 20120101 F 4 0 20120103 F 5 0 20120101 M 6 1 20120103 F 7 0 20120101 M 8 0 . F 9 0 20120103 M 10 1 20120105 M 11 0 20120103 F 12 0 20120103 F 13 0 20120103 M 14 0 20120106 M 15 0 20120107 M run;
One could make some changes to this code to do what you are asking in one submission.
But I think it's simpler just to run the program twice, once for each sex.
Just change
data vneed/ view=vneed;
set have;
if dateint=. then dateint='31dec2099'd;
run;
to
data vneed/ view=vneed;
set have;
where sex='M';
if dateint=. then dateint='31dec2099'd;
run;
for males, and make
data wantm;
Then use WHERE SEX='F'; and DATA WANTF; for females.
In your sample data, for ID 6 (case=1, sex='F', date='03jan2012), there is only ONE control with a later date (ID 8 case=0, sex='F', date=.). Is that your expectation? That is how the code currently works.
Or do you also consider matches valid if the control has the same date as the case? In your sample data, that would add two other ID's (11 and 12) that would be valid matches for ID 6.
Thanks so much!
Yes. The matches are considered valid if they have the same dates . Tried editing the code but the matches are not correct.
If same day matches are acceptable, then change
proc sort data=vneed out=need;
by descending dateint descending case; /*For tied dates make CASEs precede CONTROLs*/
run;
to
proc sort data=vneed out=need;
by sex descending dateint case; /*For tied dates make CONTROLSs precede CASEs */
run;
So for tied dates, the CONTROLs are put in the array prior to the CASES - and therefore are available for random selection in matching cases. I will also modify the answer you marked as a solution.
This is a second reply to your request for including matching by sex. It processes each sex in a single use of the program:
data have;
input id case DateInt :yymmdd. sex $ @@;
format dateInt date11.;
datalines;
1 0 . M 2 0 20120103 M 3 1 20120101 F
4 0 20120103 F 5 0 20120101 M 6 1 20120103 F
7 0 20120101 M 8 0 . F 9 0 20120103 M
10 1 20120105 M 11 0 20120103 F 12 0 20120103 F
13 0 20120103 M 14 0 20120106 M 15 0 20120107 M
run;
data vneed/ view=vneed;
set have;
if dateint=. then dateint='31dec2099'd;
run;
proc sort data=vneed out=need;
by sex descending dateint case; /*For tied dates make CONTROLSs precede CASEs */
run;
data want (drop=_: i); /* Two obs per CASE, with a new variable CONTROL_ID */
array cntrl_ids {1164000} _temporary_;
set need;
by sex; /** This starts the new part */
retain _n 0;
if first.sex then do _n=_n to 0 by -1 ;
if _n=0 then leave;
cntrl_ids{_n}=.;
end; /** This ends the new part */
if case=0 then do; /* Update list of eligible controls */
_n+1; /* Count of eligible controls */
cntrl_ids{_n}=id;
end;
if case=1 then do i=1 to 2 while (_n>0); /* Find up to 2 controls */
_ptr=ceil(_n*ranuni(150195860));
control_id=cntrl_ids{_ptr};
output;
/*Remove the assigned control_id from cntrl_ids array*/
control_id=.;
/* Slide the remaining id's to the left */
do _p=_ptr to _n;
cntrl_ids{_p}=cntrl_ids{_p+1};
end;
_n+(-1); /* Decrement count of eligible controls*/
end;
run;
@mkeintz This is awesome! Thank you so, so much for taking the time and trouble for writing the code and helping me with me with this!!!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.