Hello,
I have a dataset composed of 1.5 million subjects which are divided into cases and controls. I need SAS coding to create a new dataset composed of all the cases plus four randomly selected controls for each case, matched by date interviewed.
Any help is appreciated and I am also available to provide more details if needed.
Thanks!
You can match cases and 4 controls that were interviewed on the same day (dateInt) randomly this way:
data have;
input id case DateInt :yymmdd. @@;
format dateInt date11.;
datalines;
1 0 20120101 2 0 20120103 3 1 20120101
4 0 20120103 5 0 20120101 6 1 20120103
7 0 20120101 8 0 20120102 9 0 20120103
10 1 20120103 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120103 15 0 20120103
;
/* Assign a random order to all cases and controls */
data tmp;
call streaminit(98768976);
set have;
rnd = rand("UNIFORM");
run;
proc sort data=tmp; by dateInt case rnd; run;
proc rank data=tmp out=tmpR;
by dateInt case;
var rnd;
run;
/* Match the first case with the first 4 controls, the second case with controls
5 to 8 and so on, within each date. */
proc sql;
create table want as
select ca.dateInt, ca.id as caseId, co.id as controlId
from tmpR as ca inner join tmpR as co
on ca.dateInt=co.dateInt and ca.rnd=ceil(co.rnd/4)
where ca.case and not co.case;
drop table tmp, tmpR;
select * from want;
quit;
PG
You can match cases and 4 controls that were interviewed on the same day (dateInt) randomly this way:
data have;
input id case DateInt :yymmdd. @@;
format dateInt date11.;
datalines;
1 0 20120101 2 0 20120103 3 1 20120101
4 0 20120103 5 0 20120101 6 1 20120103
7 0 20120101 8 0 20120102 9 0 20120103
10 1 20120103 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120103 15 0 20120103
;
/* Assign a random order to all cases and controls */
data tmp;
call streaminit(98768976);
set have;
rnd = rand("UNIFORM");
run;
proc sort data=tmp; by dateInt case rnd; run;
proc rank data=tmp out=tmpR;
by dateInt case;
var rnd;
run;
/* Match the first case with the first 4 controls, the second case with controls
5 to 8 and so on, within each date. */
proc sql;
create table want as
select ca.dateInt, ca.id as caseId, co.id as controlId
from tmpR as ca inner join tmpR as co
on ca.dateInt=co.dateInt and ca.rnd=ceil(co.rnd/4)
where ca.case and not co.case;
drop table tmp, tmpR;
select * from want;
quit;
PG
PG's approach works for exact date matches. If you want to look for dates 'near' the case-date, do a google search on
nearest neighbor match site:sas.com
for a bunch of related results.
Thank you for the reply. Upon entering the code, I received the following error message: "Column case could not be found in the table/view identified with the correlation name CA" (and another one for the correlation name CO). As I am not conversant in proc sql I am unable to troubleshoot myself and any assistance would be appreciated.
Thanks!
The message means that dataset tmpR doesn't include a variable named case. It should be there if the previous step (proc rank) ran without error. - PG
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.