I have a dataset composed of 2.5 million subjects that contain case (study=1) and controls (study=0) . I need SAS coding to create a new dataset composed of all the cases plus five randomly selected controls for each case, matched by date, sex, and branch of service. Control should only be used once (if already assigned to a case, cannot be assigned to another case). This is my attempt, but not correct...
data tmp;
set exp2.case exp2.control;
rand_num=rand("uniform");
run;
*merge case/ control tables and assign random number;
proc sort data=tmp;
by ssn study rand_num;
run;
* 'study' variable is 1=case 0=control;
proc rank data=tmp out=tempR;
by ssn study;
var rand_num;
run;
proc sql;
create table exp2.match as
select a.ssn as e_ssn, a.service_branch as e_service_branch, a.sex as e_sex, a.service_entry_date as e_service_entry_date, a.study,
b.ssn as ne_ssn, b.service_branch as ne_service_branch, b.sex as ne_sex, b.service_entry_date as ne_service_entry_date, b.study
from tempr as a inner join tempr as b
on( a.service_entry_date between b.service_entry_date and b.service_entry_date_low) and a.service_branch=b.service_branch and
a.sex=b.sex and a.rand_num=ceil(b.rand_num/5)
where a.study and not b.study;
quit;
Data looks like this + 2 million encounters
Obs | SSN | study | service_branch | sex | service_entry_date | service_entry_date_low | rand_num |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | A | MALE | 05/10/2010 | 05/10/2009 | 1 |
2 | 2 | 1 | A | MALE | 11/20/2006 | 1 | |
3 | 3 | 0 | M | MALE | 10/07/2011 | 10/07/2010 | 1 |
4 | 4 | 1 | A | MALE | 06/22/2010 | 1 | |
5 | 5 | 0 | M | MALE | 10/23/2007 | 10/23/2006 | 1 |
6 | 6 | 0 | M | MALE | 07/08/2013 | 07/08/2012 | 1 |
7 | 7 | 0 | F | FEMALE | 01/16/2007 | 01/16/2006 | 1 |
8 | 8 | 1 | A | MALE | 08/03/2006 | 1 | |
9 | 9 | 0 | F | MALE | 01/20/2009 | 01/21/2008 | 1 |
10 | 10 | 0 | A | MALE | 01/06/2009 | 01/07/2008 | 1 |
11 | 11 | 0 | A | MALE | 03/18/2009 | 03/18/2008 | 1 |
12 | 12 | 1 | A | FEMALE | 04/19/2007 | . | 1 |
13 | 13 | 0 | F | MALE | 02/18/2014 | 02/18/2013 | 1 |
14 | 14 | 0 | F | FEMALE | 09/04/2007 | 09/04/2006 | 1 |
15 | 15 | 0 | A | MALE | 08/18/2014 | 08/18/2013 | 1 |
16 | 16 | 0 | N | FEMALE | 11/18/2013 | 11/18/2012 | 1 |
17 | 17 | 1 | N | FEMALE | 07/08/2014 | 1 | |
18 | 18 | 0 | N | MALE | 03/11/2014 | 03/11/2013 | 1 |
19 | 19 | 0 | N | MALE | 09/17/2013 | 09/17/2012 | 1 |
20 | 20 | 0 | N | MALE | 03/13/2014 | 03/13/2013 | 1 |
Why would you need "where a.study and not b.study" if you are using two datasets?
The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note.
The "Sort execution failure" error message is new to me.
You could use the following, without changing the rest :
proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from
exp2.enroll_case as a inner join
exp2.control_nd as b
on a.service_entry_date between b.service_entry_date and b.service_entry_date_low and
a.service_branch=b.service_branch and
a.sex=b.sex
order by ne_ssn, rnd;
Show sample of Case data set and Control data set separately to help in visualizing your problem and get you a solution.
Your matching variables are Sex and service_entry_date. The second matching variable is little confusing.
Out of the sample data sets, derive an output(it may be a random) for a minimum of 2 Cases for clarity
I would approach this a bit differently.
I'd keep the case and control datasets separate and apply the rand_num as you have done.
Drop the SORT and RANK. (BTW, RANK is working correctly. SSN is a unique key and it restarts the ranking every time the by level changes.)
Do the SQL, but change to two datasets and drop the rand_num part of the JOIN. Include the case.ssn with the control.ssn in the output dataset and drop the other case variables.. This will get you ALL of the matches for each case, with duplication, so now your task is to de-duplicate. Non-trivial, but doable.
OR, you can use one of the man 1:N matching macros that people have already written. Google search for
sas matching macro
and you will find a number.
Statistically, a 1:5 match isn't that much more informative than a 1:2 match. Depending on your next steps, you could also use all fo the matches (1:1-1:x) and have a perfectly valid analysis. The use of a limited number of matches is valuable if you have to do manual abstraction (e.g. extra labor), but if you are totally working with computer files, what's a few extra cycles to use them all.
I modified the data slightly to create three matches (there was none in your sample data). Here is a hash solution:
data have;
infile datalines missover;
input Obs SSN study service_branch :$1. sex :$6.
(service_entry_date service_entry_date_low) (:mmddyy10.);
format service_entry_date service_entry_date_low yymmdd10.;
datalines;
1 1 0 A MALE 05/10/2010 05/10/2005
2 2 1 A MALE 11/20/2006
3 3 0 M MALE 10/07/2011 10/07/2010
4 4 1 A MALE 06/22/2010
5 5 0 M MALE 10/23/2007 10/23/2006
6 6 0 M MALE 07/08/2013 07/08/2012
7 7 0 F FEMALE 01/16/2007 01/16/2006
8 8 1 A MALE 08/03/2006
9 9 0 F MALE 01/20/2009 01/21/2008
10 10 0 A MALE 01/06/2009 01/07/2005
11 11 0 A MALE 03/18/2009 03/18/2008
12 12 1 N FEMALE 04/19/2007
13 13 0 F MALE 02/18/2014 02/18/2013
14 14 0 F FEMALE 09/04/2007 09/04/2005
15 15 0 A MALE 08/18/2014 08/18/2013
16 16 0 N FEMALE 11/18/2013 11/18/2005
17 17 1 N FEMALE 07/08/2014
18 18 0 N MALE 03/11/2014 03/11/2013
19 19 0 N MALE 09/17/2013 09/17/2012
20 20 0 N MALE 03/13/2014 03/13/2013
;
proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn
from
have as a inner join
have as b
on a.service_entry_date between b.service_entry_date and b.service_entry_date_low and
a.service_branch=b.service_branch and
a.sex=b.sex
where a.study and not b.study
order by ne_ssn, rand("uniform");
create table cases as
select ssn as e_ssn, 0 as count from have where study;
quit;
data mSet;
if _n_ = 1 then do;
declare hash w(dataset:"cases");
w.definekey("e_ssn");
w.definedata("count");
w.definedone();
call missing(e_ssn, count);
end;
assigned = 0;
do until(last.ne_ssn);
set match; by ne_ssn;
if not assigned then do;
if w.find() = 0 then do;
if count < 5 then do;
count + 1;
w.replace();
output;
assigned = 1;
end;
end;
end;
end;
keep e_ssn ne_ssn;
run;
proc print; run;
Thank you @PGStats. I ran the syntax you kindly prepared for me. I am getting an error in the first sql section (see below). I am currently using two tables (table A is cases, b is controls). Both tables have SSN variable, so I'm not clear why I would get an error stating the order variable doesn' appear in the SELECT clause. I apprecitate your help!!
82 proc sql;
83 create table match as
84 select a.ssn as e_ssn, b.ssn as ne_ssn
85 from
86 exp2.enroll_case as a inner join
87 exp2.control_nd as b
88 on a.service_entry_date between b.service_entry_date and b.service_entry_date_low
88 ! and
89 a.service_branch=b.service_branch and
90 a.sex=b.sex
91 where a.study and not b.study
92 order by ne_ssn, rand("uniform");
NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT
clause.
ERROR: Sort execution failure.
Why would you need "where a.study and not b.study" if you are using two datasets?
The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note.
The "Sort execution failure" error message is new to me.
You could use the following, without changing the rest :
proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from
exp2.enroll_case as a inner join
exp2.control_nd as b
on a.service_entry_date between b.service_entry_date and b.service_entry_date_low and
a.service_branch=b.service_branch and
a.sex=b.sex
order by ne_ssn, rnd;
@PGStats Thank you so much for your help. I was delayed in testing, but it worked great!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.