BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jenim514
Pyrite | Level 9

I have a dataset composed of 2.5 million subjects that contain case (study=1) and controls (study=0) . I need SAS coding to create a new dataset composed of all the cases plus five randomly selected controls for each case, matched by date, sex, and branch of service.   Control should only be used once (if already assigned to a case, cannot be assigned to another case).  This is my attempt, but not correct...

 

data tmp;

set exp2.case exp2.control;

rand_num=rand("uniform");

run;

*merge case/ control tables and assign random number;

 

 

proc sort data=tmp;

by ssn study rand_num;

run;

* 'study' variable is 1=case 0=control;

 

proc rank data=tmp out=tempR;

by ssn study;

var rand_num;

run;

  • THIS CHANGED ALL RAND_NUM to ‘1’ WHICH I DON”T THINK IS CORRECT?

 

proc sql;

create table exp2.match as

select a.ssn as e_ssn, a.service_branch as e_service_branch, a.sex as e_sex, a.service_entry_date as e_service_entry_date, a.study,

b.ssn as ne_ssn, b.service_branch as ne_service_branch, b.sex as ne_sex, b.service_entry_date as ne_service_entry_date, b.study

from tempr as a inner join tempr as b

     on( a.service_entry_date between b.service_entry_date and b.service_entry_date_low) and a.service_branch=b.service_branch and

a.sex=b.sex and a.rand_num=ceil(b.rand_num/5)

where a.study and not b.study;

quit;

 

Data looks like this + 2 million encounters

Obs SSN study service_branch sex service_entry_date service_entry_date_low rand_num
1 1 0 A MALE 05/10/2010 05/10/2009 1
2 2 1 A MALE 11/20/2006   1
3 3 0 M MALE 10/07/2011 10/07/2010 1
4 4 1 A MALE 06/22/2010   1
5 5 0 M MALE 10/23/2007 10/23/2006 1
6 6 0 M MALE 07/08/2013 07/08/2012 1
7 7 0 F FEMALE 01/16/2007 01/16/2006 1
8 8 1 A MALE 08/03/2006   1
9 9 0 F MALE 01/20/2009 01/21/2008 1
10 10 0 A MALE 01/06/2009 01/07/2008 1
11 11 0 A MALE 03/18/2009 03/18/2008 1
12 12 1 A FEMALE 04/19/2007 . 1
13 13 0 F MALE 02/18/2014 02/18/2013 1
14 14 0 F FEMALE 09/04/2007 09/04/2006 1
15 15 0 A MALE 08/18/2014 08/18/2013 1
16 16 0 N FEMALE 11/18/2013 11/18/2012 1
17 17 1 N FEMALE 07/08/2014   1
18 18 0 N MALE 03/11/2014 03/11/2013 1
19 19 0 N MALE 09/17/2013 09/17/2012 1
20 20 0 N MALE 03/13/2014 03/13/2013 1

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Why would you need "where a.study and not b.study" if you are using two datasets?

 

The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note. 

 

The "Sort execution failure" error message is new to me.

 

You could use the following, without changing the rest :

 

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from 
    exp2.enroll_case as a inner join 
    exp2.control_nd as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
order by ne_ssn, rnd;
PG

View solution in original post

6 REPLIES 6
KachiM
Rhodochrosite | Level 12

Show sample of Case data set and Control data set separately to help in visualizing your problem and get you a solution.

Your matching variables are Sex and service_entry_date. The second matching variable is little confusing.

 

Out of the sample data sets, derive an output(it may be a random) for a minimum of 2 Cases for  clarity

Doc_Duke
Rhodochrosite | Level 12

I would approach this a bit differently. 

 

I'd keep the case and control datasets separate and apply the rand_num as you have done.

Drop the SORT and RANK.  (BTW, RANK is working correctly.  SSN is a unique key and it restarts the ranking every time the by level changes.)

Do the SQL, but change to two datasets and drop the rand_num part of the JOIN.  Include the case.ssn with the control.ssn in the output dataset and drop the other case variables.. This will get you ALL of the matches for each case, with duplication, so now your task is to de-duplicate.  Non-trivial, but doable.

 

OR, you can use one of the man 1:N matching macros that people have already written.  Google search for

sas matching macro

and you will find a number.

 

Statistically, a 1:5 match isn't that much more informative than a 1:2 match.  Depending on your next steps, you could also use all fo the matches (1:1-1:x) and have a perfectly valid analysis.  The use of a limited number of matches is valuable if you have to do manual abstraction (e.g. extra labor), but if you are totally working with computer files, what's a few extra cycles to use them all.

PGStats
Opal | Level 21

I modified the data slightly to create three matches (there was none in your sample data). Here is a hash solution:

 

data have;
infile datalines missover;
input Obs	SSN	study	service_branch :$1.	sex :$6.	
    (service_entry_date	service_entry_date_low) (:mmddyy10.);
format service_entry_date	service_entry_date_low yymmdd10.;
datalines;
1	1	0	A	MALE	05/10/2010	05/10/2005
2	2	1	A	MALE	11/20/2006
3	3	0	M	MALE	10/07/2011	10/07/2010
4	4	1	A	MALE	06/22/2010
5	5	0	M	MALE	10/23/2007	10/23/2006
6	6	0	M	MALE	07/08/2013	07/08/2012
7	7	0	F	FEMALE	01/16/2007	01/16/2006
8	8	1	A	MALE	08/03/2006
9	9	0	F	MALE	01/20/2009	01/21/2008
10	10	0	A	MALE	01/06/2009	01/07/2005
11	11	0	A	MALE	03/18/2009	03/18/2008
12	12	1	N	FEMALE	04/19/2007
13	13	0	F	MALE	02/18/2014	02/18/2013
14	14	0	F	FEMALE	09/04/2007	09/04/2005
15	15	0	A	MALE	08/18/2014	08/18/2013
16	16	0	N	FEMALE	11/18/2013	11/18/2005
17	17	1	N	FEMALE	07/08/2014
18	18	0	N	MALE	03/11/2014	03/11/2013
19	19	0	N	MALE	09/17/2013	09/17/2012
20	20	0	N	MALE	03/13/2014	03/13/2013
;

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn
from 
    have as a inner join 
    have as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
where a.study and not b.study
order by ne_ssn, rand("uniform");
create table cases as
select ssn as e_ssn, 0 as count from have where study;
quit;

data mSet;
if _n_ = 1 then do;
	declare hash w(dataset:"cases");
	w.definekey("e_ssn");
	w.definedata("count");
	w.definedone();
	call missing(e_ssn, count);
	end;
assigned = 0;
do until(last.ne_ssn);
    set match; by ne_ssn;
    if not assigned then do;
        if w.find() = 0 then do;
            if count < 5 then do;
                count + 1;
                w.replace();
                output;
                assigned = 1;
                end;
            end;
        end;
    end;
keep e_ssn ne_ssn;
run;

proc print; run;
PG
jenim514
Pyrite | Level 9

Thank you @PGStats.  I ran the syntax you kindly prepared for me.  I am getting an error in the first sql section (see below).  I am currently using two tables (table A is cases, b is controls).  Both tables have SSN variable, so I'm not clear why I would get an error stating the order variable doesn' appear in the SELECT clause.   I apprecitate your help!!

 

82   proc sql;
83   create table match as
84   select a.ssn as e_ssn, b.ssn as ne_ssn
85   from
86       exp2.enroll_case as a inner join
87       exp2.control_nd as b
88           on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low
88 ! and
89               a.service_branch=b.service_branch and
90               a.sex=b.sex
91   where a.study and not b.study
92   order by ne_ssn, rand("uniform");
NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT
      clause.
ERROR: Sort execution failure.

PGStats
Opal | Level 21

Why would you need "where a.study and not b.study" if you are using two datasets?

 

The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note. 

 

The "Sort execution failure" error message is new to me.

 

You could use the following, without changing the rest :

 

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from 
    exp2.enroll_case as a inner join 
    exp2.control_nd as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
order by ne_ssn, rnd;
PG
jenim514
Pyrite | Level 9

@PGStats Thank you so much for your help.  I was delayed in testing, but it worked great!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1722 views
  • 1 like
  • 4 in conversation