Solved: match 1:5 unique case controls...

jenim514 · Posted 08-11-2016 03:22 PM

I have a dataset composed of 2.5 million subjects that contain case (study=1) and controls (study=0) . I need SAS coding to create a new dataset composed of all the cases plus five randomly selected controls for each case, matched by date, sex, and branch of service. Control should only be used once (if already assigned to a case, cannot be assigned to another case). This is my attempt, but not correct...

data tmp;

set exp2.case exp2.control;

rand_num=rand("uniform");

run;

*merge case/ control tables and assign random number;

proc sort data=tmp;

by ssn study rand_num;

run;

* 'study' variable is 1=case 0=control;

proc rank data=tmp out=tempR;

by ssn study;

var rand_num;

run;

THIS CHANGED ALL RAND_NUM to ‘1’ WHICH I DON”T THINK IS CORRECT?

proc sql;

create table exp2.match as

select a.ssn as e_ssn, a.service_branch as e_service_branch, a.sex as e_sex, a.service_entry_date as e_service_entry_date, a.study,

b.ssn as ne_ssn, b.service_branch as ne_service_branch, b.sex as ne_sex, b.service_entry_date as ne_service_entry_date, b.study

from tempr as a inner join tempr as b

on( a.service_entry_date between b.service_entry_date and b.service_entry_date_low) and a.service_branch=b.service_branch and

a.sex=b.sex and a.rand_num=ceil(b.rand_num/5)

where a.study and not b.study;

quit;

Data looks like this + 2 million encounters

Obs	SSN	study	service_branch	sex	service_entry_date	service_entry_date_low	rand_num
1	1	0	A	MALE	05/10/2010	05/10/2009	1
2	2	1	A	MALE	11/20/2006		1
3	3	0	M	MALE	10/07/2011	10/07/2010	1
4	4	1	A	MALE	06/22/2010		1
5	5	0	M	MALE	10/23/2007	10/23/2006	1
6	6	0	M	MALE	07/08/2013	07/08/2012	1
7	7	0	F	FEMALE	01/16/2007	01/16/2006	1
8	8	1	A	MALE	08/03/2006		1
9	9	0	F	MALE	01/20/2009	01/21/2008	1
10	10	0	A	MALE	01/06/2009	01/07/2008	1
11	11	0	A	MALE	03/18/2009	03/18/2008	1
12	12	1	A	FEMALE	04/19/2007	.	1
13	13	0	F	MALE	02/18/2014	02/18/2013	1
14	14	0	F	FEMALE	09/04/2007	09/04/2006	1
15	15	0	A	MALE	08/18/2014	08/18/2013	1
16	16	0	N	FEMALE	11/18/2013	11/18/2012	1
17	17	1	N	FEMALE	07/08/2014		1
18	18	0	N	MALE	03/11/2014	03/11/2013	1
19	19	0	N	MALE	09/17/2013	09/17/2012	1
20	20	0	N	MALE	03/13/2014	03/13/2013	1

PGStats · Posted 08-18-2016 03:45 PM

Why would you need "where a.study and not b.study" if you are using two datasets?

The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note.

The "Sort execution failure" error message is new to me.

You could use the following, without changing the rest :

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from 
    exp2.enroll_case as a inner join 
    exp2.control_nd as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
order by ne_ssn, rnd;

PG

View solution in original post

KachiM · Posted 08-11-2016 04:41 PM

Show sample of Case data set and Control data set separately to help in visualizing your problem and get you a solution.

Your matching variables are Sex and service_entry_date. The second matching variable is little confusing.

Out of the sample data sets, derive an output(it may be a random) for a minimum of 2 Cases for clarity

Doc_Duke · Posted 08-11-2016 11:56 PM

I would approach this a bit differently.

I'd keep the case and control datasets separate and apply the rand_num as you have done.

Drop the SORT and RANK. (BTW, RANK is working correctly. SSN is a unique key and it restarts the ranking every time the by level changes.)

Do the SQL, but change to two datasets and drop the rand_num part of the JOIN. Include the case.ssn with the control.ssn in the output dataset and drop the other case variables.. This will get you ALL of the matches for each case, with duplication, so now your task is to de-duplicate. Non-trivial, but doable.

OR, you can use one of the man 1:N matching macros that people have already written. Google search for

sas matching macro

and you will find a number.

Statistically, a 1:5 match isn't that much more informative than a 1:2 match. Depending on your next steps, you could also use all fo the matches (1:1-1:x) and have a perfectly valid analysis. The use of a limited number of matches is valuable if you have to do manual abstraction (e.g. extra labor), but if you are totally working with computer files, what's a few extra cycles to use them all.

PGStats · Posted 08-13-2016 11:55 PM

I modified the data slightly to create three matches (there was none in your sample data). Here is a hash solution:

data have;
infile datalines missover;
input Obs	SSN	study	service_branch :$1.	sex :$6.	
    (service_entry_date	service_entry_date_low) (:mmddyy10.);
format service_entry_date	service_entry_date_low yymmdd10.;
datalines;
1	1	0	A	MALE	05/10/2010	05/10/2005
2	2	1	A	MALE	11/20/2006
3	3	0	M	MALE	10/07/2011	10/07/2010
4	4	1	A	MALE	06/22/2010
5	5	0	M	MALE	10/23/2007	10/23/2006
6	6	0	M	MALE	07/08/2013	07/08/2012
7	7	0	F	FEMALE	01/16/2007	01/16/2006
8	8	1	A	MALE	08/03/2006
9	9	0	F	MALE	01/20/2009	01/21/2008
10	10	0	A	MALE	01/06/2009	01/07/2005
11	11	0	A	MALE	03/18/2009	03/18/2008
12	12	1	N	FEMALE	04/19/2007
13	13	0	F	MALE	02/18/2014	02/18/2013
14	14	0	F	FEMALE	09/04/2007	09/04/2005
15	15	0	A	MALE	08/18/2014	08/18/2013
16	16	0	N	FEMALE	11/18/2013	11/18/2005
17	17	1	N	FEMALE	07/08/2014
18	18	0	N	MALE	03/11/2014	03/11/2013
19	19	0	N	MALE	09/17/2013	09/17/2012
20	20	0	N	MALE	03/13/2014	03/13/2013
;

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn
from 
    have as a inner join 
    have as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
where a.study and not b.study
order by ne_ssn, rand("uniform");
create table cases as
select ssn as e_ssn, 0 as count from have where study;
quit;

data mSet;
if _n_ = 1 then do;
	declare hash w(dataset:"cases");
	w.definekey("e_ssn");
	w.definedata("count");
	w.definedone();
	call missing(e_ssn, count);
	end;
assigned = 0;
do until(last.ne_ssn);
    set match; by ne_ssn;
    if not assigned then do;
        if w.find() = 0 then do;
            if count < 5 then do;
                count + 1;
                w.replace();
                output;
                assigned = 1;
                end;
            end;
        end;
    end;
keep e_ssn ne_ssn;
run;

proc print; run;

PG

jenim514 · Posted 08-18-2016 03:24 PM

Thank you @PGStats. I ran the syntax you kindly prepared for me. I am getting an error in the first sql section (see below). I am currently using two tables (table A is cases, b is controls). Both tables have SSN variable, so I'm not clear why I would get an error stating the order variable doesn' appear in the SELECT clause. I apprecitate your help!!

82   proc sql;
83   create table match as
84   select a.ssn as e_ssn, b.ssn as ne_ssn
85   from
86       exp2.enroll_case as a inner join
87       exp2.control_nd as b
88           on a.service_entry_date between b.service_entry_date and b.service_entry_date_low
88 ! and
89               a.service_branch=b.service_branch and
90               a.sex=b.sex
91   where a.study and not b.study
92   order by ne_ssn, rand("uniform");
NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT
      clause.
ERROR: Sort execution failure.

PGStats · Posted 08-18-2016 03:45 PM

Why would you need "where a.study and not b.study" if you are using two datasets?

The message "The query as specified involves ordering by an item that doesn't appear in its SELECT clause" is a Note, not an Error message. I also get that note.

The "Sort execution failure" error message is new to me.

You could use the following, without changing the rest :

proc sql;
create table match as
select a.ssn as e_ssn, b.ssn as ne_ssn, rand("uniform") as rnd
from 
    exp2.enroll_case as a inner join 
    exp2.control_nd as b
        on  a.service_entry_date between b.service_entry_date and b.service_entry_date_low and 
            a.service_branch=b.service_branch and
            a.sex=b.sex
order by ne_ssn, rnd;

PG

jenim514 · Posted 09-01-2016 02:56 PM

@PGStats Thank you so much for your help. I was delayed in testing, but it worked great!

match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Re: match 1:5 unique case controls...

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away