Solved: Re: Differences between Observations By Group

JJ_83 · Posted 02-25-2021 04:39 PM

I am working with a matched case control study, where there are multiple controls (case = 0) matched to one case (case = 1) by the variable "trimgroupid".

Each observation has it's own ID (uniqueid).

What I want to do is identify the controls that have a date of birth (dob) > 1 year away (+ or -) from it's matched case.

Here is what I have:

data have;
	infile datalines delimiter=',';
	length uniqueid $5 trimgroupid $7;
	format dob mmddyy10.;
	input uniqueid $ trimgroupid $ dob mmddyy10. case;
	datalines;
	X0131,XX01491,03/10/2000,1
	X1831,XX01491,12/05/2002,0
	X3691,XX01491,06/02/2000,0
	X1971,XX04997,02/05/2010,1
	X2611,XX04997,03/13/2011,0
	X4371,XX04997,06/25/2009,0
	X4621,XX04997,01/01/2009,0
	;
run;

And here is what I want:

data want;
	infile datalines delimiter=',';
	length uniqueid $5 trimgroupid $7;
	format dob mmddyy10.;
	input uniqueid $ trimgroupid $ dob mmddyy10. case;
	datalines;
	X1831,XX01491,12/05/2002,0
	X2611,XX04997,03/13/2011,0
	X4621,XX04997,01/01/2009,0
	;
run;

For the trimgroupid value of XX01491 there is 1 control whose dob is > 1 year away from the case it is matched to.

For the trimgroupid value of XX04997 there are 2 controls whose dob is > 1 year away from the case it is matched to.

mkeintz · Posted 02-25-2021 11:05 PM

If the data are already sorted by trimgroupid, then a MERGE statement (with a rename), accompanied by a BY statement will do what you want:

data have;
  infile datalines delimiter=',';
  length uniqueid $5 trimgroupid $7;
  format dob mmddyy10.;
  input uniqueid $ trimgroupid $ dob mmddyy10. case;
datalines;
X0131,XX01491,03/10/2000,1
X1831,XX01491,12/05/2002,0
X3691,XX01491,06/02/2000,0
X1971,XX04997,02/05/2010,1
X2611,XX04997,03/13/2011,0
X4371,XX04997,06/25/2009,0
X4621,XX04997,01/01/2009,0
run;
data want;
  merge have (where=(case=1) rename=(dob=case_dob))
        have (where=(case=0));
  by trimgroupid;
  if intck('year',case_dob,dob,'continuous')>0 or 
     intck('year',dob,case_dob,'continuous')>0 ;
run;

Since the control dob's can be either before or after the matching case dob, the subsetting IF checks on the interval from case_dob to dob, and also from dob to case_dob. Note this relies on the presence of exactly one case per trimgroupid. The program also has the advantage of storing the case_dob value in each of the qualifying controls.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Shmuel · Posted 02-25-2021 08:07 PM

To simplify the code I used 365 days as a year, neglecting a leap year:

data have;
	infile datalines delimiter=',';
	length uniqueid $5 trimgroupid $7;
	format dob mmddyy10.;
	input uniqueid $ trimgroupid $ dob mmddyy10. case;
	datalines;
	X0131,XX01491,03/10/2000,1
	X1831,XX01491,12/05/2002,0
	X3691,XX01491,06/02/2000,0
	X1971,XX04997,02/05/2010,1
	X2611,XX04997,03/13/2011,0
	X4371,XX04997,06/25/2009,0
	X4621,XX04997,01/01/2009,0
	;
run;
proc sort data=have; by trimgroupid dob; run;
data want;
 set have;
  by trimgroupid dob;
     if dif(dob) ge 365;
run;

JJ_83 · Posted 02-26-2021 03:28 PM

This solution also works, but for my particular dataset the other approach fit better.

Thank you!

mkeintz · Posted 02-25-2021 11:05 PM

If the data are already sorted by trimgroupid, then a MERGE statement (with a rename), accompanied by a BY statement will do what you want:

data have;
  infile datalines delimiter=',';
  length uniqueid $5 trimgroupid $7;
  format dob mmddyy10.;
  input uniqueid $ trimgroupid $ dob mmddyy10. case;
datalines;
X0131,XX01491,03/10/2000,1
X1831,XX01491,12/05/2002,0
X3691,XX01491,06/02/2000,0
X1971,XX04997,02/05/2010,1
X2611,XX04997,03/13/2011,0
X4371,XX04997,06/25/2009,0
X4621,XX04997,01/01/2009,0
run;
data want;
  merge have (where=(case=1) rename=(dob=case_dob))
        have (where=(case=0));
  by trimgroupid;
  if intck('year',case_dob,dob,'continuous')>0 or 
     intck('year',dob,case_dob,'continuous')>0 ;
run;

Since the control dob's can be either before or after the matching case dob, the subsetting IF checks on the interval from case_dob to dob, and also from dob to case_dob. Note this relies on the presence of exactly one case per trimgroupid. The program also has the advantage of storing the case_dob value in each of the qualifying controls.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

JJ_83 · Posted 02-26-2021 03:27 PM

This worked great, thank you!

Differences between Observations By Group

Re: Differences between Observations By Group

Re: Differences between Observations By Group

Re: Differences between Observations By Group

Re: Differences between Observations By Group

Re: Differences between Observations By Group

SAS Innovate 2026 Registration is Open