Hi,
I have a dataset with about 1,300 subjects, many of which have multiple rows because they have multiple observation periods during my eight-year data collection / follow-up period.
data tempfile;
infile datalines truncover;
INPUT
StudyID$2.@+1 count StartDate mmddyy10.@+1 EndDate1 mmddyy10.@+1;
format Startdate EndDate1 mmddyy10.;
datalines;
01 1 02/01/2007 05/01/2008
02 1 06/02/2012 05/24/2013
03 1 05/07/2010 08/19/2010
04 1 06/16/2007 06/21/2008
05 1 04/28/2007 11/07/2008
06 1 03/16/2007 06/20/2007
07 1 04/18/2007 04/19/2007
08 1 05/11/2007 08/03/2007
09 1 08/01/2007 08/28/2007
09 3 02/21/2008 03/05/2008
10 1 02/06/2008 03/12/2010
11 1 01/31/2007 12/26/2008
11 3 02/11/2009 05/08/2009
11 5 05/19/2009 08/14/2009
11 7 09/22/2009 12/18/2009
11 9 01/09/2010 02/03/2010
11 11 02/23/2010 11/19/2010
11 13 12/01/2010 03/18/2011
11 15 05/10/2011 04/23/2013
11 17 08/03/2013 09/27/2013
;
run;
proc print data=tempfile; run;
Please note: The count variable is an old variable left over from a previous coding procedure.
For subjects with more than one observation, I will need to choose which observation to use as my "starting point". In an effort to avoid bias by only looking at everyone's first period or looking at everyone's last period that would allow for adequate follow-up time, I was thinking I could randomly select which observation period to use for analysis. This would mean that I need to include everyone who has only one row and then, for subjects with more than one row, randomly selecting one of their rows to include for analysis. Is there a way to randomly assign ones and zeroes for example, in order to accomplish this? For example:
StudyID count StartDate EndDate1 INCLUDE
01 1 02/01/2007 05/01/2008 1
02 1 06/02/2012 05/24/2013 1
03 1 05/07/2010 08/19/2010 1
04 1 06/16/2007 06/21/2008 1
05 1 04/28/2007 11/07/2008 1
06 1 03/16/2007 06/20/2007 1
07 1 04/18/2007 04/19/2007 1
08 1 05/11/2007 08/03/2007 1
09 1 08/01/2007 08/28/2007 1
09 3 02/21/2008 03/05/2008 0
10 1 02/06/2008 03/12/2010 1
11 1 01/31/2007 12/26/2008 0
11 3 02/11/2009 05/08/2009 0
11 5 05/19/2009 08/14/2009 0
11 7 09/22/2009 12/18/2009 1
11 9 01/09/2010 02/03/2010 0
11 11 02/23/2010 11/19/2010 0
11 13 12/01/2010 03/18/2011 0
11 15 05/10/2011 04/23/2013 0
11 17 08/03/2013 09/27/2013 0
Thank you very much in advance!
Let surveyselect do the job:
proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
While you can probably get PROC SURVEYSELECT to do this, it's not super difficult to program yourself. For example:
data temp;
set have;
random_order = ranuni(12345);
run;
proc sort data=temp;
by id random_order;
run;
data want;
set temp;
by id random_order;
include = first.id;
drop random_order;
run;
You can always re-sort the data if you would like at that point. With 1,300 subjects, efficiency won't be a big issue.
Let surveyselect do the job:
proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
Assuming the data are sorted by studyid, you can read and count the number of recs for each studyid, generating NRECS. Randomly choose an integer between 1 and NRECS ==>dummy_seq. Then re-read and set dummy=1 for the selected sequence number:
data want (drop=nrecs dummy_seq seq);
do nrecs=1 by 1 until (last.studyid);
set have;
by studyid;
end;
dummy_seq = ceil(nrecs*ranuni(09810987));
do seq=1 to nrecs;
set have;
if seq=dummy_seq then dummy=1;
else dummy=0;
output;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.