Solved: How to randomly select at least one observation per subject

Kels123 · Posted 05-02-2017 01:48 PM

Hi,

I have a dataset with about 1,300 subjects, many of which have multiple rows because they have multiple observation periods during my eight-year data collection / follow-up period.

data tempfile;

infile datalines truncover;

INPUT

StudyID$2.@+1 count StartDate mmddyy10.@+1 EndDate1 mmddyy10.@+1;

format Startdate EndDate1 mmddyy10.;

datalines;

01 1 02/01/2007 05/01/2008

02 1 06/02/2012 05/24/2013

03 1 05/07/2010 08/19/2010

04 1 06/16/2007 06/21/2008

05 1 04/28/2007 11/07/2008

06 1 03/16/2007 06/20/2007

07 1 04/18/2007 04/19/2007

08 1 05/11/2007 08/03/2007

09 1 08/01/2007 08/28/2007

09 3 02/21/2008 03/05/2008

10 1 02/06/2008 03/12/2010

11 1 01/31/2007 12/26/2008

11 3 02/11/2009 05/08/2009

11 5 05/19/2009 08/14/2009

11 7 09/22/2009 12/18/2009

11 9 01/09/2010 02/03/2010

11 11 02/23/2010 11/19/2010

11 13 12/01/2010 03/18/2011

11 15 05/10/2011 04/23/2013

11 17 08/03/2013 09/27/2013

;

run;

proc print data=tempfile; run;

Please note: The count variable is an old variable left over from a previous coding procedure.

For subjects with more than one observation, I will need to choose which observation to use as my "starting point". In an effort to avoid bias by only looking at everyone's first period or looking at everyone's last period that would allow for adequate follow-up time, I was thinking I could randomly select which observation period to use for analysis. This would mean that I need to include everyone who has only one row and then, for subjects with more than one row, randomly selecting one of their rows to include for analysis. Is there a way to randomly assign ones and zeroes for example, in order to accomplish this? For example:

StudyID count StartDate EndDate1 INCLUDE

01 1 02/01/2007 05/01/2008 1

02 1 06/02/2012 05/24/2013 1

03 1 05/07/2010 08/19/2010 1

04 1 06/16/2007 06/21/2008 1

05 1 04/28/2007 11/07/2008 1

06 1 03/16/2007 06/20/2007 1

07 1 04/18/2007 04/19/2007 1

08 1 05/11/2007 08/03/2007 1

09 1 08/01/2007 08/28/2007 1

09 3 02/21/2008 03/05/2008 0

10 1 02/06/2008 03/12/2010 1

11 1 01/31/2007 12/26/2008 0

11 3 02/11/2009 05/08/2009 0

11 5 05/19/2009 08/14/2009 0

11 7 09/22/2009 12/18/2009 1

11 9 01/09/2010 02/03/2010 0

11 11 02/23/2010 11/19/2010 0

11 13 12/01/2010 03/18/2011 0

11 15 05/10/2011 04/23/2013 0

11 17 08/03/2013 09/27/2013 0

Thank you very much in advance!

PGStats · Posted 05-02-2017 03:57 PM

Let surveyselect do the job:

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;

PG

View solution in original post

Astounding · Posted 05-02-2017 02:15 PM

While you can probably get PROC SURVEYSELECT to do this, it's not super difficult to program yourself. For example:

data temp;

set have;

random_order = ranuni(12345);

run;

proc sort data=temp;

by id random_order;

run;

data want;

set temp;

by id random_order;

include = first.id;

drop random_order;

run;

You can always re-sort the data if you would like at that point. With 1,300 subjects, efficiency won't be a big issue.

PGStats · Posted 05-02-2017 03:57 PM

Let surveyselect do the job:

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;

PG

mkeintz · Posted 05-02-2017 03:59 PM

Assuming the data are sorted by studyid, you can read and count the number of recs for each studyid, generating NRECS. Randomly choose an integer between 1 and NRECS ==>dummy_seq. Then re-read and set dummy=1 for the selected sequence number:

data want (drop=nrecs dummy_seq seq);

do nrecs=1 by 1 until (last.studyid);

set have;

by studyid;

end;

dummy_seq = ceil(nrecs*ranuni(09810987));

do seq=1 to nrecs;

set have;

if seq=dummy_seq then dummy=1;

else dummy=0;

output;

end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Catch up on SAS Innovate 2026

How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Re: How to randomly select at least one observation per subject

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away