DATA Step, Macro, Functions and more

How to randomly select at least one observation per subject

Accepted Solution Solved
Reply
Contributor
Posts: 48
Accepted Solution

How to randomly select at least one observation per subject

Hi,

 

I have a dataset with about 1,300 subjects, many of which have multiple rows because they have multiple observation periods during my eight-year data collection / follow-up period. 

 

data tempfile;

infile datalines truncover;

INPUT

StudyID$2.@+1 count StartDate mmddyy10.@+1 EndDate1 mmddyy10.@+1;

format Startdate EndDate1 mmddyy10.;

datalines;

01 1 02/01/2007 05/01/2008

02 1 06/02/2012 05/24/2013

03 1 05/07/2010 08/19/2010

04 1 06/16/2007 06/21/2008

05 1 04/28/2007 11/07/2008

06 1 03/16/2007 06/20/2007

07 1 04/18/2007 04/19/2007

08 1 05/11/2007 08/03/2007

09 1 08/01/2007 08/28/2007

09 3 02/21/2008 03/05/2008

10 1 02/06/2008 03/12/2010

11 1 01/31/2007 12/26/2008

11 3 02/11/2009 05/08/2009

11 5 05/19/2009 08/14/2009

11 7 09/22/2009 12/18/2009

11 9 01/09/2010 02/03/2010

11 11 02/23/2010 11/19/2010

11 13 12/01/2010 03/18/2011

11 15 05/10/2011 04/23/2013

11 17 08/03/2013 09/27/2013

;

run;

proc print data=tempfile; run;

 

Please note: The count variable is an old variable left over from a previous coding procedure. 

 

For subjects with more than one observation, I will need to choose which observation to use as my "starting point". In an effort to avoid bias by only looking at everyone's first period or looking at everyone's last period that would allow for adequate follow-up time, I was thinking I could randomly select which observation period to use for analysis. This would mean that I need to include everyone who has only one row and then, for subjects with more than one row, randomly selecting one of their rows to include for analysis. Is there a way to randomly assign ones and zeroes for example, in order to accomplish this? For example:

 

StudyID count StartDate EndDate1    INCLUDE

01 1 02/01/2007 05/01/2008              1

02 1 06/02/2012 05/24/2013              1

03 1 05/07/2010 08/19/2010              1

04 1 06/16/2007 06/21/2008              1

05 1 04/28/2007 11/07/2008              1

06 1 03/16/2007 06/20/2007              1

07 1 04/18/2007 04/19/2007              1

08 1 05/11/2007 08/03/2007              1

09 1 08/01/2007 08/28/2007              1

09 3 02/21/2008 03/05/2008              0

10 1 02/06/2008 03/12/2010              1

11 1 01/31/2007 12/26/2008              0

11 3 02/11/2009 05/08/2009              0

11 5 05/19/2009 08/14/2009              0

11 7 09/22/2009 12/18/2009              1

11 9 01/09/2010 02/03/2010              0

11 11 02/23/2010 11/19/2010            0

11 13 12/01/2010 03/18/2011            0

11 15 05/10/2011 04/23/2013            0

11 17 08/03/2013 09/27/2013            0

 

Thank you very much in advance!


Accepted Solutions
Solution
‎05-08-2017 11:28 AM
Respected Advisor
Posts: 4,662

Re: How to randomly select at least one observation per subject

Let surveyselect do the job:

 

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
PG

View solution in original post


All Replies
Super User
Posts: 5,099

Re: How to randomly select at least one observation per subject

While you can probably get PROC SURVEYSELECT to do this, it's not super difficult to program yourself.  For example:

 

data temp;

set have;

random_order = ranuni(12345);

run;

 

proc sort data=temp;

by id random_order;

run;

 

data want;

set temp;

by id random_order;

include = first.id;

drop random_order;

run;

 

You can always re-sort the data if you would like at that point.  With 1,300 subjects, efficiency won't be a big issue.

Solution
‎05-08-2017 11:28 AM
Respected Advisor
Posts: 4,662

Re: How to randomly select at least one observation per subject

Let surveyselect do the job:

 

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
PG
Valued Guide
Posts: 797

Re: How to randomly select at least one observation per subject

Assuming the data are sorted by studyid, you can read and count the number of recs for each studyid, generating NRECS. Randomly choose an integer between 1 and NRECS ==>dummy_seq.  Then re-read and set dummy=1 for the selected sequence number:

 

 

data want (drop=nrecs dummy_seq seq);

  do nrecs=1 by 1 until (last.studyid);

    set have;

    by studyid;

  end;

 

  dummy_seq =      ceil(nrecs*ranuni(09810987));

 

  do seq=1 to nrecs;

    set have;

    if seq=dummy_seq then dummy=1;

    else dummy=0;

    output;

  end;

 

run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 137 views
  • 3 likes
  • 4 in conversation