BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kels123
Quartz | Level 8

Hi,

 

I have a dataset with about 1,300 subjects, many of which have multiple rows because they have multiple observation periods during my eight-year data collection / follow-up period. 

 

data tempfile;

infile datalines truncover;

INPUT

StudyID$2.@+1 count StartDate mmddyy10.@+1 EndDate1 mmddyy10.@+1;

format Startdate EndDate1 mmddyy10.;

datalines;

01 1 02/01/2007 05/01/2008

02 1 06/02/2012 05/24/2013

03 1 05/07/2010 08/19/2010

04 1 06/16/2007 06/21/2008

05 1 04/28/2007 11/07/2008

06 1 03/16/2007 06/20/2007

07 1 04/18/2007 04/19/2007

08 1 05/11/2007 08/03/2007

09 1 08/01/2007 08/28/2007

09 3 02/21/2008 03/05/2008

10 1 02/06/2008 03/12/2010

11 1 01/31/2007 12/26/2008

11 3 02/11/2009 05/08/2009

11 5 05/19/2009 08/14/2009

11 7 09/22/2009 12/18/2009

11 9 01/09/2010 02/03/2010

11 11 02/23/2010 11/19/2010

11 13 12/01/2010 03/18/2011

11 15 05/10/2011 04/23/2013

11 17 08/03/2013 09/27/2013

;

run;

proc print data=tempfile; run;

 

Please note: The count variable is an old variable left over from a previous coding procedure. 

 

For subjects with more than one observation, I will need to choose which observation to use as my "starting point". In an effort to avoid bias by only looking at everyone's first period or looking at everyone's last period that would allow for adequate follow-up time, I was thinking I could randomly select which observation period to use for analysis. This would mean that I need to include everyone who has only one row and then, for subjects with more than one row, randomly selecting one of their rows to include for analysis. Is there a way to randomly assign ones and zeroes for example, in order to accomplish this? For example:

 

StudyID count StartDate EndDate1    INCLUDE

01 1 02/01/2007 05/01/2008              1

02 1 06/02/2012 05/24/2013              1

03 1 05/07/2010 08/19/2010              1

04 1 06/16/2007 06/21/2008              1

05 1 04/28/2007 11/07/2008              1

06 1 03/16/2007 06/20/2007              1

07 1 04/18/2007 04/19/2007              1

08 1 05/11/2007 08/03/2007              1

09 1 08/01/2007 08/28/2007              1

09 3 02/21/2008 03/05/2008              0

10 1 02/06/2008 03/12/2010              1

11 1 01/31/2007 12/26/2008              0

11 3 02/11/2009 05/08/2009              0

11 5 05/19/2009 08/14/2009              0

11 7 09/22/2009 12/18/2009              1

11 9 01/09/2010 02/03/2010              0

11 11 02/23/2010 11/19/2010            0

11 13 12/01/2010 03/18/2011            0

11 15 05/10/2011 04/23/2013            0

11 17 08/03/2013 09/27/2013            0

 

Thank you very much in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Let surveyselect do the job:

 

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
PG

View solution in original post

3 REPLIES 3
Astounding
PROC Star

While you can probably get PROC SURVEYSELECT to do this, it's not super difficult to program yourself.  For example:

 

data temp;

set have;

random_order = ranuni(12345);

run;

 

proc sort data=temp;

by id random_order;

run;

 

data want;

set temp;

by id random_order;

include = first.id;

drop random_order;

run;

 

You can always re-sort the data if you would like at that point.  With 1,300 subjects, efficiency won't be a big issue.

PGStats
Opal | Level 21

Let surveyselect do the job:

 

proc surveyselect data=tempfile outall out=outfile sampsize=1;
strata studyId;
run;
PG
mkeintz
PROC Star

Assuming the data are sorted by studyid, you can read and count the number of recs for each studyid, generating NRECS. Randomly choose an integer between 1 and NRECS ==>dummy_seq.  Then re-read and set dummy=1 for the selected sequence number:

 

 

data want (drop=nrecs dummy_seq seq);

  do nrecs=1 by 1 until (last.studyid);

    set have;

    by studyid;

  end;

 

  dummy_seq =      ceil(nrecs*ranuni(09810987));

 

  do seq=1 to nrecs;

    set have;

    if seq=dummy_seq then dummy=1;

    else dummy=0;

    output;

  end;

 

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 4522 views
  • 3 likes
  • 4 in conversation