BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jacksonan123
Lapis Lazuli | Level 10

I have the following proc survey code which I  used to bootstrap the attached data set(1).  There are actually 24 subjects that comprise the data set with each subject having the same amount of data (e.g.,12 time points,  CMT's 1-24 for each time etc). When I run the code I do get bootstrapped samples but all of the data for a subject is not included.  The data for subjects 1,2,3 and 5 is shown with only 1 line of data retained for each wsub in the bootstrapped output (2).  What I want is for all of the data for a wsub to be retained during bootstrapping not just one line.  How can I adjust my code to do that?

proc import datafile='/folders/myfolders/bootstrapaptensio_PED/ngroupalla.csv'
            out=newB
            replace;
getnames=yes;
run;

PROC SURVEYSELECT DATA=NEWB METHOD=SRS REP=2 N=24 SEED= 3495 OUT=SAMPLE; RUN;

BOOTSTRAP OUTPUT (2) 

             Replic   wsub       time       dv        cmt        ……...

11120402123162
12501202120161
12120902120161
13001802123161
13601202123161
1550702133192
1510019

DATA SET (1)

 

 

WSUBTIMEDVCMTAMTEVIDMDVWEIGHTSEXAGEAGROUP          
1001300000001123162          
1002300000001123162          
100302123162          
100402123162          
100502123162          
100602123162          
100702123162          
100802123162          
100902123162          
100100212316            2          
                     
                     
                     
                     
                     
                     
                     
                     
                     
                     
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Actually, OUTHITS should have produced the desired duplicates.

 

Here's a simplified example:

data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;

The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):

                                    Number
Obs    Replicate    wsub    info     Hits

 1         1          1      11        1
 2         1          1      12        1
 3         1          2      21        2
 4         1          2      22        2
 5         2          2      21        2
 6         2          2      22        2
 7         2          3      31        1
 8         2          3      32        1

Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.

 

Now, using the OUTHITS option ...

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;

... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:

                                    Number
Obs    Replicate    wsub    info     Hits

  1        1          1      11        1
  2        1          1      12        1
  3        1          2      21        2
  4        1          2      22        2
  5        1          2      21        2
  6        1          2      22        2
  7        2          2      21        2
  8        2          2      22        2
  9        2          2      21        2
 10        2          2      22        2
 11        2          3      31        1
 12        2          3      32        1

Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.

View solution in original post

5 REPLIES 5
FreelanceReinh
Jade | Level 19

Hi @jacksonan123,

 

If the 24 subjects are the sampling units (and variable wsub is their identifier), you should insert a SAMPLINGUNIT (alias CLUSTER) statement into your PROC SURVEYSELECT step (before the RUN statement):

cluster wsub;

 

Are you sure you want METHOD=SRS? For common bootstrap samples (i.e. with replacement) METHOD=URS would be adequate.

 

Using N=number of sampling units is typical for bootstrapping, but could be simplified to RATE=1.

jacksonan123
Lapis Lazuli | Level 10
I used the following code and it did contain all of the data for a wsubject.

/*PROC SURVEYSELECT DATA=NEWB METHOD=urs REP=2 N=24 SEED= 3495 OUT=SAMPLE;
*/PROC SURVEYSELECT DATA=NEWB METHOD=urs REP=2 rate=1 SEED= 3495 OUT=SAMPLE;

cluster wsub;

RUN;

However whether I used N=24 or rate=1 there were only 19 subjects output
(i.e., 1,2,3,5,6,7,8,10,11,13,14,15,16,17,18,19,21,22,24). I manually
checked the output to see if a subject had been replaced by having that
subject appear twice in the output data. I could not find any subjects with
duplicate data. Do you have any idea of why it didn't output N=24 subjects
as requested?




FreelanceReinh
Jade | Level 19

Without the OUTHITS option of the PROC SURVEYSELECT statement subjects which were selected more than once (note: sampling with replacement) are only included once (per replicate) in the output dataset, but variable NumberHits contains the "multiplicity" (e.g. 2).

 

So, just add OUTHITS to obtain the information about those subjects multiple times in the output dataset.

jacksonan123
Lapis Lazuli | Level 10
I put the outhits= into the code and indeed the number of hits were
revealed. Since you stated that, " the PROC SURVEYSELECT statement subjects
which were selected more than once (note: sampling with replacement) are
only included once in the output dataset, but variable NumberHits contains
the "multiplicity" (e.g. 2)." I need to have even the duplicate subjects
included in the output so that I can continue to process the data since an
N=19 would cause an issue in my next analysis of the data. Is there a way
to get the subjects with multiplicity to be output into the data set?
FreelanceReinh
Jade | Level 19

Actually, OUTHITS should have produced the desired duplicates.

 

Here's a simplified example:

data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;

The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):

                                    Number
Obs    Replicate    wsub    info     Hits

 1         1          1      11        1
 2         1          1      12        1
 3         1          2      21        2
 4         1          2      22        2
 5         2          2      21        2
 6         2          2      22        2
 7         2          3      31        1
 8         2          3      32        1

Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.

 

Now, using the OUTHITS option ...

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;

... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:

                                    Number
Obs    Replicate    wsub    info     Hits

  1        1          1      11        1
  2        1          1      12        1
  3        1          2      21        2
  4        1          2      22        2
  5        1          2      21        2
  6        1          2      22        2
  7        2          2      21        2
  8        2          2      22        2
  9        2          2      21        2
 10        2          2      22        2
 11        2          3      31        1
 12        2          3      32        1

Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1303 views
  • 0 likes
  • 2 in conversation