I have the following proc survey code which I used to bootstrap the attached data set(1). There are actually 24 subjects that comprise the data set with each subject having the same amount of data (e.g.,12 time points, CMT's 1-24 for each time etc). When I run the code I do get bootstrapped samples but all of the data for a subject is not included. The data for subjects 1,2,3 and 5 is shown with only 1 line of data retained for each wsub in the bootstrapped output (2). What I want is for all of the data for a wsub to be retained during bootstrapping not just one line. How can I adjust my code to do that?
proc import datafile='/folders/myfolders/bootstrapaptensio_PED/ngroupalla.csv'
out=newB
replace;
getnames=yes;
run;
PROC SURVEYSELECT DATA=NEWB METHOD=SRS REP=2 N=24 SEED= 3495 OUT=SAMPLE; RUN;
BOOTSTRAP OUTPUT (2)
Replic wsub time dv cmt ……...
1 | 1 | 12 | 0 | 4 | 0 | 2 | 1 | 23 | 1 | 6 | 2 |
1 | 2 | 5 | 0 | 12 | 0 | 2 | 1 | 20 | 1 | 6 | 1 |
1 | 2 | 12 | 0 | 9 | 0 | 2 | 1 | 20 | 1 | 6 | 1 |
1 | 3 | 0 | 0 | 18 | 0 | 2 | 1 | 23 | 1 | 6 | 1 |
1 | 3 | 6 | 0 | 12 | 0 | 2 | 1 | 23 | 1 | 6 | 1 |
1 | 5 | 5 | 0 | 7 | 0 | 2 | 1 | 33 | 1 | 9 | 2 |
1 | 5 | 10 | 0 | 19 |
DATA SET (1)
WSUB | TIME | DV | CMT | AMT | EVID | MDV | WEIGHT | SEX | AGE | AGROUP | ||||||||||
1 | 0 | 0 | 1 | 30000000 | 1 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 2 | 30000000 | 1 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 3 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 4 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 5 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 6 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 7 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 8 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 9 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
1 | 0 | 0 | 10 | 0 | 2 | 1 | 23 | 1 | 6 | 2 | ||||||||||
Actually, OUTHITS should have produced the desired duplicates.
Here's a simplified example:
data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;
proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;
The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):
Number Obs Replicate wsub info Hits 1 1 1 11 1 2 1 1 12 1 3 1 2 21 2 4 1 2 22 2 5 2 2 21 2 6 2 2 22 2 7 2 3 31 1 8 2 3 32 1
Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.
Now, using the OUTHITS option ...
proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;
... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:
Number Obs Replicate wsub info Hits 1 1 1 11 1 2 1 1 12 1 3 1 2 21 2 4 1 2 22 2 5 1 2 21 2 6 1 2 22 2 7 2 2 21 2 8 2 2 22 2 9 2 2 21 2 10 2 2 22 2 11 2 3 31 1 12 2 3 32 1
Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.
Hi @jacksonan123,
If the 24 subjects are the sampling units (and variable wsub is their identifier), you should insert a SAMPLINGUNIT (alias CLUSTER) statement into your PROC SURVEYSELECT step (before the RUN statement):
cluster wsub;
Are you sure you want METHOD=SRS? For common bootstrap samples (i.e. with replacement) METHOD=URS would be adequate.
Using N=number of sampling units is typical for bootstrapping, but could be simplified to RATE=1.
Without the OUTHITS option of the PROC SURVEYSELECT statement subjects which were selected more than once (note: sampling with replacement) are only included once (per replicate) in the output dataset, but variable NumberHits contains the "multiplicity" (e.g. 2).
So, just add OUTHITS to obtain the information about those subjects multiple times in the output dataset.
Actually, OUTHITS should have produced the desired duplicates.
Here's a simplified example:
data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;
proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;
The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):
Number Obs Replicate wsub info Hits 1 1 1 11 1 2 1 1 12 1 3 1 2 21 2 4 1 2 22 2 5 2 2 21 2 6 2 2 22 2 7 2 3 31 1 8 2 3 32 1
Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.
Now, using the OUTHITS option ...
proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;
... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:
Number Obs Replicate wsub info Hits 1 1 1 11 1 2 1 1 12 1 3 1 2 21 2 4 1 2 22 2 5 1 2 21 2 6 1 2 22 2 7 2 2 21 2 8 2 2 22 2 9 2 2 21 2 10 2 2 22 2 11 2 3 31 1 12 2 3 32 1
Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.