Solved: Proc Survey select

jacksonan123 · Posted 12-31-2019 08:51 AM

I have the following proc survey code which I used to bootstrap the attached data set(1). There are actually 24 subjects that comprise the data set with each subject having the same amount of data (e.g.,12 time points, CMT's 1-24 for each time etc). When I run the code I do get bootstrapped samples but all of the data for a subject is not included. The data for subjects 1,2,3 and 5 is shown with only 1 line of data retained for each wsub in the bootstrapped output (2). What I want is for all of the data for a wsub to be retained during bootstrapping not just one line. How can I adjust my code to do that?

proc import datafile='/folders/myfolders/bootstrapaptensio_PED/ngroupalla.csv'
            out=newB
            replace;
getnames=yes;
run;

PROC SURVEYSELECT DATA=NEWB METHOD=SRS REP=2 N=24 SEED= 3495 OUT=SAMPLE; RUN;

BOOTSTRAP OUTPUT (2)

Replic wsub time dv cmt ……...

1	1	12	4	0	2	1	23	1	6	2
1	2	5	12	0	2	1	20	1	6	1
1	2	12	9	0	2	1	20	1	6	1
1	3	0	18	0	2	1	23	1	6	1
1	3	6	12	0	2	1	23	1	6	1
1	5	5	7	0	2	1	33	1	9	2
1	5	10	19

DATA SET (1)

WSUB	TIME	DV	CMT	AMT	EVID	MDV	WEIGHT	SEX	AGE	AGROUP
1	0	0	1	30000000	1	1	23	1	6	2
1	0	0	2	30000000	1	1	23	1	6	2
1	0	0	3	0	2	1	23	1	6	2
1	0	0	4	0	2	1	23	1	6	2
1	0	0	5	0	2	1	23	1	6	2
1	0	0	6	0	2	1	23	1	6	2
1	0	0	7	0	2	1	23	1	6	2
1	0	0	8	0	2	1	23	1	6	2
1	0	0	9	0	2	1	23	1	6	2
1	0	0	10	0	2	1	23	1	6	2

FreelanceReinh · Posted 12-31-2019 02:34 PM

Actually, OUTHITS should have produced the desired duplicates.

Here's a simplified example:

data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;

The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):

                                    Number
Obs    Replicate    wsub    info     Hits

 1         1          1      11        1
 2         1          1      12        1
 3         1          2      21        2
 4         1          2      22        2
 5         2          2      21        2
 6         2          2      22        2
 7         2          3      31        1
 8         2          3      32        1

Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.

Now, using the OUTHITS option ...

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;

... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:

                                    Number
Obs    Replicate    wsub    info     Hits

  1        1          1      11        1
  2        1          1      12        1
  3        1          2      21        2
  4        1          2      22        2
  5        1          2      21        2
  6        1          2      22        2
  7        2          2      21        2
  8        2          2      22        2
  9        2          2      21        2
 10        2          2      22        2
 11        2          3      31        1
 12        2          3      32        1

Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.

View solution in original post

FreelanceReinh · Posted 12-31-2019 11:32 AM

Hi @jacksonan123,

If the 24 subjects are the sampling units (and variable wsub is their identifier), you should insert a SAMPLINGUNIT (alias CLUSTER) statement into your PROC SURVEYSELECT step (before the RUN statement):

cluster wsub;

Are you sure you want METHOD=SRS? For common bootstrap samples (i.e. with replacement) METHOD=URS would be adequate.

Using N=number of sampling units is typical for bootstrapping, but could be simplified to RATE=1.

jacksonan123 · Posted 12-31-2019 12:40 PM

I used the following code and it did contain all of the data for a wsubject.

/*PROC SURVEYSELECT DATA=NEWB METHOD=urs REP=2 N=24 SEED= 3495 OUT=SAMPLE;
*/PROC SURVEYSELECT DATA=NEWB METHOD=urs REP=2 rate=1 SEED= 3495 OUT=SAMPLE;

cluster wsub;

RUN;

However whether I used N=24 or rate=1 there were only 19 subjects output
(i.e., 1,2,3,5,6,7,8,10,11,13,14,15,16,17,18,19,21,22,24). I manually
checked the output to see if a subject had been replaced by having that
subject appear twice in the output data. I could not find any subjects with
duplicate data. Do you have any idea of why it didn't output N=24 subjects
as requested?

FreelanceReinh · Posted 12-31-2019 01:02 PM

Without the OUTHITS option of the PROC SURVEYSELECT statement subjects which were selected more than once (note: sampling with replacement) are only included once (per replicate) in the output dataset, but variable NumberHits contains the "multiplicity" (e.g. 2).

So, just add OUTHITS to obtain the information about those subjects multiple times in the output dataset.

jacksonan123 · Posted 12-31-2019 01:30 PM

I put the outhits= into the code and indeed the number of hits were
revealed. Since you stated that, " the PROC SURVEYSELECT statement subjects
which were selected more than once (note: sampling with replacement) are
only included once in the output dataset, but variable NumberHits contains
the "multiplicity" (e.g. 2)." I need to have even the duplicate subjects
included in the output so that I can continue to process the data since an
N=19 would cause an issue in my next analysis of the data. Is there a way
to get the subjects with multiplicity to be output into the data set?

FreelanceReinh · Posted 12-31-2019 02:34 PM

Actually, OUTHITS should have produced the desired duplicates.

Here's a simplified example:

data have;
input wsub info;
cards;
1 11
1 12
2 21
2 22
3 31
3 32
;

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want1;
cluster wsub;
run;

The resulting output dataset (WANT1) contains only two subjects per replicate (but this depends on the seed value):

                                    Number
Obs    Replicate    wsub    info     Hits

 1         1          1      11        1
 2         1          1      12        1
 3         1          2      21        2
 4         1          2      22        2
 5         2          2      21        2
 6         2          2      22        2
 7         2          3      31        1
 8         2          3      32        1

Variable NumberHits contains the number of times each subject was selected. In this example it happened that wsub=2 was selected twice in replicate 1 and (accidentally) also twice in replicate 2. The total number of subjects (including the duplicates) in each of the two bootstrap samples is, of course, 3 (=number of subjects in dataset HAVE), as it should with rate=1.

Now, using the OUTHITS option ...

proc surveyselect data=have method=urs rep=2 rate=1 seed=3495 out=want outhits;
cluster wsub;
run;

... the samples remain unchanged. Only their representation in the output dataset (WANT) is different:

                                    Number
Obs    Replicate    wsub    info     Hits

  1        1          1      11        1
  2        1          1      12        1
  3        1          2      21        2
  4        1          2      22        2
  5        1          2      21        2
  6        1          2      22        2
  7        2          2      21        2
  8        2          2      22        2
  9        2          2      21        2
 10        2          2      22        2
 11        2          3      31        1
 12        2          3      32        1

Records with NumberHits>1 have now been copied NumberHits-1 times. Given that we used REP=2 and each of the three subjects in dataset HAVE had two observations, dataset WANT has now 2*6=12 observations (independent of the seed value). Variable NumberHits contains the multiplicities as before, but unlike dataset WANT1 the new output dataset WANT is not aggregated and thus NumberHits is actually redundant.

Proc Survey select

Re: Proc Survey select

Re: Proc Survey select

Re: Proc Survey select

Re: Proc Survey select

Re: Proc Survey select

Re: Proc Survey select

1	1	12	4	0	2	1	23	1	6	2
1	2	5	12	0	2	1	20	1	6	1
1	2	12	9	0	2	1	20	1	6	1
1	3	0	18	0	2	1	23	1	6	1
1	3	6	12	0	2	1	23	1	6	1
1	5	5	7	0	2	1	33	1	9	2
1	5	10	19

1	1	12	4	0	2	1	23	1	6	2
1	2	5	12	0	2	1	20	1	6	1
1	2	12	9	0	2	1	20	1	6	1
1	3	0	18	0	2	1	23	1	6	1
1	3	6	12	0	2	1	23	1	6	1
1	5	5	7	0	2	1	33	1	9	2
1	5	10	19

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!

1	1	12	4	0	2	1	23	1	6	2
1	2	5	12	0	2	1	20	1	6	1
1	2	12	9	0	2	1	20	1	6	1
1	3	0	18	0	2	1	23	1	6	1
1	3	6	12	0	2	1	23	1	6	1
1	5	5	7	0	2	1	33	1	9	2
1	5	10	19