Re: Stratified sampling with multiple strata and missing substrata

mgugiu · Posted 11-18-2019 10:18 AM

Hi everyone,

I am using SAS 9.4. version and I have to select a sample using a complex stratification process that involves several strata and a weight assigned to each observation. I created a fake data file to give you a better sense of how the data are structured (attached).

The data are structured in 5 strata (domain_name) and 4 of the 5 strata have 2 sub-strata (shade). Each strata is also distributed in 9 bands and the final sample size must be done by each stratum, sub-stratum, and band in a specific proportion/number. Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.

So far, I was able to select a partial sample--the strata that has missing substrata is never selected and I need help figuring out how to select observations from that stratum. Below is the code I used.

proc freq data=population noprint;
tables domain_name*shade*band/out= framedist;
run;

data sampsize;
do shade=1 to 2;
do band= 1 to 9;
input nfract @@;
output;
end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;
data sampsize;
set sampsize;
_nsize_ = round(nfract) ;
if 0<nfract<1 then _nsize_=1;
run;

proc print; /*this code outputs the data and sums up the total sample size; I replaced _nsize_ with nfract*/
sum nfract;
run;

proc sort data=sampsize;
by Shade band;
run;

proc sort data=framedist;
by Shade band;
run;

data sampsize;
merge sampsize framedist;
by shade band;
if _nsize_=. then _nsize_=0;
run;

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall noprint;
strata shade band;
run;

Thank you so much!

FreelanceReinh · Posted 11-18-2019 12:41 PM

Hi @mgugiu and welcome to the SAS Support Communities!

The DATA step creating the SAMPSIZE dataset does not contain observations with missing SHADE. This is why domain 'Black' is never selected. (Note that the values for 'Black' get SHADE=1, the values for 'Green' get all wrong SHADE values and the dangerous note "LOST CARD" is written to the log.) Also, I would strongly recommend including DOMAIN_NAME as a stratum variable. So, a corrected DATA step could look like this:

data sampsize;
length domain_name $6;
do domain_name='Red', 'Blue', 'Purple', 'Black', 'Green';
  do shade=., 1, 2;
    if domain_name='Black' & shade | domain_name ne 'Black' & shade=. then continue;
    do band= 1 to 9;
      input _nsize_ @@;
      output;
    end;
  end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;

Note that I read the above integers directly into variable _NSIZE_, thus making the DATA step involving NFRACT redundant.

I'm sure you noticed the minor differences between the totals (both row and column totals) given in the first Excel sheet (of which I saw only the preview -- I don't have Excel installed on my SAS workstation) and the totals computed from the table cells (e.g., the overall total is stated to be 1303, but is in fact 1298 [maybe due to rounding issues if decimals are just not displayed]).

The BY statements in the PROC SORT and MERGE steps would then read

by domain_name shade band;

(Assumption: SHADE is consistently coded 1 for 'Light' and 2 for 'Dark'.)

@mgugiu wrote:

Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.

To implement this, I think you need some sort of PPS sampling (cf. Sample Selection Methods). As an example I use the SIZE statement in the PROC SURVEYSELECT step below, which invokes PPS sampling.

proc surveyselect data=test sampsize=sampsize
seed=56789 out=sample2 selectall /* noprint */;
size weight_score;
strata domain_name shade band;
run;

I've commented out the NOPRINT option so that you can see in the output that "Selection Method" changed from "Simple Random Sampling" to "PPS, Without Replacement."

I leave it to you to test this PROC SURVEYSELECT step on a large input dataset ("TEST") because I can't read Excel files on my workstation, as mentioned above, but I'm pretty sure the result will be much closer to your requirements.

mgugiu · Posted 11-18-2019 03:35 PM

Thank you for your response. I will implement the suggested changes and let you know how they work.
best regards,

mgugiu · Posted 11-21-2019 05:20 PM

Hi,

The code work, but I am getting an error message when trying to use the PPS method for selecting the samples and using the size option which specifies the weight to be assigned to each observation (see code below).

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall /*noprint */;
size weight_score;
strata domain_name shade band;
run;

The error message I get is: " For METHOD=PPS, the relative size of each sampling unit must not exceed (1/SAMPSIZE).

Can someone please explain what exactly this means.

Thank you.

Watts · Posted 11-21-2019 05:28 PM

Usage Note 23759: Cause of error "For METHOD=PPS, the relative size of each sampling unit must not e...

mgugiu · Posted 11-26-2019 03:21 PM

Hi--thank you for your reply.

I read the SAS explanation on the reason for the error message in the log.

My question is: how should I estimate the weight score per observation so I do not get the same error message?

The current weight score is expressed in integers, which obviously does not work. The challenge I have is to figure out whether the weight score must be adjusted to the size of a specific stratum, all the strata that are used for sample specification, or something else that I am missing.

Thank you.

Stratified sampling with multiple strata and missing substrata