Calcite | Level 5

## Stratified sampling with multiple strata and missing substrata

Hi everyone,

I am using SAS 9.4. version and I have to select a sample using a complex stratification process that involves several strata and a weight assigned to each observation. I created a fake data file to give you a better sense of how the data are structured (attached).

The data are structured in 5 strata (domain_name) and 4 of the 5 strata have 2 sub-strata (shade). Each strata is also distributed in 9 bands and the final sample size must be done by each stratum, sub-stratum, and band in a specific proportion/number. Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.

So far, I was able to select a partial sample--the strata that has missing substrata is never selected and I need help figuring out how to select observations from that stratum. Below is the code I used.

proc freq data=population noprint;
run;

data sampsize;
do band= 1 to 9;
input nfract @@;
output;
end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;
data sampsize;
set sampsize;
_nsize_ = round(nfract) ;
if 0<nfract<1 then _nsize_=1;
run;

proc print; /*this code outputs the data and sums up the total sample size; I replaced _nsize_ with nfract*/
sum nfract;
run;

proc sort data=sampsize;
run;

proc sort data=framedist;
run;

data sampsize;
merge sampsize framedist;
if _nsize_=. then _nsize_=0;
run;

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall noprint;
run;

Thank you so much!

5 REPLIES 5

## Re: Stratified sampling with multiple strata and missing substrata

Hi @mgugiu and welcome to the SAS Support Communities!

The DATA step creating the SAMPSIZE dataset does not contain observations with missing SHADE. This is why domain 'Black' is never selected. (Note that the values for 'Black' get SHADE=1, the values for 'Green' get all wrong SHADE values and the dangerous note "LOST CARD" is written to the log.) Also, I would strongly recommend including DOMAIN_NAME as a stratum variable. So, a corrected DATA step could look like this:

``````data sampsize;
length domain_name \$6;
do domain_name='Red', 'Blue', 'Purple', 'Black', 'Green';
if domain_name='Black' & shade | domain_name ne 'Black' & shade=. then continue;
do band= 1 to 9;
input _nsize_ @@;
output;
end;
end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;``````

Note that I read the above integers directly into variable _NSIZE_, thus making the DATA step involving NFRACT redundant.

I'm sure you noticed the minor differences between the totals (both row and column totals) given in the first Excel sheet (of which I saw only the preview -- I don't have Excel installed on my SAS workstation) and the totals computed from the table cells (e.g., the overall total is stated to be 1303, but is in fact 1298 [maybe due to rounding issues if decimals are just not displayed]).

The BY statements in the PROC SORT and MERGE steps would then read

``by domain_name shade band;``

(Assumption: SHADE is consistently coded 1 for 'Light' and 2 for 'Dark'.)

@mgugiu wrote:

Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.

To implement this, I think you need some sort of PPS sampling (cf. Sample Selection Methods). As an example I use the SIZE statement in the PROC SURVEYSELECT step below, which invokes PPS sampling.

``````proc surveyselect data=test sampsize=sampsize
seed=56789 out=sample2 selectall /* noprint */;
size weight_score;
run;``````

I've commented out the NOPRINT option so that you can see in the output that "Selection Method" changed from "Simple Random Sampling" to "PPS, Without Replacement."

I leave it to you to test this PROC SURVEYSELECT step on a large input dataset ("TEST") because I can't read Excel files on my workstation, as mentioned above, but I'm pretty sure the result will be much closer to your requirements.

Calcite | Level 5

## Re: Stratified sampling with multiple strata and missing substrata

Thank you for your response. I will implement the suggested changes and let you know how they work.
best regards,
Calcite | Level 5

## Re: Stratified sampling with multiple strata and missing substrata

Hi,

The code work, but I am getting an error message when trying to use the PPS method for selecting the samples and using the size option which specifies the weight to be assigned to each observation (see code below).

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall /*noprint */;
size weight_score;
run;

The error message I get is: " For METHOD=PPS, the relative size of each sampling unit must not exceed (1/SAMPSIZE).

Can someone please explain what exactly this means.

Thank you.

SAS Employee

## Re: Stratified sampling with multiple strata and missing substrata

Calcite | Level 5

## Re: Stratified sampling with multiple strata and missing substrata

I read the SAS explanation on the reason for the error message in the log.

My question is: how should I estimate the weight score per observation so I do not get the same error message?

The current weight score is expressed in integers, which obviously does not work. The challenge I have is to figure out whether the weight score must be adjusted to the size of a specific stratum, all the strata that are used for sample specification, or something else that I am missing.

Thank you.

Discussion stats
• 5 replies
• 1671 views
• 4 likes
• 3 in conversation