BookmarkSubscribeRSS Feed
mgugiu
Calcite | Level 5

Hi everyone,

 

I am using SAS 9.4. version and I have to select a sample using a complex stratification process that involves several strata and a weight assigned to each observation. I created a fake data file to give you a better sense of how the data are structured (attached). 

 

The data are structured in 5 strata (domain_name) and 4 of the 5 strata have 2 sub-strata (shade). Each strata is also distributed in 9 bands and the final sample size must be done by each stratum, sub-stratum, and band in a specific proportion/number. Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.

 

So far, I was able to select a partial sample--the strata that has missing substrata is never selected and I need help figuring out how to select observations from that stratum. Below is the code I used.

 

proc freq data=population noprint;
tables domain_name*shade*band/out= framedist;
run;

 

data sampsize;
do shade=1 to 2;
do band= 1 to 9;
input nfract @@;
output;
end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;
data sampsize;
set sampsize;
_nsize_ = round(nfract) ;
if 0<nfract<1 then _nsize_=1;
run;

 

proc print; /*this code outputs the data and sums up the total sample size; I replaced _nsize_ with nfract*/
sum nfract;
run;


proc sort data=sampsize;
by Shade band;
run;

proc sort data=framedist;
by Shade band;
run;

 

data sampsize;
merge sampsize framedist;
by shade band;
if _nsize_=. then _nsize_=0;
run;

 

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall noprint;
strata shade band;
run;

 

Thank you so much!

5 REPLIES 5
FreelanceReinh
Jade | Level 19

Hi @mgugiu and welcome to the SAS Support Communities!

 

The DATA step creating the SAMPSIZE dataset does not contain observations with missing SHADE. This is why domain 'Black' is never selected. (Note that the values for 'Black' get SHADE=1, the values for 'Green' get all wrong SHADE values and the dangerous note "LOST CARD" is written to the log.) Also, I would strongly recommend including DOMAIN_NAME as a stratum variable. So, a corrected DATA step could look like this:

data sampsize;
length domain_name $6;
do domain_name='Red', 'Blue', 'Purple', 'Black', 'Green';
  do shade=., 1, 2;
    if domain_name='Black' & shade | domain_name ne 'Black' & shade=. then continue;
    do band= 1 to 9;
      input _nsize_ @@;
      output;
    end;
  end;
end;
datalines;
0 1 6 38 60 70 35 10 2
0 0 1 7 11 12 6 2 0
0 1 8 49 78 91 46 13 2
0 0 1 9 14 16 8 2 0
0 1 8 51 81 94 48 14 2
0 0 1 9 14 17 8 2 0
0 0 4 27 42 49 25 7 1
0 0 5 28 45 52 26 8 1
0 0 1 5 8 9 5 1 0
;

Note that I read the above integers directly into variable _NSIZE_, thus making the DATA step involving NFRACT redundant.

 

I'm sure you noticed the minor differences between the totals (both row and column totals) given in the first Excel sheet (of which I saw only the preview -- I don't have Excel installed on my SAS workstation) and the totals computed from the table cells (e.g., the overall total is stated to be 1303, but is in fact 1298 [maybe due to rounding issues if decimals are just not displayed]).

 

The BY statements in the PROC SORT and MERGE steps would then read

by domain_name shade band;

(Assumption: SHADE is consistently coded 1 for 'Light' and 2 for 'Dark'.)


@mgugiu wrote:

Finally, each observation has assigned a weight-score which must be taken into account: the highest the weight-score, the higher the probability of being selected.


To implement this, I think you need some sort of PPS sampling (cf. Sample Selection Methods). As an example I use the SIZE statement in the PROC SURVEYSELECT step below, which invokes PPS sampling.

proc surveyselect data=test sampsize=sampsize
seed=56789 out=sample2 selectall /* noprint */;
size weight_score;
strata domain_name shade band;
run;

I've commented out the NOPRINT option so that you can see in the output that "Selection Method" changed from "Simple Random Sampling" to "PPS, Without Replacement."

 

I leave it to you to test this PROC SURVEYSELECT step on a large input dataset ("TEST") because I can't read Excel files on my workstation, as mentioned above, but I'm pretty sure the result will be much closer to your requirements.

mgugiu
Calcite | Level 5
Thank you for your response. I will implement the suggested changes and let you know how they work.
best regards,
mgugiu
Calcite | Level 5

Hi,

 

The code work, but I am getting an error message when trying to use the PPS method for selecting the samples and using the size option which specifies the weight to be assigned to each observation (see code below). 

 

proc surveyselect data=test sampsize =sampsize
seed= 56789 out=sample2 selectall /*noprint */;
size weight_score; 
strata domain_name shade band;
run;

 

The error message I get is: " For METHOD=PPS, the relative size of each sampling unit must not exceed (1/SAMPSIZE).

 

Can someone please explain what exactly this means.

 

Thank you.

mgugiu
Calcite | Level 5

Hi--thank you for your reply. 

 

I read the SAS explanation on the reason for the error message in the log.

 

My question is: how should I estimate the weight score per observation so I do not get the same error message?

 

The current weight score is expressed in integers, which obviously does not work. The challenge I have is to figure out whether the weight score must be adjusted to the size of a specific stratum, all the strata that are used for sample specification, or something else that I am missing. 

 

Thank you.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2239 views
  • 4 likes
  • 3 in conversation