BookmarkSubscribeRSS Feed
BerryH
Obsidian | Level 7

I have the following example data:

data one;
  input Unit_ID weight @@;
  datalines;
 1  237.18    2 567.89    3  118.50    4   74.38     5 1287.23     6  258.10
 7  325.36    8 218.38    9 1670.80   10  134.71    11 2020.70    12   47.80
13 1183.45   14 330.54   15  780.10   16  895.80    17  620.10    18  420.18
19  979.66   20 810.25   21  670.85   22  314.58    23   87.50    24 1893.40
25  753.30   26 540.65   27 2580.35   28  230.56    29  185.60    30  688.43
31  505.14   32 205.48   33  650.42   34 1348.34    35   30.50    36 2214.80
37  940.35   38 217.85   39  142.90   40  806.90    41  560.72
;

 Now I want to select 12 records based on the weight variable. So, a higher weight should result in a higher probability that this
record ends up in my sample. Once a record is selected, that record should not be selected again.

I've looked into proc surveyselect but I'm unable to figure out if this is the right proc to use, and if applicable, how to use it.

Can anyone help me with this? I would like to use a standard proc that is available in SAS.

Thanks in advance.

Regards,
Berry

19 REPLIES 19
PaigeMiller
Diamond | Level 26

I would think that this is exactly what PROC SURVEYSELECT does well, you can use the options METHOD=PPS and the SIZE statement in order to create samples whose probability is proportional to size (in your case the variable WEIGHT) and without replacement. Do you agree?

 

Please, in the future, do not type text in code boxes. Text goes outside code boxes. Thank you.

--
Paige Miller
BerryH
Obsidian | Level 7

Indeed I expected that proc surveyselect would do the job. However, I ran into this problem when using it: https://support.sas.com/kb/23/759.html 

To summarize, I can select 11 samples, but not 12 because proc surveyselect throws an error. Also, I'm not sure about the proposed workarounds that are given.

 

So perhaps I don't understand what proc surveyselect is supposed to do and/or I'm using it incorrectly. 

 

Regards,

 

Berry

 

PaigeMiller
Diamond | Level 26

So in this case, you already have the answer. 

--
Paige Miller
BerryH
Obsidian | Level 7

The note does not help me in getting the 12 samples because it proposes solutions that I can't use:

 

1) Use minimum replacement (PPS sequential).

I don't want to use replacement, it should be a sample without replacement.

 

2) Adjusting size measures by specifying certsize or maxsize options.

I don't want to adjust size measures. The sizes (weights) in the data are the sizes that should be used.

 

Is there any alternative proc to get the 12 samples that I need?

 

PaigeMiller
Diamond | Level 26

You are now out of my range of experience, as I have never had to face this situation.

--
Paige Miller
ballardw
Super User

AS an absolute minimum when declaring code doesn't do what you want : Include the code.

 

 

 

Select as many as the data allows.

Reduce the data to the not selected.

Select additional records.

 

Note: you will need to spend some time creating sampling weights because the second (or subsequent sets) depend on the prior selections.

 

Or get a bigger initial sample.

BerryH
Obsidian | Level 7

If you read the posts above, you see code that is relevant for my question.

Watts
SAS Employee
In addition to adjusting the size measures or selecting the sample "with minimum replacement," another possible approach in this situation is to include the "large" units with certainty (automatically, with selection probability=1). This is frequently done in practice. You can do this in PROC SURVEYSELECT by using the CERTSIZE= option or the CERTSIZE=P= option.
 
 
SteveDenham
Jade | Level 19

Well, here is something that does, well, something. 

EDIT: It turns out that the "something" is NOT what I said it might be.  See the post by @PGStats for an excellent proof of this.

 

I create a new weight as the product of a uniform random variate and the weight variable in the dataset divided by the sum of all the weights, sort the dataset in descending order by the new variable, and then select the first 12 id numbers in the sorted dataset..

 

data one;
  input Unit_ID weight @@;
  datalines;
 1  237.18    2 567.89    3  118.50    4   74.38     5 1287.23     6  258.10
 7  325.36    8 218.38    9 1670.80   10  134.71    11 2020.70    12   47.80
13 1183.45   14 330.54   15  780.10   16  895.80    17  620.10    18  420.18
19  979.66   20 810.25   21  670.85   22  314.58    23   87.50    24 1893.40
25  753.30   26 540.65   27 2580.35   28  230.56    29  185.60    30  688.43
31  505.14   32 205.48   33  650.42   34 1348.34    35   30.50    36 2214.80
37  940.35   38 217.85   39  142.90   40  806.90    41  560.72
;
data two;
set one;
call streaminit(452021);
ranno1=rand('uniform');
ranno2=ranno1*weight;
run;

proc means data=one noprint;
var weight;
output out=totsamp sum=sum;
run;

data combined;
 if _n_=1 then set totsamp;
 set two;
 drop _type_ _freq_;
 relsize=ranno2/sum;
 run;

proc sort data=combined out=three;
by descending relsize;
run;

data four;
set three;
if _n_<=12;
run;

I am not sure about the optimality of this method at all.  Relsize is a product of the (assumed) probability of selection (=ranno1) and the proportion of the total weight each ID contributes (=weight/sum).  The first 12 are then the most likely IDs to be selected, and the procedure is such that once an ID is selected, it cannot be selected again.  I suppose iteratively reweighting would be better, which would loop through, selecting the ID with the largest relsize, removing it from dataset one, recalculating the total weight, and the proportion of the total weight, multiplying this by the random number, resorting, selecting the ID with the largest relsize under this condition, removing it, and going through this until 12 IDs had been selected.

 

SteveDenham

 

PGStats
Opal | Level 21

About that Something method, @SteveDenham, I was curious....

 

Suppose that we want to choose a sample of one from a set of two units with sampling weights P1 and P2, such that P1+P2 = 1.

Now, if I understand your sampling method, we generate two uniform random numbers X1 and X2 and sample unit 1 if X1*P1 > X2*P2. What we would want is that the probability of choosing unit 1,

 

P(1) = P1/(P1+P2) = P1. 

 

Rearranging terms, the probability P(1) can be expressed

 

P(1) = P( X2/(X1+X2) < P1 )

 

which is the PDF of the random quantity X2/(X1+X2) at P1.

 

Let's look at that PDF for r = X2/(X1+X2), with X1 and X2 uniform.

 

data test;
do i = 1 to 100000;
    x1 = rand("uniform");
    x2 = rand("uniform");
    r = x2 / (x2+x1);
    output;
    end;
keep r;
run;

proc univariate data=test noprint;
cdfplot / vscale=proportion href=.25 .5 .75 vref=.25 .5 .75;
run;

PGStats_0-1617662604855.png

As we can see, there is only one non-trivial value of r where P(1) = P1, and that is when r=0.5, i.e. for equal probability sampling.

So, as intuitively appealing as this method is, it will not generate the expected sampling probabilities.

 

PG
SteveDenham
Jade | Level 19

Thanks @PGStats .  That is a great job.  I'll go back and edit my post to note that the method does not do what I thought it might.

 

SteveDenham

PGStats
Opal | Level 21

Here is an example using the certsize option:

 

%let sampSize=12;

proc sql;
select sum(weight)/&sampSize. into :certSize
from one;
quit;

proc surveyselect data=one out=two seed=868765
    method=pps sampsize=&sampSize. certsize=&certSize.; 
size weight;
id Unit_id;
run;

which selects

PGStats_0-1617650383994.png

Note the presence of variable Certain indicating selection with certainty.

PG
BerryH
Obsidian | Level 7

Thank you for the example. However, I don't understand why it is necessary to use the certsize option. The following code in Python does exactly what I want. I hoped that something similar would exist for SAS.

 

Screenshot 2021-04-05 at 20.47.07.png

 

 

PGStats
Opal | Level 21

I couldn't find a detailed explanation of the selection method in the dataframe.sample documentation. But I suspect that the .sample method does something very similar to surveyselect with the certsize option when required. SAS gives you a bit more flexibility with its option, and the opportunity to realize that your sample will not (can not) exhibit sampling probabilities greater than 1/sampsize for any unit, unless you allow replacement.

PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 19 replies
  • 2522 views
  • 13 likes
  • 9 in conversation