Re: Selecting a weighted sample without replacement

BerryH · Posted 04-05-2021 09:24 AM

I have the following example data:

data one;
  input Unit_ID weight @@;
  datalines;
 1  237.18    2 567.89    3  118.50    4   74.38     5 1287.23     6  258.10
 7  325.36    8 218.38    9 1670.80   10  134.71    11 2020.70    12   47.80
13 1183.45   14 330.54   15  780.10   16  895.80    17  620.10    18  420.18
19  979.66   20 810.25   21  670.85   22  314.58    23   87.50    24 1893.40
25  753.30   26 540.65   27 2580.35   28  230.56    29  185.60    30  688.43
31  505.14   32 205.48   33  650.42   34 1348.34    35   30.50    36 2214.80
37  940.35   38 217.85   39  142.90   40  806.90    41  560.72
;

Now I want to select 12 records based on the weight variable. So, a higher weight should result in a higher probability that this
record ends up in my sample. Once a record is selected, that record should not be selected again.

I've looked into proc surveyselect but I'm unable to figure out if this is the right proc to use, and if applicable, how to use it.

Can anyone help me with this? I would like to use a standard proc that is available in SAS.

Thanks in advance.

Regards,
Berry

PaigeMiller · Posted 04-05-2021 09:35 AM

I would think that this is exactly what PROC SURVEYSELECT does well, you can use the options METHOD=PPS and the SIZE statement in order to create samples whose probability is proportional to size (in your case the variable WEIGHT) and without replacement. Do you agree?

Please, in the future, do not type text in code boxes. Text goes outside code boxes. Thank you.

--
Paige Miller

BerryH · Posted 04-05-2021 09:47 AM

Indeed I expected that proc surveyselect would do the job. However, I ran into this problem when using it: https://support.sas.com/kb/23/759.html

To summarize, I can select 11 samples, but not 12 because proc surveyselect throws an error. Also, I'm not sure about the proposed workarounds that are given.

So perhaps I don't understand what proc surveyselect is supposed to do and/or I'm using it incorrectly.

Regards,

Berry

PaigeMiller · Posted 04-05-2021 10:06 AM

So in this case, you already have the answer.

--
Paige Miller

BerryH · Posted 04-05-2021 10:46 AM

The note does not help me in getting the 12 samples because it proposes solutions that I can't use:

1) Use minimum replacement (PPS sequential).

I don't want to use replacement, it should be a sample without replacement.

2) Adjusting size measures by specifying certsize or maxsize options.

I don't want to adjust size measures. The sizes (weights) in the data are the sizes that should be used.

Is there any alternative proc to get the 12 samples that I need?

PaigeMiller · Posted 04-05-2021 11:00 AM

You are now out of my range of experience, as I have never had to face this situation.

--
Paige Miller

ballardw · Posted 04-05-2021 11:08 AM

AS an absolute minimum when declaring code doesn't do what you want : Include the code.

Select as many as the data allows.

Reduce the data to the not selected.

Select additional records.

Note: you will need to spend some time creating sampling weights because the second (or subsequent sets) depend on the prior selections.

Or get a bigger initial sample.

BerryH · Posted 04-06-2021 03:31 AM

If you read the posts above, you see code that is relevant for my question.

Watts · Posted 04-05-2021 11:57 AM

In addition to adjusting the size measures or selecting the sample "with minimum replacement," another possible approach in this situation is to include the "large" units with certainty (automatically, with selection probability=1). This is frequently done in practice. You can do this in PROC SURVEYSELECT by using the CERTSIZE= option or the CERTSIZE=P= option.

SteveDenham · Posted 04-05-2021 02:17 PM

Well, here is something that does, well, something.

EDIT: It turns out that the "something" is NOT what I said it might be. See the post by @PGStats for an excellent proof of this.

I create a new weight as the product of a uniform random variate and the weight variable in the dataset divided by the sum of all the weights, sort the dataset in descending order by the new variable, and then select the first 12 id numbers in the sorted dataset..

data one;
  input Unit_ID weight @@;
  datalines;
 1  237.18    2 567.89    3  118.50    4   74.38     5 1287.23     6  258.10
 7  325.36    8 218.38    9 1670.80   10  134.71    11 2020.70    12   47.80
13 1183.45   14 330.54   15  780.10   16  895.80    17  620.10    18  420.18
19  979.66   20 810.25   21  670.85   22  314.58    23   87.50    24 1893.40
25  753.30   26 540.65   27 2580.35   28  230.56    29  185.60    30  688.43
31  505.14   32 205.48   33  650.42   34 1348.34    35   30.50    36 2214.80
37  940.35   38 217.85   39  142.90   40  806.90    41  560.72
;
data two;
set one;
call streaminit(452021);
ranno1=rand('uniform');
ranno2=ranno1*weight;
run;

proc means data=one noprint;
var weight;
output out=totsamp sum=sum;
run;

data combined;
 if _n_=1 then set totsamp;
 set two;
 drop _type_ _freq_;
 relsize=ranno2/sum;
 run;

proc sort data=combined out=three;
by descending relsize;
run;

data four;
set three;
if _n_<=12;
run;

I am not sure about the optimality of this method at all. Relsize is a product of the (assumed) probability of selection (=ranno1) and the proportion of the total weight each ID contributes (=weight/sum). The first 12 are then the most likely IDs to be selected, and the procedure is such that once an ID is selected, it cannot be selected again. I suppose iteratively reweighting would be better, which would loop through, selecting the ID with the largest relsize, removing it from dataset one, recalculating the total weight, and the proportion of the total weight, multiplying this by the random number, resorting, selecting the ID with the largest relsize under this condition, removing it, and going through this until 12 IDs had been selected.

SteveDenham

PGStats · Posted 04-05-2021 06:58 PM

About that Something method, @SteveDenham, I was curious....

Suppose that we want to choose a sample of one from a set of two units with sampling weights P1 and P2, such that P1+P2 = 1.

Now, if I understand your sampling method, we generate two uniform random numbers X1 and X2 and sample unit 1 if X1*P1 > X2*P2. What we would want is that the probability of choosing unit 1,

P(1) = P1/(P1+P2) = P1.

Rearranging terms, the probability P(1) can be expressed

P(1) = P( X2/(X1+X2) < P1 )

which is the PDF of the random quantity X2/(X1+X2) at P1.

Let's look at that PDF for r = X2/(X1+X2), with X1 and X2 uniform.

data test;
do i = 1 to 100000;
    x1 = rand("uniform");
    x2 = rand("uniform");
    r = x2 / (x2+x1);
    output;
    end;
keep r;
run;

proc univariate data=test noprint;
cdfplot / vscale=proportion href=.25 .5 .75 vref=.25 .5 .75;
run;

As we can see, there is only one non-trivial value of r where P(1) = P1, and that is when r=0.5, i.e. for equal probability sampling.

So, as intuitively appealing as this method is, it will not generate the expected sampling probabilities.

PG

SteveDenham · Posted 04-06-2021 08:11 AM

Thanks @PGStats . That is a great job. I'll go back and edit my post to note that the method does not do what I thought it might.

SteveDenham

PGStats · Posted 04-05-2021 03:22 PM

Here is an example using the certsize option:

%let sampSize=12;

proc sql;
select sum(weight)/&sampSize. into :certSize
from one;
quit;

proc surveyselect data=one out=two seed=868765
    method=pps sampsize=&sampSize. certsize=&certSize.; 
size weight;
id Unit_id;
run;

which selects

Note the presence of variable Certain indicating selection with certainty.

PG

BerryH · Posted 04-06-2021 03:35 AM

Thank you for the example. However, I don't understand why it is necessary to use the certsize option. The following code in Python does exactly what I want. I hoped that something similar would exist for SAS.

PGStats · Posted 04-06-2021 03:08 PM

I couldn't find a detailed explanation of the selection method in the dataframe.sample documentation. But I suspect that the .sample method does something very similar to surveyselect with the certsize option when required. SAS gives you a bit more flexibility with its option, and the opportunity to realize that your sample will not (can not) exhibit sampling probabilities greater than 1/sampsize for any unit, unless you allow replacement.

PG