Well, here is something that does, well, something.
EDIT: It turns out that the "something" is NOT what I said it might be. See the post by @PGStats for an excellent proof of this.
I create a new weight as the product of a uniform random variate and the weight variable in the dataset divided by the sum of all the weights, sort the dataset in descending order by the new variable, and then select the first 12 id numbers in the sorted dataset..
data one;
input Unit_ID weight @@;
datalines;
1 237.18 2 567.89 3 118.50 4 74.38 5 1287.23 6 258.10
7 325.36 8 218.38 9 1670.80 10 134.71 11 2020.70 12 47.80
13 1183.45 14 330.54 15 780.10 16 895.80 17 620.10 18 420.18
19 979.66 20 810.25 21 670.85 22 314.58 23 87.50 24 1893.40
25 753.30 26 540.65 27 2580.35 28 230.56 29 185.60 30 688.43
31 505.14 32 205.48 33 650.42 34 1348.34 35 30.50 36 2214.80
37 940.35 38 217.85 39 142.90 40 806.90 41 560.72
;
data two;
set one;
call streaminit(452021);
ranno1=rand('uniform');
ranno2=ranno1*weight;
run;
proc means data=one noprint;
var weight;
output out=totsamp sum=sum;
run;
data combined;
if _n_=1 then set totsamp;
set two;
drop _type_ _freq_;
relsize=ranno2/sum;
run;
proc sort data=combined out=three;
by descending relsize;
run;
data four;
set three;
if _n_<=12;
run;
I am not sure about the optimality of this method at all. Relsize is a product of the (assumed) probability of selection (=ranno1) and the proportion of the total weight each ID contributes (=weight/sum). The first 12 are then the most likely IDs to be selected, and the procedure is such that once an ID is selected, it cannot be selected again. I suppose iteratively reweighting would be better, which would loop through, selecting the ID with the largest relsize, removing it from dataset one, recalculating the total weight, and the proportion of the total weight, multiplying this by the random number, resorting, selecting the ID with the largest relsize under this condition, removing it, and going through this until 12 IDs had been selected.
SteveDenham
... View more