- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm going through my notes from SAS Programming 3 and I'm having trouble making sense of the example code.
The goal here is to select random observations without duplicates.
data subset(drop=ObsLeft SampSize);
SampSize=10;
ObsLeft=TotObs;
do while(SampSize>0 and ObsLeft>0);
PickIt+1;
if ranuni(0)<SampSize/ObsLeft then do;
ObsPicked=PickIt;
set orion.orderfact point=PickIt nobs=TotObs;
output;
SampSize=SampSize-1;
end;
ObsLeft=ObsLeft-1;
end;
stop;
run;
I think I understand why they use ranuni(0)<SampSize/ObsLeft
to determine whether an observation is selected. It compares a random number to the ratio of remaining samples to remaining observations. In order to give each observation the same chance of being selected?
But my question is whether this is actually giving each sample the same probability of being selected. If the ratio changes with each iteration then doesn't each sample have a different probability of being selected? Im not sure that it matters but the definition given for "simple random sampling" was "equal probability" of being selected. Couldn't you just use a fixed ratio of initial sampsize/totobs?
My other question is what is preventing this from reaching the end of the dataset before outputting the required number of samples? Theoretically, the random number could never match the condition for output before it gets to the end.
The next page in my notes is about PROC SURVEYSELECT so i'm assuming this example is just academic, but those things were bugging me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First I'm going to throw out an opinion. People like to write these complicated Do While loops, so that they can say they did it all in one step (one DATA step). To me that's a false goal, and the same process could be performed with much more readable code by assigning a random value to each observation, then sorting, then selecting the top N (whatever number of observations they want). What I just described is what I would have used before I learned about PROC SURVEYSELECT. Perhaps this Do While loop method will actually run faster than what I described, I don't know, but the possibility of coding error by creating the Do While loop is increased. In addition, you mention that you think the probabilities are not constant, so there is a possibility that there is also a math error being made when you create a Do While loop algorithm. (And to answer your specific question: I haven't tried to prove this, it may be that when you compute the probability of an observation being selected conditional on earlier events having happened, it does produce the correct probability of being selected, but like I said I haven't tried to prove this).
And so I end with another opinion that I have expressed many times before in this forum. SAS has done the work to make PROC SURVEYSELECT provide the desired sample. Not only that, they have tested it, debugged it, documented it and this code has been proven in many many bazillions of real world applications. And also, you (or your company or university) is paying actual money to get the benefit of SAS's efforts, which includes code that does what you want, code that has been tested, debugged, documented and performed properly in real-world applications. There's no need to design your own algorithm in a complicated DATA step to do this, its inefficient to do so, and your own code may not give the right answer.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Second best thing to a mathematical proof, and sometimes more convincing: a simulation. Consider
/* Iterate the sample selection procedure, omitting the reads
but recording which observations would be selected */
data subset;
call streaminit(768766);
do rep = 1 to 100000;
SampSize=3;
totObs=10;
ObsLeft=TotObs;
pickIt = 0;
do while(SampSize>0 and ObsLeft>0);
PickIt+1;
if rand("uniform") < SampSize/ObsLeft then do;
ObsPicked=PickIt;
output;
SampSize=SampSize-1;
end;
ObsLeft=ObsLeft-1;
end;
end;
keep rep obsPicked;
run;
title "Get the size of the samples";
proc sql;
select sampleSize, count(*) as nbReps
from
( select rep, count(*) as sampleSize from subset
group by rep )
group by sampleSize;
quit;
title "Check if every obs has an equal probability of selection";
title2 "Equal proportions test";
proc freq data=subset;
table obsPicked / chisq;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
thats cool. it definitely works, but im still not quite sure why.
if the first iteration is 3/10=.3 and it doesnt find a match
the next iteration is 3/9=.33 and it does find a match
the next iteration is 2/8=.25 ...and so on.
So the probability doesn't stay the same, but i guess over 10000 repetitions it evens out?
I suppose it doesn't matter, at the end of the day its still a random sample.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
We did have a mathematical proof on SAS-L many years ago, but I have lost track of the author and the post.
Here's why your reasoning is not quite right.
Selecting 3 out of 10, first observation has a 30% chance of being selected.
Second observation needs a more complex formula based on whether or not first observation was selected:
30% * 2/9 + 70% * 3/9 = 30%
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This code looks like a variation of Method 3 from SAS Sample Sample 24722: Simple random sample without replacement
There is some code inline comment which explains why each row gets selected with the same probability.
I've actually asked the exactly same question many years ago for the referenced Method 3 code and got great answers. ...Tried to find the discussion but had no luck with my searches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think this selection method is usually credited to Fan, Muller, and Rezucha (1962). Their proof is on page 392.
Fan, C.T, Muller, M. E., and Rezucha, I. (1962). "Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers." Journal of the American Statistical Association 57: 387-402.