06-10-2015 10:40 AM
Not sure if the title makes much sense, but I'll do my best to explain better here. What I am looking for is a way to take a random sample of records from a dataset (say, using proc surveyselect) and have the sample fit a piece of criteria based on the sum of one of the variables in the dataset. Here's an example:
Say you own a fast food hamburger restaurant and have a dataset called ORDERS, which contains a record for each order placed at a register. Variables may include a unique order ID, price, payment method, and number of hamburgers ordered. That last variable is important. Say the manager wants to get a random sample of 100 orders from his restaurant, but he wants to ensure that the random sample contains a certain number of hamburgers ordered for quality purposes. How would he go about doing that?
Basically, I want a random sample of 100 orders where the sample considers the sum of the variable number of hamburgers ordered. The number of records that I want to sample is fixed ahead of time, but I want to make sure that, say, 200-250 hamburgers are considered. Assume the dataset is large enough that any criteria I put for the sum variable criteria is possible.
Thanks for any help you may be able to provide!
06-10-2015 11:01 AM
I would say then that this is not a random sample, it is a purposefully biased sample, and as such, you couldn't make inferences from it, defeating the whole point of sampling.
If the manager really wants a valid sample, it ought to be completely random, and have enough data points so that the inferences will have a desired level of uncertainty, all of this is standard sampling methodology which you can read in any textbook, and which SAS has tools to help with.
I also cannot think of an easy way to accomplish your stated requirement. I'm sure you could write a loop to do this, whereby if the first time you select this sample it doesn't have enough hamburgers sold, or it has too many hamburgers sold, then you repeat until you get the desired result. But as I said, why bother, this seems relatively meaningless in the context of random sampling.
06-10-2015 11:05 AM
Thanks for the quick response. You are entirely correct, this is not a 'random' sample and is very biased. For my purposes that will be okay, I just couldn't think of an easier way to explain my issue without getting too much into the data I am actually working with.
I do like the idea of some sort of loop where there's a check of the sum at the end and just repeating until the result is gained.
Thanks again for the insight.
06-10-2015 11:06 AM
For your purpose I would filter the dataset before feeding it into proc survey select. If the 'sample' is going to be as rigidly defined as you have then build your population before you do anything else.
06-10-2015 11:26 AM
In surveyselect the SIZE statement looks at the value of a numeric variable and computes a sampling unit size based on the sum of that variable. There are some other issues around which types of samples this may work for though.
I would actually filter the data so that only orders with one or more hamburgers were considered and select 100 of those. The question for the boss would be number of hamburgers or number of orders with hamburgers.
Imagine the issue of a bus coming through with a football team. One order could potentially have 100 burgers...