Hi....I would like to select a sample of records from a dataset and the total number of records in the dataset can vary and is not fixed. I want to select a sample in such a way that the sample is a true representation of all records when it comes to frequency (number of records) of the Product ID. Sales by differnt Product ID will vary. Do I need to cluster the datewr first and then sample within each cluster. Any suggestions would be greatly appreciated.....Thanks
If you have SAS/STAT, then check Proc Surveyselect, with 'strata' option.
If you don't, there are still ways of doing the random sampling, let us know if that is the case, then someone from this forum can start working on it.
You may need to clarify what you mean when you say "a true representation of all records when it comes to frequency (number of records) of the Product ID".
Do you mean that any abritrary subgroup, say on geography or customer volumn the percent of product ID will be exactly the same as for the sample overall?
How are you using the idea of "cluster" for this project?
Clusters are not normally created/assigned as such a way to gaurantee the same distribution of values as the whole sample but are identified as something that contributes to overall variability in the sample. Think of study involving school students from many schools. A natural cluster would be the school as generally there are differences in makeup of student bodies from school to school.
Assuming the data is such that you have Product_Id and the Product_Id1 and _Id2 your mention are different values then you could use Product_Id as a Strata variable and specify the sample size for each level of the variable. If you have more than a few products involved then you would likely want to use a SAMPSIZE data set. Look at the documentation for the contents as it needs specifically named variables to work properly.
You could easily make a sampsize dataset by using proc freq to generate a data set with the product_id values and overall percentage or Proc SQL. Multiply that percentage by the desired overall sample size to get the number for each product. The number would have to be rounded to an integer.
To use STRATA the input data set would need to be sorted by the strata varaible. The sampsize dataset would need to match all values of the strata variable, attributes such as length and type and sorted in the correct order.
Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.
Group | _NSIZE_ |
1 | 2 |
2 | 3 |
3 | 9 |
4 | 21 |
5 | 10 |
6 | 25 |
PROC SORT DATA=SUMMARY94;
BY GROUP;
RUN;
PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY99C OUT=hsbs3;
STRATA GROUP;
RUN;
@twildone wrote:
Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.
I think that using your Id_number as a SAMPLINGUNIT variable may help if the id_number does not exist in different groups.
Another option if you are seeing many of these is to slightly over sample and discard "extras". You haven't said wether the selection probabilities are import for this.
There may be some options in SAS 9.4 that allow more fiddling with selections but I don't have 9.4 for testing behavior.
Hi...I did try using the sampling unit statement and did get unusual results.
PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY98 OUT=hsbs3;
SAMPLINGUNIT ID_NUMBER;
STRATA PRECLUS;
RUN;
I thought about that and remembered it's likely to duplicate the Id. I was kind of hoping that if the duplicated id values were a small percentage overall it might allow skipping them.
Are you needing the probability or selection / weighting information? IF not you may be able to oversample by a bit and remove duplicate ids.
Another option is to decide which is more important, no duplicate Id or "exact" match of distribution. Perhaps create set with single values for the id and see if the overall distribution remains similar. If so then select a sample from that subset.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.