Re: Sample Selection

twildone · Posted 08-03-2016 12:14 PM

Hi....I would like to select a sample of records from a dataset and the total number of records in the dataset can vary and is not fixed. I want to select a sample in such a way that the sample is a true representation of all records when it comes to frequency (number of records) of the Product ID. Sales by differnt Product ID will vary. Do I need to cluster the datewr first and then sample within each cluster. Any suggestions would be greatly appreciated.....Thanks

Haikuo · Posted 08-03-2016 12:28 PM

If you have SAS/STAT, then check Proc Surveyselect, with 'strata' option.

If you don't, there are still ways of doing the random sampling, let us know if that is the case, then someone from this forum can start working on it.

TMiles · Posted 08-03-2016 01:04 PM

Data whatever;
set whatever;
ran = uniform(0);
if ran <= .00; /* percent you would like to sample */

ballardw · Posted 08-03-2016 01:28 PM

You may need to clarify what you mean when you say "a true representation of all records when it comes to frequency (number of records) of the Product ID".

Do you mean that any abritrary subgroup, say on geography or customer volumn the percent of product ID will be exactly the same as for the sample overall?

How are you using the idea of "cluster" for this project?

Clusters are not normally created/assigned as such a way to gaurantee the same distribution of values as the whole sample but are identified as something that contributes to overall variability in the sample. Think of study involving school students from many schools. A natural cluster would be the school as generally there are differences in makeup of student bodies from school to school.

twildone · Posted 08-03-2016 03:25 PM

Hi Ballardw....what I mean by "True representation" those product ID with a higher customer volume should have a higher probability (maybe proportionate to total volume) of being selected into the sample. For example, if there were only 2 product ID's, Product_ID1 had 100 sales and Product_ID2 had 200 sales, and if I wanted to select a sample size of 50, then I would want 1/3 of the sample (100/300) to be selected from Product_ID1 AND 2/3 of the sample (200/300)to be selected from Product_ID2. Hope this helps.

ballardw · Posted 08-03-2016 04:06 PM

Assuming the data is such that you have Product_Id and the Product_Id1 and _Id2 your mention are different values then you could use Product_Id as a Strata variable and specify the sample size for each level of the variable. If you have more than a few products involved then you would likely want to use a SAMPSIZE data set. Look at the documentation for the contents as it needs specifically named variables to work properly.

You could easily make a sampsize dataset by using proc freq to generate a data set with the product_id values and overall percentage or Proc SQL. Multiply that percentage by the desired overall sample size to get the number for each product. The number would have to be rounded to an integer.

To use STRATA the input data set would need to be sorted by the strata varaible. The sampsize dataset would need to match all values of the strata variable, attributes such as length and type and sorted in the correct order.

twildone · Posted 08-04-2016 01:42 PM

Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.

Group	_NSIZE_
1	2
2	3
3	9
4	21
5	10
6	25

PROC SORT DATA=SUMMARY94;

BY GROUP;

RUN;

PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY99C OUT=hsbs3;

STRATA GROUP;

RUN;

ballardw · Posted 08-04-2016 05:52 PM

@twildone wrote:

Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.

I think that using your Id_number as a SAMPLINGUNIT variable may help if the id_number does not exist in different groups.

Another option if you are seeing many of these is to slightly over sample and discard "extras". You haven't said wether the selection probabilities are import for this.

There may be some options in SAS 9.4 that allow more fiddling with selections but I don't have 9.4 for testing behavior.

twildone · Posted 08-05-2016 08:51 AM

Hi...I did try using the sampling unit statement and did get unusual results.

PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY98 OUT=hsbs3;

SAMPLINGUNIT ID_NUMBER;

STRATA PRECLUS;

RUN;

ballardw · Posted 08-05-2016 01:27 PM

I thought about that and remembered it's likely to duplicate the Id. I was kind of hoping that if the duplicated id values were a small percentage overall it might allow skipping them.

Are you needing the probability or selection / weighting information? IF not you may be able to oversample by a bit and remove duplicate ids.

Another option is to decide which is more important, no duplicate Id or "exact" match of distribution. Perhaps create set with single values for the id and see if the overall distribution remains similar. If so then select a sample from that subset.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away