BookmarkSubscribeRSS Feed
twildone
Pyrite | Level 9

Hi....I would like to select a sample of records from a dataset and the total number of records in the dataset can vary and is not fixed. I want to select a sample in such a way that the sample is a true representation of all records when it comes to frequency (number of records) of the Product ID. Sales by differnt Product ID will vary. Do I need to cluster the datewr first and then sample within each cluster. Any suggestions would be greatly appreciated.....Thanks

9 REPLIES 9
Haikuo
Onyx | Level 15

If you have SAS/STAT, then check Proc Surveyselect, with 'strata' option.

If you don't, there are still ways of doing the random sampling, let us know if that is the case, then someone from this forum can start working on it.

TMiles
Quartz | Level 8
Data whatever;
set whatever;
ran = uniform(0);
if ran <= .00; /* percent you would like to sample */
ballardw
Super User

You may need to clarify what you mean when you say "a true representation of all records when it comes to frequency (number of records) of the Product ID".

 

Do you mean that any abritrary subgroup, say on geography or customer volumn the percent of product ID will be exactly the same as for the sample overall?

 

How are you using the idea of "cluster" for this project?

 

Clusters are not normally created/assigned as such a way to gaurantee the same distribution of values as the whole sample but are identified as something that contributes to overall variability in the sample. Think of study involving school students from many schools. A natural cluster would be the school as generally there are differences in makeup of student bodies from school to school.

twildone
Pyrite | Level 9
Hi Ballardw....what I mean by "True representation" those product ID with a higher customer volume should have a higher probability (maybe proportionate to total volume) of being selected into the sample. For example, if there were only 2 product ID's, Product_ID1 had 100 sales and Product_ID2 had 200 sales, and if I wanted to select a sample size of 50, then I would want 1/3 of the sample (100/300) to be selected from Product_ID1 AND 2/3 of the sample (200/300)to be selected from Product_ID2. Hope this helps.
ballardw
Super User

Assuming the data is such that you have Product_Id and the Product_Id1 and _Id2 your mention are different values then you could use Product_Id as a Strata variable and specify the sample size  for each level of the variable. If you have more than a few products involved then you would likely want to use a SAMPSIZE data set. Look at the documentation for the contents as it needs specifically named variables to work properly.

You could easily make a sampsize dataset by using proc freq to generate a data set with the product_id values and overall percentage or Proc SQL. Multiply that percentage by the desired overall sample size to get the number for each product. The number would have to be rounded to an integer.

To use STRATA the input data set would need to be sorted by the strata varaible. The sampsize dataset would need to match all values of the strata variable, attributes such as length and type and sorted in the correct order.

twildone
Pyrite | Level 9

Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.

 

 

Group _NSIZE_
1 2
2 3
3 9
4 21
5 10
6 25

 

 

PROC SORT DATA=SUMMARY94;

BY GROUP;

RUN;

PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY99C OUT=hsbs3;

STRATA GROUP;

RUN;

ballardw
Super User

@twildone wrote:

Hi Ballardw....I tried your your suggestions and it did work. The thing that I would also like to do is to have ID_Number appear only once in the overall sample selected without any duplicates of the ID_Number.

 

 


I think that using your Id_number as a SAMPLINGUNIT variable may help if the id_number does not exist in different groups.

Another option if you are seeing many of these is to slightly over sample and discard "extras". You haven't said wether the selection probabilities are import for this.

There may be some options in SAS 9.4 that allow more fiddling with selections but I don't have 9.4 for testing behavior.

twildone
Pyrite | Level 9

Hi...I did try using the sampling unit statement and did get unusual results.

 

PROC SURVEYSELECT DATA=SUMMARY94 N=SUMMARY98 OUT=hsbs3;

SAMPLINGUNIT ID_NUMBER;

STRATA PRECLUS;

RUN;

ballardw
Super User

I thought about that and remembered it's likely to duplicate the Id. I was kind of hoping that if the duplicated id values were a small percentage overall it might allow skipping them.

 

Are you needing the probability or selection / weighting information? IF not you may be able to oversample by a bit and remove duplicate ids.

 

Another option is to decide which is more important, no duplicate Id or "exact" match of distribution. Perhaps create set with single values for the id and see if the overall distribution remains similar. If so then select a sample from that subset.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1388 views
  • 0 likes
  • 4 in conversation