04-30-2018 10:30 PM
I am trying to generate a 1% random sample from my data set, but I only want 1 of each distinct value in a variable. For example, if I have a set that looks like:
Company Name Project Number Joint Company
CompA 553 CompX
CompA 552 CompY
CompB 133 CompZ
In my random sample, I would only want CompA chosen once, not multiple times due to being in multiple rows. So, I need a 1% random sample using variable CompName. In addition, I do not want it to produce a new data set. Instead, I want a new variable created (let's say, "Random"), where a value of 1 means that row was selected for the random sample, and '.' or '0' means it was not selected.
Does that make sense? I have tried both proc surveyselect and proc sql and cannot 1) get distinct values instead of duplicates and 2) generate a new column in the same data set (or even a new data set that has all the original variables in it).
Thank you for your help!
05-01-2018 12:15 AM
Combine cluster sampling with stratified sampling. For example, to get a sample of 10 car models, but never more than one per make:
/* Select 10 makes */ proc surveyselect data=sashelp.cars sampsize=10 outall out=makeSample(rename=selected=selectedMake); samplingunit make; run; /* Select one model from every make */ proc surveyselect data=makeSample sampsize=1 outall out=modelSample(rename=selected=selectedModel); strata make; run; /* Select the selected model for every selected make */ data sample; set modelSample; selected = selectedMake and selectedModel; run;
Example assumes that the data is sorted by make.
05-01-2018 04:43 AM
Follow what @Reeza said.
data have; set sashelp.zipcode; call streaminit(12345678); r=rand('uniform'); run; proc sort data=have ; by STATE r; run; data want; set have; by STATE; if first.STATE then Random=rand('bern',0.1) ; drop r; run;