BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anotherdream
Quartz | Level 8

Hello everyone.  I have a dataset that has has a variable I would like to do a random sample with replacement from.

The trick is, the dataset has a grouping variable State.  I would like to do a random sample of 10,000 observations on the same variable for EACH state, so that at the end of the day my dataset would have 10,000 * 50 or 500,000 observations.

I don't think surveyselect can do this but if I am incorrect please let me know. I found this code for doing Random sample with replacement, but all of my attempts to get it to run "by groups" fails.

Random Sample with Replacement code... Want to repeat it by groups.

DATA SRS_WR/ ;

SEED=123466;

SamplingProbability=&SAMPLE_SIZE/FRAME_SIZE;

SamplingWeight=1/SamplingProbability;

DO samp_cnt=1 TO &sample_size;

SELECT=CEIL( UNIFORM(SEED) * FRAME_SIZE );

SET ResidualsDists POINT=SELECT NOBS=FRAME_SIZE;

output;

end;

stop;

run;

I could also put this into a macro loop and repeat the code that way, however that seems very inefficient and I imagine there is a way to get it to work using "by groups".  If anyone can provide help it would be greatly appreciated.

Here is an example of my data (very small example).Value

data MetricLocation;

   infile datalines delimiter=',';

   input State $ MEtricDist;

   datalines;                     

CA,506

CA,45

CA,41

CA,32

CA,49

CA,45

CA,38

CA,20

CA,61

CA,64

CA,305

CA,42

CO,506

CO,46

CO,42

CO,33

CO,50

CO,51

CO,62

CO,60

CO,55

CO,49

CO,53

CO,51

;

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

What you want then is a random sequence, not a random sample. This, for instance, works fine

data MetricLocation;

   infile datalines delimiter=',';

   input State $ MEtricDist;

   datalines;                    

CA,506

CA,45

CA,41

CA,32

CA,49

CA,45

CO,506

CO,46

CO,42

CO,33

CO,50

CO,51

;

proc surveyselect data=MetricLocation method=urs out=sample sampsize=10 seed=1764;

strata state;

run;

data rndSample;

call streaminit(86785);

set sample;

do i = 1 to numberHits;

    rnd = rand("UNIFORM");

    output;

    end;

drop NumberHits ExpectedHits SamplingWeight i;

run;

proc sort data=rndSample out=randomSequence(drop=rnd); by State rnd; run;

proc print data=randomSequence; run;

PG

PG

View solution in original post

11 REPLIES 11
Reeza
Super User

I may be missing something but this seems like a simple Surveyselect with a strata statement?

proc surveyselect data = metricLocation out=metricSample method=SRS sampsize=7;

strata state;

run;quit;

Anotherdream
Quartz | Level 8

Hey @Reeza. As always thanks for your help.  I thought the same thing, but if you only have say 10 observations in one subset and you want to do a random sample of "20" (with replacement) then SRS doesn't work sadly.

however I just tried

proc surveyselect data=ResidualsDists
rep=1 seed=69581 outhits
   method=urs sampsize=500 out=SampleSRS;
   strata State;
run;

And it does seem to work.   I am curious if there is a way to do a surveyselect maintaining inherent order however?   If you run the following example, it will repeat the metric value X times, wehre X was the number of times it was sampled.  So in our example if "506" was sampled 88 times, it would repeat 506 88 times, and then it would list 45 X times ( where X is the total number of times out of 500 it was sampled).  I would prefer to not do this but instead actual maintain the order of the sample.

So if the sample went "506" "61" "20" " 64" "64" "506" for california it would be nice to maintain that ordering.  Do you know if Survey select can do this? (I am starting to think more of sampling Multiple X variables that might be correlated).

Thanks again!

ballardw
Super User

Add a record order variable or what ever you want to call it before surveyselect and add it and any other variables you want from to an ID statement;

id order var1 var2;

IF the order variable contains the order you want you may just need to resort after the selection.

I would hope that if you are requesting a 10000 sample size that you have more than 20 records in the strata. If that is an issue you may want to look at SAMPRATE instead of SIZE...

Anotherdream
Quartz | Level 8

Hey Ballardw. Thanks to you as well for your help!  I think more clarification on my actual problem would be helpful, as I tried to simplify it down quite a bit (maybe too much) to answer one part of multiple problems I am having.  By the way your use of the "ID" variable may be exactly what I need, and i'll look into that.

Basically I have 4 datasets which each have a variable, and also have the variable "state".  I want to create a new variable which is a transformation and combination of the 4 underlying variable distributions.  Unfortunately some of the distributions are not normal (one is bi-model, another is non-defined, etc..).

In school I was taught a great way to do this is Monte carlo simulation. Basically pick randomly from EACH of the distributions (once) and then perform your transformation.  The resulting value is 1 'observation' from the distribution that you want.   Repeat this process 10k times, and the resulting distribution is the final variable 'distribution' that you are interested in.

I am struggling with survey select, because it maintains orders of the variables from the input dataset and instead outputs the 'probability' of selection, and number of times selected.  This is not good for what I am trying to do, since smaller values of a variable A could be grouped with smaller values of Variable B for no reason other than the way survey select seems to work (if they happened to be ordered smallest to largest by example before the survey select ran).  I would have then effectively added a "correlation" between the random variable distributions without meaning too.

I wasn't sure how to solve this problem and that is where the question came from.  I tried to simplify it to find a way to do a random sample By Group without surveySelect....

I think the answer to my question is found below (now).

If i want the 4 underlying distributions to be independent of eachother in their samples.

Do the survey select for each of the 4 datasets, and then just randomly order the results of each result set. That way observation 1 from each random sample from the distribution could be added together to get a new "Distribution 5", and they would still be independently generated from their own distributions.

Does that sound reasonable to you?  I didn't really think this was a good solution because you have to do a random sample, and then randomly sort your result set... I was surprised that sas didn't already have a function that would do this for you to be honest.

Let me know if I am not being clear enough!

ballardw
Super User

I've not noticed any order imposed on ID variables. The order of the STRATA variables is sort of required. Another option to consider if order is important would be to use Method=SYS.

Anotherdream
Quartz | Level 8

What I mean is. The output distribution of your variables is deterministic in it's output (but not it's results) when using proc surveyselect.   Even without using strata.   Please see thw following example.

data MetricLocation;

   infile datalines delimiter=',';

   input MEtricDist;

   datalines;                     

506

45

41

;

proc surveyselect data=MetricLocation

rep=1 seed=69581 outhits

   method=urs sampsize=500 out=SampleSRS;

run;

You will note the output dataset has the variable "metricDist" with a value of "506" repeated 145 times (number of hits 145). 45 repeated 196 times, and then 41 159 times.

Now the true order of the simulation could have been 506, 506, 45, 41, 45, 45, 506, 506, etc... But it is ordered by the MetricDist columns original distribution.. I don't want this.  I am still not 100% sure how to fix this problem.  I am working on the ID variable but am not sure I am doing it correctly.

ballardw
Super User

You can reorder the output: Add in a random number from that oversampled data and then sort on strata and that order variable.

PGStats
Opal | Level 21

What you want then is a random sequence, not a random sample. This, for instance, works fine

data MetricLocation;

   infile datalines delimiter=',';

   input State $ MEtricDist;

   datalines;                    

CA,506

CA,45

CA,41

CA,32

CA,49

CA,45

CO,506

CO,46

CO,42

CO,33

CO,50

CO,51

;

proc surveyselect data=MetricLocation method=urs out=sample sampsize=10 seed=1764;

strata state;

run;

data rndSample;

call streaminit(86785);

set sample;

do i = 1 to numberHits;

    rnd = rand("UNIFORM");

    output;

    end;

drop NumberHits ExpectedHits SamplingWeight i;

run;

proc sort data=rndSample out=randomSequence(drop=rnd); by State rnd; run;

proc print data=randomSequence; run;

PG

PG
Anotherdream
Quartz | Level 8

Hello PGStats.  This is exactly what I wanted!  I think I was more surprised that SAS didn't have a built in function to do something like this.  However if you think this is the best way to do it, so be It! I will use this methodology.

Thanks for your time!

Reeza
Super User

You'll probably need to use a different solution then....the Point solution above may work.

Mine would probably be more roundabout - I'd generate a set of random numbers that had the format I was looking for based on my rules and then merge my data in based on a record ID. I have no idea on the efficiency of this, but perhaps the solution you initially posted matches this or perhaps someone can improve on it.

proc freq data=have noprint;

table state/out=stateCNT;

run;

data sample;

set stateCNT;

seed=10;

do i=1 to 500; *number of samples;

record=count*ranuni(seed)+1;

output;

end;

run;

Then merge into records.

Reeza
Super User

I thought the same thing, but if you only have say 10 observations in one subset and you want to do a random sample of "20" (with replacement) then SRS doesn't work sadly.

With Replacement it should work. You can also look at the SELECTALL option:

SELECTALL

requests that PROC SURVEYSELECT select all stratum units when the stratum sample size exceeds the total number of units in the stratum. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum, unless you are using a with-replacement selection method.

The SELECTALL option is available for the following without-replacement selection methods: METHOD=SRS, METHOD=SYS, METHOD=SEQ, METHOD=PPS, and METHOD=PPS_SAMPFORD.

The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or those PPS methods that select two units per stratum.

For the Order you can use either @ballardW solution's or consider looking at the OUTALL option which outputs the entire dataset with a 1/0 to indicate selection. I'm not sure of the order of the data though.

sas-innovate-2024.png

Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

 

Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3277 views
  • 1 like
  • 4 in conversation