## Stratified Sampling based on multiple variables

Frequent Contributor
Posts: 77

# Stratified Sampling based on multiple variables

Hi All -

I have a dataset which contains account number, balance, limit and apr. I have to separate out 10% population from this dataset. This 10% population  should be a (stratified) random sample from the dataset  and also the distribution of balance,limit and apr between 10% popluation and remaining 90% population should be equal ( approximately equal) .

I have used proc surveyselect procedure for sampling dataset based on one variable.

proc surveryselect data = dataset out=new_dsn samprate=.1 outall;

strata cust_flag;

run;

Can you some one help me how to do the samething for many variables.

Thanks

Dhana

Super User
Posts: 20,702

## Re: Stratified Sampling based on multiple variables

Why can't you add more variables to the strata statement?

strata balance limit apr;

Frequent Contributor
Posts: 77

## Re: Stratified Sampling based on multiple variables

I tried to do the same , but instead of 10% I got 19% population. After seeing that I am little confused on how this proc works.

Super User
Posts: 5,711

## Re: Stratified Sampling based on multiple variables

I don't know how it works, but I do have a suspicion.  Perhaps the procedure requires every combination of strata variables to be represented in the sample.  If the number of observations fitting into a particular strata combination were 5, the software would still have to select one of them into the sample.  If that applied to every strata combination, you would end up with a 20% sample.  You could check the strata sizes with this sort of program:

proc freq data=have noprint;

tables three*strata*variables / out=counts (keep=count rename=(count=n_observations));

run;

proc freq data=counts;

tables n_observations;

run;

The final table would tell you how many strata combinations have 1 observation in the original data set, how many have 2 observations, etc.

Good luck.

Super User
Posts: 11,773

## Re: Stratified Sampling based on multiple variables

Could you post the code that generated the 19% sample? I did some experimenting and I get 10% within each combination of strata variables but my trial data is probably too nice.

Discussion stats
• 4 replies
• 1245 views
• 0 likes
• 4 in conversation