turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Stratified Sampling based on multiple variables

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 12:51 PM

Hi All -

I have a dataset which contains account number, balance, limit and apr. I have to separate out 10% population from this dataset. This 10% population should be a (stratified) random sample from the dataset and also the distribution of balance,limit and apr between 10% popluation and remaining 90% population should be equal ( approximately equal) .

I have used proc surveyselect procedure for sampling dataset based on one variable.

proc surveryselect data = dataset out=new_dsn samprate=.1 outall;

strata cust_flag;

run;

Can you some one help me how to do the samething for many variables.

Thanks

Dhana

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 12:57 PM

Why can't you add more variables to the strata statement?

strata balance limit apr;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 01:08 PM

I tried to do the same , but instead of 10% I got 19% population. After seeing that I am little confused on how this proc works.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 01:57 PM

I don't know how it works, but I do have a suspicion. Perhaps the procedure requires every combination of strata variables to be represented in the sample. If the number of observations fitting into a particular strata combination were 5, the software would still have to select one of them into the sample. If that applied to every strata combination, you would end up with a 20% sample. You could check the strata sizes with this sort of program:

proc freq data=have noprint;

tables three*strata*variables / out=counts (keep=count rename=(count=n_observations));

run;

proc freq data=counts;

tables n_observations;

run;

The final table would tell you how many strata combinations have 1 observation in the original data set, how many have 2 observations, etc.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-07-2012 04:11 PM

Could you post the code that generated the 19% sample? I did some experimenting and I get 10% within each combination of strata variables but my trial data is probably too nice.