11-07-2012 12:51 PM
Hi All -
I have a dataset which contains account number, balance, limit and apr. I have to separate out 10% population from this dataset. This 10% population should be a (stratified) random sample from the dataset and also the distribution of balance,limit and apr between 10% popluation and remaining 90% population should be equal ( approximately equal) .
I have used proc surveyselect procedure for sampling dataset based on one variable.
proc surveryselect data = dataset out=new_dsn samprate=.1 outall;
Can you some one help me how to do the samething for many variables.
11-07-2012 01:08 PM
I tried to do the same , but instead of 10% I got 19% population. After seeing that I am little confused on how this proc works.
11-07-2012 01:57 PM
I don't know how it works, but I do have a suspicion. Perhaps the procedure requires every combination of strata variables to be represented in the sample. If the number of observations fitting into a particular strata combination were 5, the software would still have to select one of them into the sample. If that applied to every strata combination, you would end up with a 20% sample. You could check the strata sizes with this sort of program:
proc freq data=have noprint;
tables three*strata*variables / out=counts (keep=count rename=(count=n_observations));
proc freq data=counts;
The final table would tell you how many strata combinations have 1 observation in the original data set, how many have 2 observations, etc.
11-07-2012 04:11 PM
Could you post the code that generated the 19% sample? I did some experimenting and I get 10% within each combination of strata variables but my trial data is probably too nice.