hi everyone.
i received a portfolio database of clients who received a loan. i know for a given month the total % of default (eg 5.0%), but i do not have the detail for every client.
i would like to generate for every client (row) a random number (0 or 1), with the "1" meaning a default situation, but i want also the total of "1" being the 5% of the total of rows.
thanks in advance for every suggestions.
There are two ways to interpret the "5%" criterion. The first way is to compute R=int(0.05*N) where N is the number of observations in your data set. Data_Null_ provides code to generate an indicator variable that has R selected rows. Notice that if you run PROC FREQ like this
proc freq data=samp;
tables selected;
run;
you will always get the same number of selected rows (R = int(0.05*428) = 22 in _NULL_'s example).
An alternative approach is to say that each observation has a 5% chance of being selected. This means that the number of selected observations is a random value with expected value 0.05*N. If you take this approach, every time you run the following program you get a different number of selected rows. The number of rows is binomially distributed. It's equivalent to using selected=rand("Binomial",0.05) in a DATA step.
proc surveyselect data=sashelp.cars
method=bernoulli rate=.05 out=samp outall;
run;
proc freq data=samp;
tables selected;
run;
If your eventual goal is to do some kind of bootstrap analysis, know that PROC SURVEYSELECT supports the REPS= option, which repeats this selecting process a specified number of times.
The 0,1 will be called selected.
proc surveyselect data=sashelp.cars rate=.05 out=samp outall;
run;
There are two ways to interpret the "5%" criterion. The first way is to compute R=int(0.05*N) where N is the number of observations in your data set. Data_Null_ provides code to generate an indicator variable that has R selected rows. Notice that if you run PROC FREQ like this
proc freq data=samp;
tables selected;
run;
you will always get the same number of selected rows (R = int(0.05*428) = 22 in _NULL_'s example).
An alternative approach is to say that each observation has a 5% chance of being selected. This means that the number of selected observations is a random value with expected value 0.05*N. If you take this approach, every time you run the following program you get a different number of selected rows. The number of rows is binomially distributed. It's equivalent to using selected=rand("Binomial",0.05) in a DATA step.
proc surveyselect data=sashelp.cars
method=bernoulli rate=.05 out=samp outall;
run;
proc freq data=samp;
tables selected;
run;
If your eventual goal is to do some kind of bootstrap analysis, know that PROC SURVEYSELECT supports the REPS= option, which repeats this selecting process a specified number of times.
dear rick, the advice that definitely match my goals is the second! one last doubt, since i saw that the sample size slightly change, this change is based on a confidence interval? thanks a lot!
The number of selected rows will be binomially distributed. The expected number of selected rows (the mean) will be N*p.
The standard deviation will be sqrt(N*p*(1-p)). When N is large, most of the sample sizes will be within three standard deviations of the mean.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.