hi everyone.
i received a portfolio database of clients who received a loan. i know for a given month the total % of default (eg 5.0%), but i do not have the detail for every client.
i would like to generate for every client (row) a random number (0 or 1), with the "1" meaning a default situation, but i want also the total of "1" being the 5% of the total of rows.
thanks in advance for every suggestions.
There are two ways to interpret the "5%" criterion. The first way is to compute R=int(0.05*N) where N is the number of observations in your data set. Data_Null_ provides code to generate an indicator variable that has R selected rows. Notice that if you run PROC FREQ like this
proc freq data=samp;
tables selected;
run;
you will always get the same number of selected rows (R = int(0.05*428) = 22 in _NULL_'s example).
An alternative approach is to say that each observation has a 5% chance of being selected. This means that the number of selected observations is a random value with expected value 0.05*N. If you take this approach, every time you run the following program you get a different number of selected rows. The number of rows is binomially distributed. It's equivalent to using selected=rand("Binomial",0.05) in a DATA step.
proc surveyselect data=sashelp.cars
method=bernoulli rate=.05 out=samp outall;
run;
proc freq data=samp;
tables selected;
run;
If your eventual goal is to do some kind of bootstrap analysis, know that PROC SURVEYSELECT supports the REPS= option, which repeats this selecting process a specified number of times.
The 0,1 will be called selected.
proc surveyselect data=sashelp.cars rate=.05 out=samp outall;
run;
There are two ways to interpret the "5%" criterion. The first way is to compute R=int(0.05*N) where N is the number of observations in your data set. Data_Null_ provides code to generate an indicator variable that has R selected rows. Notice that if you run PROC FREQ like this
proc freq data=samp;
tables selected;
run;
you will always get the same number of selected rows (R = int(0.05*428) = 22 in _NULL_'s example).
An alternative approach is to say that each observation has a 5% chance of being selected. This means that the number of selected observations is a random value with expected value 0.05*N. If you take this approach, every time you run the following program you get a different number of selected rows. The number of rows is binomially distributed. It's equivalent to using selected=rand("Binomial",0.05) in a DATA step.
proc surveyselect data=sashelp.cars
method=bernoulli rate=.05 out=samp outall;
run;
proc freq data=samp;
tables selected;
run;
If your eventual goal is to do some kind of bootstrap analysis, know that PROC SURVEYSELECT supports the REPS= option, which repeats this selecting process a specified number of times.
dear rick, the advice that definitely match my goals is the second! one last doubt, since i saw that the sample size slightly change, this change is based on a confidence interval? thanks a lot!
The number of selected rows will be binomially distributed. The expected number of selected rows (the mean) will be N*p.
The standard deviation will be sqrt(N*p*(1-p)). When N is large, most of the sample sizes will be within three standard deviations of the mean.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.