Hello
Rawtbl has one row for each customer with explanatory variables and information of failure in 12 months.
I am using this code to divide population into :in-sample, out-sample.
This code divide the population by random method.
The outcome of this method is that proportion of failure in following period is similar in in-sample and out-sample.
However,I want to add one more condition for this division :
I want that proportion of customers with failure in following period will be equal in in-sample and out-sample.
What is the way to do it?
Please note that the reason that I do it is that Gini coefficient has significant different value
data Wanted;
set Rawtbl.;
random=ranuni(1234);
if random=>0.3 then outsample=0;/*Build here Regression model 70%*/
else outsample=1;/*Check here Regression model*/
run;
The method is:
I think you can achieve the same using PROC SURVEYSELECT, but I have never done it that way.
May you please show code?
It is much easier to understand with real code.thanks
@Ronein wrote:
May you please show code?
It is much easier to understand with real code.thanks
Why don't you take a try at it? I know you can do PROC FREQ, I know you can do PROC SORT, I know you can create random numbers. Show us what you have if it isn't working.
Data Rawtbl;
Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue;
Cards;
1 10 20 30 40 50 60 70 80 90 100 0
2 15 20 25 30 35 40 45 50 55 110 1
3 25 30 25 30 35 40 35 50 50 130 0
4 25 30 25 30 35 20 35 50 25 100 0
5 10 30 45 30 35 40 45 50 55 150 1
and so on
;
Run;
This would select 25 records from each "strata" of Ind_touch_failure, if there are at least 25 of each. Which means that the two levels in the selected set would have "equal proportion" , i.e. 50% of each.
Proc sort data=rawtbl1; by Ind_touch_failure; run; proc surveyselect data=rawtbl out=want sampsize =25; strata ind_touch_failure; run;
If you want a repeatable selection then you want to set a SEED= option otherwise you'll likely get a different set if you rerun the code.
This code didn't work .
Why did you write sampsize =25?
My source data set has 40,000 observations.
@Ronein wrote:
This code didn't work .
Why did you write sampsize =25?
My source data set has 40,000 observations.
1) dummy code for dummy data. I explained that would select 25 records from each strata.
2) when I posted that you had not said anything about the size of your data set so I picked a small number hoping that it would at least run.
I do not know what you are attempting. I think that you need to consider generating a small example by hand of what you expect from a given example data set.
@Ronein wrote:
Data Rawtbl; Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue; Cards; 1 10 20 30 40 50 60 70 80 90 100 0 2 15 20 25 30 35 40 45 50 55 110 1 3 25 30 25 30 35 40 35 50 50 130 0 4 25 30 25 30 35 20 35 50 25 100 0 5 10 30 45 30 35 40 45 50 55 150 1 and so on ; Run;
So what about this data indicates "following period". In fact, what indicates current period?
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.