Hello
Rawtbl has one row for each customer with explanatory variables and information of failure in 12 months.
I am using this code to divide population into :in-sample, out-sample.
This code divide the population by random method.
The outcome of this method is that proportion of failure in following period is similar in in-sample and out-sample.
However,I want to add one more condition for this division :
I want that proportion of customers with failure in following period will be equal in in-sample and out-sample.
What is the way to do it?
Please note that the reason that I do it is that Gini coefficient has significant different value
data Wanted;
set Rawtbl.;
random=ranuni(1234);
if random=>0.3 then outsample=0;/*Build here Regression model 70%*/
else outsample=1;/*Check here Regression model*/
run;
The method is:
I think you can achieve the same using PROC SURVEYSELECT, but I have never done it that way.
May you please show code?
It is much easier to understand with real code.thanks
@Ronein wrote:
May you please show code?
It is much easier to understand with real code.thanks
Why don't you take a try at it? I know you can do PROC FREQ, I know you can do PROC SORT, I know you can create random numbers. Show us what you have if it isn't working.
Data Rawtbl;
Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue;
Cards;
1 10 20 30 40 50 60 70 80 90 100 0
2 15 20 25 30 35 40 45 50 55 110 1
3 25 30 25 30 35 40 35 50 50 130 0
4 25 30 25 30 35 20 35 50 25 100 0
5 10 30 45 30 35 40 45 50 55 150 1
and so on
;
Run;
This would select 25 records from each "strata" of Ind_touch_failure, if there are at least 25 of each. Which means that the two levels in the selected set would have "equal proportion" , i.e. 50% of each.
Proc sort data=rawtbl1; by Ind_touch_failure; run; proc surveyselect data=rawtbl out=want sampsize =25; strata ind_touch_failure; run;
If you want a repeatable selection then you want to set a SEED= option otherwise you'll likely get a different set if you rerun the code.
This code didn't work .
Why did you write sampsize =25?
My source data set has 40,000 observations.
@Ronein wrote:
This code didn't work .
Why did you write sampsize =25?
My source data set has 40,000 observations.
1) dummy code for dummy data. I explained that would select 25 records from each strata.
2) when I posted that you had not said anything about the size of your data set so I picked a small number hoping that it would at least run.
I do not know what you are attempting. I think that you need to consider generating a small example by hand of what you expect from a given example data set.
@Ronein wrote:
Data Rawtbl; Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue; Cards; 1 10 20 30 40 50 60 70 80 90 100 0 2 15 20 25 30 35 40 45 50 55 110 1 3 25 30 25 30 35 40 35 50 50 130 0 4 25 30 25 30 35 20 35 50 25 100 0 5 10 30 45 30 35 40 45 50 55 150 1 and so on ; Run;
So what about this data indicates "following period". In fact, what indicates current period?
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: