BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

Rawtbl has one row for each customer with explanatory variables and information of failure in 12 months.

I am using this code to divide population into :in-sample, out-sample.

This code divide the population by random method.

The outcome of this method is that proportion of failure in following period is similar in in-sample and out-sample.

However,I want to add one more condition for this division :

I want that proportion of  customers with failure in following period will be equal in in-sample and out-sample.

What is the way to do it?

Please note that the reason that I do it is that Gini coefficient has significant different value

 

data Wanted; 
set  Rawtbl.;
random=ranuni(1234);
if random=>0.3 then outsample=0;/*Build here Regression model 70%*/
else outsample=1;/*Check here Regression model*/
run;
11 REPLIES 11
PaigeMiller
Diamond | Level 26

The method is:

 

  1. Determine the proportion of failures in the full data set, let's assume that it is k% (use PROC FREQ)
  2. Assign random numbers to the full data set, and then sort by the random number and by failure or not
  3. Take the sorted full data set, and assign the first 70*k% of the failures to insample and remaining 30*k% to outsample, and assign first 70*(1-k)% of the non-failures to insample and the remaining 30*(1-k)% of the non-failures to the outsample

I think you can achieve the same using PROC SURVEYSELECT, but I have never done it that way.

--
Paige Miller
Ronein
Meteorite | Level 14

May you please show code?

It is much easier to understand with real code.thanks

PaigeMiller
Diamond | Level 26

@Ronein wrote:

May you please show code?

It is much easier to understand with real code.thanks


Why don't you take a try at it? I know you can do PROC FREQ, I know you can do PROC SORT, I know you can create random numbers. Show us what you have if it isn't working.

--
Paige Miller
ballardw
Super User
What does your "failure in following period" variable look like?
Likely Proc surveyselect with your not named variable with the failure in following period information as a STRATA variable and a SAMPRATE of 50 would work or maybe a specific SAMPSIZE. But kind of need to know what sort of variable holds the information.
Or possibly variables. May need to create a single variable that Surveyselect can use.

Details matter.
Ronein
Meteorite | Level 14
Data Rawtbl;
Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue;
Cards;
1 10 20 30 40 50 60 70 80 90 100 0
2 15 20 25 30 35 40 45 50 55 110 1
3 25 30 25 30 35 40 35 50 50 130 0
4 25 30 25 30 35 20 35 50 25 100 0
5 10 30 45 30 35 40 45 50 55 150 1
and so on
;
Run;
ballardw
Super User

 This would select 25 records from each "strata" of Ind_touch_failure, if there are at least 25 of each. Which means that the two levels in the selected set would have "equal proportion" , i.e. 50% of each.

Proc sort data=rawtbl1;
   by Ind_touch_failure;
run;

proc surveyselect data=rawtbl out=want
   sampsize =25;
   strata ind_touch_failure;
run;

If you want a repeatable selection then you want to set a SEED= option otherwise you'll likely get a different set if you rerun the code.

Ronein
Meteorite | Level 14
I dont understand what you did here.
The real source data set contain 40,000 rows.
1500 of them are with failure (Ind_Touch_Failue=1) and 38500 rows with no failure(Ind_Touch_Failue=0).
I want to divide the rows into 2 populations:
In-Sample (25% of observations) so 25% of 40000 is 10,000 rows.
Out-Sample (75% of observations) so 75% of 40000 is 30,000 rows.
The only issue is that I need to add one more criteria to division into 2 populations.
I need that proportion of failure in 2 populations be equal!
What is the code to do it please?
This code below doesn't take into consideration the request of equal proportion of failure .

data wanted;
set have;
randomPop=ranuni(1234);
if randomPop=>0.3 then outsample=0;
else outsample=1;
Run;

Ronein
Meteorite | Level 14

This code didn't work .

Why did you write  sampsize =25?

My source data set has 40,000 observations.

ballardw
Super User

@Ronein wrote:

This code didn't work .

Why did you write  sampsize =25?

My source data set has 40,000 observations.


1) dummy code for dummy data. I explained that would select 25 records from each strata.

2) when I posted that you had not said anything about the size of your data set so I picked a small number hoping that it would at least run.

 

I do not know what you are attempting. I think that you need to consider generating a small example by hand of what you expect from a given example data set.

mkeintz
PROC Star

@Ronein wrote:
Data Rawtbl;
Input CustomerID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Ind_Touch_Failue;
Cards;
1 10 20 30 40 50 60 70 80 90 100 0
2 15 20 25 30 35 40 45 50 55 110 1
3 25 30 25 30 35 40 35 50 50 130 0
4 25 30 25 30 35 20 35 50 25 100 0
5 10 30 45 30 35 40 45 50 55 150 1
and so on
;
Run;

So what about this data indicates "following period".   In fact, what indicates current period?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Meteorite | Level 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 are information in current period (base month) and Ind_Touch_Failue is information if customer "touch" failure in next 12 months (following period)

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 763 views
  • 0 likes
  • 4 in conversation