Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Frequent Contributor
Posts: 100

Hi,

I have a poulation (A+B) of 1,812,507 customers. I would like to take a sample and stratify by the variable Decile. I want to take all the customers in Targeting A and take a sample of the people in Targeting B and make sure that the proportion of the variable decile in Targeting B is similar to the proportion of  the varaible decile in Targeting A...For example for decile 0, I would like to get a proportion of 4% in Targeting B, at the moment is 7% (Please see reports below). I am not sure how to proceed in Enterprise Miner...Your help will  be really much appreciated.

Many Thanks

 Targeting Decile Frequency Percent A+B 0 107,947 6% A+B 1 125,467 7% A+B 2 137,295 8% A+B 3 148,162 8% A+B 4 162,287 9% A+B 5 179,042 10% A+B 6 202,198 11% A+B 7 226,884 13% A+B 8 259,218 14% A+B 9 264,007 15% Total 1,812,507

 Targeting Decile Frequency Percent A 0 30,377 4% A 1 35,011 5% A 2 44,019 6% A 3 62,457 9% A 4 68,468 10% A 5 75,773 11% A 6 85,504 12% A 7 95,146 14% A 8 103,738 15% A 9 94,490 14% Total 694,983

 Targeting Decile Frequency Percent B 0 77,570 7% B 1 90,456 8% B 2 93,276 8% B 3 85,705 8% B 4 93,819 8% B 5 103,269 9% B 6 116,694 10% B 7 131,738 12% B 8 155,480 14% B 9 169,517 15% Total 1,117,524
SAS Employee
Posts: 231

It would be helpful if you could provide some context how Decile is being formed as I'm not sure I understand what you are trying to accomplish with your sampling.  The common use of the word Decile would refer to 10% groupings of your data but it that would place around 69,500 observations in each of your A deciles and around 111,750 observations in each of your B deciles, but your A decile frequencies range from 107,000 to 264,000 and your B decile frequencies range from around 77,500 to 169,500.

If the target variable is a class variable, then the sample is stratified on the target variable by default in the Sampling node of SAS Enterprise Miner.  Otherwise, random sampling is performed by default.  You also have the ability to add a stratification variable (based on Decile for instance) but as you increase the number of stratification variables, you might find you can be balanced with respect to one stratification variable or with respect to another stratification variable but not balanced with both simulataneously.

It would also be helpful to understand why those percentages need to be balanced.  It would not normally be critical for each group to have the same percentage.  It also not clear why A & B need to be modeled together when modeling them separately might produce much better results.   Any additional information would be helpful in providing a more detailed response.

Thanks!

Doug

Discussion stats