Predicting Customer Lifetime Value (CLV) with SAS Survival Analysis

In advanced analytics, raw data is rarely model-ready. While algorithms like Gradient Boosting, Logistic Regression, and Neural Networks are powerful, they are limited by the quality of their inputs. Feature Engineering bridges this gap, using domain expertise to transform raw variables into high value "signals."

In this post, we move beyond basic cleaning to demonstrate advanced techniques using the SASHELP.HEART dataset. By implementing interaction terms, polynomial features, and ratio calculations, we create a high-dimensional feature set that allows models to capture complex, non-linear relationships that raw data alone would miss.

Loading the Data

We create our LIBNAME statement, where the data will be stored and the output tables are stored.

/*# Create LIBNAME Statement */;
libname Hr '/cisviya-export/cisviya/homes/Dee.McKoy@sas.com/Blgdata/data';
libname output '/cisviya-export/cisviya/homes/Dee.McKoy@sas.com/Blgdata/output_new';

We create the LIBNAME statement that represents the path directory for our data and output tables. This makes accessing your data more seamless and easily accessible to users. The next step in this process is handling any feature engineering before executing our supervised machine learning model.

data Hr.Heart_Prep;
    set sashelp.heart;
    /****  Log transformation to Stabilize Variance *****/;
    if Cholesterol > 0 then Log_Chol=log(Cholesterol);
    if Weight > 0 then Log_Weight=log(Weight);
    /**** Compound Risk Age vs Cholesterol *****/;
    Age_Chol_Interaction=AgeAtStart * Cholesterol;
    if height > 0 then Weight_Height_Ratio=Weight / height;
    /******** Binary Encoding of Target Variable *******/;
    if Status='Dead' then Death_Event=1;  
    else Death_Event=0;
    where cmiss(AgeAtStart, Cholesterol, Weight, Smoking)=0;
run;

In the above code, we start by creating a new table for our feature engineered data to be stored. The next step of the code, we create new variables “Log_Chol” and “Log_Weight” to handle any extreme outliers to make the data more symmetrical. The variable that is created is the “Age_Chol_Interaction” the multiplication of AgeAtStart and Cholesterol which can be used in the model to find non-linear risk since cholesterol is impacted by age. The AgeAtStart is the age of the participant when they entered the Framingham Heart Study and is a baseline measure for tracking cardiovascular health over time. The next variable we create is Weight_Height_Ratio which is more relevant than just using weight. The final feature engineering variable created is the “Death_Event”, if the patient is Deceased it will produce a response of “1” and if the patient is still alive it will result in a response of “0”. The CMISS statement is function that counts the numeric, character, and missing values among variables.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the above figure, we have the first 10 rows of our data, at the end of our dataset we see the new variable that we created “Death_Event” . If the patient is Deceased it produced a response of 1 and if the patient is still alive it will result in a response of 0. We need to make one more change. We need to bin the variable ‘AgeAtStart’. Binning this variable will simplify the data, help simplify the data, reveal patterns, and improve visualization by grouping the individual ages into meaningful categories.

/************ Create Bin Variables for Age  ***************/;

data HR.Heart_AgeBin_Final;
    set HR.HEART_PREP;
        IM_bin_Agethr28to39 = (AgeAtStart >= 28 or AgeAtStart < 39);
        IM_bin_Agethr39to50 = (AgeAtStart >= 39 or AgeAtStart < 50);
        IM_bin_Agethr50to61 = (AgeAtStart > 50 or AgeAtStart >=61);
        *IM_bin_Ageover61    = (AgeAtStart > 62);
run;

The above snippet of code shows the bins created for the “Heart_AgeBin_Final” table; the age range is from 28 to 62. We create 4 bin variables that are distributed, it helps simplify the data analysis by grouping individual ages into more meaningful categories. Lastly, we drop the last bin variable, it helps avoid the dummy variable trap in regression models. This is form of perfect multicollinearity, it ensures each category’s unique effect can be estimated, by dropping the category serving as baseline or reference group. Now, we can start to build out the machine learning model, the first algorithm we use is the gradient boosting model, followed by the logistic regression model.

Supervised Machine Learning Model

PROC GRADBOOST

proc gradboost data=HR.Heart_AgeBin_Final ntrees=200 seed=12345;
target Death_Event / level=nominal;
input AgeAtStart Cholesterol Weight height Age_Chol_Interaction
Weight_Height_Ratio / level=interval;
input Sex Chol_Status / level=nominal;
/*************   Output Table of Importance   *****************/;
ods output VariableImportance=output.VarImportance;
run;

The above code uses PROC GRADBOOST to build a high-performance Gradient Boosting model that predicts the categorical outcome Death_Event. It trains 200 trees for accuracy and includes a seed to ensure your results are reproducible. The model processes a complex mix of continuous variables—including the engineered features variables like Age_Chol_Interaction—and categorical inputs like Sex. Finally, it captures Variable Importance metrics via ODS output, allowing you to identify the primary drivers behind the model's predictions for better decision-making.

Model Information Summary

The Model Information table, outlines the hyperparameters used to train the model. Key settings include the use of 200 trees with a very conservative learning rate of 0.0001 and a subsampling rate of 0.5. The architecture is relatively shallow, with a maximum depth of 4 and an average of 15.475 leaves per tree. Additionally, it shows that a Ridge (L2) penalty of 1 was applied for regularization, and a specific seed (12345) was used to ensure the results are reproducible.

Variable Importance Summary

The Variable Importance summary table presents the Variable Importance results, ranking the input features based on their predictive power. AgeAtStart is by far the most significant driver of the model, holding a relative importance of 1.0000. It is followed by Sex and the engineered Age_Chol_Interaction feature, though their impact is substantially lower. Variables such as Cholesterol and Chol_Status contribute the least to the model's decision-making process, with the latter showing negligible relative importance. Below, we provide a visualization to depict the model variable importance in a bar chart.

/*************  Visualization of the Top Predictors ***********/;
proc sgplot data=output.VarImportance;
    title "Predictive Power of Engineered Features";
    hbar Variable / response=RelativeImportance
                   categoryorder=respdesc
                   fillattrs=(color=CX1b75bc);
    xaxis label="Relative Importance (Scaled to 1)";
    yaxis label="Engineered & Raw Features";
run;

The above illustration shows the relative importance to the target variable “Death Event”, and the relationship between the input variables. We notice that “AgeAtStart” has relativity of 1 when it relates to the target variable suggesting a strong correlation with patients Death_Event. The next variable of relative importance is Sex followed by Age_Chol_Interaction. The next model we will look at will be the logistic regression model, the gradient boosting model is great for pure accuracy, but the logistic regression model provides an Odds ratio. The Odds ratio allows for us to have the exact odds of the event of Death for patients, which could provide more insight into data and provide trends or patterns that could be deemed useful.

PROC LOGISTIC

The key difference in using the PROC LOGISTIC statement, unlike the PROC GRADBOOST is that the PROC LOGISTIC statement requires you to explicitly define nominal variables in a CLASS statement so it can create the necessary dummy variables. The mathematical foundation of the model is the Logit Link Function:

The logit link function is a mathematical transformation used in statistics to convert probabilities bounded between 0 and 1 in to log-odds from -∞ to +∞. This allow the model to build a relationship between the predictors and the log-odds of the Death_Event if the event is 1.

/* --- Step 2: Custom Visualization of Predictor Impact --- */
proc sgplot data=output.Heart_OddsRatios;
    scatter y=Effect x=OddsRatioEst / xerrorlower=LowerCL xerrorupper=UpperCL
            markerattrs=(symbol=DiamondFilled size=10px);
    refline 1 / axis=x lineattrs=(pattern=ShortDash color=gray);
    xaxis label="Odds Ratio (95% Confidence Limits)" grid;
    yaxis label="Predictor Variables" grid;
    title "Predictor Impact on Death Event (Odds Ratios)";
run;

The Logistic Regression results (Odds Ratio plot) provide the essential directional context; they reveal that AgeAtStart and Weight_Height_Ratio both increase the odds of a Death Event (odds greater than 1), whereas being Female acts as a protective factor (odds less than 1). By bridging these two methods, you gain the predictive accuracy of modern machine learning alongside the transparent, actionable insights needed for strategic clinical or business decisions.

Conclusion

In conclusion of this post, by Implementing advanced feature engineering including interaction terms, logarithmic transformations, and ratio calculations, we transformed the SASHELP.HEART dataset into a high-dimensional space for predictive modeling. While Gradient Boosting optimized predictive accuracy and identified AgeAtStart as the leading feature, Logistic Regression provided interpretability via Odds Ratios. Integrating these methodologies yields a model that combines non-linear optimization with statistical transparency, facilitating data-driven clinical and strategic decisions. With a larger dataset and more patient information we could improve the model and gain more meaningful insight.

For information:

Find more articles from SAS Global Enablement and Learning here.