A Data Science Approach to Home Loan Approval: Gradient Boosting and Random Forest in SAS Viya

In this demonstration, we will discuss the likelihood of a customer purchasing a home based on the property and financial features. The property details consist of location, size, price information, type of home, and amenities. The data used is provided through Kaggle. The records of the dataset consist of 200,000 homes covering over 20 + countries and the target variable (decision) is based on binary classification (Buy/Not buy). This demonstration will illustrate data preprocessing techniques, imputation of the data, and two machine learning models the PROC GRAD BOOST and PROC FOREST procedure for predictive modeling.

Libref & Data Loading

First, let’s create LIBNAME statement to store our existing dataset (local) and the output (out) of any future dataset that will be created. We will also start the Cloud Analytical Services (CAS) session where the data will be saved and stored on the SAS Viya cloud-enabled platform.

libname local '/Dee.McKoy@sas.com/Blgdata/data';
libname out '/Dee.McKoy@sas.com/Blgdata/output';

/* Start CAS Session */
;
cas;
caslib _all_ assign

To gain some insight into the data, let’s take some time to clean the data and perform some feature engineering. We need to perform PROC IMPORT procedures to load in the csv file and then save it to the LIBNAME OUT folder created in the previous step.

proc import
    datafile="/Dee.McKoy@sas.com/Blgdata/data/global_house_purchase_dataset.csv"
    out=out.global_hp_data dbms=csv replace;
    getnames=yes;
run;

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The image above is created using the PROC PRINT procedure. We displayed the first 10 observations which provide some level of minimal insight. Let’s use the PROC CONTENTS procedure to gain more knowledge about the dataset and its structure.

proc contents data=out.global_hp_data; 
run;

In the above figure, we provide a list of variables and attributes. This list provides the variable name, Type, length of the variable column, format and informat of the column as well. I notice that I may want to combine the city and country columns into 1 single column. We need to also make changes to the columns that show any monetary numbers, we want to make sure that they’re comma separated.

Data Preprocessing and Data Manipulation

/*******Data Preprocessing and Data Manipulation****/
data local.clean_bank_data;
set out.global_hp_data;
format price dollar12.;
format customer_salary dollar12.;
format monthly_expenses dollar12.;
format down_payment dollar12.;
format loan_amount dollar12.;
/**Combining the City & Country into one column**/;
Location=catx(', ', city, country);
run;

In the above snippet of code, we use the data step to create a new table saved to libname “local” library called “clean_bank _data”. The first data transformation we make in the code is changing the price, customer_salary, monthly_expense, down_payment, and loan_amount from BEST12 to DOLLAR12. This is used to write numeric values as currency with a total width of 12 characters. The final preprocessing step for this table is to combine the “city” and “country” columns into one column labeled as “Location”. To achieve this, we use the procedure CATX, it removes leading and trailing blanks from each item and inserts a specified delimiter between the values that are be combined.

In this next step of the preprocessing stage, we need to handle data entries so that logic makes sense regarding the borrowing amount and the down payment. For example, we have entries where the down payment exceeds the loan amount, this would not be ideal for a customer to pay more on the down payment of the loan that exceeds the total of the loan entirely. The way we handle this logic is by removing the loan amount entries that are lower than the down payment amount.

/****Removing loan amounts the exceed the downpayment cost**/
data local.clean_bank_data;
set local.clean_bank_data;
if loan_amount < down_payment then delete;
run;

Now that we have handled the data preprocessing stage, it’s time to create the Data Exploration visualizations. The first visualization we create will be a scatter plot that shows the housing price vs the property square footage.

/*House Price vs. Property Size (Scatter Plot)*/
ods graphics / reset width=6.4in height=4.8in;
proc sgplot data=local.clean_bank_data;
title "House Price vs. Property Size (sqft)";
/* Create the scatter plot */
scatter x=property_size_sqft y=price;
/* Add a regression line for trend estimation */
reg x=property_size_sqft y=price / lineattrs=(color=red thickness=2);
xaxis label="Property Size (sqft)";
yaxis label="Price";
run; 

<img class="size-full wp-image-149809 aligncenter" src="https://sww2.sas.com/blogs/wp/gate/files/2025/11/nov4.png" alt="" width="494" height="370" />

/*House Price vs. Property Size (Scatter Plot)*/
ods graphics / reset width=6.4in height=4.8in;

proc sgplot data=local.clean_bank_data;
title "House Price vs. Property Size (sqft)";
/* Create the scatter plot */
scatter x=property_size_sqft y=price;
/* Add a regression line for trend estimation */
reg x=property_size_sqft y=price / lineattrs=(color=red thickness=2);
xaxis label="Property Size (sqft)";
yaxis label="Price";
run;

The above figure provides a scatter plot that represents the relationship of house prices to square footage of the property size of house prices versus based on square footage of the property size. As the property size increases, we see trending correlation that follows the regression line (red). This shows best fit line between the relationship of the house price and property size.

/* Price Distribution by Country (Box Plot) */
ods graphics / reset width=6.4in height=4.8in;

proc sgplot data=local.clean_bank_data;
title "Distribution of Home Price by Country";
/* Use VBOX to create side-by-side box plots */
vbox price / category=country;
xaxis display=(nolabel);
yaxis label="Price";
run;

The above illustration provides the distribution of housing prices within each country. We notice that Singapore has the highest median home prices. United Arabs Emirates (UAE) has the second highest median price, and we see that India has the lowest home median based the data provided.

Machine Learning Algorithms

The data has been prepared, and we are now ready to start building out some Machine Learning models. The first model we will look at will be the gradient boosting model (PROC GRADBOOST). The Gradient boosting is an ensemble learning method that builds a strong predictive model by sequentially combining multiple "weak" models, typically decision trees.

%let int_inputs = 
property_size_sqft price constructed_year previous_owners rooms bathrooms 
garage garden crime_cases_reported legal_cases_on_property customer_salary 
loan_amount loan_tenure_years monthly_expenses down_payment emi_to_income_ratio 
satisfaction_score neighbourhood_rating connectivity_score 
country_code city_code property_type_code furnishing_code;

proc gradboost data=out.bank_data_part;
partition role=_PartInd_ (validate='0' train='1');
target decision / level=interval;
input &int_inputs.;
autotune tuningparameters=(ntrees samplingrate
vars_to_try(init=24) learningrate lasso ridge) targetevent='1' objective=MSE;
ods output FitStatistics=out._Gradboost_FitStats_
VariableImportance=out._Gradboost_VarImp_;
run;

PROC GRADBOOST

The above %let statement allows us to store the input variables in a macro variable called "int_inputs", so they can be called later in the PROC GRADBOOST procedure. The PROC GRADBOOST procedure consists of multiple decision trees. A predictive model defines a relationship between input variables and a target variable. The target is variable “decision”. This variable level is binary target determining whether a customer will be approved or not for home loan globally. We use the autotune feature to help us find the best combination of values for the listed options within the statement. The tuning parameters we want to change are the number of trees and sampling rate. We use the “VARS_TO_TRY” option to specify information about the number of variables to consider at each split during tree growth. The initial variable we want to try is 24 for the learning rate, lasso, and ridge options. We output the variable importance data table and the fit statistics as well.

proc sgplot data=out._Gradboost_VarImp_;
title3 'Variable Importance';
hbar variable / response=importance nostatlabel categoryorder=respdesc;
xaxis label="Percentage of Importance";
run;

From the above illustration, we see that the satisfaction score has a significant relationship to the target variable, followed by legal cases on the property, with emi_to_income_ratio, finally crime cases reported had the least significant relationship to the target variable.

PROC FOREST

The PROC FOREST procedure creates an ensemble of decision trees to predict a single target of either interval or nominal measurement level. An input variable can have an interval or nominal measurement level. The PROC FOREST procedure ignores any observation from the training data that has a missing target value.

proc forest data=out.bank_data_part ntrees=100 maxdepth=30 inbagfraction=0.5
minleafsize=25;
partition role=_PartInd_ (validate='0' train='1');
target Decision / level=nominal;
input &int_inputs.;
ods output FitStatistics=out._Forest_FitStats_
VariableImportance=out._Forest_VarImp_;
run;

The PROC FOREST statement starts off setting hyperparameter that generates 100 decision trees (ntrees=100), restricts the maximum depth of each tree to 30 (maxdepth=30), and uses 50% of the training data for bootstrap sampling for each tree (inbagfraction=0.5). To prevent overfitting, the minimum number of observations required to form a leaf is set to 25 (minleafsize=25). The goal is to predict the categorical variable Decision, treating it as a interval (Binary) target. The input statement uses the macro variable &int_inputs. to designate the predictor variables. The partition statement is crucial, using the _PartInd_ variable to split the data into training and validation sets, ensuring a reliable assessment of the model's performance on unseen data.

From the above table, we noticed that most significant variable by a large margin is satisfaction_score, which has the highest raw Importance value and a Relative Importance of $1.0000$. Following this are legal_cases_on_property and criminal_cases_reported, which are the next two most important factors, though their relative importance scores are significantly lower at 0.3495 and 0.1791, respectively. This suggests that while all features contribute, the model relies overwhelmingly on the satisfaction_score, followed by these two risk-related variables, with the remaining variables contribute very little to the model's predictive power.

The above illustration represents the relationship between the Misclassification Rate and the Number of Trees used in the Forest model. Initially, as the number of trees increases from 0, all three error rates—Training, Validation, and Out-of-Bag (OOB)—drop rapidly, indicating a quick improvement in model performance. The error rates stabilize and converge around 20 trees, all hovering below 0.002. The OOB ASE (dashed orange line) consistently remains the highest, while the Training ASE (solid blue line) is the lowest, which is expected. After about 20 trees, adding more trees yields only marginal improvements, suggesting that the model has converged and that 100 trees is sufficient, as the error rate remains consistently low and stable across the validation set.

The Variable Importance scores derived from a predictive model, illustrating which input features contribute most to the model's predictions. The variable satisfaction_score is overwhelmingly the most important factor, with its bar extending far beyond all others. The second most important variable is legal_cases_on_property, which shows substantial importance, but is still less than half that of satisfaction_score. Following this, crime_cases_reported and emi_to_income_ratio have moderate, comparable importance.

Conclusion

In conclusion, leveraged advanced ensemble methods, Gradient Boosting and Random Forest, within SAS Viya to predict the likelihood of Global Property Purchase Approval provide great insight into understanding key variables that may contribute to the likelihood of a customer being approved for a global home purchase. A key finding from both models' variable importance analysis is that satisfaction_score is the overwhelming primary driver of the approval decision, followed by risk factors like legal_cases_on_property and crime_cases_reported, underscoring the critical role of non-financial or risk-based sentiment. Model evaluation confirmed stability, with the Random Forest achieving high performance efficiency around 20 trees; these resulting models offer a robust, data-driven framework for lenders, with future work recommended to conduct a definitive model performance comparison and further investigate the interactions among the top predictive variables.

For more information:

Find more articles from SAS Global Enablement and Learning here.