## A Guide to Logistic Regression in SAS

Started ‎06-11-2019 by
Modified ‎05-03-2022 by
Views 85,729

## What is logistic regression?

Logistic regression is a supervised machine learning classification algorithm that is used to predict the probability of a categorical dependent variable. The dependent variable is a binary variable that contains data coded as 1 (yes/true) or 0 (no/false), used as Binary classifier (not in regression). Logistic regression can make use of large numbers of features including continuous and discrete variables and non-linear features. In Logistic Regression, the Sigmoid (aka Logistic) Function is used.

We want a model that predicts probabilities between 0 and 1, that is, S-shaped. There are lots of S-shaped curves. We use the logistic model: Probability = 1 / [1 +exp (B0 + b1X)] or loge[P/(1-P)] = B0 +B1X. The function on left, loge[P/(1-P)], is called the logistic function.

SAS Trainer Christa Cody presents an overview of logistic regression in this tutorial

#### Building a Logistic Model by using SAS Enterprise Guide

I am using Titanic dataset from Kaggle.com which contains a training and test dataset. Here, we will try to predict the classification — Survived or deceased. Our target variable is ‘survived’. I am using SAS Enterprise guide to analyze this dataset.

``````Setting the library path and importing the dataset using proc import

/* Setting the library path */
%let path=C:\dev\projects\SAS\PRACDATA;
libname PRAC “&path”;
/* Importing dataset using proc import */
proc import datafile = “C:/dev/projects/sas/pracdata/train.csv”
out = PRAC.titanic
dbms = CSV;
run;

Checking the contents of the dataset by using proc contents function

/* Checking the contents of the data*/
proc contents data=work.train;
run; ``````

We have 12 variables. Our target variable is ‘Survived’ which has 1 and 0. 1 for survived and 0 for not survived. Category variables: Cabin, sex, Pclass. Numeric Variables: Passenger ID, SibSp, Parch, Survived, Age and Fare. Text variable: Ticket and Name

#### Checking the frequency of the target variable ‘Survived’ by using proc frequency

``````/* Checking the frequency of the Target Variable Survived */

proc freq data=work.train;

table Survived;

run;``````

We can clearly see that 342 people were survived and 549 people are not survived. A total number of observations = 891.

``````title "Survived vs Gender";
proc sgplot data=prac.titanic pctlevel=group;
vbar sex / group=Survived stat=percent missing;
label Embarked = "Passenger Embarking Port";
run;``````

## Data Visualization

Normally, it is good practice to research with the data by using visualization. I am using `proc sgplot` to visualize the class, `Embark`.

``````title "Analysis of embarkation locations";
proc sgplot data=prac.titanic;
vbar Embarked / datalabel missing;
label Embarked = "Passenger Embarking Port";
run;``````

Nothing unusual can be seen in value distributions. Let’s analyze survived the rate with other variables.

``````title "Survived vs Gender";
proc sgplot data=prac.titanic pctlevel=group;
vbar sex / group=Survived stat=percent missing;
label Embarked = "Passenger Embarking Port";
run;``````

Here, we see a trend that more females survived than males. People travelled in class 3 died the most. Still, there are many ways to visualize the data. I am not going into detail.

#### Checking the missing values by using proc means

``````/* Checking the missing value and Statistics of the dataset */
proc means data=work.train N Nmiss mean std min P1 P5 P10 P25 P50 P75 P90 P95 P99 max;
run;``````

We can see that Age has 177 missing values and no outliers detected.

``````Checking for categorical variables:
title “Frequency tables for categorical variables in the training set”;
proc freq data=PRAC.TITANIC nlevels;
tables Survived; tables Sex; tables Pclass; tables SibSp; tables Parch; tables Embarked; tables Cabin;
run;``````

We have missing value in Age, Embarked and Cabin. We need to fill all missing age instead of dropping the missing rows. One way to filling by using mean age. However, we can check the average age by passenger class using a box plot. In SAS, we need to sort it out of the class and age variable before making it a box plot.

``````/* Sorting out the Pclass and Age for creating boxplot */
proc sort data=work.train out=sorted;
by Pclass descending Age;
run;
title ‘Box Plot for Age vs Class’;
proc boxplot data=sorted;
plot Age*Pclass;
run;``````

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these average age values to impute based on Pclass for Age.

``````/* Imputing Mean value for the age column */
data work.train2;
set work.train;
if age=”.” and Pclass = 1 then age = 37;
else if age = “.” and Pclass = 2 then age = 29;
else if age = “.” and Pclass = 3 then age = 24;
run;
``````

I have dropped the cabin variable as I don’t see it is going to impact our model and filled the missing value in ‘embarked’ using the median. (Selected median due to category variable).

#### Data Partition

Splitting the dataset into training and validation by using the 70:30 ratio. First, I need to sort out the data using `proc sort` and splitting by using `proc surveyselect`.

``````/* Splitting the dataset into traning and validation using 70:30 ratio */
proc sort data = prac.train6 out = train_sorted;
by Survived;
run;
proc surveyselect data = train_sorted out = train_survey outall
samprate = 0.7 seed = 12345;
strata Survived;
run;``````

In order to verify the correct data partition, I am generating a frequency table by using `proc freq`.

``````/* Generating frequency table */
proc freq data = train_survey;
tables Selected*Survived;
run;``````

The Selected variable with the value of 1 will our target observation of the training part. Let us also perform quick set processing in order to leave only the columns that are interesting for us and name variables properly.

#### Building Model

We filled all our missing values and our dataset is ready for building a model. I am now creating a logistic regression model by using `proc logistic`. Logistic regression is perfect for building a model for a binary variable. In our case, the target variable is survived.

``````/* Creating Logistic regression model */
proc logistic data=titanic descending;
where part=1;
class Embarked Parch Pclass Sex SibSp Survived;
model Survived(event=’1') = Age Fare Embarked Parch Pclass Sex SibSp /
selection=stepwise expb stb lackfit;
output out = temp p=new;
store titanic_logistic;
run;``````

Good thing in SAS is that for categorical variables, we don’t need to create a dummy variable. Here we are able to declare all of our category variables in a class.

Hi!
Thank you for this nice guide! I'm trying to do the same but with my own data, but I don't really understand how to do the "where part=1;" under the section "building model". Is "part" a variable? And what does the 1 stand for?

Hey Humlan/Other users following this guide

I Hit the same problem with the "where part=1;" in the "Building Model" section.  I change that line to:

where selected = 1;

and that seemed to fix the problem, and give me the same output as the tutorial.

Great guide 🙂 really enjoyed following it through.

Version history
Last update:
‎05-03-2022 01:29 PM
Updated by:
Contributors
Article Labels
Article Tags