Predictive Maintenance Classification using SAS

Predictive maintenance is revolutionizing how industries approach equipment upkeep, shifting from reactive repairs to proactive interventions. By leveraging data analytics and machine learning, organizations can anticipate potential machine failures before they occur, minimizing downtime, reducing maintenance costs, and extending asset lifespans.

To start we need to understand the data we are working with, having a better understanding of the variables we are working with can provide a direction of the how we start the machine learning process, but before that we need to load our data into SAS environments and store it for later use.

Loading/Preprocessing Data

We want to load our data by creating a libname statement and create a folder with a subfolder for the output data to be stored. Once the libname is created we can use the PROC IMPORT t and bring our data into SAS, then we can store the dataset in the library we created using our libname statement

libname HV
“/innovationlab-export/innovationlab/homes/Dee.McKoy@sas.com/demo/hvac_output”;
proc import datafile=”/innovationlab-export/innovationlab/homes/Dee.McKoy@sas.com/demo/hvac_set.csv”
        out=HV.hvac
        dbms=csv;
run;

proc print data=HV.hvac (obs=10);

run;

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

We have our data loaded and now we can start to work on some preprocessing steps. Let’s look over the data before we start. You want to make sure you have a sense of direction we are creating for the data story. Notice that we have 10 columns the first being uid (user id), productid, type (quality variant), air temperature, process temperature, rotational speed (rpm), torque, tool wear, target, failure type. We notice that for observations 3 & 4 the air temperature data is missing. We will want to impute values so that there is no missing data. We also want to change the temperature data from Kelvin to Fahrenheit.

/* Correcting Missingness and changing temperature to Fahrenheit*/;
data HV.predictive_data_cleaned;
    set HV.hvac; 
* Impute missing values;
    if missing(air_temperature_k) then air_temperature_k = &mean_air_temp;
    if missing(process_temperature_k) then process_temperature_k =    &mean_process_temp;      
/* Conversion from Kelvin to Fahrenheit temperature*/;
    air_temperature_f = (air_temperature_k - 273.15) * 1.8 + 32;
    process_temperature_f = (process_temperature_k - 273.15) * 1.8 + 32;        
        drop air_temperature_k process_temperature_k;
run;
proc print data=HV.prdcitive_data_cleaned (obs=5);
run;

From the above code and illustration, we can see that we were able to successfully convert our temperature from Kelvin to Fahrenheit and remove any missing data as well for temperature columns. Now, we can look at our descriptive statistics for our selected variables.

Above, we have our descriptive statistics for our selected variables, that provides mean, standard deviation, minimum, maximum, and median values. We can see that we were able to change our temperature data to Fahrenheit for easier readability. Let’s look at a frequency table of some of the categorical variables such as type, failure_type, and target.

proc freq data=HV.predictive_data_cleaned;
     tables type failure_type target;
run;

In this table, we see the frequency of the “type” variable. You may be wondering what does h, l, and m stand for? Well, that’s a great question. It represents the category or classification of the product. It distinguishes between different models and sizes of the HVAC (Heating, Ventilation, and Air Conditioning) units. We can observe from the table type l: low quality variant models have the highest level of frequency followed by medium quality models.

In this table, we see the failure types and the number of occurrences, notice that majority of the HVAC units have no failure with the second highest count of failure types being the tool wear. Let’s visualize the data by creating a bar graph to represent the distribution of failure types.

proc sgplot data=HV.predictive_data_cleaned;
   vbox tool_wear_min / category=target group=target;
   xaxis label='Failure Status (0=No Failure, 1=Failure)';
   yaxis label='Tool Wear (min)';
   title 'Tool Wear by Machine Failure Status';
run;

In the above figure, we have the number of failures for each type of. We can see that the no failure is at 9,511 and the next highest count is the “tool wear”.

proc freq data=HV.predictive_data_cleaned;
   tables target;
run;

In this table, we look at our target variable distribution. Failures are represented as 1 and non-failures are represented as 0. We see that for the target of 0 we have the exact same count and percentage as the failure type table. Now, it’s time to start to visualize our data.

proc sgplot data=HV.predictive_data_cleaned;
   histogram air_temperature_f / fillattrs=(color=lightblue) transparency=0.5 datalabel;
   density air_temperature_f;
   title 'Distribution of Air Temperature (F)';
   xaxis label='Air Temperature (K)';
   yaxis label='Density';
run;

In our distribution plot above, we see that we have a normal distribution, where majority air temperature data points are between 80-degrees and 85-degrees Fahrenheit. Having such a great bell-shaped curve would arise suspicion of the quality of the data. Next, we’ll create a box plot to look at the tool wear by minutes and provide a category and group of our target variable.

proc sgplot data=HV.predictive_data_cleaned;
    vbox tool_wear_min / category=target group=target;
    xaxis label='Failure Status (0=No Failure, 1=Failure)';
    yaxis label='Tool Wear (min)';
    title 'Tool Wear by Machine Failure Status';
run;

In the above figure, we have box plots that illustrate the relationship between toll wear and the target variable. If target= 0 then the HVAC has “no failure”, but if the target = 1 the HVAC has failed. The average tool wear is less than 60 minutes for HVAC non-failures and failures as well. Next, let’s show the machine failure associated with the product type. This will give us a better understanding as to which variant ( low, medium, or high) failed.

proc sgplot data=HV.predictive_data_cleaned;
    vbar type / group=target groupdisplay=cluster datalabel;
    xaxis label='Product Type';
    yaxis label='Count';
    title 'Machine Failure by Product Type';
    keylegend / position=bottom;
run;

proc sgplot data=HV.predictive_data_cleaned;
    vbar type / group=target groupdisplay=cluster datalabel;
    xaxis label='Product Type';
    yaxis label='Count';
    title 'Machine Failure by Product Type';
    keylegend / position=bottom;
run;

In the above figure, we see that we have a target of 0 and 1 (blue and yellow) and the product type h (high quality variant, l (low quality variant), and m (medium quality variant). We see from the illustration the low-quality variant has the highest count of non-failure HVAC units at 4,832 but also had the most failing systems at 244. The next highest product type was the medium quality variant that has 2,810 non-failing units, with 152 failing units.

Now, we have completed our data cleaning/preprocessing and visualizing our data, we will need to split our data into training and validation test sets so that we can perform machine learning, but first we must split our dataset into a train and test set.

data HV.train HV.test;
    set HV.predictive_data_cleaned;
    call streaminit(123); * Initialize random number generator for reproducibility;
    rand_num = rand('UNIFORM'); * Generate a uniform random number between 0 and 1;

    if rand_num <= 0.70 then do; * 70% for training data;
        output HV.train;
    end;
    else do; * Remaining 30% for test data;
        output HV.test;
    end;
    drop rand_num; * Drop the temporary random number variable;
run;

* Verify initial split counts for debugging;
proc freq data=HV.train;
    tables target;
    title "1. Target Distribution in Original Training Data (HV.train)";
run;

proc freq data=HV.test;
    tables target;
    title "1. Target Distribution in Test Data (HV.test)";
run;

From the above illustration, we see the data for the training set separated by the target variable (0=non-failure and 1=failure) with a split of 95 % of the observations with target 0 and approximately 5% with a target of 1. Notice that about 95% of the observations have a target=0 and approximately 5% have a target=1 which is very close to the entire dataset. In the next step we will fit a model to this data to see if we have meaningful results. We will use a logistic regression model

proc logistic data=HV.train_balanced plots(only)=(roc) outmodel=HV.model_logistic_balanced;
    class type (ref='l');
    model target (event='1') = type air_temperature_f process_temperature_f rotational_speed_rpm
       torque_nm tool_wear_min /
       link=logit;
run;

In the above code we perform the logistic regression on HV.train_balanced data. It models a binary target variable (where '1' is the event) using type (categorical, 'l' reference) and other continuous variables. The code also generates an ROC curve and saves the trained model as HV.model_logistic_balanced as seen below.

From the illustration we see on the x-axis is labeled as "1 - Specificity," also known as the False Positive Rate (FPR), which indicates the proportion of actual negative cases incorrectly identified as positive. The y-axis represents "Sensitivity," also known as the True Positive Rate (TPR), which indicates the proportion of actual positive cases correctly identified as positive. The blue curved line represents the ROC curve for the model, showing the trade-off between sensitivity and 1-specificity at various classification thresholds. A diagonal gray line from the origin (0,0) to (1,1) serves as a reference, representing a classifier that performs no better than random chance. A key performance indicator, the Area Under the Curve (AUC), is prominently displayed at the top, with a value of 0.6382, suggesting that the model has some ability to differentiate between classes, but it also indicates moderate predictive power.

Using a non-linear Model

In this section, we will work on improving the model for better predictive power. In our previous model using the logistic regression model, we noticed that the model performance had a 63% AUC as indicated on the ROC curve. The other importance that can be taken from the previous model is that the data is that the data may not be linear. A non-linear model such as a Neural Network, Gradient boosting, or Forest model for better analysis results. In our case we will use the gradient boosting model.

Gradient Boosting Model

proc gradboost data=HV.predictive_data_cleaned outmodel=HV.gradboost_model;
   input air_temperature_f process_temperature_f
   rotational_speed_rpm tool_wear_min
   torque_nm uid / level=interval;
   target target / level=nominal;
   output out=HV.gradboost_score_at_runtime;
   ods output FitStatistics=fit_at_runtime;
run;

The code above for SAS's GRADBOOST statement is intended to build a gradient boosting model, which is a very powerful machine learning method. A dataset named HV.predictive_data_cleaned is used to carefully train the model, utilizing several input variables such as air_temperature_f, process_temperature_f, rotational_speed_rpm, tool_wear_min, torque_nm, and even uid3. The model uses these inputs, all of which are monitored at the interval level, to forecast a nominal "target" variable. The model is neatly saved as HV.gradboost_model for later usage after training. In addition to creating the model, this code produces useful outputs, such as key performance metrics in the fit_at_runtime dataset and new data scores that are saved to HV.gradboost_score_at_runtime.

The Area under the Curve (AUC) for this model is 1.0000. An AUC of 1.0000 indicates a perfect model. This means that for every possible classification threshold, the model achieves 100% sensitivity (true positive rate) and 100% specificity (true negative rate). In simpler terms, the model can perfectly distinguish between the two classes with no false positives or false negatives. This is an ideal, but often unrealistic, result in real-world scenarios. The accompanying "Model Fit Statistics" show a "-2 Log L" of 0.870 for "Intercept and Covariates" compared to 20.421 for "Intercept Only". The "Testing Global Null Hypothesis" shows very low Pr > ChiSq values for Likelihood Ratio and Score tests, indicating the model is significant.

This table displays the variable importance for a model, indicating which factors contribute most significantly to its predictions. "Tool wear (min)" is identified as the most important variable with an importance score of 5.0829 and a relative importance of 1.0000. Following in importance are "torque (nm)" and "rotational speed (rpm)", while "air temperature (f)" shows the lowest relative importance among the listed variables.

Conclusion

We can conclude that we were able to successfully create a model, but did not provide promising predictive power. There were various steps for the analysis that have been shown. Other non-linear machine learning models could be tried to improve fitness. Further analysis may also include the modification of the model hyper parameters to achieve the best possible fit. Keep in mind your result may vary depending on different methods of approach and the use of different machine techniques as well. For more information, please see the links below:

Find more articles from SAS Global Enablement and Learning here.