In this article, you will learn how Machine Learning Pipeline Automation (MLPA) in SAS Model Studio utilizes date columns when set as input in the data tab.
The difference between a good model and an excellent model is determined by how rich your modeling data is. To improve the accuracy of the model, feature extraction is one of the crucial steps.
Did you know that dates are not only limited to time series modeling and you can use the features from dates in a machine learning pipeline to enrich the modeling data?
Let’s see how you can extract features such as year, month, quarter, weekday, and day from your date column and use it as input to your automated machine learning pipeline instead of rejecting dates or simply using them as raw dates.
Overview
In SAS Model Studio a date variable is identified as a column with any date format for example MMDDYYY10. etc. and has a numeric type. By default, the role of such variables is set to rejected.
However, a user can override this default by simply setting it as either input or ID. Once the role of a date variable is overridden, the automated machine learning pipeline will insert a custom sas code node right after the data node where all the magic will happen.
Example
In this example, I am using a housing dataset that has two date variables i.e. “Claim Date” and “Effective Date” but you can use any dataset in which you would like to process the dates.
Once the role of the desired date has been set to either ‘INPUT’ or ‘ID’ and the target has been assigned, we can proceed with generating an automated machine learning pipeline. For this example, I am going to set “Claim_Date” as an input.
Create a new pipeline and select “Automatically generate the pipeline” and set the automation time limit to your desired value. I am going with the default of 15 minutes for this example.
Once the pipeline has finished running successfully, you will notice that there is a SAS Code node right after the Data node (which you won’t see if the date variable is set to rejected).
If you open the code editor of that SAS Code node, it shows the score code that has been generated automatically for you, which extracts new features from the date variable “Claim_Date” and sets the original variable to “Rejected”. This ensures that the pipeline is only using newly generated features and not the original variable to avoid any redundancy.
Let’s take a look at the output variable table in the results window. It shows below the five new variables that have been generated and the original “Claim_Date” variable has been set to the ‘REJECTED’ role.
Lastly, let’s take a look at one of the candidate models and see if any of the new features generated contributed to it.
Below is the result from the Linear regression model node and “yr_CLA1” variable which represents year information from the “Claim_Date” variable is one of the significant variables in this model.
Note, you can always add another pipeline with none of the date variables as input and do the pipeline comparison with date variables as input to see if adding these new date features improves the accuracy of your model as it is highly dependent on the type of dataset you are using.
Similarly, learn more about extracting features from text variables in this article.
Summary
In this article, I have tried to showcase one of the ways how dates can be utilized in the modeling process. This technique can be used to improve the predictive performance of the models depending on the type of dataset.
It also covered an example of SAS Model Studio Automated pipeline creation, which uses a date variable to extract new features for subsequent use.
This is awesome. You mention changing the role of the Date variable to 'ID' - will this have the same effect?
This is really nice, I really appreciate your dedication and efforts to make this informative. I have a query how can we find all the variables having claim_date.
@raavichouhan, Glad you found this article helpful. If you are looking to find the derived variables from claim_date then you will see that sas code node result window shows you five new derived variables for year, quarter, month, week and day. It starts with acronym like YR_CLA1 (this is year extracted from claim date) etc. as shown in the article.
Now let's say for some reason in your data you have two types of claim date i.e. CLAIM_DATE_XYZ_123 and CLAIM_DATE_ABC_456. In this case if you assigned both of these variable's role as "input" and has appropriate date formats to them then it will derive new variables for both of them and will add numeric suffix to differentiate between them. However, best practice is to have different variable names to avoid confusion.
Please feel free to ask if you have any additional questions.
@tom_grant good question! Yes, you can set the role to either "INPUT" or "ID" and it will process the date variable in the same way as long as format is correct. You can open the SAS code node and see behind the scenes code where it checks for the role and format (article shows that in fourth image). Hope this helps.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.