BookmarkSubscribeRSS Feed

DataPreparation4DataScience = FeatureEngineering + DQ + DataAssembly | Gerhards' article collection

Started ‎04-13-2022 by
Modified ‎04-13-2022 by
Views 962

 

A topic "Close to my heart"

 

 

gsvolba_1-1642976861967.pngBack in 1997 when I was a researcher at the Medical University of Vienna, I wrote the first version of the %MAKELONG and %MAKEWIDE macro to transpose data from clinical trials in the appropriate data structure. At that time, I was not yet aware that I started my journey through the world of "data preparation for data science" (or "data preparation for analytics" how it was called at that time). At the beginning of  2007 my first SAS Press Book "Data Preparation for Analytics" was published, followed by "Data Quality for Analytics" in 2012.

 

 

 

 

Today, 25 years later my #datapreparation4datascience collection contains 18 SASCommunities articles, 12 articles at medium.com, LinkedIn and other media, 14 webinars on Youtube, 3 SAS Press Books, 4 SAS Global Forum papers, 2 ask-the-expert sessions, 2 SAS tips at support.sas.com, 3 SAS Blogs and more than 60 presentations on data science conferences.

 

The purpose of this article it to provide an overview over #datapreparation4datascience and a link collection for the contributions by topic.
However before I start with the overview and the list, I want say that this has not only been my own achievement. I have to thank you, the SAS User Community, for your support!

 

 

The topic of my customers

 

 

I could only develop these topic and write the SAS Press books because I was in conversation with our SAS Customers and SAS Users. The questions you asked me and the challenges you gave me, were often a trigger for new ideas, tips and tricks and solutions. Here are a few examples:

  • In long nights of detecting and fixing data quality problems we found out and discussed that the statement "the data quality is in good shape" needs to be differentiated between technical data quality and data quality for analytics. The fact, that data is fine from an IT perspective because the values are between the lower and upper boundary or the categories are in a predefined list, does not mean that data can be used for machine learning models.
  • In many machine learning projects we worked on how to improve the predictive power of the models. And some of the challenges you presented to me resulted in creative solutions for feature engineering and deriving customer behavior from transactional data.
  • The above mentioned %MAKEWIDE and %MAKELONG macro for transposing the data have been reworked and augmented over the years. They were a frequent, but not the only, challenge how to structure data for being used in analytical models, SAS procedures and SAS Enterprise Miner or SAS Model Studio.
  • Sometimes we had (quite philosophical) discussions about the origin of data and the fact that the final representation of the data not only depends on the values of the analysis subject themselves but also from the intention of the data retrieval system and the process that collects the data. The "story of my old aunt Susanne" was one of the results of these discussion and has been presented at SAS conferences and my academic teaching now for more than 10 years.
  • The discussion "what would have happened, if we had more/better data (or also less/inferior data)" in time series forecasting, was the trigger for many simulation studies, that have been performed to quantify the effect of more missing values or a longer time series.

 

 

#datapreparation4datascience - Digging one level deeper

 

 

Data Science methods have specific requirements on analysis data. Occurrences of missing values have to be investigated and handled, the distribution of the values need to be checked, and a sufficient amount of analysis objects is needed. Data Science methods also require certain data structures like the one-row-per subject structure or the timeseries dataset. In return these methods however also enhance the data preparation work and can also be used to verify and improve data quality and allow you to perform powerful feature engineering.

 

"Data Preparation" and "Data Quality" are therefore much more than just joining tables and checking the possible list of values. Data Preparation for Data Science is a discipline to create and quality check the analytic base tables and to derive the relevant features that shall be used in the analysis. In my publications I categorize my articles into the following three main topics:

 

gsvolba_0-1643285637737.png

 

 

 

1. Feature Engineering

 

 

0 Feature EngineerCorrelation.png

Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects.

 

Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer. For powerful analytical models it is however not enough to "replicate" the original data just in another format, e.g. by transposing. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.

 

 

 

 

 

 

 

 

 

 

2. Data Quality for Analytics

 

 

gsvolba_0-1643283656619.png

Analytical methods have specific requirements on the analysis data. These data quality requirements often go beyond classic requirements of basic reporting and descriptive statistics. Missing values for example can quickly reduce the available set of records with complete data.

 

The data quantity aspect has influence whether statistically significant results can be generated or if certain machine learnings methods can be applied at all. The availability of data in general for the analysis decide whether you are able to perform your analysis or whether you have to reformulate your business questions or postpone the analysis at all.

 

Analytical methods however not only have requirements on the data. These methods can also be used to profile and improve the quality of the data.

 

 

 

2.1 Conceptual considerations

 

These articles discuss the conceptual background and the motivation of "Data Quality for Analytics".

 

 

DQ Options.jpeg

 

 

 

 

 

 

2.2 Profiling and improving data quality

 

Data Science methods not only demand a certain data quality level, they also allow you to profile and improve the quality of your analysis data.

 

General Data Quality Profiling

 

 

Cross Sectional Data

 

 

Time Series Data

 

 

Using Individual Reference Limits for Data Quality Checks

 

 

 

2.3 Simulation case studies

 

Simulation case studies have been performed to quantify the effect of bad data quality on the performance of supervised machine learning models and on time series forecasting models.

 

 

1_GMMFGFRVy3nYm0Q-mMi7Hg.png

 

 

 

 

3. Data Assembly 

 

 

gsvolba_1-1643283697360.png

In order to be able to analyze the data with data science methods, the data needs to be structured in a appropriate form. The one-row-per-subject data structure and the longitudinal data structure are the two most prominent data structures in data science and machine learning. Data Assembly here refers to the exercise of joining tables together, aggregating data across hierarchies like time or product group, transposing data from one structure into another. 

 

However this is not only a technical task of "programming the data management exercise". It also involves business considerations like the selection of the right data sources to answer the business question, the alignment of data at the time axis for predictive modeling, and the discussion which input variables can be used for the analysis.

 

 

 

predmodeling time windows.PNG

 

 

 

 

SAS Global Forum Papers

Folie1.JPG

 

 

 
Version history
Last update:
‎04-13-2022 12:33 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags