Back in 1997 when I was a researcher at the Medical University of Vienna, I wrote the first version of the %MAKELONG and %MAKEWIDE macro to transpose data from clinical trials in the appropriate data structure. At that time, I was not yet aware that I started my journey through the world of "data preparation for data science" (or "data preparation for analytics" how it was called at that time). At the beginning of 2007 my first SAS Press Book "Data Preparation for Analytics" was published, followed by "Data Quality for Analytics" in 2012.
Today, 25 years later my #datapreparation4datascience collection contains 18 SASCommunities articles, 12 articles at medium.com, LinkedIn and other media, 14 webinars on Youtube, 3 SAS Press Books, 4 SAS Global Forum papers, 2 ask-the-expert sessions, 2 SAS tips at support.sas.com, 3 SAS Blogs and more than 60 presentations on data science conferences.
The purpose of this article it to provide an overview over #datapreparation4datascience and a link collection for the contributions by topic.
However before I start with the overview and the list, I want say that this has not only been my own achievement. I have to thank you, the SAS User Community, for your support!
I could only develop these topic and write the SAS Press books because I was in conversation with our SAS Customers and SAS Users. The questions you asked me and the challenges you gave me, were often a trigger for new ideas, tips and tricks and solutions. Here are a few examples:
Data Science methods have specific requirements on analysis data. Occurrences of missing values have to be investigated and handled, the distribution of the values need to be checked, and a sufficient amount of analysis objects is needed. Data Science methods also require certain data structures like the one-row-per subject structure or the timeseries dataset. In return these methods however also enhance the data preparation work and can also be used to verify and improve data quality and allow you to perform powerful feature engineering.
"Data Preparation" and "Data Quality" are therefore much more than just joining tables and checking the possible list of values. Data Preparation for Data Science is a discipline to create and quality check the analytic base tables and to derive the relevant features that shall be used in the analysis. In my publications I categorize my articles into the following three main topics:
Feature Engineering is an important tool for (supervised) machine learning. Model accuracy and interpretability benefits from variables that precisely describe the behavior of the analysis subjects. In many cases these features are derived from transactional data which are recorded, e.g. over time. In some cases, simple descriptive measures like the mean or the sum provide a very good picture of the analysis subjects.
Often it is important to dig deeper to adequately describe the behavior. Analytic methods help to calculate KPIs that measure the trend over time per customer, the accordance with predefined pattern or the correlation of individual customer with the average customer. For powerful analytical models it is however not enough to "replicate" the original data just in another format, e.g. by transposing. You want to describe the behavior of your analysis subjects. You can to this by calculating different analytic measures.
Analytical methods have specific requirements on the analysis data. These data quality requirements often go beyond classic requirements of basic reporting and descriptive statistics. Missing values for example can quickly reduce the available set of records with complete data.
The data quantity aspect has influence whether statistically significant results can be generated or if certain machine learnings methods can be applied at all. The availability of data in general for the analysis decide whether you are able to perform your analysis or whether you have to reformulate your business questions or postpone the analysis at all.
Analytical methods however not only have requirements on the data. These methods can also be used to profile and improve the quality of the data.
These articles discuss the conceptual background and the motivation of "Data Quality for Analytics".
Data Science methods not only demand a certain data quality level, they also allow you to profile and improve the quality of your analysis data.
Simulation case studies have been performed to quantify the effect of bad data quality on the performance of supervised machine learning models and on time series forecasting models.
In order to be able to analyze the data with data science methods, the data needs to be structured in a appropriate form. The one-row-per-subject data structure and the longitudinal data structure are the two most prominent data structures in data science and machine learning. Data Assembly here refers to the exercise of joining tables together, aggregating data across hierarchies like time or product group, transposing data from one structure into another.
However this is not only a technical task of "programming the data management exercise". It also involves business considerations like the selection of the right data sources to answer the business question, the alignment of data at the time axis for predictive modeling, and the discussion which input variables can be used for the analysis.
Ask-the-expert Sessions (in German)
Articles
Github
SAS Communities: Article Overivew
Youtube Playlists
Gerhard Svolba, April 2022
(18°12'45" - 16°19'26"E)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.