What drives and motivates a Data Scientist?

4 Likes

What drives and motivates a Data Scientist? How did they start their careers? Do they have valuable advice for their peers, newbies to the job, or even established colleagues out of their experience? The following series dives into the world of data scientists and their technologies, featuring bright minds from various countries.

In this one, we hear from Jean de Villiers SAS career Data Scientist, Statistician, and Head of Analytics for SAS Ireland. Read on as Jean dives into the job profile of a Data Scientist and learn why collaboration and communication are just as important as the data and modeling involved.

Q1: How did you develop a passion for Data Science?

In 2013, I was selling SAS software for a living while working for a SAS partner. At times I was struggling to communicate the value of the SAS software offerings and decided to improve my technical depth by completing a SAS certification. It was during this time, while doing the prescribed SAS training material for the SAS Statistical Business Analyst certification, that I started to grasp the endless possibilities associated with predicting future events and optimizing decisions that goes with them. This was when my passion for Data Science started. This brought me a sense of adventure, similar to how a child feels when they step foot in a theme park. This feeling of adventure combined with a hunger for more knowledge were the catalysts in my Data Science transformation. It wasn’t long afterward I completed the SAS SBA certification that I applied for a M.Sc. in Business Mathematics and Informatics at North-West University, South Africa and completed the degree the following year.

Q2: What is an important personality trait that helps you excel at Data Science?

Being curious about the world and how things work.

Q3: How did you develop the technical skills needed to excel at Data Science? What is your preferred programming languages and why?

I believe that in order to be an well-rounded Data Scientist you need to have a diverse range of skills and some solid working experience. The majority of my technical skills are build up around SAS and python as my preferred programming languages. I will provide you with a short breakdown of methods used to build up skills in each.

SAS

Certification - The nice thing about building up SAS Data Scientist skills is that SAS provide you with a path to success through their Analytics & Data Science certification tracks. The recipe for success is simple: you would choose a certification, completing the prescribed training material, do a practice exam, and then attempt the certification. This has always been my main method for acquiring SAS skills and one I would highly recommend. The top three SAS certifications that I would highly recommend are:
1. SAS Certified Data Scientist
2. SAS Certified Professional: AI & Machine Learning
3. SAS Certified Professional: Data Curation for SAS Data Scientist
Hackathons – SAS Hackathons are a great way to improve your SAS skills and collaboration efforts. For SAS Hackathon, SAS would typically provide your team with a SAS mentor, free access to a leading SAS Platform, and enablement resources in the form of virtual learning labs, e-learn training, and more.

Python

O’Reilly Data Science books – I find that the O’Reilly books are great for learning Data Science using Python. They typically provide a good introduction to a topic, solid theoretical knowledge, the necessary instructions to get a practice environment up and running, the required practice data, and a step-by-step approach to applying python to solve a problem. This is a great way to learn Python for Data Science, as they don’t just focus on the technical skills, but also on educating the user around how to install, configure, and mange a Python open-source ecosystem.
DataCamp – DataCamp is great for getting to grips with the essentials skills and the associated packages needed to develop your Python Data Science skillset. They have a wide variety of courses that help you learn interactively through, watching instructor led videos, practice and assess what you’ve learn, track your daily and overall course progress, and test your skill by taking on their case studies that simulated real-world problems.
Dataquest – Is another online learning platform that’s great if you prefer a more hands-on approach to learning. The aim of Dataquest is to develop the necessary skills by mostly going through different levels of practice exercises, that incrementally get more challenging. This learning approach emulate the learning process that someone would go through while playing video games.

Q4: What are some of the most important steps in a Data Science project?

The most important part of a Data Science project is the following:

Overall purpose of the project: Understanding how the Data Science project fits in with the customers overall business strategy and vision - where do they want to go.
Problem formulation through Stakeholder engagement: Next, you need to have stakeholder engagement to understand and refine the business requirements. It’s important that you understand what each of the stakeholders wants from this project - ongoing communication and transparency is important at this stage of the project.
Modelling approach & how will the results be used: If steps 1 & 2 were successfully conducted, one should at this stage, understand the overall goal of the project, how it aligns with the greater company strategy and vision, and what’s expected and required from your stakeholders. This information should be factored-in when deciding on an analytical modelling approach. For example: If model explainability is more important than model accuracy, then you might consider using a modelling technique that’s easier to unpack such as a logistic regression or decision tree, rather than a gradient boosting or neural network. It’s also important to understand:
1. What is the scope and timeframe of the data being used?
2. How the output of the model will be used?
3. Are the results required in real-time or not and on what platform/device will the results be utilized?
4. What is the risk/cost associated with a false positive or false negative?
5. How will the model’s performance be tracked and monitored over time, and what action will be taken when the model starts to degrade.
6. Is your model scoring in-line with what’s considered to be fair and ethical, and if not, what are the consequences and remedial actions needed to get it on track?
Story Telling: Although this is my last point in this list, it’s probably the most important point. It is extremely important to be able to communicate and share the progress and achievements to your stakeholders in a way that resonate with them. Dashboards and visualization are effective tools that can be used for storytelling. Storytelling is an art of connecting with your audience from their perspective. Therefore, structuring the information in a way that allow your audience to best consume it.

Q5: What Data Science trends do you see in 2023?

Here are some trends and my thoughts on them:

An accelerated transition to the cloud for businesses – Transitioning to the cloud brings vast opportunities that I believe out-weight the current concerns. Having access to infrastructure, platforms, solutions, connectiveness, rapid scaling, security, and a vast array of other business enablers, all virtually from anywhere, is a key differentiator in the short run and essential in the long run.
A greater focus on ModelOps – Unfortunately some Data Science teams have a disproportionate focus on model building and selling the potential business value, versus making sure that the models are actually operationalized and the business value realized. The adoption of a Modelops approach to the analytics lifecycle, promotes the following:
- Getting a minimum viable model (MVM) developed and operationalized that immediately starts adding value to the business, and then making incremental improvements to the model in subsequent model versions.
- A reduction in the overall effort and time needed to get models operationalized and used by the business (faster time-to-value).
- A culture of experimentation that allow teams to adopt the motion of fail-fast and learn faster.
- Focusing on the entire analytics lifecycle, including deployment and post-deployment activities such as:
  - model performance-tracking and -monitoring.
  - model recalibration, model retraining, and model rebuilds.
  - Algorithmic fairness that ensure that the models don’t behave in a way that could be considered biased or unfair (e.g. when the model predictions systematically or disproportionately disadvantage a group of people).
A greater focus on outcome value and less on the tool preferences - Data science teams are becoming more tool agnostic and more focused on delivering value to business. A big enabler for this is the adoption of Data Science platforms that are open to a variety of proprietary and open-source programming languages. SAS Viya is a good example of such an inclusive platform embracing this change.
Greater focus on ensuring that models are explainable - This is the processes of embedding the necessary functionality within the Data Science platform to ensure that model results and the underlying modelling processes can be understood and explained to different audiences (technical and business). This is achievable using:
- Local interpretability methods for explaining individual predictions such as LIME (local interpretable model-agnostic explanations), Shapley Values, Kernel SHAP, HyperSHAP.
- Global Interpretability methods such as Model Variable Importance (ranking of the most important variable), and Partial Dependence Plots.
Greater automation within Data Science platforms - The adoption of Auto-ML (Machine Learning) technology, and automated hyperparameter optimization technology that drives rapid model development.
Democratization of Data Science - Modern DS Platforms are designed with productivity and ease of use in mind, including being more GUI-based (Graphic User-Interface). SAS’s Visual Data Mining and Machine Learning (VDMML) platform is a good example of this.
A greater use of platform embedded Artificial Intelligence – Data Science platforms are starting to embed AI into their processes such as hyper-parameter optimization and automated report-writing.