Here are a couple of questions: how do you choose to operate as a data scientist? And more importantly, why does this matter?
Let’s start with the second question. Over and over, we hear that organizations are struggling to get value from analytics. They know that they need the insights from data to support their decision-making. However, the reality is that they have not yet reached a point where this is happening.
Some estimates suggest that over 85% of analytics projects never make it into production. That is, a problem may be identified, and a model developed to address it—but it never becomes part of ‘how we work round here’. Work by Gartner found that data science is often still considered ‘alchemy’ by around 80% of business users. In other words, it is neither understood nor trusted by most people who need to harness its insights.
Developing the story
However, this is only part of the story. In 2021, a paper was presented at the IEEE International Conference on Big Data, about the success factors in data science projects. The authors had surveyed data science professionals about the methodologies they used in implementing data science projects. They found that only 25% followed a formal project implementation methodology.
This chimes with my own experience. There are only a few data science project methodologies out there, and they are not widely used. You could argue that the analytical lifecycle used by SAS promotes a more formal approach to analytical project development. It certainly emphasizes the importance of model deployment as a key part of the project. However, it is not a formal methodology.
This may be the result of how data science is taught at university. You tend to start with personal projects, where you do everything yourself. This is fine when you’re learning—but it really doesn’t work in real life. It follows the hypothesis that you already know the result before you start to work, and means you simply work through from start to finish. However, that doesn’t make any sense in real life, because you don’t know the answer ahead of time. Development tends to be more iterative, and often circular. This top-down approach also leaves very little room to collaborate.
Learning from history?
Those with long memories may find this reminiscent of the situation perhaps 20 or 30 years ago in IT implementation. Missed deadlines, massive budget overruns and general lack of value led to the development of methodologies for project management such as PRINCE. These methodologies are strict, some might even say rigid—but they have proven useful for controlling and managing IT projects effectively over many years. More recently the adoption of Scrum and Agile have given new and more flexible approaches to software development. These methodologies are widely known and used, and generally considered to improve the management of IT and software projects.
These two facts—the struggle to implement data science projects, and the lack of methodology—may be unconnected. After all, not every project needs formal project management methodology to succeed. However, two other aspects were emphasized by the authors of the IEEE paper. First, the most important factors in project success were being able to describe stakeholders’ needs very precisely, being able to communicate project results to end-users, and teamwork. In other words, communication and interpersonal issues, NOT technical factors.
Second, data scientists who used a formal methodology had a much stronger focus on certain aspects of the project. These included the potential risks and, crucially, the route that was needed to put the model into production. This is, of course, the very point at which many data science projects fall.
Adopting a formal data science project management methodology may or may not be the way to manage every project. I suspect that many do not need to go this far. However, there certainly seems to be room for a greater discussion of how we work as data scientists. For example, how do you share intermediate steps in the development process? How do you communicate your work within larger groups of data scientists, or to end-users? What processes do you have to review and discuss unexpected results?
I think there is a real opportunity here to consider data science work processes as a community. Let’s not get too rigid—but let’s think about how we work, and let’s improve implementation in analytics.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.