Starting at the End

And avoid a potentially traumatic slip on the finish line of data initiatives

Apr 12, 2023

It's a warm spring afternoon on the English shores in the city of Liverpool. The local football team is playing for the football championship in a crucial game against rival club Chelsea. At the end of the first half, the captain and club legend Steven Gerrard slips when trying to receive a backward pass, resulting in a crucial goal for the away team. Chelsea went on to win the game, and Liverpool eventually lost out on the championship at the finish line.

Today is a big anniversary in the soccer world | Soccer Board — A day to forget for the Liverpool legend

Is there a more excruciating way to fail than this? Let me tell you a story of what I have seen as a data strategist and how what happened in Liverpool can happen to any project.

A poignant anti-pattern

This morning I brewed some fresh coffee, cleaned up my desk, and set out with the intent to write. As I started typing, I quickly realized this would be more than just one post. More than that, this topic will become a consistent thread throughout the Thinking Data publication.

I know some people who always like to read the final chapter of a book before they start. In this way - they say - they can avoid reading a book that would otherwise disappoint them eventually, like what happened on that faithful English spring day. My job is to convince you why we should do the same with our data initiatives - let us look at the final step first before a single line of code or database query is written. In the Elements of Data Strategy, I attempted to portray the worst-case scenarios of data project failures in a canvas called the “Implementation Maze”, with the finish line failure named the "So What Problem". Let me now tell you how this story of failure came to be.

Back in the early days of my consulting career, I saw a project where the team's job was to build a machine learning model, which was part of a growing analytics portfolio. The team spent three months developing a prototype of high accuracy and was proud to present it in front of the client. The data scientists and engineers highlighted their work in preparing the data and training a model, culminating in deploying it and exposing it via API. As they went on through the slide deck, they reached the final step, where their work concluded. They gave an endpoint at http://data.somedomain.com/predict and showed how to query it with (data is fake):

{
   "info": {
      "type": 1,
      "address": {
         "town": "Cheltenham",
         "county": "Gloucestershire",
         "country": "England"
      },
      "tags": ["Sport", "Water polo"]
   },
   "type": "Basic"
}

To get a response like:

{
   “prediction”: { 
     “label”: 1 
   }
}

At this point, the team was anticipating a jubilant response, but what they got in return was more like, "So, what exactly are we supposed to do with this?" It turns out nobody (till now!) thought about how the machine learning model should be consumed. Now the team had to go back to the drawing board and build a user interface because the end user of their model was nontechnical and had no direct use of the API. This took several months more and led to considerable frustration on all sides.

Two productive patterns

This is one of the worst ways a data science project can fail. How are we to avoid it? Let's illustrate the problem with a diagram:

The wild west is what we need to have a solution for. Here are two productive patterns.

Step One. Sit down - preferably at the start of the engagement - and define who or what will consume the end product. Are there human end-users or other systems and pipelines that need to work with the code? If the former, what are their backgrounds, knowledge, and skills? What are the product requirements (i.e., latency)? If the latter, what are the concrete properties of these systems, and are there potential integration issues?

Step Two. Build an API in any case. Even if, in the previous step, you conclude that you need a GUI, you should still expose the model as an API. Requirements and systems constantly change, and in this way, you contribute to a more modular architecture, and your work can be further integrated into other products and services in the future.

Start at the end

There will always be potential for failure. But failure at the finish line can safely be avoided if we study anti-patterns and prepare for the most common scenarios. Several more minutes of concentration and Steven Gerrard would have the Premier League medal to his name. Who wants to keep him company?

Thinking Data

Discussion about this post

Ready for more?