This is the second part of our take on the seven deadly sins of data science. If you haven’t read the first part yet, you can find it here.
4. Never hire a data team that you can’t fire tomorrow
A data science team needs to earn the right to be a part of the business permanently. There are many “zombie” teams of data scientists who are kept in businesses with the expectation that they will eventually find a hidden gem in the data.
In reality, however, it is often the case that the data simply isn’t very useful, at least not at a reasonable cost. That’s why the goal should always be to find out whether the data carries any value, as quickly as possible, which brings us to the next deadly sin.
5. Never incentivise your team to mislead you
The truth is that the incentives of the company and the incentives of the data science team are often misaligned. Pressure from management often creates situations where data scientists have to produce results even where they can’t be found. This leads to some data science teams being incentivised to extend untenable projects, or admit defeat and look for a new job. As a consequence, the business may not get an accurate picture of what is really going on on the ground with their data. Their data scientists are reluctant to jeopardise their jobs by revealing the true situation.
The trick is to incentivise honesty. One solution is to hire contractors on a short-term contract and offer them a sizeable chunk of their day rate even if they falsify and invalidate the quality of the data before the contract ends. This sounds counterintuitive, but it actually makes a lot of sense.
This way, your data scientists know that they will get paid, even if nothing can be done with your data. This also removes the incentive to go on a wild goose chase, and rewards a truly honest assessment of your situation. It’s far better to spend a little bit of money and be sure something can’t be done, than to spend 10x as much on a data science project that can’t generate meaningful insights.
6. Never look at the data without a business application in mind
This refers to what’s known as data “mining” or “snooping”. A lot of data science projects are sold based on the idea that if there is enough data, some incredible, hyper-valuable relationships in the data may exist. The problem is that with every big dataset, many “interesting correlations” are there often as a result of the sheer random nature of the universe.
To demonstrate this, we ran a totally random simulation with 2000 variables. Upon analysis, we found 80 significant correlations of over 40%. Yet, they were completely coincidental and meaningless!
The greater the noise and randomness in your data, the higher the likelihood that you’ll find the Fool’s Gold that is a false correlation. That’s why it’s best to start every project with a clear business application in mind, and avoid data snooping and mining altogether.
7. Never use cutting-edge research
Our experience has revealed the dangers of using the latest, most exciting research. Why? Because these types of models have been tested on extremely clean datasets. In academic settings, there is always a team of post-graduates or employed staff behind the scenes who are dedicated to cleaning the data so that the paper gets the best results possible. This is especially the case with natural language processing (NLP) datasets.
It’s easy to assume that most data exists somewhere on the continuum between academic and real-world data, but the reality is that they’re two entirely different species. For an academic piece of research, it takes years of robust, thorough analysis to prove its worth in the real world, and most technologies simply don’t meet this high bar.
It’s best to use “boring” and reliable technology that has proved its value over the last ten years, rather than the most recent, snazzy models.
If you are interested in adopting AI but you are unsure where to start or whether your business is ready for it, our Data Science Readiness Report can get you on the right path. Get in touch with us to find out more.