The shortcomings of data science

Data science suffers from many flaws, well-known by practioners.

Machine learning models are made to be applied, not understood.

Only the simplest models can be interpreted. While getting first insights from linear models is of course best practice, those are put aside in production, when accuracy matters.

How could you explain the “reasonning” behind a tree-based or ensemble model? Any model able to capture feature interactions won’t be explainable. Neural networks are specified by millions of coefficients but surely can’t be described by them. Even feature engineering can hide dark corners.

Can we escape from black boxes?

I never like to call anything a black box. My feeling is that a logistic regression is no less a black box than a k-nn classifier. The later only has a less straightforward sensibility analysis. As a data scientist I was trained not to think of SVMs or random forests as alien. Aren’t they natural?

In fact asking for models whose internals are understandable seems a wrong direction to me. Today, understanding via modeling¹ has been replaced by understanding via feature engineering. We may express this as “in cross-validation we trust."²

The problem is that while I can try to improve my models using my intuition, whether geometric, mathematical or grounded in common sense, I never truly know why they behave as they do. I have to admit all models are black boxes:

What is a better model able to capture that the previous one did not? Why did the model make this decision? How does it “see” the data? Could I even get much better results? Where and why is my tuning working?

Churning through models does not enlighten

In the recent years, we have seen a rush towards ever more complicated models³. Yesterday it was random forests. Now boosted trees is all the rage. Many just use XGBM by default, and they are right since it works easily. Many have started adding features computed from models to usual features. And blending outputs from tSNE⁴. And thinking about deep learning from day 1 while having already clear features..

Did the results improve so much? Not really: we gained a few percentage points in the process. In Kaggle competitions my impression from reading winners' comments is that most of their edge comes from features⁵, carefully controlling over-fitting, ensembling and finally better models. More complicated models provide only incremental gains.

This should not come as a surprise. The unreasonable effectiveness of data over models limits, if not models' usefulness, at least their specific advantage. Today’s bigger datasets unlocked most of today’s improvements in machine learning.

We are missing something

The recent years have been marked by the rise of deep learning. We are very fortunate that neural networks give us easy ways to glance at what they have learned on each layer. Generating images that activate the upper layers provides eye-openning insight into what we actually computed.

Neural networks' generative abilities never ceaces to amaze, from the so called deep dreams, to text generation from raw characters. They can learn exceptionnal linear embeddings and even produce lend themselves to visualising image classification models and saliency maps.

Still, the recent findings about in adversarial training, persistent trivial errors, or universal adversarial perturbations suggest there is a lot we don’t yet grasp on how our – even basic – models understand the data.

We need some torch to understand how models thinks.

Some researchers gave up trying to explain models' “structure” - their internals - and have focused on explaning their “behaviour”⁶. I have been for a long time enthousiast about Ayasdi’s work: they exploit topological data analysis tools to shed light on the data’s shape.

We will shortly publish advice for model debugging.

When does model introspection become mainstream?

We are blessed with better tools every year. Does this feel familiar?

from sklearn import *

Standardization is a good force. Data scientists are now given performant implementations of all common algorithms on all data plateforms. Worrying about implementation correctness is mostly gone. A lot of effort was put into being able to build ever mode complicated pipelines: distributed, real-time…

Still, automatic tools for model debugging are not so common. We will need them:

There is no moore’s law for machine learning’s power.

We will keep having more data. But most of the time big data will remain just unaggregated data. “Big data science” will not be a silver bullet. It may even be a distraction.

The tools we need

Understand our model’s failure modes. This is still the most insightful debugging tool.
Easy to use sensitivity analysis.
Easy to use dimensionnality reduction vizualization.
Model instrospection could be done by generating exemplars (of the data, of a specific class..) like neural networks have been doing.
Maybe easy access to vizualisation tools specific to particular algorithms⁷.

Historically, modeling was often slow to catch empiricall “insight”. For instance Kepler used Tycho Brahe’s astronomical data to devise its laws. His third law may be seen as an early success of blind linear regression! ↩︎
Our clients and managers think we deal with statistical flukes with our (often) formal training in statistics. ↩︎
I don’t count as complicated some techniques that might surprise you. Learning how to do gradient descent with gradient descent is almost natural in my book for instance. Adversarial training with DCGAN is a great idea and I’ll happily excuse its “complicated” details. To me it is in the good direction towards a non-parametric everything. ↩︎
Using tSNE is a idea in fact, see e.g. the Otto challenge on Kaggle. Don’t miss learning about tSNE’s internals. ↩︎
Aren’t the premise of neural networks about being able to learn a hierarchy of features? Then only come all the tricks, the RNN magic, etc, which are to me less relevant here. See the unreasonnable effectiveness of deep learning. ↩︎
Using Visual Analytics to Interpret Predictive Machine Learning Models - Josua Krause et al. ↩︎
AirBnB made a convincing go at random forest interpreation investigation where they cut variables. ↩︎

The shortcomings of data science

Models are blind

Can we escape from black boxes?

Churning through models does not enlighten

We are missing something

When does model introspection become mainstream?

The tools we need

10 pieces of advice to beginner data scientists

t-SNE visualization with streaming data: introduction