Recently, I wrote about how #BigData and #BigScience differ, having almost opposite approaches at looking at data. Needless to say that I remain skeptical about the varying quality of what's being said and written about data, big or not. As a matter of fact, my main concern is about what one can infer, or pretend to infer from that data. Data help to think the world, yes. Yet it isn't the whole story. Reading posts on Internet and the sky-rocketing amount of new material about it, one must honestly ask oneself: Is Data, especially since it became Big, a object of knowledge by itself?

In this post, I want to discuss the difference between covariations and correlations. In a context of data-driven decisions (a concept I've read in the two books I've mentioned last time), failing to distinguish covariations and correlations might lead to unexpected consequences, to the say least. The least dommageable being, probably, to remain ignorant after all...

In my previous post mentioned above, I cited these sequence of tweets:

Here is the image of the original tweet:

The image of the original tweet.

The image of the original tweet.

Talking to strangers, and telling them they are wrong. What else Internet is about?...

(xkcd: Duty Calls)

Anyway. Days passing, I couldn't help but keep thinking about this "fitting" problem. I think I have a (natural?, normal?, scientist?) reflex saying that data isn't telling the whole story, but just a mean, among others, to climb the ladder to stand in the shoulders of giantsThe existing story we build upon, with the help of data, is the knowledge and understanding we have of our world, and the history of the discoveries that led to the state of it. And that knowledge is based on correlations. Correlations that were observed, checked, verified, and understood (if I could make a fine word I would say, that, in computer science, we would say these correlations were entirely 'debugged', since debugging is understanding).

But correlations and covariations haven't the same meaning! Simply stated, a covariation is the observation that when one parameter varies, one another does as well, and vice versa. Covariations are (I love this rule from mathematics:) necessary but not sufficient to make correlations. Covariations are merely a hint about something happening under the hood. Covariations can have various 'shapes' or, in other words, can be represented graphically with various figures. The shape of that figure is certainly an excellent hint about the underlying phenomenon, but it is not the explanation by itself. On the other hand, understanding means giving a cause, or an explanation, to a covariation. While the study of covariations is full of lessons, this isn't usually enough to reach an explanation. And it is not a matter of quantity. Correlations are living in a different space. 'Data-points fitting' isn't equal to understanding (obvious, isn't it? Or not?). Stated simply, a correlation integrates the corpus of knowledge, while a covariation integrate the corpus of observations.

What amaze me most in this journey into BigData as I navigate into it, and the dozens of articles about it in every corner of Internet, is – again – the very weak presence of words such as understandingknowledgeresearch, 'Nature', etc. They are utterly dominated by the presence of 'insights', 'obvious', 'noise', 'pattern discovery', and also 'revolutionary', 'potential'; words that belongs a lot more to marketing than to, well, science. <note>This little game about the number of occurrences of words in BigData articles should prompt me one day to perform a semantic and quantitative analysis of them... with BigData tools, of course!</note>

Recently, I stumbled upon an truly excellent website that illustrates very well the general considerations outlined above. It is entitled A Visual Introduction to Machine Learning. (Machine Learning, for those who aren't really immersed into BigData is one of the key technique of manipulating the data. See the detailed Wikipedia entry about it.) The above article is really well crafted (even if it doesn't fully work on Safari – prefer Chrome or Firefox). Please, to follow what's next in this post, read it (~10 min) and come back. I'll wait.

In the meantime, here is a small visual interlude, with the first image of an exoplanet. Are you seeing a large white-blueish dot and small red dot too? How do you know they are not only dots?  And what about the fundamental process of crafting meaning by placing, in a spatially-structured manner, variations of colours in a limited rectangular 2D space, also known as 'image'? How does this process could even make sense to you? Isn't an image already a graphical representation of a lot of data?

Knowing how truly the electromagnetic fields of light combine to form constructive fringes that lead to measurement of the spatial coherence along a line projected into a plane would already change for ever your vision of what an image is.

Image Credits: E.S.O.

Ok, back to our business. If you freshly read the article, you probably have an idea of what I am heading in this post.

The article beautifully exemplifies the use of a Machine Learning technique. In this particular example, it allows, seemingly, to classify members of a dataset into one of the two categories: a home is either in New York or San Francisco. We have 7 different types of data points. Literally: 'elevation', 'year built', 'bathrooms', 'bedrooms', 'price', 'square feet', 'price per sqft'

Before saying anything about it, the immediate question that obviously should have strucked you as well is: why not simply obtaining geographical coordinates of these houses?!! Given the problem they ask themselves to solve, that would be the immediate and logical question to raise. (We note that the goal seems to change a little bit between the introduction – 'distinguish homes in New York from homes in San Francisco' – and the first section – 'determine whether a home is in San Francisco or in New York' – which is not really the same question. Anyway.)

But okay, that's an example. And examples are often a little bit silly, for the matter of demonstration, and they rarely demonstrate intelligence, but rather skills.

What is example beautifully illustrate is that machines are powerful, but are not smart. And those who pretend here and there that "BigData will revolutionise the way we think the man or the world" are probably seeking power rather than intelligence... 

Here is a list of problematic points that the article does not even touched gently:

  • How the data types were chosen? 
  • Are the data types relevant to the question? Is there any other relevant quantity that could help solving the problem? (okay, okay...)
  • Are the data types enough to solve the question?
  • How did these data points were obtained? Measured? Any error associated with it?
  • Are there any statistical biases? Instrumental ones? Data isn't just numbers, you know...
  • Were the data points taken all at the same time? How? By how many different people? Were there some outliers?
  • How do we know that the distributions of the points of each type can be compared? Are all these types meaningful to the question?
  • How do we know that all points have the same weight?
  • How do we know the problem is 'solved'? 
  • Actually, is the problem well- or ill-posed?

Ok there are more than that, but enough. 

There is an obvious conclusion to all of this. But I am never sure myself I didn't just miss something obvious. I would formulate a conclusion that is somehow obvious, but if this is so too for other people, why do we (I) never hear of them?

Conclusion: A data analysis does not lead to data science, even less science pure and simple. And when you see 'exciting' data-scientist positions in companies that list a number of technologies you have to master before applying, be simply aware that science is probably everything but these required technical skills.