I am reading the book "Data-ism" by Steve Lohr nowadays. Something strucked me on the first chapter. Steve explains (first chapter!) that BigData follows the principle "Measure everything first, ask questions later" (we could even say "find questions to ask later"). Boom. In one sentence, this could summarise what BigData "is". Funny enough, in a french-speaking book I am reading in parallel (and a bit pompously subtitled "BigData: think man and world differently"), it says the exact same thing. And here too, to my surprise again...
Science, and more specifically Big Science (which is not a new and buzzy expression, but rather a term coined by historians) is just plain the opposite: questions it explores have been asked since humans are humans. And we are still struggling to grab all that data about it. When I say struggling, I mean it. Not only it is difficult to obtain it by itself (the tools required are expensive such a way that only states or large organisations can afford them), but also because the right to obtaining it is the result of a fierce competition and a process over months.
I propose here to quickly illustrate the fantastic contrast between the data in BigData and that in a Big Science such as modern observational astrophysics. They compare as Day and Night. And you'll see that there is a key reason for that difference.
Let start with some context. For a long time, the use of the word 'data' was probably concentrated in spots like universities, research laboratories, probably some governmental agencies etc. Then became the fantastic revolution that the Internet is, and the the enormous increase of the amount of data that is today collected and processed is one of its by-products. Not only BigData induces important technology shifts, but as quantity becomes a quality, it induces totally new ways at considering, and using this data. One could even say that: because it is too large to be seen (read: grasp in a look), new ways of thinking arise.
I think that BigData can be... seen quite simply as the result of a combination of successive technical progresses. First, the fundamental step of interconnecting all computers (the www). Second, the storage becoming a quasi unlimited and extremely cheap resource (remember the first day of Gmail, with 1GB free, April 1st 2004?). Sounds normal today, but it was not in 2004! And finally, the mobile revolution where everybody is living with a computer-data-sensor-communicator all the time (2007, the iPhone). Sensor is key, here.
With all these technological advances combined, the amount of data produced by everybody starts to follow an exponential. The scale of that amount created new problems, unseen before, about the collect, transmission, storage, organisation and structure, mining, sharing, analysis of that data. Most of that is now comprised by the generic term 'Cloud' (although some people keep us warned that there is no such cloud...). Hence the new tools developed for it: for instance MongoDB for databases – a non-relational DB, or Hadoop, a Java framework to allow working with thousands of nodes and petabytes of data. Ok. #BigDeal.
It is apparently such new exciting stuff for many people in the IT industry that they look like discovering a toy so large that couldn't dreamed about it before. The world, or more precisely – and this is key – the vision they have of the (economic) world starts to be quantifiable. It's all about startups, algorithms, data "intelligence", companies being reshaped to "accept data", or re-organised around data, or data marketing, Gafa (Google, Apple, Facebook, Amazon) etc. Not talking about the bazillions of dollars it drives (more on that later).
Measure everything first, ask questions later. How strong the contrast is with #BigScience!
Astrophysics is a Big Science because of the size and scale of the tools it requires to perform at minimum its first mission: exploration. Telescopes, observatories, satellites, instruments, arrays of gigantic antennas (thanks to Photo Ambassadors of E.S.O. for sharing the amazing pictures that makes this site beautiful) etc are large and expensive tools requiring very specialised and trained people of many different disciplines. Astrophysics is also known to produced bazillions of bytes of data. <note>Astrophysics is not the science producing the largest amount of data, however. That title remains probably the property of the ... CERN, where the web has been created. For another post. </note>
For the sake of comparison with the amount of data an iPhone can takes (dozens of GBs per month), and how easy it is to share it with various services, let's briefly outline the process a single astronomer has to go through to obtain his/her data, taking the example of major world-class observatories. It is as follows: Every 6 months, a "Call for proposals" is open. Proposals are very specialised forms to be prepared by a scientific team. It must contains (and that absolutely key) a meaningful combination of, first and foremost, science motivation (is the subject worth the effort?), operational conditions (are you in the right place in the right moment, with the right observing conditions? - think about coordinates, brightness, phenomenon phase of the subject, moon phase, latitude etc) and technical capabilities (is the telescope and its instrument and the requested sophisticated configuration the right one to possibly answer the scientific question you want to ask, assuming this question is valid?)...
It is hard. You usually need an advanced university degree to reach that level. Simply because it is very hard to ask sensible questions.
Let's assume you have this combination, and you managed to write it down it in a very precise yet concise way and... on time. Your proposal is reviewed by a panel of 'experts' judging all proposals, and ranking them (a necessarily imperfect choice, and tensions and conflicts arise regularly, but nobody has proposed a better way so far). A threshold is set, and the amount of nights above that threshold (assuming there are no conflicts of targets, dates, configurations between proposals) is compared to the amount of nights available. And well, there is only about 180 nights inside 6 months of time, when accounting for special / technical nights. So only the top proposals are granted. At that stage you still have 0 bytes of data.
Let assume that, given a pressure factor between 3 and 20, your proposal is granted. Wouhou! Congratulations. Between 3 and 9 months after that day (i.e. between 6 and 12 after you submitted your proposal), you travel to the observatory. There, you are welcomed by support astronomers who will guide you through the complex process of preparing your observations, with dedicated custom software, giving you the latest details about the instrument, the observatory, the constraints, the news etc. Assuming the observatory, telescope and instrument is all running well (that's far from guaranteed in small observatories), you finally cross your fingers for the wind to remain low, humidity not too high and more importantly, that there will be no clouds. And if by any bad chance clouds prevent you to obtain a single bit of data, too bad for you! Thanks for coming, but other people are waiting down the line, for the next coming nights. Please, come back next year. If your new proposal is accepted.
If that's all well (and yes, a majority of the time, it is all going pretty well, fortunately), you are working during that night, manipulating complex opto-mechanical instruments, with dedicated software to obtain... raw data. That is, data full of imperfections, full of problems, more or less manageable. Once back home, you'll have to work, sometimes entire weeks, to transform these raw data into usable and true scientific data. That's it. Now, the work of thinking about your scientific question can continue...
Isn't the contrast amazing? The scientific data in this case is just extremely expensive in terms of energy, efforts, risk of failure, people involved, time spent preparing it, and justifying it! Day and Night.
At the end, to me, this difference all comes from the difference of approach I mentioned in the beginning: BigScience has tons of open questions. But they are very very hard to answer to. And they requires very sophisticated tooling and observational performance to be able to brush the surface of the question. BigData is flowing through our devices. And yet we look for questions to ask with it. But what questions? New business questions? Some call "revolutions" things that are only innovations, or more simply progresses in a field...
I may be a bit too simplistic here. There are indeed very important domains for humans (such as health, quite obviously – too obviously?) that would benefit from a "measure first, think later" approach (that's the first example in Steve Lohr's book). So the key difference is not so much the volume of data, its variety (or its velocity, the 3V). BigScience is accustomed to at least the first one.
No, what struck me most, when reading things about BigData or DataScience, is the absence of two words: knowledge, and understanding. It seems that BigData doesn't work to increase knowledge. I do not mean detecting "patterns" (which some are so fascinated about) in highly noisy data. I mean reproducible knowledge, gained through the understanding of the underlying phenomena. You, know, science...
Calling "science" something that is not focusing on knowledge and understanding is a bit problematic to me. The rush to the new gold era of BigData and DataScience (which is real, with sky rocketing amount of investments in it) will all appear slightly artificial to usual (academic) "scientists". For sure, if they embrace the business side of the force, scientists leaving academia have a definitive experience at thinking data, hence having a critical opinion about it.
Talking about thinking... (ok, these are only Tweets).
"Fitting" isn't meaningful by itself. (Click on the image link – pic.twitter.com... - to see what I mean).
When it's beautiful, it tends to be over interpreted. I wish every one could follow Edward Tufte courses, or read his absolutely stunning and brilliant books... See what I mean?
Thinking is slow. Thinking right is very slow. Business if fast. Decision making must be fast, especially(?) in business. Bringing the two together is an interesting challenge!
As a matter of fact, all that isn't making perfect sense? It may be a huge chance that BigData mostly contains very noisy, hard-to-reproduce, poorly meaningful "patterns". That's the only way thinking with its 3V – Volume, Variety, Velocity – is humanly possible. Just imagine that amount of data, at that rate, but as meaningful as what BigScience Data can produce?... That's an interesting question for the next post!