A Brief History of Data: Part 3 - Big Data
What is the history of Big Data? This article will help you discover its evolution over time. This article is the third in a series of articles on the history of data. First, we advise you to read our previous article on thehistory of analytics.
The three V's
Let's start by defining what Big Data is. The best way to define Big Data is by the "3Vs" which are:
- Volume: With Big Data, the data collected is in large quantities. When we talk about quantity, we are talking about the number of bytes, usually measured from the terabyte for Big Data.
- Variety: If we work with Big Data, it is because the data used is diverse in nature: encrypted data, images, video, text or sound. They can be structured (encrypted data for example) or unstructured (text, image, sound, video) and each type will require different processing
- Speed: Big Data means an increase in the capacity to process large volumes. With old machines, we could process 1TB of data... but in several years. Big Data implies that we are able to process this terabyte in a few minutes, or even seconds.
There is more and more talk about 5V with 2 additional features:
- Value: The data collected have an intrinsic value, which is not the case for all of them, some data have little or no use and collecting them does not bring much.
- Veracity: which will be synonymous with reliability. Collecting a lot of inaccurate data is of little use and can lead to significant errors in analyses and predictions.
Chronology of Big Data
Although the term Big Data only appeared for the first time in 1997 in the ACM (Association for Computing Machinery), the sharp increase in the volume of data began as early as the 1970s, when a race began between an increase in storage capacity followed directly by an increase in the amount of data stored, one leading to the other. This is the beginning of the data race.
Since the 2000s
The appearance of parallelism in the 2000s with multi-core multi-processors at the computer scale or networked machines with clusters of machines have made it possible to break down complex calculations into several calculations done separately. In this type of architecture, each component can work separately, it is the principle of "shared nothing".
It was also during this period that two fundamental elements emerged to enable the development of Big Data. On the one hand, NoSQL allows to relax constraints compared to SQL and to query larger volumes of data more quickly. On the other hand, the storage architecture has been completely rethought with systems such as :
- The data lake, where we will store the data in many clusters, in a raw way to be able to write it quickly.
- Cloud Computing will manage this but in network, creating services accessible on demand on shared resources.
- DFS (distributed file systems) where large files are stored on several data sources.
Supercomputers began to appear around 2005. In France, among the most important ones, we have the one of Météo-France and the one of the CEA - the French Atomic Energy and Alternative Energies Commission - or in other research centers in France.
In 2010, Eric Schmidt, former CEO of Google announced that in 2009-2010, as much data was produced as from the birth of the earth until 2003.
For now, the amount of data collected continues to grow each year. Today, digital data is estimated to account for 3 to 4% of global greenhouse gas emissions.