Philipp Rebsamen, 21.10.2019
Big Data is not just a synonym for lots of data, it is more a concept of finding insights through leveraging existing or newly generated data sources. Part one of this small series of blog posts will cover the basics of every Big Data project: what data do I have?
Data is usually generated by three different sources:
Machine generated data is by far the biggest source of data today and will be even more so in the future as more and more devices are becoming “smart” and connected to the internet of things. This data is coming from sensors, cameras or log files of devices, vehicles or industry machinery.
On the other hand there is human generated data, which refers to all the data that a person is actively publishing by itself. This contains social media data like status updates, tweets etc. but also web searches, emails or texts, videos, and pictures. Human generated data is usually highly unstructured and thus has no well-defined format or filetype.
The last group is organizational data which consists of records that are generated by business applications and/or operations. Most of the time this type of data is already well structured (for example transaction records, logs etc.) and can be leveraged directly.
What makes data “big”?
Now that we know where our data comes from how do we know if we are in fact dealing with a “big” data project? We commonly refer to the following three characteristics, also known as the “3Vs”:
The sheer amount of data you will face defines a Big Data project. It is now very common to have Tera- and Petabytes of data in an enterprise environment with an exponential growth rate. This requires a careful evaluation of the underlying system architecture which supports your expected storage requirements.
In the era of Big Data, the interval between two samples is constantly decreasing. Just a couple of years ago a daily health update might have seem as perfectly sufficient, nowadays we are looking at processing data in real- or near real-time. The high velocity of incoming data streams defines a Big Data project.
Previously in this post we have looked at the data origins and how diverse the range of generated data can be. To generate meaningful insights you will often face the challenge of collecting data from a multitude of different sources in different formats. This variety defines a Big Data project.
Now that you know where your data is coming from and it satisfies the “3V” criterias of Big Data, what’s next? Read on in part two of this series where we will try to find ways to ingest, process and store the data.