Introduction to Big Data and 5Vs - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

Introduction to Big Data and 5Vs

Share This
Big Data works with massive amounts of unstructured or semi-structured data. Big Data helps collect those data with a continuous increment and process them in a structured format. In simple words, Big data refers to extremely large and complex datasets that cannot be easily processed or managed with traditional data processing tools.

Characteristics of Big Data


These datasets are characterized by the three Vs:

1. Volume: Big data involves a massive amount of information. This could be terabytes, petabytes, or even exabytes of data, far beyond what traditional databases can handle. Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. 
 
The size of data generated by humans, machines and their interactions on social media itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.

2. Velocity: Data is generated at an unprecedented speed. Social media posts, online transactions, sensor data, and other sources continuously produce new information that needs to be processed quickly. Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous. 
 
There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the number of users are growing on social media and how fast the data is getting generated daily. If you are able to handle the velocity, you will be able to generate insights and make decisions based on real-time data.

3. Variety: As there are many sources which are contributing to Big Data, the type of data they are generating is different. Big data comes in various formats. It includes structured data (like databases), semi-structured data (like XML or JSON files), and unstructured data (like text documents, images, videos). Hence, there is a variety of data which is getting generated every day. 
 
Earlier, we used to get the data from excel and databases, now the data is coming in the form of images, audios, videos, sensor data etc. as shown in below image. Hence, this variety of unstructured data creates problems in capturing, storage, mining and analyzing the data.

In addition to these three Vs, some definitions also include additional characteristics like: 
 
4. Veracity: Refers to the quality and accuracy of the data. Big data may come from diverse sources, and ensuring its accuracy can be a challenge. Thus, Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. In the image below, you can see that few values are missing in the table. Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity.

Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like Twitter posts with hashtags, abbreviations, typos and colloquial speech. The volume is often the reason behind the lack of quality and accuracy in the data.

5. Value: Extracting meaningful insights and value from big data is a primary goal. This involves analyzing and interpreting the data to make informed decisions.

To handle big data, specialized tools and technologies have been developed, such as distributed computing frameworks (like Apache Hadoop), NoSQL databases, and various data processing and analytics tools. Big data analytics involves extracting patterns, trends, and insights from these massive datasets to support decision-making processes in various industries and fields.


Happy Exploring!

No comments:

Post a Comment