What Is BIG DATA? Why?

The Simple Meaning of BIG DATA is, large amount/volume of structured and unstructured data, data which are unable to kept/stored on one machine. In 2012 Gartner (an american research and advisory firm ) introduced 3Vs which describes Big Data.

Volume - The amount of generated stored data.
Velocity - This term explains the speed at which the data is generated and processed.(the speed which Big data should be analysed).
Variety - Data will have different kind of types; structured, unstructured.

Also by adding two more characteristics this can be called as 5Vs

4. Variability - Inconsistency of the data

5. Veracity - Uncertainty of the data

_________________________________________________________

WHY?

For a moment think that you have a cloth shop. If you can track all your previous data of your business, customers, trends and then apply analytics to that and get some business ideas and maximize the profit of your business.

Simply bigdata analytics helps to harness the data and use to find new patterns/opportunities which will lead to more efficient business moves, operations, services, happy customers.

_________________________________________________________

METHODS/TERMS

RDBMS(Relational Database Management Systems) have to face various kind of difficulties when it comes to handling Big Data. Majority of the data comes in a semi-structured format from different types of sources and these types of unstructured data cannot be handled via traditional databases and also the scalability and all the tuples/records in a relation must stored in one machine were some main problems. So the inability of RDBMS led to the emergence of new technologies.

NoSQL - NoSQL does not mean 'no-SQL' but it means 'not only SQL'. It is non-relational, distributed data sources.

Hadoop - Was launched in 2006 as an open source project under Apache which is a distributed database processing platform.

HDFS - Hadoop Distributed File System. It provides the scalable, fault-tolerant, cost-efficient storage for Big Data.

HBase - Its a distributed, non-relational database. It can be used when we want real time read/write access to our data(Big Data).

MapReduce - Its the main functionality of Hadoop. 'Map' is the process of where the individual element are broken down into key/value pairs. 'Reduce' is the process which takes the output of the 'Map' process as the input and combines those input data into another smaller tuples.

Yarn - It's a core part of Hadoop. It's a cluster management system.

Spark - It is a cluster computing system. It can run on Hadoop, standalone or even in the cloud and the data sources can be HBase, HDFS.

HIVE - HIVE provides data summarization, query and analysis to hadoop.

Kafka - This is an open source streaming process platform which is written on java and scala.

_________________________________________________________

To Be Continued.....

(All the above mentioned and more methods/frameworks will be explained individually on the upcoming blogs)

Supun Tennakoon

Search This Blog

NoSQL Databases