Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, January 25, 2017

Big Data Challenges and how Hadoop came into existence

Big Data as word describes is large set of data which has many bottlenecks related to its :
  • Storage
  • Transfer
  • Sharing
  • Analysis
  • Processing
  • Visualization
  • Security 
Big data is not just about size
–Finds insights from complex, noisy, heterogeneous, longitudinal, and voluminous data
–It aims to answer questions that were previously unanswered

In our existing traditional approach, we use a Data-warehouse to store data (OLTP-OLAP) in structured format. Process it , do data mining and build reports for further high level analysis.
This approach works fine with those applications that process less volume of data which can be accommodated by standard db servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scale-able data, it becomes a problem to process it using this tradition approach.

This is when Big Data got Distributed System into picture. It most of all related to Map-Reduce technology.
For example, 1 machine with 4 I/O channels can process 1 terabyte of data in approx 42 mins if the channel speed is 100 mb/s.
But if we have a distributed system of 100 machines, each with 4 I/O channels, and each channel speed is 100 mb/s, then it will take few sec to process the data.

 To adopt distributed System, Map reduce algorithm was used.This algorithm divides the task into small parts and assigns them to many computers (cluster), and collects the results from them which when integrated, form the output data-set.



Using the above solution, Doug Cutting and his team developed an Open Source Project called HADOOP. Its written in java that allows distributed processing of large data-sets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Challenges with Hadoop :

Hadoop is not suitable for On-Line Transaction Processing workloads where data is randomly accessed on structured data like a relational database.Also, Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data is sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence. As of Hadoop version 2.6, updates are not possible, but appends are possible.

To proceed further and understand Hadoop Architecture, please read my  Next Blog 





What is Big Data ? Is it Only Hadoop ?

Big Data, the new buzz word in the today's technology is gaining more importance due to its high rewards. A systematic and focused approach toward the adoption of Big Data allows one to derive maximum value and utilize the power of Big Data.

 Its nothing but a new framework or system to get insight of existing different data forms and increasing the researchers/analyst power to get more out of existing system.

As BG Univ says, "Big data is about the application of new tools to do MORE analytic on MORE data for More people."

Lifecycle of data can be defined as :

 
 

People get confuse with Big Data & Hadoop as 2 similar things. But no, Big data is not only Hadoop

Big Data is not a tool or single technique. Its actually a platform or a framework having various components like Data Warehouses (providing OLAP data/History), Real time Data systems and Hadoop (provides insight to structured/semi or unstructured Data).

Examples of Big Data are like Traffic data, Flights Data/ Search engine data etc.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types :

a) Structured data: Relational data.
b) Semi Structured data: XML data.
c) Unstructured data: Word, PDF, Text, Media Logs.

 Big Data can be characterized by 3 V's :

1) Velocity -> Batch processing data, real time
2) Variety-> Structured, semi-structured, unstructured and polymorphic data
3) Volume-> Terabytes to Petabytes


Big Data puts existing traditional systems into trouble due to many reasons because when data increases the complexity, Security, maintenance, processing time of it also increases. Big Data gets Distributed processing system into picture. Its using multiple system/disk for parallel processing.

There are various tools & technologies in the market from different vendors including IBM, Microsoft, etc., to handle big data. Few of them are:


1) No SQL Big Data systems are designed to provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. It allows massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement. For example MongoDB

2)MPP & MapReduce provide analytical capabilities for complex analysis including lot of data. Based on them we have Hadoop, Hive, Pig, Impala

3) Storage (HDFS ie Hadoop Distributed File System)
4) Servers (Google App Engine)

There are major challenges with Big Data. Read my Next Blog  to understand this.