Creative Me: What are the difference between Hive / Impala & Pig

Comparing Impala to Hive & Pig

Similarities:

Queries expressed in high-level languages
Alternatives to writing map-reduce code
Used to analyze data stored on Hadoop cluster

Differences:

Impala

It was created based on Google's Dremel paper.
1) It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS).
2) Its an open source massively parallel processing (MPP) query engine on top of clustered systems like Apache Hadoop.
3) MPP style parallel databases have a relation model, more suitable for processing structured and semi-structured data. Due to its architectural advantages, it doesn't involve the overheads of a MapReduce jobs viz. job setup and creation, slot assignment, split creation, map generation etc., hence enables low-latency.
4) It offers lower latency / processing time for the queries at the cost of less scalability and less stability.
5) Impala supports high-performance UDF (User Defined Function) written in C++, as well as reusing some Java-based Hive UDFs.
6) Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems.
7) Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from unexpected time zone issues. Timestamps are stored and interpreted relative to UTC.
8) Impala utilizes the Apache Sentry authorization framework for Security, which provides fine-grained role-based access control to protect data against unauthorized access or tampering.
9) It can query data stored in HDFS or HBase tables
10) Uses subset of SQL 92 and do not support Stored Procedure
11) The Impala TIMESTAMP type can represent dates ranging from 1400-01-01 to 9999-12-31.
12) With Impala, you can query the following File formats:Parquet /Avro /RCFile /SequenceFile
/Unstructured text
13) Impala shares the meta store with Hive
14) Impala can process in milliseconds when running at low load conditions and Impala is one of the valid choices if no SQL parallel processing is executed.
15) Impala is an MPP-like engine, so each query you are executing on it will start executor on each and every node of your cluster. This delivers the best performance for a single query running on the cluster, but the total throughput degrades heavily under high concurrency. In such systems you should limit the amount of parallel queries to kinda low value of ~10.

Being highly used it still has cons like:

1) Impala can't handle complex data types(Array,Map or Struct)

2) Impala is not fault tolerant For e.g. if you run a query in Impala and if the query fails you will have to start the query all over again
3) Doesnot not support Parameters in scripts

4) Impala does not currently support many of HiveQL statements like ,ANALYZE TABLE (the Impala equivalent is COMPUTE STATS),DESCRIBE COLUMN,DESCRIBE DATABASE,EXPORT TABLE,IMPORT TABLE, many more

5) Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.

6) Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.

7) Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and the from_unixtime() format string.

8) Impala, is not currently supported by YARN

9) Impala is not the best choice if there is a batch execution, and SQL parallel execution

Hive

It is a component of Horton works Data Platform(HDP).
1) Hive provides a SQL-like interface to data stored in Hadoop clusters.
2) It translate SQL queries into MapReduce/Tez/Spark jobs and executes them on the cluster, to implement batch based processing. Hence best suited for ETL- long running queries.
3) Its used by Data Analyst for completely structured data.
4) Supports complex Data types like arrays, Struct etc, custom file formats, "DATE" data type,XML and JSON functions.
5) Its fault tolerant .For e.g. if you run a query in hive mapreduce and while the query is running one of your data-node goes down still the output is given as query will start running mapreduce jobs in other nodes.Its fault tolerant.
6) Supports Parameters Which Can Come Handy While Writing Hive Scripts.
7) Its supported by YARN. So you can manage your resources for mapreduce or any other applications supported by YARN
8) Hive runs on top of MapReduce/Tez framework which requests resources based on the amount of data to process. This way for large clusters it would give you much better concurrency for “small” queries, as each of them would request small amount of execution resources which would result in more queries running in parallel.
9) The Hive component included in CDH 5.1 and higher now includes Sentry-enabled security .GRANT, REVOKE, and CREATE/DROP ROLE statements. Earlier Hive releases had a privilege system with GRANT and REVOKE statements that were primarily intended to prevent accidental deletion of data, rather than a security mechanism to protect against malicious users.
10) Uses subset of SQL 92 and do not support Stored Procedure
11) Hive TIMESTAMP type can represent dates ranging from 0000-01-01 to 9999-12-31.
12) Hive supports several file formats like Text File /SequenceFile /RCFile/ Avro Files/ORC Files
/ Parquet/ Custom INPUTFORMAT and OUTPUTFORMAT.

But the cons are big as well –

1) Since Hive uses MapReduce to access Hadoop clusters, query overheads results in high latency.
2) lower performance especially for table joins
3) No query optimizer

Pig

Pig which is a scripting language with a focus on data flows.It has two parts:
a) A language for processing data, called Pig Latin.
b) A set of evaluation mechanisms for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using Hadoop

1) Pig can process data of any format, such as tab delimited text files, are supported via built-in capabilities. A user can add support for a file format by writing a function that parses the bytes of a file into objects in Pig's data model, and vice versa.
2) Pig's data model is similar to the relational data model.
3) In Pig, tables are called bags. Pig also has a "map" data type, which is useful in representing semi-structured data, e.g. JSON or XML.
4) It can combine multiple data sets, via operations such as join, union or co-group, OR can split a single data set into multiple ones, using an operation called split.
5) It is a Procedural Data Flow Language and mostly used by Researchers or programmers.
6) Pig is Fault Tolerant
7) Pig supports "maps" of (key, value) pairs, where retrieving the value associated with a given key is an efficient operation. Maps provide a convenient way to represent semi-structured data, where the set of non-null fields varies from record to record. Maps are helpful when processing JSON, XML, and sparse relational data (i.e., tables with a lot of null values).

Creative Me

Blog Categories

Tuesday, January 24, 2017

What are the difference between Hive / Impala & Pig

No comments:

Post a Comment