Apache Hive is open-source information warehouse software program designed to read, write, and control massive datasets extracted from the Apache Hadoop Distributed File System (HDFS) , one component of a large Hadoop Ecosystem.
With great Apache Hive documentation and non-stop updates, Apache Hive continues to innovate facts processing in an ease-of-access way.
The History of Apache Hive
Apache Hive is an open supply mission that used to be conceived of via co-creators Joydeep Sen Sarma and Ashish Thusoo for the duration of their time at Facebook. Hive began as a subproject of Apache Hadoop, however has graduated to emerge as a top-level undertaking of its own. With the developing obstacles of Hadoop and Map Reduce jobs and the growing dimension of facts from 10s of GB a day in 2006 to 1TB/day and to 15TB/day inside a few years. The engineers at Facebook had been unable to run the complicated jobs with ease, giving way to the introduction of Hive.
Apache Hive used to be created to gain two dreams – an SQL-based declarative language that additionally allowed engineers to be capable to plug in their very own scripts and applications when SQL did no longer suffice, which additionally enabled most of engineering world (SQL Skills based) to use Hive with minimal disruption or retraining in contrast to others.
Second, it supplied a centralized metadata shop (Hadoop based) of all the datasets in the organization. While firstly developed in the partitions of Facebook, Apache Hive is used and developed by using different corporations such as Netflix. Amazon continues a software program fork of Apache Hive blanketed in Amazon Elastic MapReduce on Amazon Web Services.
How does Apache Hive software program work?
The Hive Server two accepts incoming requests from customers and purposes earlier than growing an execution layout and robotically generates a YARN job to manner SQL queries. The YARN job may additionally be generated as a MapReduce, Tez, or Spark workload.
This venture then works as a disbursed software in Hadoop. Once the SQL question has been processed, the consequences will both be lower back to the end-user or application, or transmitted lower back to the HDFS.
The Hive Metastore will then leverage a relational database such as Postgres or MySQL to persist this metadata, with the Hive Server 2 retrieving desk shape as phase of its question planning. In some cases, purposes might also additionally interrogate the metastore as phase of their underlying processing.
Hive workloads are then performed in YARN, the Hadoop aid manager, to furnish a processing surroundings succesful of executing Hadoop jobs. This processing surroundings consists of allotted reminiscence and CPU from the a number of employee nodes in the Hadoop cluster.
YARN will try to leverage HDFS metadata statistics to make certain processing is deployed the place the wished records resides, with MapReduce, Tez, Spark, or Hive can auto-generate code for SQL queries as MapReduce, Tez, or Spark jobs.
Despite Hive solely currently leveraging MapReduce, most Cloudera Hadoop deployments will have Hive configured to use MapReduce, or occasionally Spark. Hortonworks (HDP) deployments typically have Tez set up as the execution engine.
Apache Hive vs. Apache Spark
An analytics framework designed to system excessive volumes of records throughout a variety of datasets, Apache Spark presents a effective consumer interface succesful of helping quite a number languages from R to Python.
Hive affords an abstraction layer that represents the records as tables with rows, columns, and records kinds to question and analyze the use of an SQL interface known as HiveQL. Apache Hive helps ACID transactions with Hive LLAP. Transactions warranty regular views of the information in an surroundings in which a couple of users/processes are gaining access to the information at the identical time for Create, Read, Update and Delete (CRUD) operations.
Databricks affords Delta Lake, which is comparable to Hive LLAP in that it affords ACID transactional guarantees, however it affords countless different advantages to assist with overall performance and reliability when gaining access to the data. Spark SQL is Apache Spark’s module for interacting with structured statistics represented as tables with rows, columns, and facts types.
Spark SQL is SQL 2003 compliant and makes use of Apache Spark as the dispensed engine to manner the data. In addition to the Spark SQL interface, a DataFrames API can be used to have interaction with the statistics the usage of Java, Scala, Python, and R. Spark SQL is comparable to HiveQL.
Both use ANSI SQL syntax, and the majority of Hive features will run on Databricks. This consists of Hive features for date/time conversions and parsing, collections, string manipulation, mathematical operations, and conditional functions.
There are some features precise to Hive that would want to be transformed to the Spark SQL equal or that don’t exist in Spark SQL on Databricks. You can count on all HiveQL ANSI SQL syntax to work with Spark SQL on Databricks.
This consists of ANSI SQL mixture and analytical functions. Hive is optimized for the Optimized Row Columnar (ORC) file layout and additionally helps Parquet. Databricks is optimized for Parquet and Delta however additionally helps ORC. We usually suggest the usage of Delta, which makes use of open-source Parquet as the file format.
Tags : apache hadoop, apache hive, hadoop hdfs, hadoop hive