Top 40+ Most Asked Apache Hive Interview Questions and Answers

Following is a list of most frequently asked Apache Hive interview questions and answers.

1) What is Apache Hive? / Explain Apache Hive in short.

Apache Hive is a Data warehousing tool developed over the Hadoop Distributed File System (HDFS). It runs SQL like queries called HQL (Hive Query Language), which gets internally converted to map reduce jobs. Hive is developed by Facebook and used for querying and analysis of data that is stored in HDFS. Hive is an open-source tool or software that facilitates programmers to analyze large data sets on Hadoop. It also supports Data Definition Language (DDL), Data Manipulation Language (DML) and user-defined functions.

Note: Although Hive is not a database, it gives you logical abstraction over the databases and the tables.

2) When can we use Hive?

We can use Hive in the following conditions:

When we have to make data warehouse applications.
When we have to deal with static data instead of dynamic data.
When we have to maintain a large data set.
When we use queries instead of scripting.
When the application is on high latency (high response time).

3) Name some applications that Hive supports.

Hive supports all those client applications written in Java, PHP, Python, C, and Ruby programming languages.

4) Can you rename a table in Hive?

Yes, we can rename a table in Hive by using the following command:

5) What are the different types of tables available in Hive?

There are two types of tables available in Hive:

Managed tables
External tables

6) What is the difference between external and managed tables in Hive?

There are two types of tables available in Hive, external tables and managed tables. Here, external tables are used to give data control to Hive but not control of a schema. On the other hand, the managed tables give both schema and data control to Hive.

7) What are the different modes of Hive?

According to the size of data nodes in Hadoop, Hive can be operated in the following two modes:

Local mode
Map reduce mode

8) When should we use Map reduce mode in Hive?

In Hive, the Map reduce mode is used in the following conditions:

To perform on a large amount of data sets and query going to execute in a parallel way.
When Hadoop has multiple data nodes and is distributed across different nodes, we should use this mode.
To process large data sets and also achieve better performance.

9) Can we use Hive for OLTP systems? / Is Hive suitable to be used for OLTP systems? Why?

No. Because Hive does not provide insert and update at the row level, it is not suitable for the OLTP system.

10) What are the most important components of Hive Architecture?

The most important components of Hive Architecture are:

User Interface
Compiler
Metastore
Driver
Execute Engine

11) Where does the data of a Hive table get stored?

By default, the data of a Hive table is stored in an HDFS directory - /user/hive/warehouse. We can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.

12) What are the three main parts that Hive is composed of?

Hive contains the following three main parts:

Hive Clients
Hive Services
Hive Storage and Computing

13) Is it possible to change the default location of a managed table in Hive?

Yes, it is possible to change the default location of a managed table in Hive by using the LOCATION '<hdfs_path>' clause.

14) What do you understand by a Hive Metastore?

A Hive Metastore is a relational database used to store the Metadata of Hive partitions, tables, databases, etc.

15) What is the difference between local and remote Metastores in Hive?

Local Metastores: Local metastores run on the same Java Virtual Machine (JVM) as the Hive service.

Remote Metastore: The Remote metastores run on a separate, distinct JVM as the Hive service.

16) What types of databases does Hive support?

Hive supports two types of databases:

Derby Database: Hive uses a derby database for single-user metadata storage.
MySQL Database: Hive uses the MySQL database in the case of multiple user Metadata or shared Metadata.

17) Is it possible that multiple users use one Metastore?

No, Hive doesn't support metastore sharing so, multiple users can not use one Metastore.

18) Why does Hive not store metadata information in HDFS?

Hive does not store metadata information in HDFS. Instead, it uses RDBMS. Hive stores metadata information in the metastore, and to achieve low latency, it uses RDBMS. Because HDFS read/write operations are time-consuming processes.

19) What are the three different modes in which we can operate Hive?

The three modes we can operate Hive are local mode, distributed mode, and pseudo-distributed mode.

20) What is a partition in Hive?

In Hive, a partition is used to group similar data types together based on column or partition key. Hive organizes tables into partitions. In other words, we can say that partition is used to create a sub-directory in the table directory. Each table can have one or more partition keys to identify a particular partition.

21) Why is partitioning used in Hive?

Partitioning is used in Hive to reduce the query latency. Instead of scanning the entire tables, it scans only the relevant partitions and corresponding datasets.

22) What do you understand by dynamic partitioning, and when is it used?

A partitioning is called dynamic partitioning while loading the data into the Hive table. In other words, we can say that dynamic partitioning values for partition columns in the runtime.

Dynamic partitioning is used in the following cases:

While we Load data from an existing non-partitioned table, it is used to improve the sampling. Thus it decreases the query latency.
While we do not know all the values of the partitions beforehand, so, finding these partition values manually from a huge dataset is a tedious task.

23) What are the three Hive collection data types?

The three main Hive collection data types are:

ARRAY
MAP
STRUCT

24) When should we use SORT BY instead of ORDER BY?

We should use SORT BY instead of ORDER BY when we have to sort huge datasets. The reason is that the SORT BY clause sorts the data using multiple reducers, while the ORDER BY sorts all of the data together using a single reducer.

Hence, if you use the ORDER BY clause, it will take a lot of time to execute many inputs. So, in this case, SORT BY is preferred over ORDER BY.

25) Which data type in Hive is used to store data information?

The TIMESTAMP data type in Hive is used to store all data information in the java.sql.timestamp format.

26) What is a Hive variable? Why is it used?

A Hive variable is a variable created in the Hive environment that Hive scripts can reference. It is used to pass some values to the Hive queries when we start executing queries.

27) Is it possible to run a Unix shell command from Hive? Give an example to demonstrate.

Yes, we can run a Unix shell command from Hive by using the ! mark just before the command.

For example, !pwd at hive prompt can be used to list the current directory.

28) Is it possible to execute Hive queries from a script file?

Yes, we can execute Hive queries from a script file with the help of a source command. For example - Hive> source /path/queryfile.hql

29) Is it possible to delete the DBPROPERTY in Hive? / How can you delete the DBPROPERTY in Hive?

It is not possible to delete the DBPROPERTY in Hive because there is no proper way to delete the DBPROPERTY.

30) What is a .HIVERC file?

The .HIVERC is a file that contains a list of commands that need to be run when the Command Line Input (CLI) is initiated.

31) What do you understand by schema on read?

The schema is validated with the data while reading the data and not enforced while writing the data, and that's why it is called schema on read.

32) How can you check if a specific partition exists in Hive?

We should use the following command to check if a specific partition exists in Hive:

33) What do you understand by bucketing in Hive? Why do we need a bucket?

In Hive, bucketing is the concept of breaking data down into ranges, which are known as buckets. Bucketing is mainly a data organizing technique. It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets. The partitioning into buckets can give extra structure to the data to use for more efficient queries. The range for a bucket is determined by the hash value of one or more columns in the dataset.

There are two main reasons for performing bucketing to a partition:

We perform bucketing to a partition because a map side join requires the data belonging to a unique join key to be present in the same partition.
Bucketing facilitates us to decrease the query time, and it also makes the sampling process more efficient.

34) How can you list all databases that began with the letter 'C'?

We can list all databases that began with the letter 'C' by using the following command:

35) How Hive distributes the rows into buckets?

Hive distributes the rows into buckets by using the following formula:

The hash_function depends on the column data type. Although, hash_function for integer data type will be:

36) What do you understand by indexing, and why do we need it?

Indexing in Hive is a Hive query optimization technique, and it is mainly used to speed up the access of a column or set of columns in a Hive database. With the use of the index, the Hive database system does not need to read all rows in the table, especially that one has selected. That's why we use indexing.

37) Which Java class is used to handle the input record encoding into files that store Hive tables?

The following Java class is used to handle the input record encoding into files that store Hive tables:

38) Which Java class is used to handle the output record encoding into Hive query files?

The following Java class is used to handle the output record encoding into Hive query files:

39) What is the use of Hcatalog in Hive?

In Hive, Hcatalog is used to share data structures with external systems. It provides access to Hive metastore to the users of other tools on Hadoop so that they can easily read and write data to Hive's data warehouse.

40) What are the key differences between Hive and HBase?

Both Hive and HBase are incredible Apache tools, and both are used for Big Data, but there are some differences between them. A list of key differences between Hive and HBase:

Hive	HBase
Hive is a query engine.	Hbase is data storage mainly for unstructured data.
Hive allows most of the SQL queries.	HBase does not allow SQL queries.
Hive is mainly used for batch processing.	Hbase is mainly used for transactional processing.
Hive is not real-time processing.	HBase is real-time processing.
Hive is only used for analytical queries.	HBase is used for real-time querying.
Hive runs on the top of MapReduce.	HBase runs on the top of HDFS (Hadoop distributed file system).
Hive is not a full database. It is a data warehouse framework	HBase supports the NoSQL database.
Hive provides SQL features to Spark/Hadoop data.	HBase is used to store and process Hadoop data in real-time.
Hive has a schema model.	HBase is free from the schema model.
Hive is made for high latency operations.	HBase is made for low-level latency operations.
Hive is not suited for real-time querying.	HBase is used for real-time querying of Big Data.

41) What do you understand by a Hive variable? What is its usage?

Hive variables are created in the Hive environment that can be referenced by Hive scripts. These variables are used for passing some values to the hive queries when the query starts executing.

42) What do you understand by ObjectInspector functionality in Hive?

In Hive, the ObjectInspector functionality is used to analyze the structure of individual columns and the internal structure of the row objects. It facilitates us to get access to complex objects which can be stored in multiple formats in Hive.

43) What is UDF in Hive?

In Hive, UDF is a user-designed function created with a Java program to address a specific function not part of the existing Hive functions.

44) What are the different types of joins in Hive? Explain with example.

There are mainly 4 different types of joins in Hive:

JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN

Example:

To understand it well, let's consider two tables named "CUSTOMERS" and "ORDERS" respectively.

Table 1: CUSTOMERS

ID	NAME	AGE	ADDRESS	SALARY
1	Alex	21	New York	2000.00
2	Aryan	22	Delhi	3000.00
3	Neetu	23	Lucknow	3500.00
4	Raj	24	Kanpur	2600.00
5	Priya	25	Ludhiana	3200.00
6	Robert	26	London	4000.00
7	Julia	27	Paris	2700.00

Table 1: ORDERS

OID	DATE	CUSTOMER_ID	AMOUNT
102	2009-10-08 00:00:00	3	3000
100	2009-10-08 00:00:00	3	1500
101	2009-11-20 00:00:00	2	1560
103	2008-05-20 00:00:00	4	2060

Now, see the different join operations:

JOIN

The Hive JOIN clause is used to combine and retrieve the records from multiple tables. It is very similar to Outer Join in SQL. In Hive, a JOIN condition is to be raised using the tables' primary keys and foreign keys.

Use the following query to demonstrate JOIN on the CUSTOMERS and ORDERS tables:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT 
FROM CUSTOMERS c JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

After the successful execution of the query, you get to the following result:

ID	NAME	AGE	AMOUNT
3	Neetu	23	3000
3	Neetu	23	1500
2	Aryan	22	1560
4	Raj	24	2060

LEFT OUTER JOIN

The LEFT OUTER JOIN in Hive returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches zero records in the right table, the JOIN still returns a row in the result with a NULL in each column from the right table.

In other words, we can say that a LEFT OUTER JOIN returns all the values from the left table, plus the matched values from the right table and NULL in the case of no matching JOIN predicate.

Use the following query to demonstrate LEFT OUTER JOIN on the CUSTOMERS and ORDERS tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
LEFT OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

After the successful execution of the query, you get the following result:

ID	NAME	AMOUNT	DATE
1	Alex	NULL	NULL
2	Aryan	1560	2009-11-20 00:00:00
3	Neetu	3000	2009-10-08 00:00:00
3	Neetu	1500	2009-10-08 00:00:00
4	Raj	2060	2008-05-20 00:00:00
5	Priya	NULL	NULL
6	Robert	NULL	NULL
7	Julia	NULL	NULL

RIGHT OUTER JOIN

The RIGHT OUTER JOIN in Hive returns all the rows from the right table, even if there are no matches in the left table. It is the simple opposite of LEFT OUTER JOIN. If the ON clause matches zero records in the left table, the RIGHT OUTER JOIN still returns a row with NULL in each column from the left table.

In other words, we can say that a RIGHT OUTER JOIN returns all the values from the right table, plus the matched values from the left table and NULL in case of no matching join predicate.

Use the following query to demonstrate RIGHT OUTER JOIN on the CUSTOMERS and ORDERS tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c 
RIGHT OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

After the successful execution of the query, you get the following result:

ID	NAME	AMOUNT	DATE
3	Neetu	3000	2009-10-08 00:00:00
3	Neetu	1500	2009-10-08 00:00:00
2	Aryan	1560	2009-11-20 00:00:00
4	Raj	2060	2008-05-20 00:00:00

FULL OUTER JOIN

The FULL OUTER JOIN in Hive combines the records of both the left and the right outer tables that satisfy the JOIN condition. The result table contains all the records from both the tables or fills in NULL values for missing matches on either side.

Use the following query to demonstrate FULL OUTER JOIN on the CUSTOMERS and ORDERS tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
FULL OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

After the successful execution of the query, you get the following result:

ID	NAME	AMOUNT	DATE
1	Alex	NULL	NULL
2	Aryan	1560	2009-11-20 00:00:00
3	Neetu	3000	2009-10-08 00:00:00
3	Neetu	1500	2009-10-08 00:00:00
4	Raj	2060	2008-05-20 00:00:00
5	Priya	NULL	NULL
6	Robert	NULL	NULL
7	Julia	NULL	NULL
3	Neetu	3000	2009-10-08 00:00:00
3	Neetu	1500	2009-10-08 00:00:00
2	Aryan	1560	2009-11-20 00:00:00
4	Raj	2060	2008-05-20 00:00:00