Top 40+ Most Asked Apache Hive Interview Questions and AnswersFollowing is a list of most frequently asked Apache Hive interview questions and answers. 1) What is Apache Hive? / Explain Apache Hive in short.Apache Hive is a Data warehousing tool developed over the Hadoop Distributed File System (HDFS). It runs SQL like queries called HQL (Hive Query Language), which gets internally converted to map reduce jobs. Hive is developed by Facebook and used for querying and analysis of data that is stored in HDFS. Hive is an open-source tool or software that facilitates programmers to analyze large data sets on Hadoop. It also supports Data Definition Language (DDL), Data Manipulation Language (DML) and user-defined functions. Note: Although Hive is not a database, it gives you logical abstraction over the databases and the tables.2) When can we use Hive?We can use Hive in the following conditions:
3) Name some applications that Hive supports.Hive supports all those client applications written in Java, PHP, Python, C, and Ruby programming languages. 4) Can you rename a table in Hive?Yes, we can rename a table in Hive by using the following command: 5) What are the different types of tables available in Hive?There are two types of tables available in Hive:
6) What is the difference between external and managed tables in Hive?There are two types of tables available in Hive, external tables and managed tables. Here, external tables are used to give data control to Hive but not control of a schema. On the other hand, the managed tables give both schema and data control to Hive. 7) What are the different modes of Hive?According to the size of data nodes in Hadoop, Hive can be operated in the following two modes:
8) When should we use Map reduce mode in Hive?In Hive, the Map reduce mode is used in the following conditions:
9) Can we use Hive for OLTP systems? / Is Hive suitable to be used for OLTP systems? Why?No. Because Hive does not provide insert and update at the row level, it is not suitable for the OLTP system. 10) What are the most important components of Hive Architecture?The most important components of Hive Architecture are:
11) Where does the data of a Hive table get stored?By default, the data of a Hive table is stored in an HDFS directory - /user/hive/warehouse. We can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml. 12) What are the three main parts that Hive is composed of?Hive contains the following three main parts:
13) Is it possible to change the default location of a managed table in Hive?Yes, it is possible to change the default location of a managed table in Hive by using the LOCATION '<hdfs_path>' clause. 14) What do you understand by a Hive Metastore?A Hive Metastore is a relational database used to store the Metadata of Hive partitions, tables, databases, etc. 15) What is the difference between local and remote Metastores in Hive?Local Metastores: Local metastores run on the same Java Virtual Machine (JVM) as the Hive service. Remote Metastore: The Remote metastores run on a separate, distinct JVM as the Hive service. 16) What types of databases does Hive support?Hive supports two types of databases:
17) Is it possible that multiple users use one Metastore?No, Hive doesn't support metastore sharing so, multiple users can not use one Metastore. 18) Why does Hive not store metadata information in HDFS?Hive does not store metadata information in HDFS. Instead, it uses RDBMS. Hive stores metadata information in the metastore, and to achieve low latency, it uses RDBMS. Because HDFS read/write operations are time-consuming processes. 19) What are the three different modes in which we can operate Hive?The three modes we can operate Hive are local mode, distributed mode, and pseudo-distributed mode. 20) What is a partition in Hive?In Hive, a partition is used to group similar data types together based on column or partition key. Hive organizes tables into partitions. In other words, we can say that partition is used to create a sub-directory in the table directory. Each table can have one or more partition keys to identify a particular partition. 21) Why is partitioning used in Hive?Partitioning is used in Hive to reduce the query latency. Instead of scanning the entire tables, it scans only the relevant partitions and corresponding datasets. 22) What do you understand by dynamic partitioning, and when is it used?A partitioning is called dynamic partitioning while loading the data into the Hive table. In other words, we can say that dynamic partitioning values for partition columns in the runtime. Dynamic partitioning is used in the following cases:
23) What are the three Hive collection data types?The three main Hive collection data types are:
24) When should we use SORT BY instead of ORDER BY?We should use SORT BY instead of ORDER BY when we have to sort huge datasets. The reason is that the SORT BY clause sorts the data using multiple reducers, while the ORDER BY sorts all of the data together using a single reducer. Hence, if you use the ORDER BY clause, it will take a lot of time to execute many inputs. So, in this case, SORT BY is preferred over ORDER BY. 25) Which data type in Hive is used to store data information?The TIMESTAMP data type in Hive is used to store all data information in the java.sql.timestamp format. 26) What is a Hive variable? Why is it used?A Hive variable is a variable created in the Hive environment that Hive scripts can reference. It is used to pass some values to the Hive queries when we start executing queries. 27) Is it possible to run a Unix shell command from Hive? Give an example to demonstrate.Yes, we can run a Unix shell command from Hive by using the ! mark just before the command. For example, !pwd at hive prompt can be used to list the current directory. 28) Is it possible to execute Hive queries from a script file?Yes, we can execute Hive queries from a script file with the help of a source command. For example - Hive> source /path/queryfile.hql 29) Is it possible to delete the DBPROPERTY in Hive? / How can you delete the DBPROPERTY in Hive?It is not possible to delete the DBPROPERTY in Hive because there is no proper way to delete the DBPROPERTY. 30) What is a .HIVERC file?The .HIVERC is a file that contains a list of commands that need to be run when the Command Line Input (CLI) is initiated. 31) What do you understand by schema on read?The schema is validated with the data while reading the data and not enforced while writing the data, and that's why it is called schema on read. 32) How can you check if a specific partition exists in Hive?We should use the following command to check if a specific partition exists in Hive: 33) What do you understand by bucketing in Hive? Why do we need a bucket?In Hive, bucketing is the concept of breaking data down into ranges, which are known as buckets. Bucketing is mainly a data organizing technique. It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets. The partitioning into buckets can give extra structure to the data to use for more efficient queries. The range for a bucket is determined by the hash value of one or more columns in the dataset. There are two main reasons for performing bucketing to a partition:
34) How can you list all databases that began with the letter 'C'?We can list all databases that began with the letter 'C' by using the following command: 35) How Hive distributes the rows into buckets?Hive distributes the rows into buckets by using the following formula: The hash_function depends on the column data type. Although, hash_function for integer data type will be: 36) What do you understand by indexing, and why do we need it?Indexing in Hive is a Hive query optimization technique, and it is mainly used to speed up the access of a column or set of columns in a Hive database. With the use of the index, the Hive database system does not need to read all rows in the table, especially that one has selected. That's why we use indexing. 37) Which Java class is used to handle the input record encoding into files that store Hive tables?The following Java class is used to handle the input record encoding into files that store Hive tables: 38) Which Java class is used to handle the output record encoding into Hive query files?The following Java class is used to handle the output record encoding into Hive query files: 39) What is the use of Hcatalog in Hive?In Hive, Hcatalog is used to share data structures with external systems. It provides access to Hive metastore to the users of other tools on Hadoop so that they can easily read and write data to Hive's data warehouse. 40) What are the key differences between Hive and HBase?Both Hive and HBase are incredible Apache tools, and both are used for Big Data, but there are some differences between them. A list of key differences between Hive and HBase:
41) What do you understand by a Hive variable? What is its usage?Hive variables are created in the Hive environment that can be referenced by Hive scripts. These variables are used for passing some values to the hive queries when the query starts executing. 42) What do you understand by ObjectInspector functionality in Hive?In Hive, the ObjectInspector functionality is used to analyze the structure of individual columns and the internal structure of the row objects. It facilitates us to get access to complex objects which can be stored in multiple formats in Hive. 43) What is UDF in Hive?In Hive, UDF is a user-designed function created with a Java program to address a specific function not part of the existing Hive functions. 44) What are the different types of joins in Hive? Explain with example.There are mainly 4 different types of joins in Hive:
Example: To understand it well, let's consider two tables named "CUSTOMERS" and "ORDERS" respectively. Table 1: CUSTOMERS
Table 1: ORDERS
Now, see the different join operations: JOINThe Hive JOIN clause is used to combine and retrieve the records from multiple tables. It is very similar to Outer Join in SQL. In Hive, a JOIN condition is to be raised using the tables' primary keys and foreign keys. Use the following query to demonstrate JOIN on the CUSTOMERS and ORDERS tables: After the successful execution of the query, you get to the following result:
LEFT OUTER JOINThe LEFT OUTER JOIN in Hive returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches zero records in the right table, the JOIN still returns a row in the result with a NULL in each column from the right table. In other words, we can say that a LEFT OUTER JOIN returns all the values from the left table, plus the matched values from the right table and NULL in the case of no matching JOIN predicate. Use the following query to demonstrate LEFT OUTER JOIN on the CUSTOMERS and ORDERS tables: After the successful execution of the query, you get the following result:
RIGHT OUTER JOINThe RIGHT OUTER JOIN in Hive returns all the rows from the right table, even if there are no matches in the left table. It is the simple opposite of LEFT OUTER JOIN. If the ON clause matches zero records in the left table, the RIGHT OUTER JOIN still returns a row with NULL in each column from the left table. In other words, we can say that a RIGHT OUTER JOIN returns all the values from the right table, plus the matched values from the left table and NULL in case of no matching join predicate. Use the following query to demonstrate RIGHT OUTER JOIN on the CUSTOMERS and ORDERS tables: After the successful execution of the query, you get the following result:
FULL OUTER JOINThe FULL OUTER JOIN in Hive combines the records of both the left and the right outer tables that satisfy the JOIN condition. The result table contains all the records from both the tables or fills in NULL values for missing matches on either side. Use the following query to demonstrate FULL OUTER JOIN on the CUSTOMERS and ORDERS tables: After the successful execution of the query, you get the following result:
|