Top 30+ Most Asked Azure Data Factory Interview Questions and Answers
1) What is Azure Data Factory? / What do you understand by Azure Data Factory?
Azure Data Factory is a cloud-based, fully managed, data-integration ETL service that automates the movement and transformation of data. It facilitates users to create data-driven workflows in the cloud and automates data movement and transformation. Azure Data Factory is also called ADF and facilitates users to create and schedule the data-driven workflows (called pipelines) that can retrieve data from disparate data stores and process and transform the data by using computing services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
It is called Data Factory because it works like a factory that runs equipment to transform raw materials into finished goods. In the same way, Azure Data Factory uses services to collect raw data and transform it into ready-to-use information.
We can create data-driven workflows to move data between on-premises and cloud data stores using Azure Data Factory. It also facilitates us to process and transform data with Data Flows. We can execute our data processing either on an Azure-based cloud service or in our own self-hosted compute environments, such as SSIS, SQL Server, or Oracle.
2) What is the requirement of Azure Data Factory? What is the purpose of the ADF Service?
ADF Service or Azure Data Factory is mainly used to automate retrieving and copying data between relational and non-relational data sources hosted on the cloud or a local datacenter.
We can understand it well by the following scenario:
Nowadays, we have to deal with huge amounts of data, and we know that this data comes from different sources. Suppose we have to move this particular data to the cloud, so we have to follow certain things to move this data.
The data we get from different sources can be in a different formats. These different sources will transfer or channel the data in different ways. When we bring this data to the cloud or particular local storage, we must ensure that this data is well managed. So, we have to transform the data and delete unnecessary parts.
We can do this task in a traditional data warehouse, but there are certain disadvantages. Sometimes, we have to use some custom applications to deal with all these processes individually, which is time-consuming, and integrating all these sources is a tedious job. We figure out a way to automate this process or create proper workflows to deal with all these problems. That's why Azure Data Factory is required. It automates the complete process into a more manageable or organizable manner. Azure Data Factory provides amazing tools such as the ELT tool for data ingestion in most Big Data solutions.
3) What are the different components used in Azure Data Factory?
Azure Data Factory consists of several numbers of components. Some components are as follows:
4) What do you understand by the integration runtime?
The integration runtime is the compute infrastructure used by Azure Data Factory to provide some data integration capabilities across various network environments.
Following are the three types of integration runtimes:
5) What is the limit on the number of integration runtimes in Azure Data Factory?
There is no limit on the number of integration runtime instances for Azure Data Factory. But there is a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.
6) What are the key differences between Azure Data Lake and Azure Data Warehouse?
Azure Data Lake and Azure Data Warehouse are widely used to store big data, but they are not synonymous, and we can't use them interchangeably. Azure Data Lake is a huge pool of raw data. On the other hand, Azure Data Warehouse is a repository for structured, processed, and filtered data already processed for a specific purpose.
Following is a list of key differences between Azure Data Lake and Azure Data Warehouse:
7) What is Blob Storage in Azure? What are its important features?
In Microsoft Azure, Blob Storage is one of the most fundamental components. It is mainly used to store large volumes of unstructured data such as text or binary data. Enterprises also use it to render data to outsiders or save app data confidentially. Blob Storage is used to expose data publicly to the world. It is most commonly used for streaming audios or videos, storing data for backup, disaster recovery, analysis, etc. It can also be used to create Data Lakes to perform analytics.
Following is a list of some primary applications of blob storage:
8) What is the key difference between the Dataset and Linked Service in Azure Data Factory?
Dataset specifies a reference to the data store described by the linked service. When we put data to the dataset from a SQL Server instance, the dataset indicates the table's name that contains the target data or the query that returns data from different tables.
Linked service specifies a description of the connection string used to connect to the data stores. For example, when we put data in a linked service from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that instance.
9) What are the key differences between Data Lake Storage and Blob Storage?
Following is the list of key differences between Data Lake Storage and Blob Storage:
10) How many types of triggers are supported by Azure Data Factory?
Following are the three types of triggers that Azure Data Factory supports:
11) What steps are used to create an ETL process in Azure Data Factory?
When we process a dataset to retrieve some data form the Azure SQL server database, the processed data is stored in the Data Lake Store.
Following are the steps we have to follow to create an ETL process:
12) How can you schedule a pipeline?
There are two ways to schedule a pipeline:
13) What are the key differences between Azure HDInsight and Azure Data Lake Analytics?
Following is the list of key differences between Azure HDInsight and Azure Data Lake Analytics:
14) What are the top-level concepts of Azure Data Factory?
Following is a list of four basic top-level concepts of Azure Data Factory:
15) What are the different rich cross-platform SDKs for advanced users in Azure Data Factory?
The Azure Data Factory V2 provides a rich set of SDKs that we can use to write, manage, and monitor pipelines by using our favorite IDE. Some popular cross-platform SDKs for advanced users in Azure Data Factory are as follows:
16) How can you pass parameters to a pipeline run in Azure Data Factory?
In Azure Data Factory, parameters are the first-class, top-level concepts. We can pass parameters to a pipeline run by defining the pipeline level parameters and passing arguments while executing the pipeline run on-demand or by using a trigger.
17) What are the key differences between the Mapping data flow and Wrangling data flow transformation activities in Azure Data Factory?
In Azure Data Factory, the main difference between the Mapping data flow and the Wrangling data flow transformation activities is as follows:
The Mapping data flow activity is a visually designed data transformation activity that facilitates users to design graphical data transformation logic. It doesn't need the users to be expert developers. It is executed as an activity within the ADF pipeline on an ADF fully managed scaled-out Spark cluster.
On the other hand, the Wrangling data flow activity is a code-free data preparation activity. It is integrated with Power Query Online to make the Power Query M functions available for data wrangling using spark execution.
18) Is it possible to define default values for the pipeline parameters?
Yes, we can easily define default values for the parameters in the pipelines.
19) How can you access the data using the other 80 Dataset types in Azure Data Factory?
Azure Data Factory provides a mapping data flow feature that allows Azure SQL database, Data Warehouse, Delimited text files from Azure Blob Storage, or Azure Data Lake storage to generate tools natively for source and sink. We can use copy activity to state data from any other connectors and then execute the data flow activity to transform data.
20) Can an activity in a pipeline consume arguments that are passed to a pipeline run?
Every activity within the pipeline can consume the parameter value passed to the pipeline and run with the @parameter construct.
21) What do we need to execute an SSIS package in Azure Data Factory?
We have to create an SSIS IR, and an SSISDB catalog hosted in Azure SQL Database or Azure SQL Managed Instance to execute an SSIS package in Azure Data Factory.
22) Can the output property of activity be consumed in another activity?
The output property of an activity can be consumed in a subsequent activity with the @activity construct.
23) What types of compute environments are supported by Azure Data Factory to execute the transform activities?
Following are the two types of compute environments supported by Azure Data Factory to execute the transform activities:
24) How can you handle the null values in an activity output in Azure Data Factory?
We can handle the null values gracefully in an activity output in Azure Data Factory by using the @coalesce construct in the expressions.
25) Is the knowledge of coding required for Azure Data Factory?
No, it is not necessary to have a good knowledge of coding for Azure Data Factory. Azure Data Factory provides 90+ built-in connectors to transform the data using mapping data flow activities without the knowledge of programming skills or spark cluster knowledge. It also facilitates us to create workflows very quickly.
26) Which Azure Data Factory version is used to create data flows?
We can use the Azure Data Factory V2 version to create data flows.
27) What are the two levels of security in Azure Data Lake Storage Gen2 (ADLS Gen2)?
Azure Data Lake Storage Gen2 provides an access control model that supports two levels of security, both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs).
28) Is Azure Data Factory an ETL tool?
Yes, Azure Data Factory is one of the best tools available in the market for ETL processes. ETL process stands for Extract, Transform and Load. It is a data integration process that combines data from multiple sources into a single, consistent data store and loads it into a data warehouse or other target system. Azure Data Factory is the best tool because it simplifies the entire data migration process without writing any complex algorithms.
29) What do you understand by Azure Table Storage?
The Azure Table Storage is a service that facilitates users to store structured data in the cloud and provides a Keystore with schemas designed. It is very quick and effective storage for modern-day applications.
30) How can we monitor the execution of a pipeline executed under the Debug mode?
We can check the output tab of the pipeline under the Azure Data Factory Monitor window to monitor the execution of a pipeline that is executed under the Debug mode.
31) What are the steps involved in the ETL process?
The ETL process stands for Extract, Transform, and Load. There are mainly four steps involved in the ETL process:
32) Which integration runtime should we use to copy data from an on-premises SQL Server instance using Azure Data Factory?
To copy data from an on-premises SQL Server instance using Azure Data Factory, we must have installed the self-hosted integration runtime on the on-premises machine where the SQL Server Instance is hosted.
33) What changes can we see regarding data flows from private preview to limited public preview?
Following is a list of some important changes we can see regarding data flows from private preview to limited public preview: