Apache Airflow in Python | Airflow Python Operator
In this tutorial, we will learn about the Apache Airflow and its operators. We will discuss all the operators of airflow however our primary aim is explore the Python operators and how we can use it. Before dive deep into this topic let's understand the basic concept of the Airflow and why it is popular.
What are Data Pipelines?
Data pipelines contain multiple tasks or actions that must be executed to get the desired result. For example - We want to create a weather API that forecasts the weather in the upcoming week. We need to perform the following tasks to implement this live weather dashboard.
As we can see that these are the set of tasks in the pipeline. Moreover, these tasks are required to perform in a specific order.
What is Apache Airflow?
Apache Airflow is a tool that is popularly used in the data engineering fields. It is a workflow engine that easily schedules and runs the complex data pipeline. It assures that each task in the data pipeline will get executed in the order and each task gets the required resources. It provides an excellent user interface and monitors and fixes issues.
Airflow uses the DAG (Directed Acyclic Graph), a collection of all tasks users want to run. These tasks are organized in such a way that the relationship and dependencies are maintained. The structure of the DAG (tasks and their dependencies) is specified as code in Python scripts.
DAG of data pipeline is the best way to make task relationships more apparent. The node represents the task, and the directed edges represent the dependencies between tasks.
For example, if task X connected with the edge and pointed to task Y, task A must be finished before task B can begin. This direction makes it a directed graph.
Installation of Airflow
It is important to create the dags folder in the airflow directory where we will define our DAGS. Now open the web browser and visit http://localhost:8080/admin/ it will look like as below.
Python Operator in Apache Airflow
There are multiple operators in Apache Airflow, such as BashOperator, PythonOperator, EmailOperator, MySqlOperator, etc. An operator specifies a single workflow task, and operators provide many operators for the different tasks.
In this section, we will discuss the Python operator.
Defining DAG Argument
We need to pass one argument dictionary for each of the DAG. Below is the description of the argument that we can pass.
Let's understand the following example -
Defining the Python Function
Now, we will define a Python function that will print the given string using an argument, and the Python operator will later use this function.
The next step is to create the DAG object and pass the dag_id. The dag_id is the name of the DAG and must be unique. Then pass the argument that we defined earlier and add a description and schedule_interval. It will run the DAG after the specified interval of time. Let's see the following example.
Defining the Task
In the workflow, we have defined only one task -
We will pass the task_id to the Python Operator object. We will see the name on the nodes of Graph view of our DAG. In the python_callable argument, pass the function name that we want to execute and pass its parameter value "op_kwargs" as a dictionary and finally, the DAG object to which we want to link this task.
Run the DAG
Now, refresh the Airflow dashboard; it will show the DAG in the list. Each step in the workflow will be a separate box; click on the DAG and wait until its border turns dark green, indicating that it is completed successfully.
Click on the node "print" to get more details about this step, and then click on Logs, and you will see the output like this.
What are the Variables in Apache Airflow?
As we discussed, the airflow can be used to create and manage complex workflows. We can run multiple workflows at the same time. The workflow can use the database or the same file path. Now, we change the directory path where the user saves files or changes the database configuration. In that case, we don't want to go and update each of the DAGS distinctly.
Using the airflow, we can create the variables where we can store and retrieve data at runtime in the multiple DAGS. If we want to make the change, we can edit the variable, and our workflow is good to go.
How to create Variables?
To create the variable, we open the Airflow and click on the Admin from the top menu and then click on Variables.
Click on the Create button to create a new variable and a window will open. Now add the value and submit.
Now, we will create a DAG where we will find out the word count of the text data in this file. We can import the newly created variables. Let's understand the following example.
Now we will define the function that will use the variable, read it, and calculate the word count.
Now, the steps are same as above, we need to define the DAG and task and our workflow is ready to run.
When we run the DAG, it will show the word count. We can also edit the DAG whenever we want and all our DAGS get updated.
In this tutorial, we have discussed the Python Operator in Apache Airflow and normal variables and branching. We have understood the basic concepts of Apache Airflow and its installation.