Javatpoint Logo
Javatpoint Logo

Apache Airflow

Airflow in Apache is a popularly used tool to manage the automation of tasks and their workflows. They are also primarily used for scheduling various tasks. Consider that you are working as a data engineer or an analyst and you might need to continuously repeat a task that needs the same effort and time every time. The kind of such tasks might consist of extracting, loading, or transforming data that need a regular analytical report. You can simply automate such tasks using Airflow in Apache by training your machine learning model to serve these kinds of tasks on a regular interval specified while training it.

Additionally, Airflow allows you to easily resolve the issue of automating time-consuming and repeating task and is primarily written in SQL and Python because these languages have tremendous integration and backend support along with rich UI to identify, monitor, and debug any of the issues that may arrive with time or environment. Thus, Apache Airflow is an efficient tool to serve such tasks with ease.

Before proceeding with the installation and usages of Apache Airflow, let's first discuss some terms which are central to the tool.

What is DAG?

DAG abbreviates for Directed Acyclic Graph. It is the heart of the Airflow tool in Apache. It can be specifically defined as a series of tasks that you want to run as part of your workflow. The Airflow tool might include some generic tasks like extracting out data with the SQL queries or doing some integrity calculation in Python and then fetching the result to be displayed in the form of tables. In Airflow, these generic tasks are written as individual tasks in DAG. The main purpose of using Airflow is to define the relationship between the dependencies and the assigned tasks which might consist of loading data before actually executing. It might also consist of defining an order of running those scripts in a unified order.

However, DAG is written primarily in Python and is saved as .py extension, and is heavily used for orchestration with tool configuration. Also, while running DAG it is mandatory to specify the executable file so that DAG can automatically run and process under a specified schedule. The schedule for running DAG is defined by the CRON expression that might consist of time tabulation in terms of minutes, weeks, or daily.

Thus, after learning about DAG, it is time to install the Apache Airflow to use it when required. See the below installation measures for your reference.

Installation

Since we have discussed much the Airflow, let's get hands-on experience by installing and using it for our workflow enhancements. Consider the below steps for installing Apache Airflow.

  1. The first step for installing Airflow is to have a version control system like Git. If you don't have it, consider downloading it before installing Airflow.
  2. After installing Git, create a repository on GitHub to navigate a folder by name. Once you have done this, clone your repository to the local environment using the "git-web url" method.
  3. Now, navigate to the terminal of your local environment i.e. location of your directory cd/path/to/my_airflow_directory.
  4. Once you are in the required directory, you need to install the pipenv environment setup with a Python-specific version along with Flask and Airflow. These installations are important because they have dependencies for running Airflow.
  5. Next, it is good practice to specify versions of all installations, which can be done using the following command in the terminal.

The above command would install all the specific versions that fulfill all the requirements and dependencies required with the Airflow.

The next step is to specify the location on your local system called AIRFLOW_HOME. If you fail to specify it will take as the default route to your directory. To specify the .env file you need to type the following command.

We might have previously come across the fact that Airflow requires a database backend to run and for that requirement, you can opt to use SQLite database for implementation. To use the database, you will need to initialize with the database type and that can be done using the below command.

Now, the final task is to run the Airflow and make use of it.

Additionally, Airflow offers a fabulous UI for the web so that you can monitor and observe your dags. To start the server to view the contents of the web UI it offers, run the below command.

Also, we need to start the scheduler using the following command.

We are now ready to view the contents offered by the web UI of Apache Airflow. Just navigate to the localhost as shown below:

CLI commands offered by Airflow DAG

Since we have installed and set up the Airflow DAG, let's see some of the most commonly used CLI commands.

To configure the sleep scheduler, you can use the command.

To list your tasks in DAG, you can use the below command.

To unpause or pause your file execution, use the below command.

To perform the tasks assigned on some previous date or Backfill, you can use the following command.

Summary

This article is designed to be a complete introduction to get you up and running with using Airflow to create a first DAG. In this tutorial, you learned the complete introduction and configuration of Apache Airflow. You also came across the basic CLI commands that serve the workflow of using DAGS in Airflow.


Next Topicp5 js





Youtube For Videos Join Our Youtube Channel: Join Now

Feedback


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Preparation


Trending Technologies


B.Tech / MCA