AutoML Workflow

What is Automated Machine Learning?

Automated Machine Learning, or AutoML, is a branch that regulates machine learning and automates it to real-world problems. It helps to make machine learning tasks easier and more accessible for machine learning experts, researchers, and those with less expertise in data science and machine learning. AutoML provides various tools and practices to automate machine learning models' selection process and fine-tuning.

The main aim of Automated Machine Learning is to make the branch of machine learning easier and more familiar to machine learning enthusiasts and to those who are exploring data science to build and deploy more efficient models. AutoML helps reduce the time and effort needed to make efficient machine learning models.

Working of AutoML

AutoML is an open-source library that helps simplify the machine learning process and tasks, from exploring data sets and manipulating them to deploying the machine learning model. In the traditional process of Machine Learning, each process of developing a model is done separately. But AutoML automatically locates the machine learning algorithms and then uses the best model with optimized solutions.

The process of choosing and implementing the models in AutoML is done with two different concepts:

Neural Architecture Search: It helps in designing neural networks. It helps to build and explore new architectures for the AutoML models.
Transfer Learning: This technique is used to transfer the already existing architectures to new models and problems that are best fit for them.

Python provides the AutoML library, which is used to automate the process of machine learning. It can be installed in Python using this command:

The workflow of Traditional Machine Learning starts with identifying the problem statement and then preprocessing the data, including data cleaning, feature engineering, etc., training the dataset and choosing the best model, and then predicting the outcomes with visualizations. This process is a long and iterative process that takes a huge time. It needs multiple experiments and iterations to reach the optimal model and solution.

The workflow of AutoML starts with the collection of data and its preprocessing. Then, it will explore the data and choose the best algorithm that fits the relationship between the target value and their attributes.

Let's study the different steps involved in AutoML. This includes:

Data Loading
Data Preprocessing
Feature Engineering
Model Selection
Model Training
Hyperparameter Tuning
Deployment

Data Loading

This is the first step in AutoML to load and read data in a suitable supportive form and then analyze it to check whether it can be used for further processing. This step is also called data ingestion. It includes data exploration, checking the null values in the dataset, and making sure the data can be used for machine learning tasks.

It is significant to consider that many AutoML tools can be used if the model has enough labeled data. As a result, this stage also guarantees that one has adequate data to train a strong model.

Data Preprocessing

The data processing is the second process of the AutoML process. It includes modifying the raw data into a clean format. The data preparation or data preprocessing includes different techniques like duplication of the data, checking null values, replacing it with a suitable value, scaling, and normalizing the data. This step ensures the data quality which can be used for the model building.

Feature Engineering

This step includes the selection of the features which are used for building the model. It tells about the process of how features or data are extracted and processed, along with the sampling and shuffling.

The feature engineering or data engineering process can be done manually or automatically with the help of deep learning techniques, which automatically completes the extraction of the features from the data set.

Data Sampling is a process of fragmenting the dataset into different fragments: training and testing data. Some portions of the data set are selected randomly by AutoML to use as training data.

Data shuffling includes the process of rearrangement of the data pieces of the original data into multiple sequences before training.

Model Selection

This is the fourth step in which AutoML chooses the best model out of various models for model building and its training.

A few models may perform better on a specific dataset or for specific objectives, such as binary classification or time series prediction. When there are numerous models, it is crucial to determine which details you require from your datasets and your purposes. AutoML tools automatically select the appropriate model. For this, some systems employ a cutting-edge technique called neural architecture search.

Model Training

Then, the next step is to train the model. There are numerous machine learning models, each with its unique set of hyperparameters. Some machine learning models are linear regression, decision trees, random forests, neural networks, and deep neural network models.

Different models are trained on the data, and the best with the highest accuracy is selected for more refining or tuning and, thus, for deployment.

Hyperparameter Tuning

The hyperparameters need to be tuned for the better performance of the AutoML. This is called hyperparameter optimization. AutoML must generate predictions for various hyperparameters and select the best.

Deployment

Once made and modified, deploying a trained model can be challenging, particularly in large-scale systems that typically need extensive data engineering activities.

An AutoML system, on the other hand, is capable establish a machine learning pipeline straightforwardly by leveraging built-in knowledge about how to deploy the model to various systems and contexts.

Tools in AutoML

Various tools are used to automate the process of machine learning. These are:

Auto Sklearn: This AutoML tool is an open-source package based on the sklearn library. This framework searches for the optimal solution for the model architecture, processing procedure, and hyperparameters. It preprocesses the data using the Bayesian optimization method and meta-learning.
Google AutoML: It provides various AutoML services for tasks like image recognition, natural language processing, and data analysis. Google provides a Google AutoML table tool, an AuoML tool used for data processing. It helps the users to make and deploy the model for different applications like regression, time series forecasting, etc.
ai: This AutoML tool offers H2O driverless AI, an AutoML platform that automates the machine learning workflow. It includes data preprocessing, feature engineering, and hyperparameter tuning. It is used for both labeled and unlabeled data.
Databricks AutoML: Databricks AutoML is an AutoML application that makes building machine learning models on large datasets It can handle a broad spectrum of activities and provides a simple and interactive model development and evaluation environment.
Auto PyTorch: Auto PyTorch is an open-source AutoML package that automates the deep learning models. Auto Pytorch uses Bayesian Optimization to get the optimal model architecture. It allows its users to work on high-level problem formulation.
Auto Keras: It is an open-source AutoML package based on Keras and Tensorflow. It offers a simple interface for simplifying the procedure of creating deep learning models. AutoKeras, among other things, provides image classification, regression, and text classification. To determine the best neural network architecture and hyperparameters for a given dataset, it employs neural architecture search (NAS).

Let's understand the implementation of the autoML using AutoKeras

Program 1: A program to implement the AutoML tool for predicting the flowers from the dataset.

1. Importing libraries and dataset

Code:

import os
import pathlib
import numpy as np
import tensorflow as tf
import autokeras as ak
import warnings
warnings.filterwarnings('ignore')

data_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data = tf.keras.utils.get_file('flower_photos', origin=data_url, untar=True)
data = pathlib.Path(data)
print(data)

2. Splitting the datset

Code:

batch_size = 40
img_hght = 200
img_wdth = 200
 
training_data = ak.image_dataset_from_directory(
    data,
    # 20% training data
    validation_split=0.20,
    subset="training",
    seed=25,
    size_image =(img_hght, img_wdth),
    batch_size =batch_size,
)
 
testing_data = ak.image_dataset_from_directory(
    data,
    validation_split=0.25,    # 25% validation data
    subset="validation",
    seed=20,
    size_image =(img_height, img_width),
    batch_size =batch_size,
)

Output:

Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 917 files for validation.

Explanation:

We have split the data set into test, train, and validation data. We have set a fixed size for the image, which can be used for training and predictions. We have split 20% of the data for training and 25% as validation data. We found that there are a total of 3670 files, out of which 2936 are for training and 917 are for validation.

3. Building and training of the model

Code:

model = ak.ImageClassifier(num_classes = 6,
                                      multi_label = True,
                                      overwrite=True,
                                      max_trials=1)

model.fit(train_data, epochs=8)

Output:

Trial 1 Complete [00h 54m 52s]
val_loss: 0.40751439332962036

Best val_loss So Far: 0.40751439332962036
Total elapsed time: 00h 54m 52s
INFO:tensorflow:Oracle triggered exit
Epoch 1/8
74/74 [==============================] - 534s 7s/step - loss: 0.9473 -     accuracy: 0.4101
Epoch 2/8
74/74 [==============================] - 479s 6s/step - loss: 0.3521 -    accuracy: 0.6104
Epoch 3/8
74/74 [==============================] - 519s 7s/step - loss: 0.2737 -    accuracy: 0.7296
Epoch 4/8
74/74 [==============================] - 485s 7s/step - loss: 0.1841 -    accuracy: 0.8535
Epoch 5/8
74/74 [==============================] - 485s 7s/step - loss: 0.1091 -    accuracy: 0.9363
Epoch 6/8
74/74 [==============================] - 484s 7s/step - loss: 0.0769 -    accuracy: 0.9656
Epoch 7/8
74/74 [==============================] - 453s 6s/step - loss: 0.0742 -    accuracy: 0.9700
Epoch 8/8
74/74 [==============================] - 484s 7s/step - loss: 0.0593 -    accuracy: 0.9796
INFO:tensorflow:Assets written to: .\image_classifier\best_model\asset

Explanation:

We have trained our data with 8 epochs using the Auto Keras Image Classifier model, by which the training of such a huge data set becomes very easy and fast

4. Evaluation of Model

Code:

Output:

23/23 [==============================] - 42s 2s/step - loss: 0.1599 -  accuracy: 0.9062
[0.15992486476898193, 0.9062159061431885]

Explanation:

Here, we have evaluated the testing data using the image classifier, an Auto Keras model.

5. Predictions from the model

Loading an image

Code:

from PIL import Image
image = Image.open("dandelion.jpg")
print(image.format, image.size, image.mode)
resize_image = image.resize((img_height, img_width))
print(resize_image.format, resize_image.size, resize_image.mode)
resize_image.show()
resize_image.save("auto.jpeg")

Output:

JPEG (320, 240) RGB
None (200, 200) RGB

Explanation:

We have loaded a sample image from the dataset. Then, we resized that image with the size we had set above while training. The path of the image in Image.open() will be the path of the sample image.

Predicting

Code:

final_image = np.expand_dims(resize_image, axis=0)
 
predictions = image_classifier.predict(final_image)
print(predictions)

Output:

1/1 [==============================] - 0s 219ms/step
1/1 [==============================] - 0s 57ms/step
[['dandelion']]

Explanation:

Finally, we have predicted the image using the predict function that the image is Dandelion.

Benefits of AutoML

There are various benefits to using AutoML for building and deploying machine learning models. These include:

It saves much time as it has decreased or removed manual trial and error while creating any machine learning model.
AutoMl does not need much expertise in machine learning. It is accessible to a huge range of users.
AutoML is used for huge datasets and complex machine learning tasks, resulting in making accurate models.
It helps to decrease the bias in the machine learning models by automating feature engineering and selecting models.

Drawbacks of AutoML

Though AutoML gives highly accurate models, sometimes they do not give the required results according to the demands.
AutoML generates complex models that are not easy to interpret, which makes it more challenging to get the predictions.
The tools used for AutoML require a high cost for large machine learning projects.
Overfitting is also possible in AutoML resulting in poor performance.