Weka Data Mining

Weka contains a collection of visualization tools and algorithms for data analysis and predictive modelling, together with graphical user interfaces for easy access to these functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms implemented in other programming languages, plus data preprocessing utilities in C and a makefile-based system for running machine learning experiments.

This original version was primarily designed as a tool for analyzing data from agricultural domains. Still, the more recent fully Java-based version (Weka 3), developed in 1997, is now used in many different application areas, particularly for educational purposes and research. Weka has the following advantages, such as:

Free availability under the GNU General Public License.
Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform.
A comprehensive collection of data preprocessing and modelling techniques.
Ease of use due to its graphical user interfaces.

Weka supports several standard data mining tasks, specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted according to the Attribute-Relational File Format and filename with the .arff extension.

All Weka's techniques are predicated on the assumption that the data is available as one flat file or relation, where a fixed number of attributes describes each data point (numeric or nominal attributes, but also supports some other attribute types). Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. Weka provides access to deep learning with Deeplearning4j.

It is not capable of multi-relational data mining. Still, there is separate software for converting a collection of linked database tables into a single table suitable for processing using Weka. Another important area currently not covered by the algorithms included the Weka distribution in sequence modelling.

History of Weka

In 1993, the University of Waikato in New Zealand began the development of the original version of Weka, which became a mix of Tcl/Tk, C, and makefiles.
In 1997, the decision was made to redevelop Weka from scratch in Java, including implementing modelling algorithms.
In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service Award.
In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence. It forms the data mining and predictive analytics component of the Pentaho business intelligence suite. Hitachi Vantara has since acquired Pentaho, and Weka now underpins the PMI (Plugin for Machine Intelligence) open-source component.

Features of Weka

Weka has the following features, such as:

1. Preprocess

The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are chances that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or have a different naming convention. All these things degrade the results.

To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set of options under the filter category. Here, the tool provides both supervised and unsupervised types of operations. Here is the list of some operations for preprocessing:

ReplaceMissingWithUserConstant: to fix empty or null value issue.
ReservoirSample: to generate a random subset of sample data.
NominalToBinary: to convert the data from nominal to binary.
RemovePercentage: to remove a given percentage of data.
RemoveRange: to remove a given range of data.

2. Classify

Classification is one of the essential functions in machine learning, where we assign classes or categories to items. The classic examples of classification are: declaring a brain tumour as "malignant" or "benign" or assigning an email to a "spam" or "not_spam" class.

After selecting the desired classifier, we select test options for the training set. Some of the options are:

Use training set: the classifier will be tested on the same training set.
A supplied test set: evaluates the classifier based on a separate test set.
Cross-validation Folds: assessment of the classifier based on cross-validation using the number of provided folds.
Percentage split: the classifier will be judged on a specific percentage of data.

Other than these, we can also use more test options such as Preserve order for % split, Output source code, etc.

3. Cluster

In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case, the items within the same cluster are identical but different from other clusters. Examples of clustering include identifying customers with similar behaviours and organizing the regions according to homogenous land use.

4. Associate

Association rules highlight all the associations and correlations between items of a dataset. In short, it is an if-then statement that depicts the probability of relationships between data items. A classic example of association refers to a connection between the sale of milk and bread.

The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association rules mining in this category.

5. Select Attributes

Every dataset contains a lot of attributes, but several of them may not be significantly valuable. Therefore, removing the unnecessary and keeping the relevant details are very important for building a good model.

Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.

6. Visualize

In the visualize tab, different plot matrices and graphs are available to show the trends and errors identified by the model.

Requirements and Installation of Weka

We can install WEKA on Windows, MAC OS, and Linux. The minimum requirement is Java 8 or above for the latest stable versions of Weka.

As shown in the above screenshot, five options are available in the Applications category.

The Exploreris the central panel where most data mining tasks are performed. We will further explore this panel in upcoming sections.
The tool provides an Experimenter In this panel, we can run experiments and also design them.
WEKA provides the KnowledgeFlow panel. It provides an interface to drag and drop components, connect them to form a knowledge flow and analyze the data and results.
The Simple CLIpanel provides the command line powers to run WEKA. For example, to fire up the ZeroR classifier on the arff data, we'll run from the command line:

java weka.classifiers.trees.ZeroR -t iris.arff

Weka Datatypes and Format of Data

Numeric (Integer and Real), String, Date, and Relational are the only four datatypes provided by WEKA. By default, WEKA supports the ARFF format. The ARFF, attribute-relation file format, is an ASCII format that describes a list of instances sharing a set of attributes. Every ARFF file has two sections: header and data.

The header section consists of attribute types,
And the data section contains a comma-separated list of data for that attributes.

It is important to note that the declaration of the header (@attribute) and the declaration of the data (@data) are case-insensitive.

Let's look at the format with a weather forecast dataset:

@attribute outlook {sunny,overcast,rainy} 
@attribute tempreture {hot,mild,cool} 
@attribute humidity {high,normal} 
@attribute windy {TRUE,FALSE} 
@attribute play {yes,no} 

@data 
sunny,hot,high,FALSE,no 
sunny,hot,high,TRUE,yes 
overcast,hot,high,TRUE,yes 
overcast,cool,normal,TRUE,yes 
rainy,cool,normal,FALSE,no 
rainy,cool,normal,TRUE,no

Besides ARFF, the tool supports different file formats such as CSV, JSON, and XRFF.

Loading of Data in Weka

WEKA allows you to load data from four types of sources:

The local file system
A public URL
Query to a database
Generate artificial data to run models

Once data is loaded from different sources, the next step is to preprocess the data. For this purpose, we can choose any suitable filter technique. All the methods come up with default settings that are configurable by clicking on the name:

If there are some errors or outliers in one of the attributes, such as sepallength, in that case, we can remove or update it from the Attributes section.

Types of Algorithms by Weka

WEKA provides many algorithms for machine learning tasks. Because of their core nature, all the algorithms are divided into several groups. These are available under the Explorer tab of the WEKA. Let's look at those groups and their core nature:

Bayes: consists of algorithms based on Bayes theorem like Naive Bayes
functions: comprises the algorithms that estimate a function, including Linear Regression
lazy: covers all algorithms that use lazy learning similar to KStar, LWL
meta: consists of those algorithms that use or integrate multiple algorithms for their work like Stacking, Bagging
misc: miscellaneous algorithms that do not fit any of the given categories
rules: combines algorithms that use rules such as OneR, ZeroR
trees: contains algorithms that use decision trees, such as J48, RandomForest

Each algorithm has configuration parameters such as batchSize, debug, etc. Some configuration parameters are common across all the algorithms, while some are specific. These configurations can be editable once the algorithm is selected to use.

Weka Extension Packages

In version 3.7.2, a package manager was added to allow the easier installation of extension packages. Some functionality that includes Weka before this version has moved into such extension packages, but this change also makes it easier for others to contribute extensions to Weka and maintain the software, as this modular architecture allows independent updates of the Weka core and individual extensions.

Next Topic#

← prev next →