Elasticsearch Analysis

Analysis is a process of converting the text into tokens or terms, e.g., converting the body of any email. These are added to inverted index for further searching. So, whenever a query is processed during a search operation, the analysis module analyses the available data in any index. This analysis module includes analyzer, tokenizer, charfilter, and tokenfilter.

For example -

Sentence: "A quick white cat jumped over the brown dog."

Tokens: [quick, white, cat, jump, brown, dog]

Analysis is performed by an analyzer. It can be either a built-in analyzer or a custom analyzer. Custom analyzers are defined according to the index. If the analyzer is not defined, then by default built-in analyzers, filters, token, and tokenizers get registered with the analysis module. Analysis is performed with the help of -

Analyzer
Tokenizer
Token Filter
Character Filter

We will discuss each of them in detail. Before this, start with a simple example.

Example -

Let's take a simple example in which we will use standard analyzer to analyze the text and convert them into tokens. It is a default analyzer used when nothing is specified. It will analyze the sentences and break them into tokens (words) based on the grammar.

Copy Code

POST http://localhost:9200/
_analyze
{
   "analyzer": "standard",
   "text": "It is best elasticsearch tutorial."
}

Response

{
       "tokens": [
       {
	"token": "it",
	"start_offset": 0,
	"end_offset": 2,
	"type": "<ALPHANUM>",
	"position": 0,
        },
       {
	"token": "is",
	"start_offset": 3,
	"end_offset": 5,
	"type": "<ALPHANUM>",
	"position": 1,
        },
       {
	"token": "best",
	"start_offset": 6,
	"end_offset": 10,
	"type": "<ALPHANUM>",
	"position": 2,
        },
       {
	"token": "elasticsearch",
	"start_offset": 11,
	"end_offset": 24,
	"type": "<ALPHANUM>",
	"position": 3,
        },
       {
	"token": "tutorial",
	"start_offset": 25,
	"end_offset": 33,
	"type": "<ALPHANUM>",
	"position": 4,
        }
    ]
}

Configure the Standard analyzer

Standard analyzer can be configured according to our requirement. We can also configure other analyzers to fulfill our custom requirements. With the help of an example, understand it much better.

First of all, create an index with the analyzer having max_token_length to configure the standard analyzer.
Provide the value in max_token_length variable. We will set it to 7.
We will also modify the name of the standard analyzer to my_english_analyzer, which will be used while analyzing the text.
Look at the example given below -

Copy Code

PUT http://localhost:9200/analysis_example
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_english_analyzer": {
               "type": "standard",
               "max_token_length": 7,
               "stopwords": "_english_"
            }
         }
      }
   }
}

Response

{
	"acknowledged: true,
	"shards_acknowledged": false,
	"index": "analysis_example"
}

Analysing text after configuring standard analyzer

After creating an index with a modified analyzer, now we will apply the analyzer with text. In the following example, you must have to provide the index name (analysis_example) in query string and analyzer in request body along with text string. Do not forget to provide _analyze API.

Copy Code

POST http://localhost:9200/analysis_example
_analyze
{
   "analyzer": "my_english_analyzer",
   "text": "It is best elasticsearch tutorial."
}

You note that elasticsearch and tutorial have 13 and 8 characters respectively, in the text string. Thereby, they will also be further broken according to the maximum token length specified in the previous query.

Response

{
       "tokens": [
       {
	"token": "best",
	"start_offset": 6,
	"end_offset": 10,
	"type": "<ALPHANUM>",
	"position": 2,
        },
       {
	"token": "elastic",
	"start_offset": 11,
	"end_offset": 18,
	"type": "<ALPHANUM>",
	"position": 3,
        },
       {
	"token": "elastic",
	"start_offset": 19,
	"end_offset": 24,
	"type": "<ALPHANUM>",
	"position": 4,
        },
       {
	"token": "tutoria",
	"start_offset": 25,
	"end_offset": 31,
	"type": "<ALPHANUM>",
	"position": 5,
        }
       {
	"token": "l",
	"start_offset": 32,
	"end_offset": 33,
	"type": "<ALPHANUM>",
	"position": 6,
        }
    ]
}

Different types of Analyzers

These are several analyzers, each having different functionality. They help to achieve different objectives as needed. Below is a list of built-in analyzers with their descriptions -

Sr.No	Analyzer	Description
1	Standard Analyzer (standard)	In standard analyzer, max_token_length and stopwords settings can be modified. In this, default max_token _length is 255 and the stopwords list is empty. It is a default analyzer if nothing is specified.
2	Simple Analyzer (simple)	It is composed of lowercase tokenizers.
	Whitespace Analyzer (whitespace)	It is composed of whitespace tokenizers.
	Stop Analyzer (stop)	In this analyzer, stopwords and stopwords_path can be configured. By default, stopword_path contains the path to a text file with stopwords and stopwords are initialized to English stopwords by default.
4	Keyword Analyzer (keyword)	The keyword analyzer tokenizes the whole stream into a single token. For example, it can be used for zip code.
5	Pattern Analyzer (pattern)	As the name specifies, this analyzer deals with regular expressions. In this analyzer, we can set up various settings like patterns, lowercase, flags, and stopwords.
6	Language Analyzer	The language analyzer allows the users to deal with various languages like Hindi, English, Dutch, Arabic, etc.
7	Snowball Analyzer (snowball)	Snowball analyzer uses standard tokenizer along with the standard filter, snowball filter, lowercase filter, and stop filter.
8	Custom Analyzer (custom)	As the name specifies, this analyzer helps to create customized analyzers along with a tokenizer and optional token filters and char filters. In this analyzer various settings can be configured like - filter, char_filter, tokenizer, and position_increment_gap.

Example of Keyword tokenizer

As we discussed, the keyword analyzer treats the whole stream as a single token. Look at the below example of keyword analyzer. For this, we do not require any index. We just need to specify the analyzer type in the analyzer field and text string in the text field. Do not forget to pass _analyze API in API column.

Copy Code

POST http://localhost:9200/
_analyze
{
   "analyzer": "Keyword",
   "text": "It is best elasticsearch tutorial."
}

Response

{
       "tokens": [
       {
	"token": "It is best elasticsearch tutorial.",
	"start_offset": 0,
	"end_offset": 34,
	"type": "word",
	"position": 0,
        }
    ]
}

Screenshot

Tokenizers

In elasticsearch, tokenizers are used to generate tokens from the text. Tokenizer helps to break down the text into tokens by putting whitespace or other punctuation symbols. Elasticsearch provides built-in tokenizers, which can be used in custom analyzers. Below is a table of tokenizers with their description -

Sr.NO	Tokenizer	Description
1	*Standard tokenizer (standard)*	This standard tokenizer is built on a grammar-based tokenizer. For this tokenizer, max_token_length can be configured.
2	Edge NGram tokenizer (edgeNGram)	This tokenizer allows us to set up different settings such as min_gram, max_gram, and token_chars.
3	Keyword tokenizer (keyword)	The Keyword tokenizer generates the whole input as an output. We can set the buffer_size for this tokenizer.
4	Letter tokenizer (letter)	The letter tokenizer captures the entire word until it encounters a character, which is not a letter. It avoids displaying non-letter data.
5	*Lowercase tokenizer (lowercase)*	The lowercase tokenizer is much similar to the letter tokenizer. The only difference is that it converts the tokens to lowercase after creating them.
6	*NGram tokenizer*	For this tokenizer, we can set up some settings like min_gram, max_gram, and token_chars settings. The default value of min_gram is 1, and max_gram is 2.
7	*Whitespace tokenizer (whitespace)*	Based on whitespaces in the query string, it divides the text.
8	*Pattern tokenizer (pattern)*	Similar to the pattern analyzer, it also uses regular expressions as a token separator. We can set the pattern, group, and flag settings for this tokenizer.
9	*Classic tokenizer*	As it is a classic tokenizer, it just breaks the query string into tokens based on the grammar. We can setup max_token_length for this tokenizer. It works on grammar-based tokens.
10	*UAX Email URL Tokenizer (uax_url_email)*	This tokenizer works the same as the standard tokenizer, but there is also a difference between them. UAX Email URL tokenizer treats email and URL as a single token.
11	*Path hierarchy tokenizer (path_hieharchy)*	This tokenizer helps to generate all the possible paths present in the input directory path. There are some settings available for this tokenizer, which are - replacement_buffer_size (default is 1024), reverse (default is false), delimiter (default is /), and skip (default is 0).
12	*Thai tokenizer*	Thai tokenizer is basically designed for the Thai language, which used a built-in Thai segmentation algorithm.

Now, let's take an example of tokenizer that how it works in elasticsearch. In the following example, tokenizer will break the text into tokens whenever a non-letter character is found. It also converts all tokens into lowercase. See the example below -

Lowercase Tokenizer Example

Copy Code

POST http://localhost:9200/
_analyze
{
   "tokenizer": "lowercase",
   "text": "It Was My BIRTHDAY 2 Days Ago"
}

Response

{
       "tokens": [
       {
	"token": "it",
	"start_offset": 0,
	"end_offset": 2,
	"type": "word",
	"position": 0,
        },
       {
	"token": "was",
	"start_offset": 3,
	"end_offset": 6,
	"type": "word",
	"position": 1,
        },
       {
	"token": "my",
	"start_offset": 7,
	"end_offset": 9,
	"type": "word",
	"position": 2,
        },
       {
	"token": "birthday",
	"start_offset": 10,
	"end_offset": 18,
	"type": "word",
	"position": 3,
        },
       {
	"token": "days",
	"start_offset": 21,
	"end_offset": 25,
	"type": "word",
	"position": 4,
        },
       {
	"token": "ago",
	"start_offset": 26,
	"end_offset": 29,
	"type": "word",
	"position": 5,
        }
    ]
}

Note that - in the above response, all the uppercase letters converted into lowercase and elasticsearch did not print the numeric (non-letter) value, i.e., 2. But it keeps the space for it like one for before the non-letter and one for end of that non-letter character and one for itself.

Screenshot

Classic Tokenizer Example

In this example, we will pass the _analyze API in API section and specify the tokenizer name in a "tokenizer" variable. At last, we need to provide a text string to break the text into tokens.

Copy Code

POST http://localhost:9200/
_analyze
{
   "tokenizer": "classic",
   "text": "It Was My BIRTHDAY 2 Days Ago"
}

Response

In the response below, you can see that the text stream is divided into tokens and the type is alphanumeric, which means both alphabetic and numeric value will be displayed. Like the lowercase tokenizer, tokens are not converted to lowercase here, and the numeric value is also displayed in response.

{
       "tokens": [
       {
	"token": "It",
	"start_offset": 0,
	"end_offset": 2,
	"type": "<ALPHANUM>",
	"position": 0,
        },
       {
	"token": "Was",
	"start_offset": 3,
	"end_offset": 6,
	"type": "<ALPHANUM>",
	"position": 1,
        },
       {
	"token": "My",
	"start_offset": 7,
	"end_offset": 9,
	"type": "<ALPHANUM>",
	"position": 2,
        },
       {
	"token": "BIRTHDAY",
	"start_offset": 10,
	"end_offset": 18,
	"type": "<ALPHANUM>",
	"position": 3,
        },
       {
	"token": "2",
	"start_offset": 19,
	"end_offset": 20,
	"type": "<ALPHANUM>",
	"position": 4,
        },
       {
	"token": "Days",
	"start_offset": 21,
	"end_offset": 25,
	"type": "<ALPHANUM>",
	"position": 4,
        },
       {
	"token": "Ago",
	"start_offset": 26,
	"end_offset": 29,
	"type": "<ALPHANUM>",
	"position": 5,
        }
    ]
}

Screenshot

Similarly, we can get the result for another tokenizer by just passing their name in tokenizer field and text in a text string.

Token Filters

Elasticsearch offers built-in token filters, which receive input from tokenizers.
After receiving input, these filters can add, modify, and delete text in that input.
We have already explained most of the token filters in the previous section.

Character Filters

Character filter process the text before tokenizers.
It looks for the specified pattern or special character or html tags and after that either delete or change them to an appropriate word like & to and, delete html markup tags.

Next TopicElasticsearch Mapping

← prev next →