Elasticsearch is composed of modules, which are responsible for its functionality. With the help of these modules, elasticsearch perform its functionality. We will discuss several modules in this chapter. These modules have two settings that can be static and dynamic.
- Static settings need to be configured.
- These settings must be set at the node level and on every relevant node.
- We can configure static settings in config file (elasticsearch.yml) before starting Elasticsearch.
- We can also set these settings on command line or as an environment variable when starting a node.
- To reflect the changes made by these settings, we have to update all the concerned nodes in the cluster.
- Dynamic settings can be set dynamically in elasticsearch.
We can update these settings on a live cluster with the cluster-update-setting API in elasticsearch.
|Cluster level routing and shard allocation
||It is responsible for providing the settings that control all the activities of shards and nodes. This means that these settings control when, where, and how shards are allocated to nodes.
||It is responsible for discovering a cluster. It also maintains the state of all nodes present in the cluster. Nodes discover each other and also form the cluster.
||As the discovery module maintains the state of nodes. Similarly, the gateway module maintains the state of the cluster. It manages the shards throughout the full cluster while restart.
||Manage the communication between the Elasticsearch API and HTTP client.
||It helps to maintain the settings that are set globally for all indexes.
||It controls default network settings as elasticsearch binds to the localhost.
||A node client acts as a master node. It starts as well as joins a node in a cluster but cannot hold data.
||Basic elasticsearch functionalities are enhanced by the plugin in a custom manner.
||A scripting language is designed for elasticsearch to be secured as much as possible.
||Scripting enables the user to use a script and evaluate the custom expressions
|Snapshot or Restore
||Snapshots can be created for entire cluster as well as for individual indices into a remote repository. It is used for data backup.
||A node stores several thread pools, which helps to improve the thread memory consumption that is managed within a node.
||In elasticsearch, the transport layer is used for communication between clusters. The transport networking layer needs to be configured.
||It acts as a federated client across the cluster and also responsible for joining the clusters.
||It allows executing the search request query on multiple clusters. It does not require to join the cluster to execute this request. Same as the tribe node, cross-cluster search also acts as a federated client.
We will discuss each of them in details -
Shard Allocation and Cluster-Level Routing
Cluster level settings decide the shards allocation to different nodes. These settings also decide the reallocation of shards to rebalance the cluster. Following are some settings that used to control shard allocation -
Cluster Level Shard Allocation
Following are a list of settings for cluster-level shard allocation along with its possible values and description:
||It is the default value for this setting that allows the shard allocation for all types of shards.
||One of the possible values for this setting is none that does not allocate any shard.
||As the possible value is primaries, it allows the shards allocation only for primary shards.
||Like the primary value, new_primaries is also responsible for shard allocation. It allocates shards only for primary shards and new indices.
||Numeric value allowed (default is 2)
||This setting restricts the recovery of concurrent shards.
||Numeric value allowed (default is 4)
||This setting restricts how many parallel initial primaries will recover.
||Boolean value allowed (default value is false)
||In the same physical node, it restricts the allocation of multiple replicas of the same shard.
||Numeric value allowed (default is 3)
||At the time of shard recovery from the peer shards, it controls the number of open network stream.
||Numeric value allowed (default is 2)
||For small files, it controls the number of open streams per node. At the time of shard recovery, the size of this small file is less than 5 Mb.
||Allows balancing for all kinds of shards.
||Any kind of shard balancing is not allowed by it.
||This setting allows shard balancing only for primary shards, not for all.
||As the name specifies, shard balancing is allowed only for replica shards.
||Numeric value allowed (default is 2)
||The number of concurrent shard balancing is restricted by this setting in the cluster.
||Only float value allowed (default is 0.45f)
||On each node, it defines the weight factor for shards allocation.
||Float value allowed (default is 0.55f)
||It helps to define the ratio of the number of shards per node allocated on a specific node.
||Float but only non-negative value allowed (default is 1.0f)
||It is the minimum optimization value of operation.
||This is the default value for this setting that always allows rebalancing.
||When all the primary shards are allocated in a cluster, it allows rebalancing.
||When all the primary and replica shards are assigned, it allows rebalancing.
Disk-based Shard Allocation
After the cluster level shard allocation setting, we will talk about the disk-based shard allocation. Following are a list of settings for disk-based shard allocation along with its possible values and description as well:
||It accepts Boolean (true or false) value to enable and disable the disk allocation decider. By default, this value is true.
||This disk-based setting indicates the maximum usage of the disk. After this point, it is not allowed to allocate any other shard to that disk. It accepts string values, and by default, it is 85%.
||This setting indicates the maximum utilization of disk at the time of allocation. Elasticsearch allocates that shard to another disk if this point has reached the time of allocation. By default, its value is 90%.
||It indicates the interval between the disk usages and checkups. The default interval value is 30s.
||This setting helps to decide that - while calculating the disk utilization, whether we should consider the shard that is currently being allocated. For this, it accepts a Boolean value, which is true by default.
This module basically helps in discovery of the clusters. With the help of this module, we can discover a cluster and maintain the state of all nodes available in it. So, whenever a node is added or deleted from the cluster, the state of that cluster changes. The cluster name setting creates a logical difference between multiple clusters.
The cloud vendor provides some modules that help us to use the APIs. These modules are as follows -
- Google compute engine discovery
- Azure discovery
- Zen discovery
- EC2 discovery
This module helps to maintain the cluster state as well as manages the shard data across the full cluster restart. Following are some static settings of the gateway module with its possible values and description -
||The default possible value for this setting is 0 (zero). For the recovery of local shards, it is the number of nodes being expected in the cluster.
||Accept numeric value
||It is the number of master nodes that are expected in the cluster before recovery begins. The default value for this setting is 0.
||It is the number of data nodes that are expected in the cluster before recovery begins. By default, it takes 0 for this setting.
||This setting indicates the interval between disk utilization and checkups.
||Basically, this setting is used to specify the time for which the recovery process will wait to start without worrying about the number of nodes included in the cluster.
- HTTP module is responsible for managing the communication between Elasticsearch APIs and HTTP client.
- This module can be disabled if required and enabled back too. We can disable it by changing the enabled value to false.
- There is a list of settings that need to be configured to control this module. These settings are available in yml file.
Below is a list of different http settings with description -
||It is the http port used to access elasticsearch on web. The default port number is 9200. Its range is between 9200-9300.
||This http.bind_host is a host address for http services.
||This port is used for http client. In case of firewall, it is also useful.
||Similar to the http.bind_host, it is a host address. This host address is for http client.
||This is used to set the maximum size of the content in an http request. The default size for it is 100mb.
||This is used to specify the maximum size of URL. The default size of it is 8kb.
||This specifies the maximum size of the http header. By default, its value is 8kb.
||The default value of this setting is false. This setting is used to enable or disable the support for compression.
||The http.pipelining setting is used to enable or disable the HTTP pipelining.
||Before shutting down the http request, this setting helps to limit the number of events to be queued.
This module helps to maintain settings for every index, which are set globally. There are a few settings that we will discuss, mainly related to memory usage. These settings are as follows -
- There are several circuit breakers in Elasticsearch.
- This circuit breaker setting is used to prevent all the operations due to OutOfMemoryError.
- It mainly controls the JVM heap size using indices.breaker.total.limit setting.
- By default, it is 70% of the JVM heap.
- This fielddata setting is used while aggregating on a field.
- It must have enough memory to allocate it.
- The amount of memory can be controlled by using fielddata.cache.size setting.
- This memory is used for field data cache.
Node Query Cache
- Node query cache memory is used to cache the result of queries.
- It uses LRU (Least Recently Used) eviction policy.
- All shards share one query cache per node.
- To control the memory size of this cache, queries.cache.size setting is used.
- The indexing buffer is used to store the newly created document in the index.
- Whenever the buffer gets full, it flushes the documents.
- The indices.memery.index_buffer_size setting helps to control the amount of heap, which is allocated for this buffer to store the document.
Shard Request Cache
- The shard request cache holds the local search data for each shard.
- By default, it can cache the result of the search request.
- Elasticsearch allows us to enable and disable the cache.
- We can enable the cache while creating an index. By sending the URL parameter, the cache can be disabled too.
- It is responsible for recovering the resources during the recovery process. These are some following settings (with its default values) used to control the resources -
TTL interval refers to as time to live interval. The main objective of ttl interval is to define the time of a document, after which the document gets deleted. There are dynamic settings to control this process -
Each node has the option of being a data node or not. A node will be a data node if the setting has the value as false. Elasticsearch allows this property to be changed. By changing node.data setting, we can change this setting.