Elasticsearch best practices

Author: Vaidyanathan, Praveen
Supported Versions: 9.12 to 10.4

Introduction

API Gateway uses Elasticsearch as its primary data store for persisting different types of data like APIs, Policies, Applications etc apart from runtime events and metrics. This document is intended to provide some basic guidelines for configuring and managing Elasticsearch. At different points in the document, we will provide the configurations/tunings details vis-a-vis API Gateway. Please note, though the information provided in this document would enable to you get started with the basic configurations, it is important that you refer to the official Elasticsearch documentation for completeness.

Elasticsearch Basics

Document

A basic unit of information which can be stored and retrieved. A document is managed in JSON format. An API instance basic information can be visualized as a document.

Type

Type is a collection of  documents which shares common attributes."API: type is a type which defines common attributes of APIs.

Index

Index is a collection of documents which are grouped as types. Till 5.6.4 version of Elasticsearch. From 6.X each index can have a single type. 

Shard

Index can be divided as one or more shards. Shard is nothing but a data partition. This is needed as the index ( no. of documents) might not fit in a single node. So its partitioned and stores across nodes in a cluster and helps to scale horizontally. Data partition also speeds up the search performance can search can be done parallelly across shards of an index. Each document will be part of one primary shard and one or more replicas.

Replica

Replica is a duplicate of a shard. Replicas help to recover the data on shard loss or machine crash. Basically it improves the data availability.This helps to improve the concurrent reads and hence to scale out the searches.

Nodes

Node is an instance of Elasticsearch process. A collection of Nodes is called a cluster. There are many Nodes but the 2 most important types are:

Master Node

There will be only one master node in the cluster. The master node is responsible for creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which node. One master will be elected from all possible master eligible nodes (node.master=true) and all nods are master eligible by default.

Data Node

Data nodes hold data and perform data related operations such as CRUD, search, and aggregations. By default all nodes are data nodes (node.default=true).

Read / Write Semantics

For a write (update, delete), the request will be routed to primary shared. The primary validates the data, writes the changes and replicate the same to other shards. Once all replication is done successfully, the ack is sent back to caller.

For a read, the request is sent to all nodes which could serve the requests in "scatter" phase. the result is consolidated in "gather" phase and sent back to caller. if there are more replica, the request will be sent by round robin.

Refer read/write model for more details.

API Gateway Data Types

API Gateway data can be broadly classified into 4 types.

1. Core Configurations

This includes APIs, Applications, Policies, Plans, Packages, Administration Settings, Security Configurations (Keystores/Trustores) & Tokens (OAuth/API Keys). This data, by default, is stored in Elasticsearch embedded in API Gateway (as InternalDataStore).

2. Runtime Transactions

This includes the runtime transactions events and metrics data. By default API Gateway stores this data in Elasticsearch (InternalDataStore) but this can be modified so that API Gateway can emit this data to other destinations (like external Elasticsearch, RDBMS, DES etc). 

3. Application Logs (from 10.3)

API Providers can configure the application logs to be stored in Elasticsearch (InternalDataStore) but this can be modified. Refer to the product documentation for more details.

4. Audit Logs (from 10.2)

API Gateway, by default, stores the audit logs in Elasticsearch (InternalDataStore) but this can be modified. Refer to the product documentation for more details.

Separation of Data

It is possible to separate the Core Configuration store from the other data store (such as Runtime transactions, Application Logs, Audit Logs). And this is recommended in the cases when there is a large volume of runtime transactions data and you want to scale the Elasticsearch used for runtime transactions independent of the default store.  This is possible because API Gateway provides options for the API providers to configure external destinations to which API Gateway can emit the data such as Runtime transactions, Application Logs, Audit Logs. Refer to the product documentation to configure the external destination store. 

Elasticsearch Configurations

Using External Elastic Search

API Gateway installation bundles a default Elasticsearch (IntenalDataStore) and all the data by default is written to this Elasticsearch. But the user can configure the API Gateway to use any external Elasticsearch.  

Note: 10.1 or lower versions of API Gateway can only be configured with 2.3.2 version of Elasticsearch. Starting 10.2 version of API Gateway, you can configure a wide range of Elasticsearch versions. 

Following table shows the Elasticsearch support matrix for API Gateway.

API Gateway
Elasticsearch
Comments
9.12 2.3.2  

10.0

2.3.2  
10.1 2.3.2  
10.2 >=2.3.2 and <= 5.6.4 InternalDataStore is 5.6.4 but API Gateway can work with versions below 5.6.4
10.3 >=2.3.2 and <= 5.6.4 InternalDataStore is 5.6.4 but API Gateway can work with versions below 5.6.4
 

For 10.1, please refer to this article.

For 10.2 & 10.3, please refer to product documentation, API Gateway Configuration Guide or Configuring External Elasticsearch for API Gateway 10.2.

Elasticsearch HTTP Client Configurations

API Gateway provides options for you to control the way that API Gateway will talk to the Elasticsearch. Refer to product document, API Gateway Configuration Guide, to get more details on the same.

Elasticsearch server Configurations

API Gateway defaults to the Elasticsearch default configurations for its InternalDataStore. The default Elasticsearch configurations are generally good for starters. Following are the default configurations that you can find in InternalDataStore\config\elasticsearch.yml

Default Configuration

cluster.name: SAG_EventDataStore
node.name: XXXXX1540445984673
path.logs: C:\APIGW\InternalDataStore/logs
network.host: 0.0.0.0

http.port: 9240

discovery.zen.ping.unicast.hosts: ["localhost:9340"]
transport.tcp.port: 9340
path.repo: ['C:\APIGW\InternalDataStore/archives']

discovery.zen.minimum_master_nodes: 1
Important Recommendations

Data location

Elasticsearch provides an option for you to configure the locations where you would want your data and logs to be stored. You should make sure that you specify the locations that are accessible and will have enough disk space.  It is also important to monitor and ensure basic housekeeping of the data location by plan an effective data retention strategy. Following configuration lets you change the defaults - https://www.elastic.co/guide/en/elasticsearch/reference/current/path-settings.html 

Data location

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

Network

Elasticsearch, by default, binds to the loopback address. It is important to change that for production deployment.  https://www.elastic.co/guide/en/elasticsearch/reference/current/network.host.html 

Sizing Strategy

The sizing strategy for Elasticsearch will depend on the volume of data that you plan to store. Generally, the Core Configurations (explained in the Data Types section above) does not have scope to grow in runtime (except for consumer applications & security tokens) and hence your initial sizing should depend on the number of APIs, configured policies, estimated number of applications/subscriptions & estimated number of security tokens. 

The Runtime Transactions data has the potential to grow big. Runtime transactions and metrics are stored when Logging and Monitoring policies are enabled.  You can base your sizing strategy based on the following parameters

  • Transactions per second
  • Average size of the payload (request + response)
  • Data Retention strategy (purging interval)

The size of Application logs and Audit logs purely depend on your log configurations. You have to make a best judgment here. 

The disk space allocation for all the above will also depend on your sharding and replication strategy (see below) as enabling these will distribute your data across different Elasticsearch nodes. 

Sharding & Replication Strategy

An Elasticsearch index may store a large amount of data. When dealing with a growing data set, it is important to plan your sharding and replication strategy to meet your demands of high availability and performance throughput. Read the basics of sharding and replication at Shards & Replicas.

Important: Though there are no official guidelines from Elasticsearch on shards and replication strategy, the below document provides a general guideline on the same based on the best practices for performance throughput and high availability. Hence the below sections should just be treated as a guideline and the actual strategy should depend on the factors such of the level of throughput needed, availability needs, memory and disk constraints of the machines where the nodes are running, the number of Elasticsearch nodes in a cluster and so on.

The sharding strategy should primarily depend on the growth of your data set and performance throughput. Lesser the number of shards the more will be the size of each shard and more computation resources needed on nodes that are holding the shards. On the other side, more the number of shards, there may be an overhead on the performance as the query results from all the shards will have to be consolidated. Hence there has to be a balance between the number of shards per index.   Additionally, this is a quote from https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

"The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health"

Since the shard size will have an impact on reallocation (in case of failover) and reindex (if needed), the general recommendation is to keep the shard size between 30-50 GB.  So if you believe that your index might grow up to 600 GB of data,  then you can define the number of shards as follows, assuming there are 3 Elasticsearch nodes with each of these nodes will run with 4 GB heap and 2 replicas per shard. 

  • Number of shards (30 GB/shard) - 20 primary (with replicas it is 60)
  • No of shards per GB - 5 (max recommended 20-25)
  • No of shards per node - 20 (max recommended for 4 GB node is 80-100)
  • No of Elasticsearch nodes - 3 

As your data grows, you can keep adding more nodes and Elasticsearch will do the rebalancing of the shards across the new nodes keeping the number of total shards constant.   

For replication, the general recommendation for a production cluster is to have 2 replicas per shard for handling failover. 

API Gateway ships 7 indices by default. They are 

  • gateway_default (store for Core Configurations)
  • gateway_default_analytics (store for runtime transactions)
  • gateway_default_dashboard
  • gateway_default_license (store for license details)
  • gateway_default_audit (store for audit logs)
  • gateway_default_cache (store for cache statistics) 
  • gateway_default_log  (store for application logs)

When it comes to Sharding and replication of the above indices, API Gateway leaves it to Elasticsearch default (which is 5 shards per index & 1 replica for a shard).  You can specify the number of shards per index level. So based on your estimate on the index(es) dealing with a growing data set, you plan the number of shards for that index(es). So let's say you estimate large volumes of runtime transactions, then you increase the number of shards for that index appropriately.  For others, you can go with the default value (5 shards per index). For replication, it is better to stick to the general recommendation of 2 replicas per shard irrespective of the index. 

Note: Please note, Replicas can be changed later point in time after the initial planning, but changing the number of shards is not trivial and Elasticsearch recommends that pre-planning for the correct number of shards is the optimal approach.

Heap and Memory Configurations

For JVM configurations, you can do these changes in config/jvm.options file. Take a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html.

The default heap configuration for Elasticsearch is 1 GB. More often than not, in a production setup, this need to be changed. More inputs on Heap and main memory configurations can be obtained from https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html and https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html 

Cluster Strategy & Configurations

Recommendations

  • A stable master is important for the stability of the cluster. If your cluster really grows( > 8 nodes) or seeing stability issues, its recommended to host dedicated master nodes. i.e node.master=true , node.data=false , node.ingest=false
  • To avoid a split-brain issue, it's recommended to use "quorum" for selecting a master. quorum = (No. master eligible nodes / 2) + 1. At least 3 nodes should be participating in the selection process to avoid split-brain. i.e discovery.zen.minimum_master_nodes=2
  • By default "write" decision use "quorum".  quorum = ((primary + replica) / 2) + 1. By default, each index will have 5 primary shards and 1 replica. The best that you could have is n-1 replica where n is the number of nodes. This provides protection in any worst case circumstances. But more replica means more work the data is sent to all replica but in parallel. More replica is bad for write heavy applications. But this also speeds up concurrent reads as replicas share the load. Less replica is bad for read heavy applications. 2 might be a good value up to 5 nodes to start with. Replicas can be changed at any time. 

Securing Elasticsearch

Refer to Securing Elasticsearch for API Gateway 10.2 for more information on securing Elasticsearch.

Data Management

You can use API Gateway's Backup utilities to periodically back up the data in API Gateway.  You may either adopt a similar strategy for all types of data or plan different strategies for different data. For example, you can adopt data backup strategy for your core configurations whereas you can adopt archive and purge for your transactions, audit and application logs based on your data retention period. 

Backup and restore

When you perform a backup, API Gateway does only an incremental backup of your data.  API Gateway backup and restore command-line utility will help you to take complete or partial backup and restore of API Gateway. You can refer to the  Back up and restore of API Gateway assets and Periodical Data backup for more details.

Every time you back up your data, Elasticsearch creates a new snapshot that is incremental to the previous snapshot.  In case you perform a daily backup, it is also important that you have a strategy on how many snapshots that you would want to maintain since in most of the practical cases it is sufficient that you maintain the backups for last few days.  In such cases, it is a good practice to clean up the older snapshots. Use the following command to perform the cleanup.

Clear the older snapshots

cd <apigatewayInstallationLocation>\IntegrationServer\instances\default\bin

To List all the older backup

apigatewayUtil<.bat|.sh> list backup

All the created backups will be list in chronological sequence with the oldest backup listed first.

Example:-

./apigatewayUtil.sh list backupBackups available in default are

default-2019-may-27-15-18-3-618000000

default-2019-may-27-15-18-38-752000000

default-2019-may-27-15-18-48-435000000

default-2019-may-27-15-21-43-506000000

To delete a backup

apigatewayUtil<.bat|sh> delete backup -name <backupName>

Example:

./apigatewayUtil.sh delete backup -name default-2019-may-27-15-18-3-618000000

 

Archive and Purge

It is important to make sure you have an estimate of the data that you might store depending on your data retention strategy to help manage the data.  Typically after your planned data retention period, you will have 2 options 

  • Purge old data
  • Archive and purge the data 

API Gateway provides option to archive the transactions, audit or applications logs data from the UI.  For automation, you can use the Administration REST APIs (APIGatewayAdministration.json).

Dynamic configuration updates

Elasticsearch APIs can be used to update most of the configurations dynamically.  You use the Cluster settings API to perform the configuration changes. Please refer to https://www.elastic.co/guide/en/elasticsearch/guide/current/_changing_settings_dynamically.html and https://www.elastic.co/guide/en/elasticsearch/reference/2.4/cluster-update-settings.html  for more details.

You can find the list of all configurations that can be dynamically updated at https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules.html#modules

Monitoring

For Elasticsearch monitoring, no plugins are shipped along with API Gateway but there are a lot of good plugins available in the web. X-pack is a commercial monitoring solution from the same company which created Elasticsearch. The other good open-source alternatives are

  1. Elastic-HQ
  2. Elasticsearch-kopf
  3. Elasticsearch-head

Across Data Centre deployment

Elastic search does not encourage the deployment of ES across datacentres but has proposed few alternatives for the same. 

Refer https://www.elastic.co/blog/clustering_across_multiple_data_centers.