Collecting Runtime Monitoring Data for Prometheus with the Command Central REST API

When you deploy a solution that includes Command Central to a Kubernetes cluster, you can use the Command Central REST API to monitor metrics for the runtimes in the solution that are managed by Command Central. The Command Central REST API collects metrics, such as the product status and other KPIs, and sends those metrics to Prometheus. When a container is deployed to the Kubernetes cluster, Prometheus consumes the metrics that it obtains from Command Central. The Command Central REST API for release 10.3 and higher is also enhanced to provide KPI monitoring data in integer format.

Command Central REST API Endpoint

When you deploy a solution that contains Command Central to a Kubernetes cluster, the Prometheus plugin registers a Command Central monitoring endpoint and consumes metrics for all runtimes in a solution through the single Command Central endpoint. Use the following format to register the Command Central monitoring endpoint:

metrics_path: /cce/monitoring/prometheus
static_configs:
- targets: ['cc:<port>']
    labels:
      customer: '<customerId>'
      solution: '<solutionId>'
      stage: '<stageId>'

The following example registers a Command Central monitoring endpoint for a solution with name "solution1" in the "dev" stage, for the customer with ID "sag", and will allocate port 8090 to Command Central:

metrics_path: /cce/monitoring/prometheus
static_configs:
- targets: ['cc:8090']
    labels:
      customer: 'sag'
      solution: 'solution1'
      stage: 'dev'

After registering the Command Central monitoring endpoint, the Command Central REST API uses one of the following requests to collect data from the managed runtimes:

GET /cce/monitoring/prometheus/
GET /cce/monitoring/prometheus/<nodeAlias>/<runtimeComponentId>

Where <nodeAlias> is the alias of the node of the run-time component and <runtimeComponentId> is the ID of the run-time component.

The Command Central REST API returns the collected monitoring data in the following format:

<kpi_id>{ nodeAlias="<nodeAlias>",runtimeComponentId="<runtimeComponentId>",productId="<productId>"} <kpiValue> <lastUpdatedTimestamp>

Where <kpi_id> is the ID of the KPI that Command Central is monitoring, <nodeAlias> is the alias of the node of the run-time component; <runtimeComponentId> is the ID of the run-time component, and <productId> is the ID of the product instance.

The <kpiValue> is an integer number and the <lastUpdatedTimestamp> is a UNIX timestamp of the time when the monitoring data was last updated.

The following example shows output data for the JVM CPU load, memory usage, and the number of threads for the product with ID "CCE" (that is Command Central itself):

sag_cpu{nodeAlias="local",runtimeComponentId="OSGI-CCE",productId="CCE"} 2 1530853721742
sag_memory{nodeAlias="local",runtimeComponentId="OSGI-CCE",productId="CCE"} 278792 1530853721742
sag_threads{nodeAlias="local",runtimeComponentId="OSGI-CCE",productId="CCE"} 180 1530853721742

The following example shows output KPI data for the "NUMRealmServer" product instance of the run-time component with ID "Universal-Messaging-localhost":

sag_um_kpi_fanoutBacklog{nodeAlias="local",runtimeComponentId="Universal-Messaging-localhost",productId="NUMRealmServer"} 1 1530853721742
sag_um_kpi_jvmMemoryUsed{nodeAlias="local",runtimeComponentId="Universal-Messaging-localhost",productId="NUMRealmServer"} 191.57 1530853721742
sag_um_kpi_queuedTasks{nodeAlias="local",runtimeComponentId="Universal-Messaging-localhost",productId="NUMRealmServer"} 0
1530853721742

Runtime Statuses

The GET /cce/monitoring/prometheus request also returns the run-time status data in the following format:

sag_runtime_status{nodeAlias="<nodeAlias>",runtimeComponentId="<runtimeComponentId>",productId="<productId>"} <status>

Where <nodeAlias> is is the alias of the node of the run-time component, <runtimeComponentId> is is the ID of the run-time component, and <productId> is the ID of the product instance.

The <status> value is an integer number. The following table maps the integer values of <status> (as shown in the output for Prometheus) to the run-time statuses supported by the monitored run-time components:

Integer Status Runtime Status
- 10 FAILED
- 5 UNRESPONSIVE
- 3 ERROR
0 STOPPED
3 STARTING
2 STOPPING
5 NOT_READY
10 ONLINE
11 ONLINE_MASTER
9 ONLINE_SLAVE