Diagnostics tool for Apama in Cumulocity IoT
The use of diagnostic tools can help to resolve uncertainty about a set of symptoms or avoid the possibility of failures by providing reassurance and organizing the symptoms into something more understandable and manageable. It can also empower users to start analyzing issues themselves. With that thought in mind we worked upon making some of the diagnostics information available for Apama. This is provided in the zip file which can be downloaded from Streaming Analytics home page .
Importance of Diagnostics tools
Previously if someone encountered issues, the starting point for analyzing them was either from the Apama-ctrl microservice logs in Cumulocity IoT Cloud or the logs generated from the Diagnostic Collector in Cumulocity Edge IoT. They then had to make deductions from those files. Now, diagnostics information is gathered and downloadable as a zip file via the web applications which provides diagnostics information from the Apama correlator. This will help in developing, debugging and deploying applications. The link is shown at the bottom of the Streaming Analytics home page (see - Figure1).
Figure 1: Link for Diagnostic zips in Streaming Analytics UI.
Overview of Diagnostics tool
The diagnostics information in the file named diagnostic-overview<timestamp>.zip (from the diagnostics link in Figure1) includes the following information:
- The microservice log file contents, if available, including a record of the correlator's startup logging and the last hour, or a maximum of 20,000 lines of logging (this may require the "microservice hosting feature" in the subscribed applications).
- Apama-internal diagnostics information (similar to the engine_watch and engine_inspect command-line tools available in Apama).
- A copy of all EPL applications, Smart Rules and Analytics Builder models.
- A copy of any alarms that the Apama-ctrl microservice has raised.
- CPU profiling (over a duration of 5 seconds).
- Some information from the environment (tenant details, environment variables).
- Version numbers of the components.
The diagnostics information in the file named diagnostic-enhanced<timestamp>.zip (from the enhanced link in Figure1) includes the above-mentioned. In addition, it includes requests that are more expensive and may significantly slow down the correlator like EPL memory profiler snapshots, contents of queues, CPU usage etc. In the first instance it is better to use the Overview zip unless this further information is required.
As generating diagnostics might consume significant resources, to prevent misuse these endpoints are not exposed to everyone; you must have READ permissions for "CEP management" to be able to access them.
Figure 2: Contents of Diagnostic Overview zip file.
The following are scenarios where using these diagnostics is helpful in identifying problems:
Consider a situation where some injected EPL is generating excessive requests (maybe in a tight loop). This might result in blocking of the output queue of the Apama correlator. You can look at the correlator/inspect.json output in the diagnostics zip file, especially at the receivers’ section - the 'queueSize' property, to diagnose the cause of the problem.
Example: A Correlator status line and a JSON fragment of correlator/inspect.json of a full output queue:
Correlator Status: sm=52 nctx=11 ls=239 rq=0 iq=20000 oq=11024 icq=20000 lcn="main" lcq=20000 lct=79.9 rx=20105 tx=158739 rt=129 nc=5 vm=3832844 pm=571796 runq=0 si=0.0 so=0.0 srn="CumulocityIoTGenericChain" srq=10000
From the above it can be deduced that the queue for "CumulocityIoTGenericChain" is full. You can further check which monitors or models are responsible for this. For details on status fields see the "Descriptions of correlator status log fields" section in Apama documentation.
- If the output queue is not full, it's possible there is an EPL application in a tight loop, consuming CPU and not processing new events. This can lead to the input queue filling up and delaying events processed by Analytics Builder and 'CEP queue full' alarms in Cumulocity IoT. You can analyze the cpuProfile.csv output in the diagnostic overview zip file, especially the monitor - name and CPU time. Also, the data collected in the profiler might help you in identifying other possible bottlenecks. For details on the CPU profiler by Apama correlator, please see "Using the CPU profiler" section in the Apama documentation.
- Consider a situation where memory consumption of the Apama correlator is increasing over time. It might be possible that there is memory leak. You check this by looking at the eplMemoryProfile.csv output in the diagnostics enhanced zip file. The EPL memory profiler shows memory usage that can be directly attributed to individual monitor instances, so it is helpful to check for any memory leaks. For details on EPL memory profiler, see "Using the EPL memory profiler" in the Apama documentation.
- If there is a listener leak in some injected EPL and you want to identify which listener in which monitor is responsible for that, you can look at correlator/inspect.json in the overview diagnostics. Especially look for the 'eventTemplates' for the event types, or for the number of subMonitors if it's a monitor spawn leak.
- 5. If you want to get a snapshot of the Apama correlator, that is, see what are all the Smart rules, EPL applications and Analytics Builder models in the Apama Correlator then you need to look into the service/smartrule/smartrules.json, eplfiles.json and analyticsbuilder.json files in the root of the overview zip file respectively.
Also, if the memory usage of the Apama-ctrl microservice goes beyond 90% of the maximum memory permitted for the microservice container then this diagnostic information is saved as a zip file in the Cumulocity IoT inventory and an audit log message is written which contains the URL of the stored file. This can be accessed in Admin → Management → Files repository.
Figure 3: Alarm when Apama-ctrl microservice is using more than 90% of the maximum memory permitted.
The diagnostics tools for Apama in Cumulocity IoT captures the diagnostics information when experiencing problems, or for debugging EPL applications. It is also useful to provide to Support if you are filing a support ticket.
The diagnostic zip file contains a lot more information which is useful for debugging problems. For further details please refer to the documentation.