UPDATE (17/10/2019) : there is now a community-developed tool that can help with analyzing correlator log files – see GitHub - ApamaCommunity/apama-log-analyzer: Python 3 script for analyzing Apama correlator log files and extracting useful diagnostic information for more information.
Correlator status lines allow you to monitor the runtime operational status of a running correlator. When a project is running, correlator status lines are logged every 5 seconds when you start the correlator.
You can specify the interval at which the correlator logs status lines by providing the --logQueueSizePeriod
option when you start the correlator.
WHERE TO LOOK FOR STATUS LOG LINES
The default behavior when a correlator is run from the command line is for the correlator log (which includes the status information) to be written to standard output. When run from Designer, status lines are shown in Console view.
FORMAT OF CORRELATOR STATUS LINES
When logging at INFO level the status lines are in the format shown below. The meaning of each field is described in the table in the Meaning of fields section .
Correlator Status: sm=11 nctx=1 ls=60 rq=0 eq=0 iq=0 oq=0 icq=0 lcn="<none>" lcq=0 lct=0.0 rx=5 tx=17 rt=31 nc=2 vm=325556 pm=81068 runq=0 si=915.3 so=0.0 srn="<none>" srq=0
HOW TO REDIRECT LOGS TO FILE
To send correlator logs to a file, specify the -f file
or --logfile file
option, where file is the name of the file that you want to be the log file.
If you specify the name of an existing file, the correlator overwrites it on Windows or appends to it on UNIX. To avoid overwriting log, specify ${START_TIME}
in the log file name when you start the correlator.
For example: --logfile correlator_${START_TIME}.log
This command opens a log with a name something like correlator_2016-10-06_16:51:33.log
In UNIX shells, you need to escape the tag symbols like correlator -l license --logfile correlator_\${START_TIME}.log
when starting the correlator.
HOW STATUS LOG LINES CAN BE USEFUL
For diagnosing correlator busy scenarios when connected to external applications.
These logs are helpful in identifying memory leaks, listener leaks, etc. For example:
- If “
pm
“ increases continuously, it may be an indication of a memory leak. - An increase in “
ls
“ may indicate a listener leak. - If the “
runq
“ parameter is consistently more than 0 and latency is a problem then it indicates performance issue and the application is not keeping up. You should reduce the amount of work being done in the correlator or add more/faster CPUs to your configuration.
In this way, you can assess the performance of applications based on the values of the fields shown in status lines.
MEANING OF FIELDS
The table describes the meaning of the fields in correlator status lines and a few examples of how we can diagnose issues from them:
Field | Description | Units | Examples | Comments |
---|---|---|---|---|
sm | Number of sub monitors/monitor instances | Number of monitor instances | If there is a consistent increase in this number, it is an indication that monitors are being spawned but not terminating. This will eventually lead to an out of memory condition. | This number should usually be steady |
nctx | Number of contexts (The number of contexts in the correlator. This includes the main context plus any user-defined contexts) | Number of context | This number should usually be steady | |
ls | Number of listeners (This is the sum of listeners in all contexts. This includes each on statement and each stream source template) | Number of listeners | If “ls” is increasing but “sm” isn’t, the culprit is probably custom EPL that is creating listeners that are never satisfied or quit. | This number should usually be steady |
rq | Route queue (Sum of the number of routed events on the input queues of all contexts) | Number of events | If this number is largish (>50) there is probably a very lengthy loop (possibly infinite) routing events. | This number is typically 0 or near 0 |
eq | Enqueue queue (Number of events on the enqueued events queue) | Number of events | This number is typically 0 or near 0 | |
iq | Input queue (Sum of the number of entries on the input queues of all contexts) | Number of events | Every context has an input queue and each queue has a maximum size of 20,000 events and it is perfectly normal to see this number in the hundreds. But if this number is steadily increasing this indicates the system is not keeping up and may eventually block senders. Either the correlator simply cannot process the events at the rate they arrive, or there is back pressure from the output queue. | This number is typically 0 or near 0, if it is growing then it indicates contexts are not keeping up with incoming events. |
oq | Output queue (The number of events waiting on the correlator’s output queue to be dispatched to any registered receivers) | Number of events | The output queue is fed by the “emit” or “send to” keyword. There may be multiple processes attached as consumers to the output queue. If any one of them cannot consume events at the rate the correlator is emitting them, the queue will back up and this number will start getting large. By default, it has a limit of 10,000 events per receiver. If this limit is reached, the correlator will, by default, stop processing events and wait for the output queue to drain. | This number is typically 0 or near 0, if it is growing then it indicates one or more slow receivers. |
icq | The total number of enqueued events on the input queues of all public contexts | Number of events | This will be a subset of the events included in the iq count. | This number is typically 0 or near 0, if it is growing then it indicates one or more slow receivers. |
lcn | Name of the slowest context | N/A | ||
lcq | For the context identified by lcn, this is the number of entries on its queue | Number of events | ||
lct | For the context identified by lcn, this is the time difference between its current logical time and the most recent time tick added to its input queue | Seconds | ||
rx | Number of events the correlator has received from external sources since the correlator was started | Number of events | This number increases as events are received. It only increases, it can never go down. | |
tx | Number of events the correlator has sent to receivers since the correlator was started. This number includes duplicate events sent to multiple receivers | Number of events | If you inject EPL code that emits an event, and there are five receivers that are subscribed to channels that publish that event. In this situation, the tx count goes up by five. Although there was 1 event, it was sent five times. | This number normally increases as events are emitted |
rt | Route total. Total number of events that have been routed across all contexts since the correlator was started | Number of events | This number normally increases as events are routed | |
nc | Number of consumers (the number of components connected to receive emitted events) | Number of connections | This number should usually be steady | |
vm | Virtual memory | kilobytes | This number should usually be steady, though may vary slightly | |
pm | Physical memory | kilobytes | This number should usually be steady, though may vary slightly | |
runq | Run queue. Number of contexts that have work to do but are not currently running | Number of contexts | If this number increases, it indicates a performance issue and the application is not keeping up | This number is typically 0 or near 0 |
si | The rate at which pages are being read from swap space | Pages/second | If “si” is non-zero, the host is running out of memory and will not perform well. This severely affects latency and it is possible that this will prevent Status lines being regularly logged every 5 seconds. | This number is typically 0 or near 0 |
so | The rate at which pages are being written to swap space | Pages/second | If “so” is non-zero, the host is running out of memory and will not perform well. This severely affects latency and it is possible that this will prevent Status lines being regularly logged every 5 seconds. | This number is typically 0 or near 0 |
srn | Slowest receiver name (The receiver with the largest number of incoming events waiting to be processed) | N/A | ||
srq | Slowest receiver queue size (For the receiver with the largest number of incoming events waiting to be processed, this is the number of events that are waiting) | Number of events |
Apama implements flow control for its queues. The input queue (one per context) has a maximum size of 20,000 events and the output queue (one per receiver) has a size of 10,000 events. If a receiver is slow, the output queue will grow until it is full and then block contexts that are sending to it. If contexts are not keeping up (or blocked), the input queue will grow until 20,000 and then block senders.
Use the --qdisconnect
command line flag to prevent the correlator from consuming an unbounded amount of memory and failing with out of memory in the case where receiver is slow and events continue to be sent in.
CONCLUSION
The correlator status log lines are the first place you should look if you are experiencing problems (throughput, latency, memory use, high CPU, etc) with Apama and give a great insight into the internal status of the correlator. Please use the forum if you have any questions.