Here in the Apama R&D team we’re not only passionate about delivering great products but we also love to find creative ways to empower our users to quickly analyze and solve any problems in the applications you’re building on top of our platform. In this post we’re going to introduce a non-product log analysis script which we’re contributing to the Apama community and hope you’ll find really useful!
How did this tool get created? Well any time a user gets in touch to ask Software AG for help the very first thing we do is look at the correlator log files (or in case of Cumulocity deployments, the apama-ctrl microservice log which also includes the correlator log output). These logs are always a treasure trove of valuable clues , containing everything from the exact versions of the product and OS environment, timezone (rather important for correlating across machines!), warn/error messages that often point to the cause of the problem, to – perhaps most valuable of all – the Correlator status:
lines logged every 5 seconds with a concise summary of essential correlator KPIs such as events sent/received, input queue size and memory use. However, it’s not always easy for a human to look through a a set of correlator log files which might represents days, weeks or months of operation, and sort the wheat from the chaff or (to switch metaphor) find those golden nuggets that will help you identify what went wrong in a complex user application. Enter the Apama Log Analyzer tool – which is now available to everyone in our community of Apama users.
The Log Analyzer is a simple script that just needs an installation of Python 3 to run (such as can be found in Apama 10.3+). You point it at a single log file or a list/directory/.zip of correlator log files, and it and generates a directory of analysis output including: an HTML file with some graphs, .csv files you can open in a spreadsheet for deeper analysis, summaries of the error/warn messages, receiver/slow receiver connections events and more. We’ll take a look at just a few of the features in this post; for more details see the GitHub - ApamaCommunity/apama-log-analyzer: Python 3 script for analyzing Apama correlator log files and extracting useful diagnostic information readme.
After running the tool, the first thing you’ll want to look at is the overview.html file which contains some nice graphs showing event rates, memory use, queue sizes etc. These graphs can be zoomed and panned interactively.
Of course this is no replacement for having a fully configurable production monitoring and dashboarding solution such as Prometheus (see Using Apama with Prometheus and Grafana), but it’s great for environments where such monitoring isn’t available, and for getting some quick graphs showing the most important info without any configuration effort.
One of the other really useful graphs is the one showing resident/physical memory use. Note especially the black dots/lines which are a handy visualization of times when the correlator’s memory was swapped out/in by the operating system . Swapping is a common cause of timeouts and performance reductions, since if the OS is focusing on other applications running on the same machine to the detriment of the Apama process it won’t be a surprise if things slow down a little (it might be time to invest in more memory or putting things on separate boxes).
Eye-catching as the charts are, remember that there’s some great stuff in the textual sections of the overview.html too. There’s a useful overview of the versions/timezone/event rates for each correlator which might highlight some difference between your machines/runs that you weren’t aware of (e.g. different timezone, OS version, etc), and also an “ if you need help ” section containing essential questions and some data detected from the log files which you can copy+paste into an email and if you need to ask for help analyzing your logs. Note if you get in touch with Software AG for support we’d still prefer you to provide the original log files rather than only the analyzer output if possible, since although we’ll be running the analyzer on those files there may be extra information in the logs that’s important for solving your problem.
Something that’s worth bearing in mind: when you start a correlator it logs what we call the “ startup stanza ” which is a few messages at log level ##### containing super-useful info about the version/environment/parameters it’s running with. On older Apama releases when you rotate the correlator log the new log won’t contain this startup information, which means you’ll get a seriously under-powered analysis from this tool if you run it only on the latest log where you had the problem. So if you possibly can, be sure to keep that all-important startup log file and make sure it’s one of the files you pass to the analysis tool (even if you don’t bother including other logs from between the startup time and the problem you’re analyzing).
The other feature of the analyzer’s output that’s really worth noting is the summarization of error and warn messages . You’ll find this in logged_error.txt/logged_warnings.txt. Whether it’s error/warns from the Apama correlator itself or those logged by your EPL application, these log messages often hold the key to unlocking the mystery of why the application didn’t behave as expected, but in practice some logs contain so many unimportant repeated warn/error messages (perhaps from over-enthusiastic application logging) that it’s difficult to find the ones that really matter. The error/warn summarization normalizes the warn/error messages by stripping out changeable strings such as numbers (timestamps/ids etc) and lists the messages sorted by message string (rather than time) so that related messages will appear next to each other. The analysis also indicates how many times each (normalized) message was logged, and in which time range – to help you correlate the messages to the application problems you are investigating. Here’s an example:
9 warnings in correlator1
3 warnings in correlator2
1 warnings in correlator3
1 warnings in correlator4
Summary of warnings, sorted by normalized message, with number of occurrences of that message indicated by '<OCCURRENCES>x':
--- 1x: 2019-07-30 18:04:30.123 WARN [53100:DistMemStore] - <com.acme.MyJavaClass> This is a #1 unique warning message
in correlator1 line 83
--- 9x: ABC warning for process ___ took ___ seconds
6x from Tue 2019-07-30 18:04:31 to 18:04:36 (=0:00:05) in correlator1
1x at Thu 2019-08-01 18:04:31 in correlator2
1x at Tue 2019-01-01 18:04:31 in correlator3
1x at Fri 2019-02-01 18:04:31 in correlator4
Examples from correlator1:
line 85 : 2019-07-30 18:04:31.567 WARN [53100:DistMemStore] - ABC warning for process 12345 took 0.555 seconds
line 86 : 2019-07-30 18:04:32.567 WARN [53100:DistMemStore] - ABC warning for process 456 took 9.10 seconds
line 87 : 2019-07-30 18:04:33.567 WARN [53100:DistMemStore] - ABC warning for process 456 took 9.12 seconds
line 88 : 2019-07-30 18:04:34.567 WARN [53100:DistMemStore] - ABC warning for process 456 took 29.10 seconds
line 89 : 2019-07-30 18:04:35.567 WARN [53100:DistMemStore] - ABC warning for process 456 took 9999.10.123.123 seconds
line 90 : 2019-07-30 18:04:36.567 WARN [53100:DistMemStore] - ABC warning for process 456 took 29.10 seconds
Examples from correlator2:
line 85 : 2019-08-01 18:04:31.567 WARN [53100:DistMemStore] - ABC warning for process 91011 took 0.1 seconds
Examples from correlator3:
line 83 : 2019-01-01 18:04:31.567 WARN [53100:DistMemStore] - ABC warning for process 91011 took 0.1111 seconds
Examples from correlator4:
line 83 : 2019-02-01 18:04:31.567 WARN [53100:DistMemStore] - ABC warning for process 91011 took 0.1112 seconds
--- 2x: Receiver engine_receiveSLOW (component ID ___/___ ___) is slow (have approx. ___ kb of messages outstanding). Next event to be sent is MyEvent of approx. ___ bytes on channel 'mychannel'
2x from Tue 2019-07-30 18:09:20 to 18:09:22 (=0:00:02) in correlator1
Examples from correlator1:
line 92 : 2019-07-30 18:09:20.116 WARN [139627924645632] - Receiver engine_receiveSLOW (component ID 6758468452563427684/6758749927540138340 [0x7efd74000a00]) is slow (have approx. 10240 kb of messages outstanding). Next event to be sent is MyEvent of approx. 14 bytes on channel 'mychannel'
line 93 : 2019-07-30 18:09:22.116 WARN [139627924645632] - Receiver engine_receiveSLOW (component ID 6758468452563427684/11111111111 [007FFd74000a00]) is slow (have approx. 10240 kb of messages outstanding). Next event to be sent is MyEvent of approx. 22 bytes on channel 'mychannel'
--- 2x: Took ___s for an invocation of Writing log line
2x from Thu 2019-08-01 18:04:35 to 18:05:31 (=0:00:56) in correlator2
Examples from correlator2:
line 87 : 2019-08-01 18:04:35.567 WARN [29164] - Took 15.487733s for an invocation of Writing log line
line 88 : 2019-08-01 18:05:31.567 WARN [29164] - Took 12.345532s for an invocation of Writing log line
There’s lot more to explore, but we hope this post has whetted your appetite for what the tool can do. It’s part of the ApamaCommunity so we’d welcome feature suggestions, bug reports or pull requests on the GitHub project – get involved!
To download the latest release (and check back for updates) see Releases · ApamaCommunity/apama-log-analyzer · GitHub