Debugging performance problems in Apama

Debugging Apama performance problems

We have seen some performance problems this week at our client.  The symptoms were:

  • Initially everything works fine
  • By the time the system has been running for 2.5 hours, the correlator is unresponsive and cannot process any more data.

To fix this, I first needed to identify the cause of the problem.  I initially looked at the correlator log, in particular at the status lines:

2015-05-11 13:52:33.360 INFO  [140130488678144] - Correlator Status: sm=31 nctx=3 ls=35670 rq=166 eq=0 iq=20210 oq=0 icq=20210 lcn="main" lcq=20210 lct=2938.2 rx=35115 tx=2433 rt=1252267 nc=2 vm=69800372 runq=0 si=0.0 so=0.0 srn="<none>" srq=0

2015-05-11 13:54:33.360 INFO  [140130488678144] - Correlator Status: sm=31 nctx=3 ls=35670 rq=533 eq=0 iq=20210 oq=0 icq=20210 lcn="main" lcq=20210 lct=3058.2 rx=35115 tx=2433 rt=1266824 nc=2 vm=69931444 runq=0 si=0.0 so=0.0 srn="<none>" srq=0

OK, what does this mean?  We see that:

  • The number of listeners (ls) is high, at 35670: this may be necessary (after all, if we have two listeners per open order, and there are 15k orders, this accounts for nearly all of that).
  • The input queues are full:
    • iq (input queue)
    • icq (input context queue)

both at 2021

  • The number of routed events is huge – 1266824 after 2.5 hours – when we’ve received 35k events.  This means we’re routing an average of 36 events for each event received, which seems excessive.
  • Routed events are backed up: the route queue (rq) is >0 in every status line.
  • The output queue (oq) is empty, so we’re not backed up by a slow receiver.

This is enough that I felt confident going to search for all route statements in the code.  The code base here is small enough that this is a viable approach.  I found a lot of these statements in the IdManager.mon:

      FindCarrierShipmentOrders findShipmentOrdersByCarrier;

      on all FindCarrierShipmentOrders(): findShipmentOrdersByCarrier{

            route CorrespondingShipmentsByCarrier(idAdmin.carrierShipmentOrders(findShipmentOrdersByCarrier.carrierReference));

      }

So we’ve got a request-response pattern, and it looks like there’s too many requests and it’s flooding the system.  Where are we routing these?  (I also added some log statements to see which ones were called most frequently, so before each route statement there was a simple line such as log “IdManager line 55”;

 

This led me to the OrderStatusStore:

action updateCarrierStatus(CarrierStatus carrierStatus) {

      integer tmpRes;

      log "Updating order status entry with carrier information for: " + orderLine.orderLineId() at INFO;

      route ChangeCarrierReference(carrierStatus.shipmentReference, carrierStatus.carrierId);

      route FindCarrierShipmentOrders(carrierStatus.carrierId);

 

      CorrespondingShipmentsByCarrier shipments;

      on CorrespondingShipmentsByCarrier(): shipments{

 

            ShipmentId shipmentReference;

            for shipmentReference in shipments.matchingShipmentIds {

                  route FindCorrespondingOrderLines(shipmentReference.getShipmentId());

OK – now we’re getting somewhere.  We’re routing a request, iterating through the response and sending another request each time.

 

This was the only place we were making these requests, so it was easy for me to refactor the IdManager as an event object with actions.  Instead of route FindCarrierShipmentOrders, we had:

        CorrespondingShipmentsByCarrier shipments := idManager.findCarrierShipmentOrders(FindCarrierShipmentOrders(carrierStatus.carrierId));

We installed this on Wednesday, and found that the situation improved: after 5 hours, the system was still working OK.  However, when I tested this morning, I found we were back to problems: no data showing in the GUI.  I didn’t have any logs available, so I had to use other tools.  First, I used engine_watch: this gets much the same information as the status lines, and gave me:

Uptime (ms)

# Contexts

# Monitors

# Sub-monitors

# Java Apps

# Listeners

# Sub-listeners

# Event Types

Input Queue

# Received Events

Route Queue

# Routed Events

# Consumers

Output Queue

# Created Output Events

# Sent Output Events

# Processed Events

Most backed up context

Backed up context queue

Slowest receiver

Slowest receiver queue

68950346

3

18

29

0

127515

127515

158

19824

111723

1

155021

2

0

49561

49561

343712

main

0

<none>

0

(there were many more lines like this)

This tells us that the route queue is now a steady 1; we were gradually seeing an increase in Processed Events (2 per 20s), but everything was saturated.  This isn’t routing too many events, but there’s still something slow going on.  The next step (and one I should probably have taken earlier!) was to use the profiler.  The commands here are:

engine_management -r “profiling on”

<wait a while>

engine_management –r “profiling get” > profile.csv

The profiling output is listed in descending order of CPU usage, which is what I wanted (additional columns hidden for space reasons)

Context Id

Context Name

Location

Filename and line number

Cumulative time

CPU time

1

main

com.apamax.jnj.rta.OrderStatusStore:com.apamax.jnj.rta.ManagementActions.correspondingOrderLineIDs

/opt/softwareag/Apama/ME250_RTA/objects/IdManagementActions.mon:55

22301

22301

1

main

com.apamax.jnj.rta.ManagementActions.correspondingOrderLineIDs

/opt/softwareag/Apama/ME250_RTA/objects/IdManagementActions.mon:55

703

703

1

main

com.apamax.jnj.rta.OrderStatusStore:com.apamax.jnj.rta.IdManager.findCorrespondingOrderLines

/opt/softwareag/Apama/ME250_RTA/monitors/IdManager.mon:39

22380

79

Times are measured in correlator ticks.  It looks like the action starting at line 55 in IdManagementActions.mon is using 96% of the processing time: that’s a good candidate for optimisation.  Let’s look at the code:

 

action correspondingOrderLineIDs(string shipment) returns sequence<OrderLineId>{

    string currentOrderLineId;

    sequence<OrderLineId> matches := new sequence<OrderLineId>;

    for currentOrderLineId in orderLineIds.keys(){

        if((orderLineIds[currentOrderLineId].orderCompany = shipmentIds[shipment].orderCompany) and

           (orderLineIds[currentOrderLineId].orderType = shipmentIds[shipment].orderType) and

           (orderLineIds[currentOrderLineId].orderNumber = shipmentIds[shipment].orderNumber)) then {

               matches.append(orderLineIds[currentOrderLineId]);

       }

    }

    return matches;

}

 

That for loop in the middle looks like a candidate.  In order to refactor this, I looked at the data model and found that we can clean it up by using a two-way dictionary: that way, the lookup is much cheaper, though there’s more work to be done on item addition, and it uses more memory.  We now have, for example:

    dictionary<string, OrderKey> shipmentsByCarrier;

    dictionary<OrderKey, string> carrierByShipment;

   

    // OrderKey is Order Company, Type, Number; this is a unique identifier for a

    // particular order and can be used to link order line and shipment records.

    // Need a record of which order lines belong to a particular key, with and

    // without known shipment IDs.

    // The inner dictionary is used as a set.

    dictionary<OrderKey, dictionary<string, boolean> > orderLines;

    /**

    * Returns a sequence with all the matching warehouse order status for a given carrier shipment

    */

   action getOrderLinesForCarrierShipment(string carrierShipment) returns sequence<string> {

       OrderKey shipment := shipmentsByCarrier.getOrDefault(carrierShipment);

       return orderLines.getOrDefault(shipment).keys();

   }

So the action still returns a sequence of strings, but:

  • We only make one call, to getOrderLinesForCarrierShipment, instead of getting all shipments for that carrier and then all order lines associated with them.
  • We look up the shipment key (Order company, order number, order type) which is the same as the order key, in the dictionary of carriers to shipments.
  • We return all known order lines on that order.

By moving the effort to data addition, and ensuring that we have the links set up correctly, we make the lookup much cheaper.

OK, I think I’ve solved the problem: now to test it.  I took a data set (11,000 events) and sent it in to the correlator with the old code, profiling it; I then profiled the new code over the same data set.  Here’s what I found:

Old:

Location

Filename and line number

Cumulative time

CPU time

com.apamax.jnj.rta.OrderStatusStore:com.apamax.jnj.rta.ManagementActions.correspondingOrderLineIDs

C:\projects\jnj-rta\ME250_RTA\objects\IdManagementActions.mon:55

1338

1338

Snapshot

 

1

100

com.apamax.jnj.rta.OrderStatusStore:com.apamax.jnj.rta.IdManager.findCorrespondingOrderLines

C:\projects\jnj-rta\ME250_RTA\monitors\IdManager.mon:39

1410

72

New:

Location

Filename and line number

Cumulative time

CPU time

Snapshot

 

0

42

com.apamax.jnj.rta.CarrierIssuesManager:;GC-mark

 

33

33

com.apamax.jnj.rta.CarrierStatus:Event

 

15

14

com.apamax.jnj.rta.CarrierIssuesManager:com.apamax.jnj.rta.CarrierIssuesManager.processShipmentData

C:\projects\jnj-rta\ME250_RTA\monitors\CarrierIssuesManager.mon:85

32

11

So we see that, in the old version, we’re spending 1338 ticks in correspondingOrderLineIDs.  In the new version, our biggest code cost is somewhere completely different, and it’s spending 11 ticks.

In summary:

  • If you have an unresponsive application but no crash, start with the correlator status lines (or engine_watch; I’d recommend the –t –v flags to get a usable CSV file).  This often tells you what type of problem it is: input, output, or internal.
  • For a performance problem, the profiler is very useful; it will tell you where resources are being used.
  • If you’re doing frequent lookups, use a dictionary instead of iterating over a sequence.
  • Find a reproducible data set and run a profile test before and after your changes: this is where Subversion is your friend (you can update to an old revision, forcing your application to run “old” code, if you’ve forgotten to profile with a known data set before making changes).