We have been seeing out of memory issues on AIX.
The issue for us has been heap fragmentation rather than exhaustion.
You need to use the IBM tools to analyze your heapdumps (or IBM can help with this) to tell which situation you have. If you have an out-of-memory when only part of the heap is used, you have a fragmentation problem.
Unlike the Sun JVM, the IBM 1.4.2 JVM adds a category of immovable heap objects they call “Dosed”. Any method-local storage is dosed, and therefore cannot be moved during a heap compaction cycle. The Integration Server uses a lot of method-local references.
Depending on your thread count, you can find yourself in a situation where you have a large number of dosed objects scattered across the heap but only using say… 20% of the total. The first relatively large object that comes along doesn’t find enough contiguous space in the heap. So a garbage collection is triggered, and then a compaction. All those dosed objects can’t be moved during the compaction cycle. Still not enough contiguous space. Game over. You could say that in this situation, the large object is the victim rather than the perpetrator.
The dosed object situation is a calculated tradeoff IBM made in order to get higher performance by eliminating a layer of heap object reference indirection. An IBM java lab guy explained to me that at the time 1.4.2 was being designed, the workloads the hardware was supporting made this a good tradeoff. It’s not working so well now, and IBM has returned to a more “Sun-like” heap management approach for Java 1.5. If only we could get wm support on IBM java 1.5…
In the meantime, the best approach we have found so far is to recognize that in general, threads running inside the Integration Server are a poor place to queue workload, and get agressive about controlling thread count on any Integration Server. I’ve reduced the max server threads significantly.
The challenge is that by simply reducing the max server threads setting, there is a potential to starve high priority work with high-volume low priority work. So we must effectively throttle the workload requests at their source. This means taking a look at trigger throttle settings, concurrency settings for individual triggers, scheduler tasks, JDBC pool sizes, reverse-invoke connections, etc. etc. Ultimately, you may have a situation where more servers are required for high priority work, or a client simply must be prepared to wait.