Performance Issues

Having some performance challenges with what should be a manageable load for our infrastructure. I’ve put in a support request for this but was curious if others might have seen this type of behavior. Here are the details:

The end result: During load testing the Integration Server appears to pause at different intervals while processing the documents. It starts off handling the load with no problems(actually very fast) but degrades over time. This slow down starts the queues in the broker to start backing up and eventually the throughput is not met.

What I have observed: The slow down is not processor bound or IO bound. Resources are not fully utilized. In fact during the slow down or pause the processor utilization actually drops by 20%. The IS server is not thread starved, only using about 10 - 15% of the available threads. During these pauses it appears that certain thread activity is suspended or even goes away but then resumes.

What I have tried: Tweaking trigger refill levels, number of concurrent threads for trigger processing, database connection pools, audit settings, JVM memory sizes, Solaris file descriptors.

I have also enable verbose GC debugging on the JVM. I do not see any major collections going on during the run, but there are a lot minor collections. Should I try tweaking the young generation size settings? Would that cause this pause?

-Server: Sparc 4(1.2Ghz) x 8GB, SAN Storage for Broker

-Integration: JDBC Adapter Notification, MQ Adapter, Mainframe Adapter. The integration consists of one put to an mq queue, a listener for responses, a mainframe transaction invoke and several broker publishes.

Any ideas are appreciated, thanks.

markg
http://darth.homelinux.net

Mark,

I faced this problem once, and I would like to point you where I found issues during my load testing:

  1. The load testing tool itself (JUnit) was increasing the overall time, but it was a setup problem;
  2. The database (Oracle) was increasing the overall time also, and it was used to generate the transaction logs;
  3. We had IO problems while saving lots of files at the same time;
  4. The web server (Apache) was being suffocated for (and stop responding) the number of requests sent by the delivery service into IS which was consuming all available threads.

I think your scenario is a little bit different from the one I had the same problem, but the performance issue wasn’t exactly into wM platform and both behaviors was very similar.
May be those tips won’t help you at all… But this was the experience I had.

HTH,
Maldonado

Some more information from this morning’s test:

Looking at the GC verbose output, the memory utilization gradually increases during the load testing with minor collections running constantly. The memory allocated gradually reaches the max for the jvm and then a major collection occurs. The memory allocation then goes back to the original size. The performance degradation appears to follow this path, with processing times gradually decreasing as the max jvm is approached and then improving after the full GC is run.

Increasing the JVM size seems to just prolong how long it takes before the degradation occurs but it still occurs.

Could this be problems in the pipelines of these flows? Or is this expected behavior of the JVM and webMethods for high volumes?

markg
http://darth.homelinux.net

Hello,
I don’t you can do too much administration wise to help out your situation besides, like you said, increase the upperbound of your memory. Your system is probably doing well to work as fast as possible to compelete tasks with available memory. If you had by chance introduced some aggressive gc cycle to undo unwanted memory space, you may waste out some of your time actually compeleting a task. However, you may want to run a service run doing gc in some shorter interval of your perceived memory max-out time frame. Good day.

Yemi Bedu

You can create a simple Java service which will do a gc periodically…this won’t solve your problem but actually you’ll free considerable memory space.

Conventional wisdom from Sun and others is not to try and run a System.gc() call to try and nudge the gc along. However I tried it just for the sake of trying it. It made no difference on performance or rather throughput. The problem is still occurring. During the load test I observed the first set of data going through in good time, the second set follows and it goes okay but a little slower, the third set a little slower, and then by the fourth set, significant queueing begins.

We are running about 2000 per minute which it seems to handle okay for first few minutes. If we run 1400 per minute then it is able to keep up without queueing. The strange thing is that the 2000 per minute does not take a lot more processor to run. It just gradually starts getting slower. The processor utilization actually starts dropping during this slow down. It seems like the integration server just pauses and then resumes.

Thanks for all the suggestions so far.

markg
http://darth.homelinux.net

Hello,
Yeah, it is definitely not an issue with the processor, that is why I suggested the gc call as I knew you had the cycles. It is on the object creation side, where there are objects are created and stay in “scope” the hold time. when you reach your threshhold, there is a lot of waste lying around, then only does some of things get swept away. I know that IS has a Manager for some data that run periodic sweeps. I saw this link:
http://advantage.webmethods.com/cgi-bin/advantage/main.jsp?w=0&targChanId=knowledgebase&oid=1611814149

And it is not directly dealing with your issue, but it may help think about some things you had not explained yet. So …
Do you have caching enabled for some services in this test set? Do Save or load information with files (not db) in this set? You may then need to set a smaller sweep interval (about 65 seconds) and / or you may what to see if (depends on OS) if you have a lot of file descrip opens at the same time (maybe more than 1024). NOt too sure about your environment so you can fill in a little more info at any time. Thanks for listening. Good day.

Yemi Bedu

Hello,
You may also want to read the solution for this thread:

http://advantage.webmethods.com/cgi-bin/advantage/main.jsp?w=0&targChanId=knowledgebase&oid=1611719573

Again, it does not say to be your same problem, but the resolution seems on par with what you are trying to achieve. Good day.

Yemi Bedu

Mark,

This is an interesting one (they are always interesting when they are happening to someone else)

From the tests done so far, it seems that there is a correlation between your observed throughput degradation and the size of the heap or the time spent doing GC on a larger heap.

I assume that you are using a Sun JVM. What version are you using and what command line args are being passed when the JVM is started?

Also, is it possible to use other GC analysis tools to perhaps show something that you haven’t noticed? For example what are the average pause times for minor and major GC’s? How much memory is being freed on average?

The fact that a major GC occurs before performance improves again, but not when minor GC’s occur should tell us something, but I’m not sure exactly what. Do you have access to a real JVM geek at your shop? Could Sun be persuaded to cough one up? (assuming they are the JVM vendor as well as the HW/OS vendor).

And not to get too narrowly focused, is there any way that you could be encountering network contention once your throughput reaches a certain level? Is it possible that MQ has some sort of memory leak that is impacting IS? Stretches, I know, but don’t want to put those blinders on.

BTW, thanks for confirming my long-running assertion that scheduled System.GC() calls do nothing to help the situation. That old wives tale gets recycled here far to often. :wink:

Mark

BTW, what happens when you switch JVM vendors? While you don’t want to do that lightly, you might learn something interesting from a running a test or two on BEA JRockit or a different version of the one you are using.

-M

The JVM is an older one, Sun’s 1.3.1 with no patches, we are still on 6.0.1 sp2. We are upgrading webMethods to 6.5 early next year. I haven’t really ruled out anything as a possibility at this point. I have been focused on the JVM just because of the perceived pause in processing that occurs.

The pause times on the minor collections stay pretty consistent even during the bad performance. It is really weird. The actual integration works like this:
1- Every 3 minutes 1500 records are inserted into an Oracle table that we have a JDBC notification on.
2-They get picked up by a trigger and the first flow service which does some mapping into a common format and then published again.
3- The next trigger puts them into a MQ queue on another server.
4- An MQ Listener picks up the response from the mq server, does some formatting and then publishes back to the broker.
5 -Another trigger picks up the messages and invokes a mainframe transaction and updates an oracle database. End of integration.

On the first insertion of 1500, total processing time is just over 1 minute to complete the entire process. 2nd insertion about the same. 3rd it goes to about two minutes, then on the 4th it does not get finished prior to the next insertion. At that point it never catches up. The cpu utilization to about 50 - 60% during the 1st and 2nd iterations and then drops to 30 -45% during the 3rd and 4th when it is having problems.

The reason I believe the performance problem is isolated to the Integration Server and not any of the other resources is when the performance issue occurs the very first flow service that processes the inbound documents slows down significantly. Where it normally drains the 1500 from the queue in less than a minute, it now takes over 3. That flow service is not dependent on any external resource.

As far as command line args to the jvm. I’m using Bound Threads since I’m running on Solaris which generally gives better performance.

JAVA_MEMSET="-server -ms${JAVA_MIN_MEM} -mx${JAVA_MAX_MEM} -native -XX:-UseBoundThreads -XX:-UseLWPSynchronization -XX:+UseThreadPriorities"

I like the idea of changing the JVM version and even the vendor. That might help isolate. Plus 1.4.2 has some improvements in GC options. I guess after that I can “unplug” individual flow services and see if there is an individual service or external resource that could be causing the issue. Since all of the flow services are decoupled from each other, I’m hoping we can find if one is the offender or its not the flow services at all.

The JVM guru is me, which isn’t saying a whole lot. . Sun maybe an option although they probably won’t provide support for the end of life JVM version we are using.

markg
http://darth.homelinux.net

I forgot to mention we do have Wiley here. We normally use it for webSphere but we do have some webMethods agents. I’m going to hook it up and see if it helps debug the problem. Anyone ever use Wiley with webMethods? Good results?

markg
http://darth.homelinux.net

Hey Mark,
we’ve used wiley with webMethods it’s quite useful when all else is proving inadequate… Quite useful if you just get around to setting it up.

On the garbage collection issue, there were a bunch of new options/algorithms with anything over 1.4.1, try digging into those perhaps as a way to cut down those monster pauses (because garbage collection means all threads in all CPUs paused while it collects). So definitely look at whether you can go to 1.4.2.

Some quick google searches:
http://java.sun.com/docs/hotspot/gc1.4.2/
http://www.javaworld.com/javaworld/jw-03-2003/jw-0307-j2segc.html

If you’ve got long running services, perhaps check that they’re not leaving stuff in the pipeline any longer than necessary, as this will decrease the amount of easily freed up memory until the top level service finishes executing. This might be forcing it do do a “deep” garbage collect unnecessarily. It doesn’t take too many copies of large arrays, or same data as document->XMLnode->XMLString->bytes type stuff left around.
Check out what’s left in the pipeline of your top level service that isn’t listed as an output for what you should track down and clean up.

Hope that helps,
Nathan Lee

Just an update. I’ve decided to hold (with client concurrence )off on figuring out this issue. We are in the process of our 6.5 migration/upgrade. Since our regression testing will include load testing, I decided not to do this twice in a 3 month timeframe.

I was able to determine that some sequencing of events or flows is likely causing the issue. I was able to extract some of flow logic and run some more performance testing. I was able to get around 60k per hour without changing any settings including number of processing trigger threads. That’s alot more than the 24k per hour with the entire group of flows. I’ll have to do the process of elimination to figure what flow or combination of flows is causing our little rest break.

Thanks to all for all of the suggestions. I’ll update this again in about 2 to 3 months.

markg
http://darth.homelinux.net

"

[quote=griffima;31418]
I forgot to mention we do have Wiley here"
mark
We are planning to Use wiley with webmethods .I know this is a old thread , but was just curious to know ,how does it work with webmethods and also do u have any inputs on how to configure wiley to IS,we did change the server.sh file but it keeps throwing errors about classes not found etc…i guess i am missing something here.

-sri