Audit mechanism

Hi all,

I have a question. I skim through the Logging and Monitoring doc for 6.1 and couldn’t find the answer.

When we set the audit logging to a database with:
watt.server.auditDBSize=20
watt.server.auditDir=audit
watt.server.auditLog=perSvc
watt.server.auditMaxPool=10
watt.server.auditMinPool=1
watt.server.auditStore=database
watt.server.auditThreshold=100
watt.server.auditlog.error=true
watt.server.auditlog.gd=true
watt.server.auditlog.session=true

Setup the JDBC, db etc…

Does the audit goes to the DB synchronously everytime the service runs (if we set that way) or it is stored in a file format under …/audit/data directory and get send to the database using some internal Audit management asynchrounously?

My concern is performance, if we have a remote database to do service audit logging, would it affect the performance severly?

In our case, it seems like performance becomes really poor even we have set the service to log on top level only and only audit if it’s success or error.

It seems like every top level service (most of them are just empty wrappers) calls waste a lot of time in auditing. Sometimes it can take up to 2000 - 3000 ms for just auditing process. While the actual services that do the work are generally quite fast.

Any comments and thoughts?

Thanks

Kurt

Hi Kurt,

Find below WebMethods overview regarding Logging and Auditing in document “webMethods_Logging_and_Monitoring_Guide_6.1.5.pdf”. You will probably find in Bold response to your question :


[FONT=PalatinoLinotype-Roman][SIZE=2]

By default, Integration Server stores logging data in flat files, but webMethods highly recommends that you configure Integration Server to store the data in a database instead. Typically, you would only use flat file storage if you lack the resources to purchase, install, and manage a database.

[/size][/font]
[FONT=PalatinoLinotype-Roman][SIZE=2]

You [/size][/font][FONT=PalatinoLinotype-Italic][SIZE=2]must [/size][/font]configure Integration Server to store logging data in a database if you want to use Monitor. You must use Monitor if you want to do the tasks listed below; you cannot do them from Integration Server Administrator.
[FONT=PalatinoLinotype-Roman][SIZE=2]

[/size][/font][FONT=Wingdings-Regular][SIZE=3][COLOR=#818181]

�� [/color][/size][/font][FONT=PalatinoLinotype-Roman][SIZE=2]Monitor and control Modeler processes and Workflow workflows at runtime.

[/size][/font][FONT=Wingdings-Regular][SIZE=3][COLOR=#818181]

�� [/color][/size][/font]View document logging data, process logging data, workflow logging data, and 4.x
activation logging data
[FONT=PalatinoLinotype-Roman][SIZE=2].

[/size][/font][FONT=Wingdings-Regular][SIZE=3][COLOR=#818181]

�� [/color][/size][/font]Resubmit services, documents, processes, and 4.x activations[FONT=PalatinoLinotype-Roman][SIZE=2].
Monitor also offers these additional benefits:

[/size][/font][FONT=Wingdings-Regular][SIZE=3][COLOR=#818181]

�� [/color][/size][/font]Links related logging data in its display (for example, for any process or process step, you can see all relevant service, error, and user-defined message entries). Integration Server Administrator does not link related data for you; you must look through the individual logs for related data yourself.
[FONT=PalatinoLinotype-Roman][SIZE=2]

[/size][/font][FONT=Wingdings-Regular][SIZE=3][COLOR=#818181]

�� [/color][/size][/font]Lets you view implicit processes. An implicit process is a sequence of service invocations joined by publishing of documents that was not explicitly modeled in Modeler. You cannot view such processes from Integration Server Administrator.

[FONT=PalatinoLinotype-Roman][SIZE=2][FONT=PalatinoLinotype-Roman][SIZE=2]

Another benefit of storing logging data in a database is that doing so improves logging performance. When you use a database, multiple logging threads can write data from Integration Server’s temporary store to the database simultaneously. When you use flat files, only one logging thread at a time can write data to each file.

[FONT=PalatinoLinotype-Roman][SIZE=2]

Also, when you store logging data in a database, you can use facilities in your database to
control the length of time the data is retained. When you store logging data in flat files, you must delete the files manually when you no longer need them.
If you decide to store logging data in a database, you must store both the IS Core AuditLog and the Process Audit Log to a database. You cannot store one log in a database and the other in flat files.



Auditing services and processes are not mandatory however it adds-on features that i think are very important from a governance perspective.

Regards,
[FONT=PalatinoLinotype-Roman][SIZE=2]Djibril.

[/size][/font][/SIZE][/FONT][/SIZE][/FONT][/SIZE][/FONT]

I generally don’t recommend auditing your services except for error only on the top level services and even then only if you plan on having the resubmit capability. Performance does suffer a great deal. If you add on save pipeline to that it gets even worse. It does run in the background but it still performs pretty bad. You can play around with the batch size but I’ve found this still in general with large integrations to perform bad.

Error only with pipeline can provide value in being able to resubmit the service. But if you are not planning on using that then auditing is really a lot of wasted overhead.

Hi All,

Thanks for the response.

Djibril. thanks for your response.

What I want to understand is how webMethods actually proces them internally. Somebody told me that even when we setup the audit to the db, as what I’m doing now, the audit data is actually stored as a file, managed by some sort of AuditManager and get submitted to the DB. I’m not convinced. I thought, as in Logging and Monitoring docos, the audit data is sent synchronously (maybe multi-threaded) if we setup as audit log to DB. If it runs in the background, when and how does it run?

From the bolded part in your explaination, it seems like IS stores the audit data temporarily somewhere and then write to the DB simultanously (or asynchronously ?)
My questions then probably where does it get stored temporarily? If it stores it as flat file, doesn’t it contradict the statement in the docos that says if it uses flat-file can only use single thread?

Can anybody confirm that?

I agree with griffima, that most of the time, it is useless to audit, except if we see the benefits from governance perspectives.

My finding is that audit log to a DB takes a lot more time than publishing a message. I’m just having a thought to disable the audit for success and replace that with publishing a message and only audit if it’s error. We have auditing/error-handling/monitoring IS that talks to a centralized DB anyway. So, in this case we can trace back if a service runs succesful or not (the management happy :wink: ), and not sacrificing the performance too much. The disadavantage are; that we need another doc format and the audit doc add extra traffics, also put more work for the audit/monitoring IS.

Any thoughts?

The performance of using auditing is affected most by the following setting:

watt.server.auditGuaranteed

Assuming you have the audit subsystem configured to write to a database (watt.server.auditStore=database), then setting guaranteed auditing equal to true causes all audit events to be written to a file in the file system (by default /IntegrationServer/audit/data) before being asynchronously written to the database. Setting this to false causes all audit events to be written to an in-memory (read volatile) store before being asynchronously written to the database.

To fully understand the performance issues with the audit subsystem, some explanation of the producer / daemon thread pools is needed…

The Producer Pool
A thread from the producer thread pool is used any time the IS needs to write an audit record. The job of the producer thread is to write the audit record to the intermediate store - either file system (guaranteed auditing on) or memory (guaranteed auditing off). While these threads perform their work, the calling service is in a blocked state. Any time a service needs to write an audit record, it must wait for an available thread in the producer pool.

The Daemon Pool
Threads in the daemon pool are simply responsible for “draining” the intermediate store and writing the entries to the database. This is done in the background, but due to synchronization issues (which I’ll explain in a moment) can have a negative impact on the performance of your services. Draining of the audit store use to be a big problem in 6.0.1 in an even moderately high volume environment - the store would fill up faster than it could be drained. Fortunately in 6.1 a new parameter was introduced:

watt.server.auditFetchSize

This controls how many audit records each daemon pulls from the intermediate audit store. The default is 10 (in 6.0.1 it used to only pull one at a time).

Controlling the Pool Size
The min / max threads that are contained within the Producer and Daemon Pools is controlled by the following settings:

watt.server.auditMinPool
watt.server.auditMaxPool

These settings control the size of both pools - so you cannot have a Producer Pool sized differently from your Daemon pool. However, each is a separate pool. So if Min is 1 and Max is 10 (the defaults), each pool has a minimum of 1 thread and maximum of 10 (they don’t max out at 5 each). If you are ever curious to see how many threads are actually getting used, you can view the audit producer threads in IS Administrator under System Threads. They will show up as AuditProducer0-AuditProducerN. Note that a Producer thread will only show up if there was ever enough concurrent demand for it and once created will not go away until the IS is restarted (i.e. the pool will not be brought back down to the minimum size).

The Synchronization Issue
Okay, now to the real issue with guaranteed auditing. To summarize, we have a Producer Pool which is used by services to write audit events to the intermediate store. We have an intermediate store which is either in memory or in the file system. We have a daemon pool which is responsible for draining the intermediate store and writing the entries to a database.

Let’s suppose we have an IS with guaranteed auditing enabled, 250 server threads and a producer / daemon pool max size of 30. Furthermore, lets suppose that all 250 threads are trying to concurrently write audit records.

Now let’s look at the synchronization that is going on:

  1. The 250 service threads will all be in competition for the 30 producer threads.
  2. The 30 producer threads are all in competition trying to write to a single file in the file system - only one thread can write at a time. (In actuality, there can be multiple files, but there is a log file against which all threads are synchronized).
  3. The 30 daemon threads are also in competition with the producer threads since they are also trying to modify (in this case by removing entries from) the intermediate audit store. This is how a daemon thread can actually negatively impact the performance of a service - even though it would seem that the work they are doing is 100% asyncronous from the execution of your service, there is underlying synchronization going on that can negatively impact the speed of your code.

Increasing the size of the pools may or may not help. Your service may be able to get a producer thread from the pool more quickly, but it will most likely have to wait longer to write its entry to the intermediate store since you have just increased the contention against that file.

Turning guaranteed auditing off, will give you vastly superior performance. In my testing it is almost as fast as having no auditing at all. I’m not 100% sure if this eliminates the synchronization issue (since I don’t know what internal mechanism is being used) or just lessens it due to the raw speed of memory versus disk i/o. But if you make the switch, be sure to monitor the java heap, since all of your audit records are now in a memory store.

However, you need to decide if you can live without the audit records in the event of a server failure. If you are just gathering statistics, then I would definitly turn off guaranteed auditing. If you are relying on the audit records for recovery / resubmission, then you may want to think twice about turning off guaranteed auditing. If you have a requirement for both, then I would consider setting up separate IS instances that are specifically tuned based upon SLA’s and/or auditing requirements.

You may also want to consider turning off logging of “unnecessary” events. Session logging may be a good example of this. Session logging can be turned off by setting watt.server.AuditLog.session=false. This will help eliminate unneeded contention within the audit subsystem.

Lastly, make sure you are only auditing what is necessary. In particular, only log pipelines on error - and only if you plan on actually resubmitting using the logged pipeline. Only log top level services and only if you care about gathering statistics about them. Never log the start of a service - success and error should suffice. For Process Models, make sure that “Enable Resubmission” is only checked where absolutely needed. Most importantly - have solid standards around the proper use of auditing and ensure that all of your developers understand them. You can run some simple queries against the audit tables to see if there are unnecessary things being audited (i.e. audit your auditing)…

Hope that helps.

Most comprehensive explanation. That gives me light to move ahead.

Thank you.

With IS 6.1 SP2, webMethods has implemented a new feature called synchronous auditing that bypasses the local audit stores. It can be implemented by adding / setting the new watt.server.auditSync property to true.


However, if watt.server.auditSync=true, then I have the following questions for webMethods (I’m opening a ticket to get them answered and will post the reply):

1) Does the audit producer pool size (i.e. watt.server.auditMaxPool) govern how many concurrent connections will be used to log data synchronously to the database? If not what governs this besides the JDBC pool settings for IS Core Audit and Process Audit?
2) Assuming the answer to question (1) is “yes”, then if there are more threads attempting to write than producers, will the threads block until a producer thread becomes available? If not, what will happen?
3) I’m assuming that watt.server.auditMaxFetchSize and watt.server.auditGuaranteed are not relevant when auditSync is true. Is this a correct assumption?
4) Does watt.server.auditRetryCount still have any effect? (i.e. if I lose connectivity to my database for a few seconds, can this be used to retry the logging of the event to the database?) I realize this does not give me “guaranteed auditing”, but would theoretically give me better QoS than no retries.
5) Are there any plans to have the audit subsystem write to the local store in the event that the database is unavailable and then drain that store when the database becomes available and revert to the synchronous mechanism? (Much in the same way that the Broker Outbound Store works?) In effect, this would give you guaranteed auditing with synchronous auditing (it would only go asynchronous in the event of a database outage and would then honor the auditGuaranteed setting to determine if it should queue up the events in memory (false) or to a file (true))?

Regardless of the answers to these questions, this new setting provides even more flexibility in configuring the audit subsystem and should eliminate most of the local synchronization issues that currently plague the audit subsystem with guaranteed auditing enabled. It does not, however, provide for guaranteed auditing in the event that the database becomes unavailable.

Hi,
In most of the implementations, till now we have been usign the property setting of WhenTopLevelService and log on errors, include pipeline on errors without much of problem. But there is one thing i would like to make a point. Suppose that there are two services, a and b and a is invoking b, there is a constaint that you cannot log the pipline of service b eventhough u set the audit property to log on always, and remove the top level service thing. So there is no need as such to enable auditing for a service which is a child service, apart from knowign the status of the service instance completion, whcih can as well be done from the service usage of the IS admin page

Experts, correct me i m wrong any where

Regards,
Pradeep

The follows quoted are very helpful explanation on the producer/daemon pool.
Is there any document explaining how the server draws the threads
from the pool for the large amount of services invoked simultaneously?
It must be helpful to estimate the time performance.

Thanks!

--------->

Not really sure I understand the question, but if there is no producer thread available, the service thread will wait until it can get one (i.e. it will block). Review my comments on synchronization.

If you want to understand the performance impact of blocking, then you will have to test that out in your environment. If you are just interested in testing out the impact of clients blocking when waiting for a thread from the producer pool, then I would highly recommend turning off guaranteed auditing so that you can be sure that you are not waiting for a file lock or i/o.

Here is a simple test plan that you might follow (I’ll assume that auditMaxPool is set to at least the default of 10 and that the machine you are testing on has 10 or fewer cpu’s):

  1. Start with 1 thread and baseline your performance - I would run that thread with at least 100 iterations.
  2. Now run concurrent threads equivalent to the number of cpus on your box and note the performance (again run each thread for 100 iterations - or whatever number you did in your single-threaded baseline). You should see that you are scaling linearly (or at least near linearly) to this point. If not, then I would say there is some other blocking going on in your code (and if you are just trying to test the impact of blocking, then you should probably just be executing an empty flow service with auditing turned on).
  3. Next run a test with 10 concurrent threads (100 iterations each) and again note your performance. At this point, there should be no blocking going on while waiting for a producer thread, since the number of service threads is not greater than the total number of producer threads. Assuming you don’t have a 10 cpu machine, you should notice that the response times are not as good any more.
  4. Now change auditMinPool and auditMaxPool to 1, cycle your IS and re-run steps 1-3. This should give you a good idea how long a service thread will have to wait for a producer thread - all other things being equal.

Oh, BTW, any time you run a test like this, make sure you are the only one executing code on the box (and I don’t just mean the IS instance).

Thanks for the protocol! I would like to try it.

The previous question is on the thread increment of the producer/daemon pool.

For example, assuming you set auditMinPool=5, auditMaxPool=70,
you have 5 threads ready in the producer/daemon pool when you
started your server.
Now you are going to execute 1000 threads concurrently, and the
5 threads in producer/daemon pool are not enough. According to the
document, the server will increase the producer/daemon threads
until to 70 (auditMaxPool=70) to meet your demand.

The question is:
how many threads (increment) will be added to the producer/daemon pool each time (each step)?
In other words, does the server increase the producer/daemon threads
in the pool one step by one step (with the increment) based on its own
estimation, or increase the threads from 5 to 70 at one dash?

Thanks!

---->

Don’t 100% know the answer to that for sure. My best guess is that it would be 1 at a time. This guess is based upon the fact that the pool never shrinks. So, I ask myself, if I was designing it, would I build in functionality to allow the pool to increment by more than one? No. You’re only going to take the hit one time - no need for added complexity. Plus, there should not be much latency in adding a thread to the pool - you’re not waiting for the thread to do it’s work, just for it to be added to the pool. I guess in the end, I don’t see that it matters much…

[FONT=Arial]Here are the questions re-stated and the [COLOR=Blue]answers[COLOR=Black] from webMethods…[/color]

[FONT=Verdana][SIZE=2][/COLOR]1) Does the audit producer pool size (i.e. watt.server.auditMaxPool) govern how many concurrent connections will be used to log data synchronously to the database? If not what governs this besides the JDBC pool settings for IS Core Audit and Process Audit?
Yes[/size][/font]

[SIZE=2][/size][/FONT] 2) Assuming the answer to question (1) is “yes”, then if there are more threads attempting to write than producers, will the threads block until a producer thread becomes available? If not, what will happen?
Yes, It will wait till the thread becomes available.

3) I’m assuming that watt.server.auditMaxFetchSize and watt.server.auditGuaranteed are not relevant when auditSync is true. Is this a correct assumption?
[COLOR=#0000ff]Yes, your assumption is correct.[/color]

[FONT=Arial][SIZE=2][/size][/font] 4) Does watt.server.auditRetryCount still have any effect? (i.e. if I lose connectivity to my database for a few seconds, can this be used to retry the logging of the event to the database?) I realize this does not give me “guaranteed auditing”, but would theoretically give me better QoS than no retries.
Yes, this has nothing to do with temporary store but writing to a configured store. Basically it will retry for the specified number of attempts.

5) Are there any plans to have the audit subsystem write to the local store in the event that the database is unavailable and then drain that store when the database becomes available and revert to the synchronous mechanism? (Much in the same way that the Broker Outbound Store works?) In effect, this would give you guaranteed auditing with synchronous auditing (it would only go asynchronous in the event of a database outage and would then honor the auditGuaranteed setting to determine if it should queue up the events in memory (false) or to a file (true))?
The Feature is currently unavailable and we might consider for future releases.

One is the wM documentation and the other is a performance case.

  1. wM mentioned that “The number of audit daemons that can run concurrently
    is equal to the number of audit producers that can run concurrently…”

This documentation doesn’t mention when/how the number of audit producers can run concurrently.

  1. There is a case reported that when a service is invoked by 1000 clients, for example, each of the first several (10 for example) invocations take longer processing time than the others. The processing was set concurrent. This always happens for the
    first service execution (first lot of 1000 invocations) after the server
    was started/restarted.
    Thereafter the processing time for each invocation (of 2nd, 3rd, …, lot of 1000 invocations) is always
    the samething (lower time consumption), unless the server is restarted.

If the producer/audit pool is filled up to the auditMaxPool one by one,
as you supposed,
it may account for more time consumed for the first 10 invocations.
That’s why we want to konw how and when the audit
producers reach the max.

By the way, the processing time for the first 10 processes is about
7-10 seconds, while the processing time for the later processes is about 2 seconds.

Thanks!

---->

  1. “The number of audit daemons that can run concurrently
    is equal to the number of audit producers that can run concurrently…”
    is simply another way of stating that the settings that control the min / max pool sizes for each of these pools are shared. You’re trying to read too much into this. If max is 10, then you can have 10 audit daemons AND 10 audit producers running concurrently. They are two separate pools the min / max sizes of which are controlled by two “shared” properties.

  2. It sounds like the test that you are running to look at audit producer performance contains other code (I’m basing this on your statement that “the processing time for the later processes is about 2 seconds”). If you want to test audit producer performance, then you need to do your testing against a service that contains NO code (i.e. an empty service that just has the desired level of auditing enabled). Otherwise you are adding a lot more variables into the equation. Even with the empty service test, there could still be some other initialization going on besides the producer pool that may account for latency. In my experience with doing load tests, I have never seen any latency due to the establishment of the producer pool. There is definitely some other source of latency in your code. I think you are on the right track though in assuming that it is some sort of initialization going on.