We´ve run into some trouble while publishing documents to the broker.
First off, the IS loses the connection to the broker (at least we think so)
We’re starting to see the following entries in the server log:
The broker and the IS are on the same network (no FW in between), so I´m a bit puzzled.
Second - the publish suddenly take time, more than 30s for each publish (even when delayUntilServiceSuccess=false) which makes the server get overloaded due to the increasing number of threads.
This happen very sporadically (it begun a few weeks ago) and I still haven’t found any reason for it… yet.
I’m not sure how to adress this problem or even where to look for clues.
Any ideas anyone?
Not much activity here… but I post the solution anyway
After spending many many ours on diagnosing the network without any success, we started to look elsewere - and finally we found the cause.
The problem was not a network problem, but a SAN problem.
The part of the SAN the broker stored it´s document sometimes got exchausted which caused the broker to stall and close all the connections.
The logging and error handling in the broker should really be improved (added a request in brainpower). If it can’t read/write something to disk it should throw some really nasty exception you shouldnt be able to miss
It’s great that you have solved your problem. I come across this document in IS administration guide before. Hope it helpful to you.
Setting the Capacity of the Outbound Document Store
Bydefault,theoutbounddocumentstorecancontainamaximumof500,000documents.*
Aftertheoutbounddocumentstorereachescapacity,theserver*“blocks”or“pauses”any
threadsthatareexecutingservicesthatpublishdocuments.Thethreadsremainblocked
untiltheserverbeginsdrainingtheoutbounddocumentstore.*
Thewatt.server.control.maxPersistserverparameterdeterminesthecapacityofthe*
outbounddocumentstore.IfyouplantobringtheBrokerdownforanextendedtime*
period,considereditingthisparametertolowerthecapacityoftheoutbounddocument*
store.Ifyoukeeptheoutbounddocumentstoreatthedefaultcapacity,andtheBroker*
becomesunavailable,itispossiblethatstoringoutbounddocumentscouldexhaust*
memoryandcausetheservertofail.Iftheoutbounddocumentstorehasalowercapacity,
theserverwillblockthreadsinsteadofcontinuingtousememorybystoring*documents.
Outbound document store stay with IS not broker.
Theoutbounddocumentstorecontainsguaranteeddocumentspublishedbytheserver*
whentheconfiguredBrokerisnotavailable.AftertheconnectiontotheBrokeris
reestablished,theserversendsthedocumentsintheoutbounddocumentstoretothe*
Broker
Our problem was not related to memory or number of stored documents, but the fact that the connection went up and down…
When the IS can’t talk to the broker, every publish gets “stuck” until the connection is broken (causes a huge peak in thread count).
When the connection is broken - everything works well, the documents are placed in the outbound document store.
When a new connection is established (IS<->BROKER) the broker must drain the documents from the outbound document store. This takes some time and during that period it takes approx. 5-10s to publish a document - which again causes the thread count to rise very quickly.
So - so summarize:
No reply from broker: Slow publish until the connection is really broken.
Broker is draining: Slow publish until “drain” is finished.
Assuming the environment can handle a lot of documents in the outbound document store, it’s actually better to be disconnected for a while than having the connection go up and down (makes the system very unstable due to thread thrashing).
We never applied any fixes - we only tweaked the configuration a bit.
Here are our settings
watt.server.brokerTransport.dur=60
watt.server.brokerTransport.max=60
watt.server.brokerTransport.ret=3
You can never guarantee that the broker will be available, and if you need 100% uptime you might want to think about some sort of clustering to keep the downtime to a minimum.
Which version of he IS are you using?
In our new 8.2 environment we skipped native publishing and used JMS instead, which gives you a lot more flexibility and control (and a bit more overhead when it comes to configuring it). We set up a IS cluster and broker cluster with jndi failover and used LDAP as a directory service. If a broker goes down, the client (IS) just tried the next jndi provider in the failover list.
However - there were alot of bugs - seems the jms implementation was far from mature and we ended up reporting loads of bugs to SAG and we spended months on load testing before everything finally worked as it should. So, I wouldn’t recommend using JMS unless you are using IS 8.2.