Hi all - For the last year or so, we’ve had problems on and off with Trading Networks document deliveries not getting out. On our IS/TN 4.6 server, TN deliveries sometimes stay in the PENDING state – the only reliable fix is a server restart. Does anyone have any suggestions?
Here’s an example: I restarted my server last night. Today morning, I found 75 (and the number is growing) document delivery tasks stuck in the PENDING state. The earliest of the PENDING deliveries was from 6:25 in the morning.
My ‘Server > Service Usage’ screen shows me these two tasks have been running since 6:10 in the morning:
– pub.client:http (1) 403 4/3/03 6:10 AM - -
– wm.tn.transport:primaryHttps (1) 286 4/3/03 6:10 AM - -
So it looks like these two tasks hung at 6:10 and caused deliveries after them (6:25 and onwards) to get stuck in the PENDING state.
Funnily enough, the delivery at 6:10 was successful – it got a HTTP 200 from the remote server and it’s delivery status is DONE. However, its tasks are still hanging around.
Has anyone else come across such a problem? We spoke to webMethods and they gave us TNS FIX 45. However, this patch seems to break delivery retries completely – so we haven’t installed it on the server mentioned above.
we recently experienced a similar problem but (unfortunately for you) found that TN SP1 and fix 45 fixed the problem.
When we experienced the problem we found it would help to manually stop all pending tasks (through TNConsole tasks) and then reprocess each task in turn. Some time later the problem would occur again and we would have to repeat above.
This is pretty much as described in Fix 45 :
- (Trax:1-9VN0D) If one of the running task is hung, all
subsequent tasks get stuck in PENDING status.
This problem only occurred on DEV instance (thank goodness) and not TEST and PROD, even though the other systems are at the same version of TN.
If you have these same symptoms you might want to check why FIX 45 caused you more problems.
Kevin - The problem we found with FIX 45 was: If (and only if) a delivery failed, subsequent retries of that deliver got an empty bizdoc.
Did you see anything like this happen with FIX 45?
I’ve found something very interesting regarding the hung delivery from this morning. The delivery previous to that (delivered last night), seems to still have its TCP/IP connection in the ESTABLISHED state at the OS level of our Linux server (this is as shown by the ‘netstat -vn’ command).
Both deliveries - the one last night and the one this morning - were successfully delivered according to TN. So I wonder why TN is not cleaning up the TCP/IP socket from last night.
TN PENDING task problem has been a haunting issue for us too .
we could not go for manual task stops, as the pending tasks count was very large.
At one instance, I also had the same observation regarding the pub.client:http and wm.tn.transport:primaryHttps.
I would like to list down my observations in case you find something interesting.
FYI, we are using WMTN 4.6 on a Windows 2000 server…using TN reliable delivery.
1)The pub.client:http and wm.tn.transport:primaryHttps were hung. The pending task count was …say 130.
2) I restarted the WM server . As was expected, the reliable delivery thread started clearing these tasks.
3)To my surprise, the service usage still showed these two services as hung. Meanwhile, all these 130 tasks got cleared but at the same time all the new tasks went to a PENDING state.
TN was still holding on to the TCP/IP socket.
4)A complete machine restart however cleared all the PENDING tasks.
An additional observation here was the time when these two services got hung. This time coincided with the time when one of our partner system was shut down abruptly. May be this was the time when TN attempted to deliver documents.
In any case, we would expect TN to not hold on to these TCP/IP sockets.
I wouldn’t say that this is the only scenario when TN delivery tasks would go in a pending state.
Apart from this, there were numerous other occations when TN delivery tasks went in pending state.
All these issues were pointed out to WM support and in response we got TNSP1 after around one month - in Dec’02.
we have still seen a few occurance of this problem even after installing TNSP1.
As you said, the only way out is - restart !
Hi Sindhu - Thank you for sharing your experience.
> This time coincided with the time when one of our
> partner system was shut down abruptly
Our experience today mirrors yours - I contacted the partner whose server (it’s webMethods as well - a SAP BC server) had triggered the situation today. He found that:
- His server was “comatose” most of today.
- He restarted the server a few hours ago by
restarting his server’s Windows machine.
However, netstat on my server still showed the connection as ESTABLISHED (3 hours past his server’s restart).
In other words, WM should be terminating up TCP/IP the connection (especially since the problem delivery had received a HTTP 200/OK status from the ‘comatose’ server). But WM isn’t doing so.
So I shared this witih to my webMethods support person and he’s going to push the following point with PD: “If the connection has become idle, the application should destroy the existing outbound connection.”
It’s all painful - I had to spend the best part of today working on this bug.
Hope that they give a fix for this very soon.
If you manage to get rid of this problem, do share on this forum.
all the best!
> If you manage to get rid of this problem, do share on this forum.
Thanks Sindhu - I sure will.
It’s not a permanent solution, but we’ve found that if you do a nightly restart of the WM Server, the Pendings seemed to happen a lot less. Hopefully support will find an actual fix that works.
I think I have seen the problem with the empty bizdocs showing up in services. In our case, this was not in a delivery task but when the processing rule was invoking a service it would get empty bizdocs. This started happening when we installed TN SP1. We were told by webMethods that this only happens when the service is called asynchronously and with large documents.
They gave us TN Fix 51 and that solved the problem.
We also have seen delivery tasks in PENDING status but fix 45 solved it for us. The only difference in our case is that we have custom delivery task in which we use pub.client:http to send the document. And as our custom delivery service calls wm.tn.doc:view as a first step, we do not get empty bizdocs.
That is the way around. Create a custom delivery service that uses wm.tn.doc:view using the bizdoc ID. For some wierd reason even in environments where the empty bizdocs show up, the wm.tn.doc:view service runs fine.
Thanks Mike - The server restart seems to clean-up hung connections that are causing deliveries behind them to pile up in the PENDING state.
FIX 45 seems to be reallya workaround , not a fix, since it allows deliveries to “bypass” the hung connections (which is good), but does not fix the first problem (the connections hanging).
Rupinder - it’s interesting that FIX 51 solved your problem with empty bizdocs in processing rules. We ran across this problem 2 years back (TN 4, maybe TN 3.6 (?) - I’m not sure). WM could not fix it, and so we’ve run all our processing rules synchronously since then. I think it’s sad if it took WM 2 years to fix async processing rules. It’s wierd how the tn.doc:view seems to cause empty bizdocs to materialize.
I’ve asked my SR person about using FIX 51 to fix our empty bizdoc problem in delivery services. He’s going to check with PD. In the meantime he recommended I set my watt.net.timeout parameter to timeout network connections in 5 minutes (it’s set to 0 right now)… I’m in two minds about this since I can’t duplicate this problem in testing.
See this mail from Pam - a WM customer on another mailing list - PENDING deliveries is indeed a ‘hot’ issue for them as well.
Can anyone share values they’re using for the watt.net.timeout server.cnf parameter ?
Our problem was mainly related to the content of large documents not showing up in an asynchronous call. The regualr document were going fine. So I guess webMethods fixed it for regular documents but it showed up with large documents again. And it only happens if you install TN SP1. So this is indeed a new feature
I had traced this document and found that the culprit was the wm.tn.doc:getContentPartData service which is used by delivery services indirectly through wm.tn.doc:getDeliveryContent. But wm.tn.doc:view does not use that service to retrieve the doc contents. That is why it works.
We have recently been having a problem - that may have some similar roots.
In some cases, we have partner servers deliver content to our hub via GD.
This has been going on for a few months. Only recently have we seen some transactions hang.
We see the invoke on the HUB complete successfully, the content is delivered, the subsequent synchronous calls complete, and a 200 OK is logged on the HUB audit log.
The PARTNER, however, does not receive the response from the HUB. The GD transaction does not complete successfully.
Snooping the TCP conversation indicates that the TCP conversation has not been completed. Looks like the TCP connection remains open.
We have one client who posts one transaction at a time (fairly small data size). They only post the next trans if they receive a 200 ok from the partner server. The process works fine for hours at a stretch - and then decides to hang!
We have another client who posts one large batch of transactions. This now always seems to hang - when sending files of similar size (approx 1.5 M). This used to work fine.
We have tried changing the HUB keep-alive timout setting (30 s), and rebuilding the repository - to no avail.
Any thoughts would be appreciated.
> Snooping the TCP conversation indicates that the TCP
> conversation has not been completed. Looks like the TCP
> connection remains open.
That’s what happened here as well. In fact, the TCP/IP connection stayed in the ESTABLISHED (as shown by ‘netstat -vn’) even when our remote partner claimed to have rebooted their server (a SAP BC server). I tried using tcpdump, but it logged no packets.
WM support suggested setting watt.net.timeout=300. I tried that, but it caused a problem. It’s like this: our TN DB goes offline each night for backup. So far, since watt.net.timeout was 0 (don’t timeout), we never had to reconnect manually. But after we set timeout to 300, the TN-DB stays kaput after Oracle came back up, forcing a manual server restart to reinit it.
> I had traced this document and found that the culprit was
> the wm.tn.doc:getContentPartData service which is used
> by delivery services indirectly through
> wm.tn.doc:getDeliveryContent. But wm.tn.doc:view does
> not use that service to retrieve the doc contents. That is
> why it works.
Thanks Rupinder. FIX 51 may still apply to our empty bizdoc problem with delivery services.
Also recently, WM released TNS FIX 41 for a race condition bug in duplicate detection. This suggests WM have done some risky things under the hood for good performance, right? Does anyone know if 6.0.1 is a rewrite?
>WM support suggested setting watt.net.timeout=300. I >tried that, but it caused a problem. It’s like this: our TN DB >goes offline each night for backup. So far, since >watt.net.timeout was 0 (don’t timeout), we never had to >reconnect manually. But after we set timeout to 300, the >TN-DB stays kaput after Oracle came back up, forcing a >manual server restart to reinit it.
You dont have to restart the server, you just have to open up the database adapter and connect to a database. This restores the database connectivity!
> you just have to open up the database adapter and
> connect to a database.
Can you tell me how that is done in webMethods Trading Networks 4.6? Does this operation take place with the Merant proxy ?
Normally, webmethods shuts down the database adapter if there is a timeout specified and server goes down. All we do is manually does a connect from the Database adapter in webMethods for one db and all the db connections are up and running.
We’ve experienced similar errors with PENDING. Our scenerio is…
If a doc. fails we have a scheduled service to restart in 2 hours. At that point the status is updated to PENDING (from FAILED) and then when doc. is resent it is marked as DONE. However, it seems that taskstatus is updated in more than one place. If you look at the Task status tab in the analysis screen of the TN Console we see that the status is still PENDING. However if we open that record up, the task status is DONE. So it was sent upon the second retry (during scheduled resend of failed document), however it’s listed as PENDING in deliveryjob table’s task status and DONE elsewhere (I’ve asked wM about this but we still haven’t got an answer as to where this other task status is stored). So…when we restart our server, it looks at the field that contains PENDING (in db) and resends a document that has already been sent. Anybody else with this problem?
Hi Brian -
> If you look at the Task status tab in the analysis screen of
> the TN Console we see that the status is still PENDING
I’ve seen caching bugs occur in the tasks pane in TN console. Usually, a “File/ Restore Session” operation in TN Console can fix it – do you still get this discrepancy after a “Restore Session”?
Currently, we are still FIX 45 in testing. However, our frequency of PENDING requests has gone down in production since we’ve modified the backend to not send out documents that were failing validation in IS. My theory-of-the-week is that the failed validation of these documents was stressing IS, and this was causing the PENDING errors.
Has anyone ever processed the response from a post to a client. What I am doing is wm6.01 routes my xml to TN that in turn delivers the xml to the client. The client in turn set’s the response (http request/response) with xml data. Is there a way for me to get to that response and process that xml data? I have defined the doc type and processing rule to act on the doc type that the client is sending in the response but for some reason the TN has no entry for the response.