0074 0074 WAIT Timeout stops RPC server from working

We have EntireX 7.3 installed on HPUX 11i, and we have a Broker running with 3 Natural RPC servers running. We use the Java wrapper ACI to make calls from our web application to our Natural subprograms. We sometimes receive the error class 0074, error code 0074 “WAIT Timeout”. I realize that this is a normal error to receive if the call takes too long. However, what seems to be happening is once we receive the WAIT Timeout error, that instance of the RPC server stops responding. The UNIX process appears to still be out there, but we cannot make calls to that RPC server. This behavior is not good, as once all of the RPC server instances we have running are “killed” in this manner, all users are prevented from accessing the system.

Does anyone have any insight into this problem?

Eddie

1 Like

Hi Eddie,

I think it is actually the other way round: for some reason the RPC servers stop to handle requests from clients and this results in the 0074 0074 on the client side.

You probably need to turn on tracing/logging in the RPC servers to see what is happening.

OK, we’ve got the RPC trace log turned up. I can see when the subprogram is called, we were getting the following error:

*** Execute callnat… RPS541N1 at 11:44:09
M RPS541N1 1110 NAT1101 The specified maximum page count has been exceeded.
0 SERVER: RPC100

However, I’m not sure that this error was occurring just before the WAIT Timeout is received, or if it is unrelated.

The programmer has since modified the program. They stated that there was a REPEAT loop that was never being escaped.

I’m going to try to reproduce this with a simple program that loops continously.

Eddie

if the program loops continuously, it can cause a fatal error - global type exceptions to Natural (CPU limit exceeded, call limit, loop limit, etc) may cause the Natural RPC Server to terminate: the RPC process is largely just another Natural program to the batch Natural process and it is governed by most of the same rules as any other Natural program.

And since the Natural RPC Server (unlike the EntireX RPC Servers) does not have any dynamic start/stop/restart of servers, terminating the process terminates the server.

The specific condition you are encountering is that the Broker has not heard from the server within the WAIT time out period, causing Broker to return a 0074/0074 message to the client. However, since your server process is still running (looping uselessly), it has not terminated yet. If you compare the time of the NAT1101 error with the time the CALLNAT was started, you will likely find that it exceeded the wait time out. I would guess that the WAIT time out was exceeded while the REPEAT process was looping and sometime after, the NAT1101 was reported.

From what you report, it would appear that the RPC Server was able to continue following the NAT1101 message. If you want to free up the RPC Server sooner, set your maximum limits (CPU, database calls, page count, loop limit, etc) lower to cause the max error message to be encountered sooner.

Thanks Doug. It seems like exactly what you describe was happening. The subprogram loops continously, and we receive the 0074 0074 on the client. Meanwhile, the RPC server process is still out there looping. Eventually, it does seem to cause a fatal error, as after awhile we get 0007 0007 Service not registered; the RPC server is gone at this point.

Should we just call this one ‘programmer error’?