We are exploring the possibility of utilizing API Gateway to send alerts (emails and gateway transactions) whenever performance conditions are violated. In our research, we have identified the “Monitor Service Performance” policy, which almost meets our requirements.
The key configuration we are interested in is the Fault Count, which indicates the number of faults returned within the current interval. However, the challenge we are facing is that the API Gateway considers HTTP status codes greater than or equal to 400 as fault request transactions, but our service implementation does not consider a 404 status code as an error.
Is there a way to exclude the 404 status code from triggering the “Monitor Service Performance” policy? This would allow us to set up the alert system without generating unnecessary notifications for 404 responses, which are not considered errors in our service implementation.
4xx are related to consumer issues (bad request, unknown request, duplicate entry, etc.) Do you really want to monitor metrics related to API consumer performance?
If what you want is monitoring of your server (and backend performance), then you should focus on the 5xx errors.
Yes, we do need to monitor 4XX certain error codes, For example, we depend on a provider service that occasionally takes too long to respond - sometimes up to 5 seconds. In these cases, our gateway layer will return a 408 Timeout error, which is an important signal for us to be aware of.
Additionally, if we notice that a consumer is interacting with our gateway endpoint in an improper way, resulting in 400 Bad Request errors, we will need to communicate this to them and our goal would not be to limit the consumer’s rate, as that would be outside of our agreed SLA. Rather, we want to make the consumer aware of the problematic behavior so they can address it on their end.
In this case you could use custom extensions in the response processing stage. It would work with an on-premise gateway, but from an architecture standpoint this isn’t ideal.
Or alternatively you could send the traffic data to an external destination (for instance Elastic Search or Kafka) and then apply the analytics logic at this level. There you’ve got full flexibility to analyze your API traffic, detect patterns and issue notifications in case of necessity.
How 404 should be used is a frequent discussion topic on the web. Lots of material to peruse.
But strictly speaking, 404 is an error.
4xx codes Tell a UA it did something wrong, the request it constructed isn’t proper and shouldn’t try it again, without at least some modification.
For the service implementations, whether or not 404 is appropriate depends on the semantics of the request.
If the URL is requesting a specific resource, such as /widgets/2345 and 2345 doesn’t exist, 404 is the right thing to return. Why is the caller asking for something that doesn’t exist? Where did they get 2345 from? From a monitoring perspective, you likely want to know about these.
If the URL is requesting a “search” result, such as /widgets or widgets?id=2345 or widgets?name=“foo*” (assuming support for wildcards) then 404 is not the right thing to return Should be 200 with an empty set. That the query (search a set) is logically intended to return just 1 entry (via the id parameter) is somewhat immaterial. All queries for widgets should return a (possibly empty) set.
Of course the details of the resources being exposed matters. If you can share more about the scenarios there may be more guidelines to consider.