Dear Cloud API, your fault line is showing

Most APIs are like hospital gowns. They seem to provide good coverage, until you turn around.

I am talking about the dreadful state of fault reporting in remote APIs, from Twitter to Cloud interfaces. They are badly described in the interface documentation and the implementations often don’t even conform to what little is documented.

If, when reading a specification, you get the impression that the “normal” part of the specification is the result of hours of whiteboard debate but that the section that describes the faults is a stream-of-consciousness late-night dump that no-one reviewed, well… you’re most likely right. And this is not only the case for standard-by-committee kind of specifications. Even when the specification is written to match the behavior of an existing implementation, error handling is often incorrectly and incompletely described. In part because developers may not even know what their application returns in all error conditions.

After learning the lessons of SOAP-RPC, programmers are now more willing to acknowledge and understand the on-the-wire messages received and produced. But when it comes to faults, there is still a tendency to throw their hands in the air, write to the application log and then let the stack do whatever it does when an unhandled exception occurs, on-the-wire compliance be damned. If that means sending an HTML error message in response to a request for a JSON payload, so be it. After all, it’s just a fault.

But even if fault messages may only represent 0.001% of the messages your application sends, they still represent 85% of those that the client-side developers will look at.

Client developers can’t even reverse-engineer the fault behavior by hitting a reference implementation (whether official or de-facto) the way they do with regular messages. That’s because while you can generate response messages for any successful request, you don’t know what error conditions to simulate. You can’t tell your Cloud provider “please bring down your user account database for five minutes so I can see what faults you really send me when that happens”. Also, when testing against a live application you may get a different fault behavior depending on the time of day. A late-night coder (or a daytime coder in another time zone) might never see the various faults emitted when the application (like Twitter) is over capacity. And yet these will be quite common at peak time (when the coder is busy with his day job… or sleeping).

All these reasons make it even more important to carefully (and accurately) document fault behavior.

The move to REST makes matters even worse, in part because it removes SOAP faults. There’s nothing magical about SOAP faults, but at least they force you to think about providing an information payload inside your fault message. Many REST APIs replace that with HTTP error codes, often accompanied by a one-line description with a sometimes unclear relationship with the semantics of the application. Either it’s a standard error code, which by definition is very generic or it’s an application-defined code at which point it most likely overlaps with one or more standard codes and you don’t know when you should expect one or the other. Either way, there is too much faith put in the HTTP code versus the payload of the error. Let’s be realistic. There are very few things most applications can do automatically in response to a fault. Mainly:

  • Ask the user to re-enter credentials (if it’s an authentication/permission issue)
  • Retry (immediately or after some time)
  • Report a problem and fail

So make sure that your HTTP errors support this simple decision tree. Beyond that point, listing a panoply of application-specific error codes looks like an attempt to look “RESTful” by overdoing it. In most cases, application-specific error codes are too detailed for most automated processing and not detailed enough to help the developer understand and correct the issue. I am not against using them but what matters most is the payload data that comes along.

On that aspect, implementations generally fail in one of two extremes. Some of them tell you nothing. For example the payload is a string that just repeats what the documentation says about the error code. Others dump the kitchen sink on you and you get a full stack trace of where the error occurred in the server implementation. The former is justified as a security precaution. The latter as a way to help you debug. More likely, they both just reflect laziness.

In the ideal world, you’d get a detailed error payload telling you exactly which of the input parameters the application choked on and why. Not just vague words like “invalid”. Is parameter “foo” invalid for syntactical reasons? Is it invalid because inconsistent with another parameter value in the request? Is it invalid because it doesn’t match the state on the server side? Realistically, implementations often can’t spend too many CPU cycles analyzing errors and generating such detailed reports. That’s fine, but then they can include a link to a wiki a knowledge base where more details are available about the error, its common causes and the workarounds.

Your API should document all messages accurately and comprehensively. Faults are messages too.

9 Comments

Filed under API, Application Mgmt, Automation, Cloud Computing, Everything, Mgmt integration, Protocols, REST, SOAP, Specs, Standards, Tech, Testing, Twitter, Utility computing

9 Responses to Dear Cloud API, your fault line is showing

  1. Obviously I agree with this: the faults are the most interesting part of the API as compared to the “success” response, errors are by quantity the largest proportion of the responses you can get from a service. Because they are the rarest, they are also bonded to the client-side code that doesn’t get tested enough.

    0. Service APIs must provide some machine parseable response. Soap faults aren’t that bad a design to start from.

    1. Service APIs must document all known errors, with parseable examples.

    2. Mock service endpoints should be provided to simulate failures like asking for some machines and getting you card declined, or asking for 500 machines and getting back 12.

    There’s some interesting politics here, because failure mode handling becomes the barrier to moving between apparently identical API implementations; just because Eucalyptus implements the EC2 API doesn’t mean it fails in the same way.

  2. Thanks for the comment Steve. Interesting point on the “fault behavior as a lock-in mechanism”, I hadn’t thought of that. Now you’re going to have me suspect nefarious intent where I only saw laziness before…

    Reading your comment, I remembered the Cloud Tools Manifesto that you wrote a while ago and which contains much of what I describe here. I am pretty sure a lot of my thinking on this traces back to when I read that entry on your blog, and the practical confirmations I’ve seen since.

  3. Tom Maguire

    I absolutely agree that failure modes and behavior is a point of tight coupling….. Some of the most insidious kind…

  4. One problem here is that it is really hard to specify in advance what the failure modes will be, as they are often implementation issues -if you change the back end, the errors change. What is good is for people to list their existing errors, and have an error format in which errors have some unique ID (like a URI) into which other people implementing the API can re-use the errors, or insert new ones. the original SOAPFault, not the 1.2 version, is a good starting point, now we need a JsonFault that is similar.

  5. Error responses should always carry a message body and that message body should be in the media-type expected by the client (JSON, HTML, XML, CSV, etc.).

    The error message body is an important _addition_ to the HTTP response code and can carry any interesting data API designers consider important including a link to another resource that can carry even more details (whether a generic help message or details real-time logging data).

    In a recent blog post (http://amundsen.com/blog/archives/1054) I discussed a generic version of an error message that can be easily adapted to any media-type.

    This same approach is covered in better detail in the “RESTful Web Services Cookbook” by Subbu Allamaraju.

  6. “Error responses should always carry a message body and that message body should be in the media-type expected by the client (JSON, HTML, XML, CSV, etc.).”

    That’s a big ask. If you’ve got a web-server passing a request to an app-server, and the app-server dies, then how can the web-server have any idea what format that URL call is expected to return.

    What if the web service offers a response in either XML or JSON, depending on a parameter, and the client specifies an invalid parameter.

    And that’s assuming the client request even makes it to the requested web-server. What about the network being down at the client end.

    I would say that, as a client calling a web service, you need to first cater for all the HTTP errors, and then a generic “does the response conform to an expected format”, and then any specified error response structure.

  7. Thanks for the good comment Gary. Though I would say that if the network is down at the client side there is nothing the API spec or the server implementation can do to help you. But I guess the underlying point you are making is that the client needs to be ready to handle not-in-the-API-spec faults anyway. Which is indeed true.

    Interestingly, soon after Mike posted his comment he and I had a little back-and-forth on Twitter that was along the same lines as your comment. See: me, him, me, him.

    As you can see, his solution to the unclear/unsuitable media type is to use a link header w/ rel=”error”.

  8. Gary/William:

    Making additional error information available to clients is not a big ask. Sure, sometimes it is impossible for the server to supply that info (e.g. the server doesn’t get the request, intermediary strips it out, the server dies) and sometimes the client will ignore the response (e.g. client doesn’t recognize the format, client dies, etc.). But these are exception cases that are handled in the normal course of things. HTTP is, by design, an unreliable protocol.

    I used the word “should” (not “must”) and, if it makes folks feel better, I’d be happy to adjust that to “Whenever reasonably possible servers should…” as it doesn’t change the state of affairs at all.

    Now, I will grant you some Web frameworks make it hard for developers to inject meaningful content into error response bodies, but that’s a matter for the framework folks, not the protocol.

    Finally, as William already pointed out, using a Link , rel=”error” is a fine way to sidestep lots of payload issues and still provide clients additional helpful information.

  9. Pingback: William Vambenepe — Cloud APIs are like military parades