I was having a conversation with a co-worker, where I complained that the biggest problem with some of the code we’re working on is that it was over-engineered. I reminded him of Gall’s Law:
“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.”
He asked me a question about client-server interaction, and rather than shooting off a 140 character reply it seems to make sense to answer the question more fully here.
To summarize, we are building an application that communicates to a remote server using JSON over HTTP. The remote system is built on top of Drupal. (Honestly it’s not how I would have done it, because there seems to be too many moving parts already–but the folks who built the server assure me that Drupal is God, and who am I to complain?)
Now on an error there are several things that can go wrong. The remote server can return a 4xx or a 5xx HTTP error. It can return a JSON error. Or it could return an HTML formatted error (I guess they have some flag turned on which translates all remote errors into human-readable error messages.) The client is hardened against all of these return possibilities, of course, and is also hardened against a network failure or against a server being unresponsive.
So here’s the question: how should the client respond to these wide variety of errors?
On the server side, of course, there needs to be logging. We even do some device-level logging, and a mechanism is in place on our iPhone client to transmit that log in the background to the server if there is a problem. Because clearly from a developer’s perspective a 404 is not a 501 is not a 505 is not a “server is unresponsive” is not an “illegal character in the API” is not a “network down” error. So clearly we need to know what is going on so we can fix it.
But what does the end user need to know?
In my mind, handling this is very simple: what can the user affect?
In the case of a mobile device, the user can retry again later, or he can make sure he’s got good network reception. Period.
So in the event of a failure, the client’s behavior should be simple: if a connection fails, then test to see if the network is up. If the network is down, tell the user “I’m unable to talk to the remote server because your network is down.” And if the network is up, tell the user “Sorry, a problem occurred while talking to the server.”
And that’s it.
If the network is down, the user can try to turn it on: he can take the device out of airplane mode, or move to where he has reception, or turn on the WiFi. But it’s up to the user–we just have to tell him “Um, the network is down.” And let him decide how he wants to handle the problem–or even if he wants to handle the problem: it could be he clicked on our application while at 38,000 feet over the Atlantic.
If we can’t hit the server, but the network is up, then there is nothing the user can do about it. All we can do as software developers is make sure whomever is maintaining the servers that they should investigate the outage. All we can tell the user, however, is “sorry, no dice; try again later.”
And hope to hell our servers are reliable enough we don’t drive users away.
Making it more complex is silly. It doesn’t serve the user who may be stressed out by messages he can do nothing about anyway.