Thursday, September 9, 2010

Holes in Application Architecture - Hidden single points of failure

Most enterprise applications and almost all customer facing web applications have high availability as a stated or implicit requirement. Ensuring high availability, among other things, also means building  in redundancy for those parts of the system which can become single points of failure.

While defining the deployment and application architectures of highly available systems, one needs to take both an outside-in and inside-out view of the system to identify all possible single points of failures. Many times, while defining the deployment architecture , there is a tendency to take just a outside-in view of the system. Same goes for the case of defining application architecture aspects related to high availability. This approach takes care of building redundancy for the "obvious" single points of failures such as web server, app server and database and many times that is good enough. Sometimes though, due to the nature of the application, components of the application interact with external entities and services which are deployed separately in their own deployment units. To ensure true high availability, redundancy needs to be built for such entities and services as well. Identifying them needs an an inside-out view without which these might just get the miss. 

In this post, I use the context of a web application that is accessible over the internet to highlight a few such hidden single points of failure.

A highly available web application is typically deployed in a setup where there are four vertical tiers:


  • Tier 1 is a load balancer (a hardware one quite often). 
  • Tier2 is a web server farm with the web servers also acting as reverse proxies.  These web servers are also configured to support failover to another application server if the one that a request is sent to fails to respond. 
  • Tier3 is a cluster of application servers.
  • Tier4 is a set of database servers (at least 2 in either active-active or active-passive mode). 
  • In case the application has filesystem dependencies (a video sharing site for example), the filesystem is mounted from network storage on individual application web server and/or application server nodes.

This setup takes care of all the obvious single points of failure. What it does not take care of are the not so obvious ones. Some of these are:

Outbound http calls
Application sometimes need to make calls to other urls on the internet. These could be calls to get RSS feed data or could be web service calls to third party services such as spam detection services or other services such as Feedburner or a host of other things for that matter. Quite often, security guildelines mandate that applications cannot directly make calls to internet urls and must get the requests proxied through proxy servers. In such a case it's important to  build in redundancy for proxy servers and applications should be coded correctly to resend requests through another proxy server if one fails.

Single Sign On
Some applications delegate authentication to a separate SSO solution (e.g. Siteminder, CAS). Allowing a user to login to an application via SSO usually means the following sequence of events:
a) redirecting the user to a login page hosted by the SSO application
b) If the login is successful, the user request then presents a token to the application requesting it to be allowed access. This token is generated by the SSO application.
c) Making a "behind the scenes" call to the SSO app to validate the token
The third step is a potential single point of failure. Assuming SSO has multiple instances for redundancy and the instances are clustered, the application should be coded correctly to contact a different SSO app instance in case the first doesn't respond.

Antivirus checks
Some applications which allow users to upload binary content (videos, audios, images etc) may need to run the uploaded content through an anti-virus check and quarantine the file if found infected. If the AV check is done via a service invocation on a remote process (which is usually the case), the application code should take care of invoking the service on another instance if the first instance is not responding.

Emails
Some applications need to send emails to various kind of recipients. This means talking to a SMTP server either directly or via a relay (e.g. a sendmail relay on Linux) . In either of the cases, assuming redundancy exits for the SMTP server or relay, the  applications need to be coded to send the email content to another instance if the first instance doesn't respond.

These are some example from my experience. I am sure there would be a few more. Thoughts on what those could be?

No comments: