Subhashish's Technology Musings

Monday, September 13, 2010

Subversion - Managing exclusions in commit hooks

Subversion is now a widely used version control system. It is easy to setup, supports branching and merging almost effortlessly and has a few good client tools (like TortoiseSvn) and plugins for major IDEs that make life very easy for users.
Repositories
Subversion stores content in what are termed as repositories. One can visualize a repository as the base folder inside which various code lines exist. These can be trunk (the main code line), branches and tags (a tag usually corresponds to a release).
Hooks
Subversion also has a concept of hooks which are programs (usually scripts) that can be invoked at specific events. There are hooks for pre-commit, commit and post-commit events. Hooks can be used for a variety of purposes e.g. checking for a proper commit comment, running static analysis tools and fail commits on violations (this is usually better done in CI but sometimes might need to be done in a hook), code format checking etc.

Activating and Deactivating hooks for specific artifacts
Hooks exist at the level of a repository. What this means is that if a pre-commit hook exists, it will be invoked on each commit transaction to any code line within the repository. Sometimes though, certain code lines or projects or files need to be excluded from the checks built into the hooks. In other words, one might need to "activate" or "deactivate" hooks for certain commits. This "activation" or "deactivation" is not available out of the box in SVN and needs to be coded and built in. Here's an example of how one can achieve the same. The example below uses bash shell script but can be in any language.

Let's take an example of a pre-commit hook that needs to be deactivated for a few branches, one project and one file in a specific branch with a certain word in its name. A pre-commit hook is invoked as part of a commit transaction just before the transaction executes and becomes a revision. If the hook executes successfully and returns 1, the transaction commits and a new revision gets created. If it aborts or returns 0, the transaction is rolled back by the SVN server.
A pre-commit hook is called by the SVN server with two arguments, the first is the absolute path to the repository and the second is the transaction id of the commit transaction in question. With this information in hand, the script can be made to find the changes that are part of a transaction using the svnlook command. The command for the same is:
svnlook changed -t $2 $1 ($1 is the repos path and $2 is the transaction id)

With this information now available, the hook can be made to ignore certain files, projects etc from its processing.
The example here uses a flat text file where each exclusion is listed in a line. An example exclusion file could be:
------------------------------exclusions.txt------------------------------------
branch1
branch2
Project1
branch3/(.?)*FileC.java
--------------------------------------------------------------------------------------------------

The hook can then be written as:
--------------------------------------pre-commit--------------------------------------------
#!/bin/bash
REPOS=$1
TXN=$2
exclusionFile="exclusions.txt"

isExcluded()
{
   exclude=0
   path=${1}
   while read line
   do
   regex=([a-z][A-Z][0-9])*${line}
   if [[ ${path} =~ ${regex} ]]
   then
   exclude=1
   fi
   done < ${exclusionFile}
   return ${exclude}
}
CHANGED=`${SVNLOOK} changed -t "${TXN}" "${REPOS}" | /bin/grep -v ^D | /bin/awk '{print $2}` # (ignoring deletes)
for artifact in ${CHANGED}
do
   isExcluded(${artifact})
   if [ $? -eq 0 ]
   then
   #do processing here
   fi
done
------------------------------------------------------------------------------------------------
The benefit of taking this approach is that exclusions can be configured and maintained easily without any change to the hook itself.

Thursday, September 9, 2010

Holes in Application Architecture - Hidden single points of failure

Most enterprise applications and almost all customer facing web applications have high availability as a stated or implicit requirement. Ensuring high availability, among other things, also means building in redundancy for those parts of the system which can become single points of failure.

While defining the deployment and application architectures of highly available systems, one needs to take both an outside-in and inside-out view of the system to identify all possible single points of failures. Many times, while defining the deployment architecture , there is a tendency to take just a outside-in view of the system. Same goes for the case of defining application architecture aspects related to high availability. This approach takes care of building redundancy for the "obvious" single points of failures such as web server, app server and database and many times that is good enough. Sometimes though, due to the nature of the application, components of the application interact with external entities and services which are deployed separately in their own deployment units. To ensure true high availability, redundancy needs to be built for such entities and services as well. Identifying them needs an an inside-out view without which these might just get the miss.

In this post, I use the context of a web application that is accessible over the internet to highlight a few such hidden single points of failure.

A highly available web application is typically deployed in a setup where there are four vertical tiers:

Tier 1 is a load balancer (a hardware one quite often).
Tier2 is a web server farm with the web servers also acting as reverse proxies. These web servers are also configured to support failover to another application server if the one that a request is sent to fails to respond.
Tier3 is a cluster of application servers.
Tier4 is a set of database servers (at least 2 in either active-active or active-passive mode).
In case the application has filesystem dependencies (a video sharing site for example), the filesystem is mounted from network storage on individual application web server and/or application server nodes.

This setup takes care of all the obvious single points of failure. What it does not take care of are the not so obvious ones. Some of these are:

Outbound http calls
Application sometimes need to make calls to other urls on the internet. These could be calls to get RSS feed data or could be web service calls to third party services such as spam detection services or other services such as Feedburner or a host of other things for that matter. Quite often, security guildelines mandate that applications cannot directly make calls to internet urls and must get the requests proxied through proxy servers. In such a case it's important to build in redundancy for proxy servers and applications should be coded correctly to resend requests through another proxy server if one fails.

Single Sign On
Some applications delegate authentication to a separate SSO solution (e.g. Siteminder, CAS). Allowing a user to login to an application via SSO usually means the following sequence of events:
a) redirecting the user to a login page hosted by the SSO application
b) If the login is successful, the user request then presents a token to the application requesting it to be allowed access. This token is generated by the SSO application.
c) Making a "behind the scenes" call to the SSO app to validate the token
The third step is a potential single point of failure. Assuming SSO has multiple instances for redundancy and the instances are clustered, the application should be coded correctly to contact a different SSO app instance in case the first doesn't respond.

Antivirus checks
Some applications which allow users to upload binary content (videos, audios, images etc) may need to run the uploaded content through an anti-virus check and quarantine the file if found infected. If the AV check is done via a service invocation on a remote process (which is usually the case), the application code should take care of invoking the service on another instance if the first instance is not responding.

Emails
Some applications need to send emails to various kind of recipients. This means talking to a SMTP server either directly or via a relay (e.g. a sendmail relay on Linux) . In either of the cases, assuming redundancy exits for the SMTP server or relay, the applications need to be coded to send the email content to another instance if the first instance doesn't respond.

These are some example from my experience. I am sure there would be a few more. Thoughts on what those could be?

Thursday, May 27, 2010

Searchability of widgets

Almost a year back, I was working as an Architect on building a Social Media platform as part of which we had to build a widget creation platform. The widgets in question here are "web page widgets" and not desktop widgets. The specific kind of widgets here were termed as "Content Aggregation widgets". These widgets were meant to display, in various visual forms, content fetched and aggregated from various sources on the web. The content could be flickr photostream RSS feeds containing metadata about images, youtube channel RSS feeds, blog RSS feeds etc. In fact, the content could be any RSS or ATOM feed from pretty much any source.

Examples of requirements could be :

1. A client is running a video campaign for a product in youtube and wants to create a widget that would display all those videos along with some metadata (number of views, average rating etc.)

2. A client would like to aggregate content from a few blogs and show the aggregated content in a widget.

In both the cases above, widgets can be used as a very powerful medium of content syndication.

The functional requirements were pretty clear and we set off on our journey of creating the "content aggregation widgets" platform.

An "unusual" requirement then cropped up which said that content within the widgets should be searchable by search engines. What it meant was if the owner of abc.com came to our platform, created a widget and stuck it into a page on abc.com, search engines should be able to find this page on abc.com if content within the widget was being searched for.

Now this was tricky.

Widgets in a technical sense are small chunks of html code (either <embed> or <object> tags or both) which can be added to any web page (by editing the page's source of course). The chunk of widget code usually points to a javascript or flash file. When the web page comes up next time, this chunk of code makes the javascript or flash code execute and creates a part of the web page. The important thing to understand here is that the content is created not on the server side but on the client side.

So why was this "searchability" thing tricky?

Simply because search engines when searching pages for keywords, are only able to search the http response corresponding to a request made for the page. In other words, search engines can only look into the equivalent of "view source" i.e. server side generated content. Since widget code executes on the client side (i.e. the browser), the content generated by a widget is just not seen by a search engine.

Now that pretty much means that such widgets cannot be searchable in the way the requirement above wanted that "searchability". While that is true, there is a workaround to this technical limitation. Before that here's how the architecture of the solution looked like before the searchability requirement came in. For the sake of understanding, "vendor" refers to my company that hosts the platform, "client" is the company which is the user of the widget platform and and creates and embeds widgets in their pages, "outside world" is..well, the outside world and "end user" is someone who views pages served from the client's domain.
Both the diagrams below don't show up in full in the normal view. Please click on the image to see the full picture.

After the "searchability" requirement came in, we had to implement a workaround which meant that for widgets where searchability was a requirement, the widget platform in addition to generating widget embed code, also generated another chunk of server side code. The language of that code could be Java, Perl, PHP etc. The additional server side code would make a call to a REST API exposed by the vendor platform. This REST API would be something like www.vendor.com/api/getkeywords/[widgetID]. This API would then return an XML containing keywords related to the widget's content. The server side code, after obtaining this XML, would parse out the keywords and insert a html meta tag with the keywords in them (something like <meta name="keywords" content="string1, string2" />).

The diagram below depicts this change:

This enabled us to make the widgets searchable to some extent though adding meta tags with keywords is only one of the many things that one can do to make a web page search engine friendly.

The funny thing though was this was less of a technical and more of a "political" challenge. Our view technology for widgets was flash and questions were raised about the searchability of flash content. Suggestions were made that if the widgets had been done in html/javascript they would somehow be more searchable. It took some time to convince the client that any content created on the client side is not searchable, period. The technology creating that content doesn't matter.