Thursday, May 27, 2010

Searchability of widgets

Almost a year back, I was working as an Architect on building a Social Media platform as part of which we had to build a widget creation platform. The widgets in question here are "web page widgets" and not desktop widgets. The specific kind of widgets here were termed as "Content Aggregation widgets". These widgets were meant to display, in various visual forms, content fetched and aggregated from various sources on the web. The content could be flickr photostream RSS feeds containing metadata about images, youtube channel RSS feeds, blog RSS feeds etc. In fact, the content could be any RSS or ATOM feed from pretty much any source. 

Examples of requirements could be :
1. A client is running a video campaign for a product in youtube and wants to create a widget that would display all those videos along with some metadata (number of views, average rating etc.)
2. A client would like to aggregate content from a few blogs and show the aggregated content in a widget.

In both the cases above, widgets can be used as a very powerful medium of content syndication.

The functional requirements were pretty clear and we set off on our journey of creating the "content aggregation widgets" platform.

An "unusual" requirement then cropped up which said that content within the widgets should be searchable by search engines. What it meant was if the owner of abc.com came to our platform, created a widget and stuck it into a page on abc.com, search engines should be able to find this page on abc.com if content within the widget was being searched for. 

Now this was tricky. 

Widgets in a technical sense are small chunks of html code (either <embed> or <object> tags or both) which can be added to any web page (by editing the page's source of course). The chunk of widget code usually points to a javascript or flash file. When the web page comes up next time, this chunk of code makes the javascript or flash code execute and creates a part of the web page. The important thing to understand here is that the content is created not on the server side but on the client side.

So why was this "searchability" thing tricky? 

Simply because search engines when searching pages for keywords, are only able to search the http response corresponding to a request made for the page. In other words, search engines can only look into the equivalent of "view source" i.e. server side generated content. Since widget code executes on the client side (i.e. the browser), the content generated by a widget is just not seen by a search engine.

Now that pretty much means that such widgets cannot be searchable in the way the requirement above wanted that "searchability". While that is true, there is a workaround to this technical limitation. Before that here's how the architecture of the solution looked like before the searchability requirement came in. For the sake of understanding, "vendor" refers to my company that hosts the platform, "client" is the company which is the user of the widget platform and and creates and embeds widgets in their pages, "outside world" is..well, the outside world and "end user" is someone who views pages served from the client's domain.
Both the diagrams below don't show up in full in the normal view. Please click on the image to see the full picture.



After the "searchability" requirement came in, we had to implement a workaround which meant that for widgets where searchability was a requirement, the widget platform in addition to generating widget embed code, also generated another chunk of server side code. The language of that code could be Java, Perl, PHP etc.  The additional server side code would make a call to a REST API exposed by the vendor platform. This REST API would be something like www.vendor.com/api/getkeywords/[widgetID]. This API would then return an XML containing keywords related to the widget's content. The server side code, after obtaining this XML, would parse out the keywords and insert a html meta tag with the keywords in them (something like <meta name="keywords" content="string1, string2" />).


The diagram below depicts this change:


This enabled us to make the widgets searchable to some extent though adding meta tags with keywords is only one of the many things that one can do to make a web page search engine friendly.


The funny thing though was this was less of a technical and more of a "political" challenge. Our view technology for widgets was flash and questions were raised about the searchability of flash content. Suggestions were made that if the widgets had been done in html/javascript they would somehow be more searchable. It took some time to convince the client that any content created on the client side is not searchable, period. The technology creating that content doesn't matter.