Category Archives: Google

October 1, 2012

Joining Google

Next Monday, I will start at Google, in the Cloud Platform team.

I’ve been watching that platform, and especially Google App Engine (GAE), since it started in 2008. It shaped my thoughts on Cloud Computing and on the tension between PaaS and IaaS. In my first post about GAE, 4.5 years ago, I wrote about that tension:

History is rarely kind to promoters of radical departures. The software industry is especially fond of layering the new on top of the old (a practice that has been enabled by the constant increase in underlying computing capacity). If you are wondering why your command prompt, shell terminal or text editor opens with a default width of 80 characters, take a trip back to 1928, when IBM defined its 80-columns punch card format. Will Google beat the odds or be forced to be more accommodating of existing code?

This debate (which I later characterized as “backward-compatible vs. forward-compatible”) is still ongoing. App Engine has grown a lot and shed its early limitations (I had a lot of fun trying to engineer around them in the early days). Google’s Cloud Platform today is also a lot more than App Engine, with Cloud Storage, Compute Engine, etc. It’s much more welcoming to existing applications.

The core question remains, however. How far, and how quickly will we move from the abstractions inherited from seeing the physical server as the natural unit of computation? What benefits will we derive from this transformation and will they make it worthwhile? Where’s the next point of equilibrium in the storm provoked by these shifts:

IT management technology was ripe for a change, applying to itself the automation capabilities that it had brought to other domains.
Software platforms were ripe for a change, as we keep discovering all the Web can be, all the data we can handle, and how best to take advantage of both.
The business of IT was ripe for a change, having grown too important to escape scrutiny of its inefficiency and sluggishness.

These three transformations didn’t have to take place at the same time. But they are, which leaves us with a fascinating multi-variable equation to optimize. I believe Google is the right place to crack this nut.

This is my view today, looking at the larger Cloud environment and observing Google’s Compute Platform from the outside. In a week’s time, I’ll be looking at it from the inside. October me may scoff at the naïveté of September me; or not. Either way, I’m looking forward to it.

7 Comments

Filed under Cloud Computing, Everything, Google, Google App Engine, Google Cloud Platform, People, Uncategorized, Utility computing

June 28, 2012

Google Compute Engine, the compete engine

Google is going to give Amazon AWS a run for its money. It’s the right move for Google and great news for everyone.

But that wasn’t plan A. Google was way ahead of everybody with a PaaS solution, Google App Engine, which was the embodiment of “forward compatibility” (rather than “backward compatibility”). I’m pretty sure that the plan, when they launched GAE in 2008, didn’t include “and in 2012 we’ll start offering raw VMs”. But GAE (and PaaS in general), while it made some inroads, failed to generate the level of adoption that many of us expected. Google smartly understood that they had to adjust.

“2012 will be the year of PaaS” returns 2,510 search results on Google, while “2012 will be the year of IaaS” returns only 2 results, both of which relate to a quote by Randy Bias which actually expresses quite a different feeling when read in full: “2012 will be the year of IaaS cloud failures”. We all got it wrong about the inexorable rise of PaaS in 2012.

But saying that, in 2012, IaaS still dominates PaaS, while not wrong, is an oversimplification.

At a more fine-grained level, Google Compute Engine is just another proof that the distinction between IaaS and PaaS was always artificial. The idea that you deploy your applications either at the IaaS or at the PaaS level was a fallacy. There is a continuum of application services, including VMs, various forms of storage, various levels of routing, various flavors of code hosting, various API-centric utility functions, etc. You can call one end of the spectrum “IaaS” and the other end “PaaS”, but most Cloud applications live in the continuum, not at either end. Amazon started from the left and moved to the right, Google is doing the opposite. Amazon’s initial approach was more successful at generating adoption. But it’s still early in the game.

As a side note, this is going to be a challenge for the Cloud Foundry ecosystem. To play in that league, Cloud Foundry has to either find a way to cover the full IaaS-to-PaaS continuum or it needs to efficiently integrate with more IaaS-centric Cloud frameworks. That will be a technical challenge, and also a political one. Or Cloud Foundry needs to define a separate space for itself. For example in Clouds which are centered around a strong SaaS offering and mainly work at higher levels of abstraction.

A few more thoughts:

If people still had lingering doubts about whether Google is serious about being a Cloud provider, the addition of Google Compute Engine (and, earlier, Google Cloud Storage) should put those to rest.
Here comes yet-another-IaaS API. And potentially a major one.
It’s quite a testament to what Linux has achieved that Google Compute Engine is Linux-only and nobody even bats an eye.
In the end, this may well turn into a battle of marketplaces more than a battle of Cloud environment. Just like in mobile.

2 Comments

Filed under Amazon, Cloud Computing, Everything, Google, Google App Engine, Google Cloud Platform, Utility computing

February 28, 2012

GAE Traffic Splitting

Interesting addition to the Google App Engine (GAE) platform in release 1.6.3: Traffic Splitting lets you run several versions of your application (using a DNS sub-domain for each version) and choose to direct a certain percentage of requests to a specific version. This lets you, among other things, slowly phase in your updates and test the result on a small set of users.

That’s nice, but until I read the documentation for the feature I had assumed (and hoped) it was something else.

Rather than using traffic splitting to test different versions of my app (something which the platform now makes convenient but which I could have implemented on my own), it would be nice if that mechanism could be used to test updates to the GAE platform itself. As described in “Come for the PaaS Functional Model, stay for the Cloud Operational Model“, it’s wishful thinking to assume that changes to the PaaS platform (an update applied by Google admins) cannot have a negative effect on your application. In other words, “When NoOps meets Murphy’s Law, my money is on Murphy“.

What would be nice is if Google could give application owners advanced warning of a platform change and let them use the Traffic Splitting feature to direct a portion of the incoming requests to application instances running on the new platform. And also a way to include the platform version in all log messages.

Here’s the issue as I described it in the aforementioned “Cloud Operational Model” post:

In other words, if a patch or update is worth testing in a staging environment if you were to apply it on-premise, what makes you think that it’s less likely to cause a problem if it’s the Cloud provider who rolls it out? Sure, in most cases it will work just fine and you can sing the praise of “NoOps”. Until the day when things go wrong, your users are affected and you’re taken completely off-guard. Good luck debugging that problem, when you don’t even know that an infrastructure change is being rolled out and when it might not even have been rolled out uniformly across all instances of your application.

How is that handled in your provider’s Operational Model? Do you have visibility into the change schedule? Do you have the option to test your application on the new infrastructure or to at least influence in any way how and when the change gets rolled out to your instances?

Hopefully, the addition of Traffic Splitting to Google App Engine is a step towards addressing that issue.

2 Comments

Filed under Application Mgmt, Automation, Cloud Computing, DevOps, Everything, Google, Google App Engine, Utility computing

February 22, 2012

The war on RSS

If the lords of the Internet have their way, the days of RSS are numbered.

Apple

John Gruber was right, when pointing to Dan Frakes’ review of the Mail app in Mountain Lion, to highlight the fact that the application drops support for RSS (he calls it an “interesting omission”, which is both correct and understated). It is indeed the most interesting aspect of the review, even though it’s buried at the bottom of the article; Along with the mention that RSS support appears to also be removed from Safari.

[side note: here is the correct link for the Safari information; Dan Frakes’ article mistakenly points to a staging server only available to MacWorld employees.]

It’s not just John Gruber and I who think that’s significant. The disappearance of RSS is pretty much the topic of every comment on the two MacWorld articles (for Mail and Safari). That’s heartening. It’s going to take a lot of agitation to reverse the trend for RSS.

The Mountain Lion setback, assuming it’s not reversed before the OS ships, is just the last of many blows to RSS.

Twitter

Every twitter profile used to exhibit an RSS icon with the URL of a feed containing the user’s tweets. It’s gone. Don’t assume that’s just the result of a minimalist design because (a) the design is not minimalist and (b) the feed URL is also gone from the page metadata.

The RSS feeds still exist (mine is http://twitter.com/statuses/user_timeline/18518601.rss) but to find them you have to know the userid of the user. In other words, knowing that my twitter username is @vambenepe is not sufficient, you have to know that the userid for @vambenepe is 18518601. Which is not something that you can find on my profile page. Unless, that is, you are willing to wade through the HTML source and look for this element:

<div data-user-id="18518601" data-screen-name="vambenepe">

If you know the Twitter API you can retrieve the RSS URL that way, but neither that nor the HTML source method is usable for most people.

That’s too bad. Before I signed up for Twitter, I simply subscribed to the RSS feeds of a few Twitter users. It got me hooked. Obviously, Twitter doesn’t see much value in this anymore. I suspect that they may even see a negative value, a leak in their monetization strategy.

[Updated on 2013/3/1: Unsurprisingly, Twitter is pulling the plug on RSS/Atom entirely.]

Firefox

It used to be that if any page advertised an RSS feed in its metadata, Firefox would show an RSS icon in the address bar to call your attention to it and let you subscribe in your favorite newsreader. At some point, between Firefox 3 and Firefox 10, this disappeared. Now, you have to launch the “view page info” pop-up and click on “feeds” to see them listed. Or look for “subscribe to this page” in the “bookmarks” menu. Neither is hard, but the discoverability of the feeds is diminished. That’s especially unfortunate in the case of sites that don’t look like blogs but go the extra mile of offering relevant feeds. It makes discovering these harder.

Google

Google has done a lot for RSS, but as a result it has put itself in position to kill it, either accidentally or on purpose. Google Reader is a nice tool, but, just like there has not been any new webmail after GMail, there hasn’t been any new hosted feed reader after Google Reader.

If Google closed GMail (or removed email support from it), email would survive as a communication mechanism (removing email from GMail is hard to imagine today, but keep in mind that Google’s survival doesn’t require GMail but they appear to consider it a matter of life or death for Google+ to succeed). If, on the other hand, Google closed Reader, would RSS survive? Doubtful. And Google has already tweaked Reader to benefit Google+. Not, for now, in a way that harms its RSS support. But whatever Google+ needs from Reader, Google+ will get.

[Updated 2013/3/13: Adios Google Reader. But I’m now a Google employee and won’t comment further.]

As far as the Chrome browser is concerned, I can’t find a way to have it acknowledge the presence of feeds in a page at all. Unlike Firefox, not even “view page info” shows them; It appears that the only way is to look for the feed URLs in the HTML source.

Facebook

I don’t use Facebook, but for the benefit of this blog post I did some actual research and logged into my account there. I looked for a feed on a friend’s page. None in sight. Unlike Twitter, who started with a very open philosophy, I’m guessing Facebook never supported feeds so it’s probably not a regression in their case. Just a confirmation that no help should be expected from that side.

[update: in fact, Facebook used to offer RSS and killed it too.]

Not looking good for RSS

The good news is that there’s at least one thing that Facebook, Apple, Twitter and (to a lesser extent so far) Google seem to agree on. The bad news is that it’s that RSS, one of the beacons of openness on the internet, is the enemy.

[side note: The RSS/Atom question is irrelevant in this context and I purposedly didn’t mention Atom to not confuse things. If anyone who’s shunning RSS tells you that if it wasn’t for the RSS/Atom confusion they’d be happy to use a standard syndication format, they’re pulling your leg; same thing if they say that syndication is “too hard for users”.]

70 Comments

Filed under Apple, Big picture, Everything, Facebook, Google, Protocols, Social networks, Specs, Standards, Twitter

August 5, 2011

On resisting shiny objects

The previous post (compiling responses to my question on Twitter about why there seems to be more PaaS activity in public Clouds than private Clouds) is actually a slightly-edited repost of something I first posted on Google+.

I posted it as “public” which, Google+ says, means “Visible to anyone (public on the web)”. Except it isn’t. If I go to the link above in another browser (not logged to my Google account) I get nothing but an invitation to join Google+. How AOLy. How Facebooky. How non-Googly.

Maybe I’m doing something wrong. Or there’s a huge bug in Google+. Or I don’t understand what “public on the web” means. In any case, this is not what I need. I want to be able to point people on Twitter who saw my question (and, in some cases, responded to it) to a compilation of answers. Whether they use Google+ or not.

So I copy/pasted the compilation to my blog.

Then I realized that this is obviously what I should have done in the first place:

It’s truly public.
It brings activity to my somewhat-neglected blog.
My blog is about IT management and Cloud, which is the topic at hand, my Google+ stream is about… nothing really so far.
The terms of use (for me as the writer and for my readers) are mine.
I can format the way I want (human-readable text that acts as a link as opposed to having to show the URL, what a concept!).
I know it will be around and available in an open format (you’re probably reading this in an RSS/Atom reader, aren’t you?)
There is no ad and never will be any.
I get the HTTP log if I care to see the traffic to the page.
Commenters can use pseudonyms!

It hurts to admit it, but the thought process (or lack thereof) that led me to initially use Google+ goes along the lines of “I have this Google+ account that I’m not really using and I am sympathetic to Google+ (for reasons explained by James Fallows plus a genuine appreciation of the technical task) and, hey, here is something that could go on it so let’s put it there”. As opposed to what it should have been: “I have this piece of text full of links that I want to share, what would be the best place to do it?”, which screams “blog!” as the answer.

I consider myself generally pretty good at resisting shiny objects, but obviously I still need to work on it. I’m back to my previous opinion on Google+: it’s nice and well-built but right now I don’t really have a use for it.

I used to say “I haven’t found a use for it” but why should I search for one?

1 Comment

Filed under Big picture, Everything, Google, Off-topic, Portability, Social networks, Twitter

June 12, 2011

Backward-looking API ToS vs. forwarding-looking API usage

From the developer’s perspective, most terms of service (ToS) documents for public APIs contain only constraints and no real permission. That’s because they generally don’t come with any guarantee about the future. At best, they tell you what you were authorized to do yesterday. Since the version you’re looking at may already be outdated, it tells you nothing about what you can do now or tomorrow. That perspective is often ignored when debating a change in ToS. The new version may forbid you from doing something that was not forbidden before, but it’s not actually taking anything away from you. The only thing it would take away from you is the right to do something from now on, but you were never granted such right over that period of time.

For example, the Twitter API ToS:

“The Rules will evolve along with our ecosystem as developers continue to innovate and find new, creative ways to use the Twitter API, so please check back periodically to see the most current version.”

and more specifically

“Twitter may update or modify the Twitter API, Rules, and other terms and conditions, including the Display Guidelines, from time to time its sole discretion by posting the changes on this site or by otherwise notifying you (such notice may be via email). You acknowledge that these updates and modifications may adversely affect how your Service accesses or communicates with the Twitter API. If any change is unacceptable to you, your only recourse is to terminate this agreement by ceasing all use of the Twitter API and Twitter Content. Your continued access or use of the Twitter API or any Twitter Content will constitute binding acceptance of the change.”

Changes in the terms of service can have a major effect on consuming application, going as far as making the whole purpose of the application impossible. For example, any Twitter app is a the mercy of the next corporate strategy change decided at Twitter HQ. That can happen without any change whatsoever to the API from a technical perspective.

If building your business on a proprietary API is sharecropping, then changes in ToS are the “droit de cuissage” (a.k.a “droit du seigneur“) that comes with it. Be ready for it to be exercised at any point.

Even in the cases where the ToS contain a guarantee over time, there is often enough language to allow the provider to wiggle out of it. Case in point, the recent decommission of the Google Translate API.

In the API ToS, Google includes a generous three years deprecation window:

“If Google in its discretion chooses to cease providing the current version of the Service whether through discontinuation of the Service or by upgrading the Service to a newer version, the current version of the Service will be deprecated and become a Deprecated Version of the Service. Google will issue an announcement if the current version of the Service will be deprecated. For a period of 3 years after an announcement (the “Deprecation Period”), Google will use commercially reasonable efforts to continue to operate the Deprecated Version of the Service and to respond to problems with the Deprecated Version of the Service deemed by Google in its discretion to be critical. During the Deprecation Period, no new features will be added to the Deprecated Version of the Service.”

But further down, Google reserves the right to do away with the Deprecation Period for several reasons, among which:

“- providing the Deprecated Version of the Service could create a substantial economic burden on Google as determined by Google in its reasonable good faith judgment; or
– providing the Deprecated Version of the Service could create a security risk or material technical burden upon Google as determined by Google in its reasonable good faith judgment.”

A “substantial economic burden” or a “material technical burden”. I am not a lawyer, but good luck going to court against Google arguing that maintaining the API doesn’t provide any such burden if Google claims it does (if I was a Google lawyer, I’d ask “if there’s no such burden, why don’t you provide it yourself?”).

And that’s exactly what Google is claiming in the case of the Translate API. There’s a reason why the words “substantial economic burden” are included in this statement at the very top of the home page for the Google Translate API.

“Important: The Google Translate API has been officially deprecated as of May 26, 2011. Due to the substantial economic burden caused by extensive abuse, the number of requests you may make per day will be limited and the API will be shut off completely on December 1, 2011. For website translations, we encourage you to use the Google Translate Element.”

[For more information about Google’s translation services and what might really be behind this change, I recommend reading this interesting analysis. Whether or not the author is guessing right, it’s an interesting perspective on how the marriage between Search and Advertising turns into a love triangle (with Translation as the third party) when you take a worldwide perspective.]

Protection against such changes (changes in ToS or the API going away entirely) would be a nice value-add for the companies that offer services around managing API usage. I vaguely remember an earlier announcement about Apigee providing a way to invoke the Twitter API that was sheltered from Twitter’s throttling (presumably via some agreement between Apigee and Twitter). Maybe Apigee can also negotiate long-term bidirectional ToS agreements with API providers on behalf of its customers. If I was building a business on top of a third-party service like Twitter, I’d probably value that kind of guarantee even more than the cool technical tools Apigee provides around API usage.

What’s more important in an API? Whether it’s REST or SOAP or whether you can count on using it for some time in the future?

Comments Off on Backward-looking API ToS vs. forwarding-looking API usage

Filed under API, Business, Everything, Google, Protocols

December 2, 2010

Nice incremental progress in Google App Engine SDK 1.4

When Google released version 1.3.8 of the Google App Engine SDK in October, they introduced an instance console, showing you how many instances are serving your application and some basic metrics about these instances. I wrote a blog to consider the implications of providing this level of visibility to application administrators. It also pointed out some shortcomings of this first version of the console.

The most glaring problem was that the console showed an “average latency” which was just a straight average of the latencies of all the instances, independently of the traffic they see. Which is a meaningless number.

Today, Google released an update to the SDK (1.4), and along with it some minor updates to the instance console. Except that, as you can see below, the screen capture in their announcement happens to show three instances that have processed exactly the same number of messages. Which means that we can’t tell whether they have fixed the “unweighted average” problem or not. Is this just by chance? Google, WTF? (which stands for “what’s the formula?”, of course).

I decided it was worth spending a few minutes to find the answer. I don’t have any app currently in use on GAE, but it doesn’t take much work to generate enough load to wake up one of my old apps and get it to spin a couple of instances. Here is the resulting console instance:

If you run the numbers, you can see that they’ve fixed that issue; the average latency is now weighted based on instance traffic. Thanks Google for listening.

Apparently, not all the updates have trickled down to my version of the instance console. The “requests”, “errors” and “age” columns are missing. I assume they’re on their way. Seeing the age of the instances, especially, is a nice addition, one of those I requested in my blog.

In the grand scheme of things, these minor updates to the console (which remains quite basic) are not the big news. The major announcement with SDK 1.4 is that the dreaded 30 seconds limit on execution time has been lifted for background tasks (those from Task Queue and Cron). It’s now a much more manageable 10 minutes. This doesn’t apply to the execution of Web requests served by your app.

Google App Engine has been under criticism recently, and that 30-second limit (along with reliability issues) figured prominently in the complains. Assuming the reliability issues are also coming under control, this update will go a long way towards addressing these issues.

Just so you realize how lucky you are if you are just now starting with Google App Engine, here are the kind of hoops you had to jump through, in the early days, to process any task that took a significant amount of time. This was done a year before the Cron and Task Queue features were added to GAE.

Another nice addition with SDK 1.4 is that you can now retrieve the source code of your application from Google’s servers. Of course you should never need that if you are rigorous and well-organized… Presumably this is only for Python since in the Java case Google’s servers never see the source code.

The steady progress of the GAE SDK continues.

1 Comment

Filed under Application Mgmt, Cloud Computing, Everything, Google, Google App Engine, IT Systems Mgmt, Manageability, Middleware, PaaS, Utility computing

August 19, 2010

URL shorteners and privacy: The Good, the Bad and the Cookie

The table below compares various URL shorteners based on how much they value service performance and the privacy of their users.

Here is the short version of the reading guide: a URL shorterner which gives a high priority to reliability, performance and privacy will use a 301 (“Moved Permanently”) response code, will not use cache control headers and will not use cookies. A URL shortener which gives high priority to its own ability to monetize its traffic by tracking users will do one or more of these things.

Here is how a few of the most popular shorteners perform by this measure (red is bad).

For the long version (and an explanation of how I came to create this table) read below the table.

Service name	Cookie	Status code	Caching limitations
t.co (Twitter)	–	301	5 min
bit.ly	tracking	301	–
tinyurl.com	–	301	–
goo.gl (Google)	–	301	24h
wp.me (WordPress)	–	301	–
snurl.com	–	301	10h
fb.me (Facebook)	(*)	301	–
twurl.nl	tracking	301	–
is.gd	–	–	–
ping.fm	–	301	–
p.ly	tracking	301	no caching
ff.im	tracking	301	(**)
u.nu	–	301	–
tiny.cc	tracking	301	–
snipurl.com	–	301	10h
chkit.in	tracking	301	–
ur1.ca	–	302	no caching
digs.by	–	302	no caching

Notes:

(*) Facebook’s service, fb.me, tries to set a cookie but its content is “locale=en_US” and cannot be used for identification. In addition, it sets the domain to “.facebook.com” in the Set-Cookie directive but since the response comes from another domain (fb.me) the cookie is actually never returned by the browser and therefore useless. It looks like this is a leftover configuration setting copied from the normal facebook.com servers. Defying all expectations, Facebook comes out as one of the most privacy-friendly URL shorteners.

(**) ff.im limits the cache to being “private” which means that your browser can cache the result but a shared proxy (e.g. your company’s proxy) should not cache it. Forcing each user behind that proxy to resolve the URL once. I magnanimously did not ding them for this, even though it’s sub-optimal.

Now for the longer explanation

Despite the potential it offers to stretch out our tweets, I wasn’t too impressed when I learned of Twitter’s plan to roll out (and mandate) its own URL shortening service. My fundamental issue is that URL shortening is made necessary by an arbitrary decision on Twitter’s part (the 140 character limit and the fact that URLs count toward it) and that it would be entirely within their power to make these abominations unneeded. Or, at least, much more rarely needed (when tinyurl.com came out, the main use case was to insert a very long URL in an email without having problems with carriage returns, not to turn third-world countries into purveyors of silly domain names).

Beyond this fundamental issue, my main concerns about Twitter’s t.co mechanism are that it reduces privacy and it demands that you break the HTTP specification.

From a privacy perspective, the issue is that anyone who clicks on these links tells Twitter where they are going. And Twitter can collect and correlate these actions. The easiest way for them (or any other URL shortener) to do this is to use cookies. Cookies aren’t often used as part of redirections, but technically nothing prevents them. So I wanted to see if Twitter used them.

[Side note: in practice there are ways to track your browser without using identifying cookies, not to mention simply using the IP address which works quite well on people who browse from home. Still, identifying cookies are the preferred method.]

From a specification conformance perspective, the problem is that Twitter announced that they would modify the Terms of Service of their API to prevent you from replacing the short URL with the real location once you’ve resolved it the first time (as of this writing they apparently haven’t yet made the ToS change). That behavior would be in violation of the HTTP specification if the redirection used status code 301 (“Moved Permanently”) which states that “any future references to this resource SHOULD use one of the returned URIs” and “clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server“. So I wanted to see whether t.co indeed returns a 301 (and asks us to violate the spec) or if they use a Temporary Redirect (302 or the new 307) in which case the specification would not be violated but other problems would arise (for example, search engines would not give you PageRank karma for such a link).

The other (spec-compliant) way to force a 301 to call back home once a while is the (strange but legal) practice of using cache control headers on permanent redirections. So I also wanted to see how t.co behaves on that front.

And then I decided to also test a few other services, which is how the table above came to be.

Comments Off on URL shorteners and privacy: The Good, the Bad and the Cookie

Filed under Everything, Facebook, Google, Protocols, Security, Social networks, Tech, Testing, Twitter

Tagged as Web

May 6, 2010

Integration patterns for social data: the Open Social Data Bus

The previous entry, “Don’t tell Facebook what you like, tell Twitter“, used Twitter and Facebook as examples to illustrate a general point about the integration of social profile data. Unfortunately, the examples may have overshadowed the larger point. In the post, I didn’t consider Twitter as a social network but as a message conduit. Most people on the other hand think of Twitter as a social network (after all, which Twitterer is not watching his/her follower count?) and could come out with the impression that I was just saying that Twitter is a better social network than Facebook. It wasn’t my point.

The main point is about defining the right integration pattern for social data: is it a “message bus” pattern or a “shared database” pattern. For readers who haven’t had the joy of dealing with integration architecture and enterprise integration patterns, here is a one-paragraph primer:

The expense report application in a company needs to be in sync with the data in the HR system, so that an expense report can be sent to the right manager for review/approval. Implementing such application integration in an efficient, resilient and flexible way is hard. Battle-tested approaches (high-level “patterns”) have emerged that have been successful, in the right context. Architects have learned that 99% of the time they are better off asking themselves which of these enterprise integration patterns is right for their problem, rather than trying to invent a new approach. Two of the most common basic patterns are the “shared database” pattern and the “message bus” pattern. In the “shared database” pattern, all the applications read and write to the same repository. In the “message bus” pattern, applications post messages on a shared channel (the “bus”) and also listen on the channel for messages from other applications that they are interested in. It’s similar to a radio channel of the kind used by police and ham radio operators.

(diagrams by Hohpe/Woolf, under cc license)

Facebook wants your social data to be shared across sites and applications using the “shared database” pattern, in which Facebook is the central database (and also the primary application). What I described in the previous post was the use of a “message bus” pattern (in which Twitter was used as the bus).

A bus has the following advantages when applied to the problem of sharing social data:

All applications have equal access
The applications are loosely-coupled, meaning that changing one doesn’t break the others
If applications only communicate via the bus, you get to observe the data shared about you
It can scale well

There are lots of interesting considerations about how to build and operate such a bus: security, scalability, access protocols, payload format, etc. But they are secondary to the choice of the integration pattern. For the sake of illustration, Twitter’s approach to security is OAuth, their scalable architecture is described here, the access protocols here and the payload format here. Reasonable alternatives exist for all these functions.

It’s hard for me to imagine the content of the messages on this bus not resembling RDF-like subject/verb/object triplets, in which the subject is implicit (the user attached to the message). The verbs could be simple strings or represented by URIs and have an associated taxonomy. And as in RDF, the objects should be either URIs or simple values (mostly strings, of a limited size, be it 140 characters or something else). Possible examples (the subject is implicit, the verb is in square brackets):

[say] I just had coffeecake for breakfast
[like] http://www.hobees.com/
[location] http://www.hobees.com/redwood.html

I still think Twitter is the most practical implementation of the Open Social Data Bus, for reasons I listed before:

It’s here today
It’s open and makes no pretense of (often violated) “privacy settings”
It can scale (give or take some growing pains and some still-drastic quota restrictions)
It has a delegated authorization model (though not quite as fine-grained as I’d like)
It already has a large ecosystem of provider/consumer applications
Humans look at the messages, ensuring that any integration of personal data will remain at a human scale and therefore controllable
It has proven to be a very successful environment for semantic tags to emerge spontaneously
It is persisted by many actors, including Google, Bing and the Library of Congress
Did I mention that it’s here today?

I remember discussions, in the early-to-mid-nineties, about whether the Internet, this quirky but fast-growing network, would turn into the expected global “information superhighway” or whether a superior one would have to emerge. This might seem like a silly discussion today but it wasn’t so obvious at the time. Wondering whether Twitter will turn out to be the Open Social Data Bus will seem just as silly in 15 years, though I don’t know if it will be deemed silly because the answer was obviously “no” or obviously “yes”…

The tension between Twitter as an infrastructure provider and Twitter as a competitor in the Twitter app marketplace is well-known. The company understands that what makes them different from other social networks is the ecosystem of applications that was enabled by this “message bus” pattern. Which is why, even as they announced that they were going to create their own applications to tap into the stream, they took pains to explain that they would be calling the same interfaces as everybody else.

On the other hand, Twitter obviously also needs to worry about making money. If their service becomes a low-level service, invisible to users (almost like DNS), then who is going to pay for the operations? Especially since the expectations on Twitter are currently so high that a “normal” rate of profit on operating such an infrastructure would be a huge letdown for investors. But this is not a post about the business prospects and strategic challenges of Twitter. It’s about allowing integration of social profile data in a way that benefits users.

I’d be fine with some other Open Social Data Bus implementation taking over and serving this need, as long as it fulfills the key requirements of being equally open to all applications and allowing individuals to control what gets posted about them. There are other avenues if Twitter cannot (or doesn’t want to) play this role. As the DNS example shows, it doesn’t necessarily have to be operated by a single operator. And there are a variety of funding models for such essential infrastructure (see “who funds root name server operations?” in the DNS root name servers FAQ). Alternatively, applications might be charged based on how much data they get from the bus.

Corporate support can take different forms. From wireless frequencies to wi-fi networks to DNS to supporting Firefox Google has shown a willingness to support the development and operation of the internet infrastructure, confident that they’ll be in the best position to benefit from it. Especially if the alternative is what Pete Cashmore describes as “Google’s nightmare“.

You could even think of this service eventually falling under the “common carrier” model, with the corresponding legal constraints. Especially in societies that are more privacy-aware.

I don’t know what the right business/operating model is for the Open Social Data Bus. What I know is that it’s how I want my social profile data to flow between applications.

[UPDATED 2010/5/20: Some supporting evidence for my recollection of “discussions, in the early-to-mid-nineties, about whether the Internet, this quirky but fast-growing network, would turn into the expected global ‘information superhighway’ or whether a superior one would have to emerge”:

Gates’s 286-page book [The Road Ahead, 1995] mentions the World Wide Web on only four of its pages, and portrays the Internet as a subset of a much a larger “Information Superhighway.” The Internet, wrote Gates, is one of “the important precursors of the information highway,” along with PCs, CD-ROMs, phone networks, and cable systems, but “none represents the actual information highway. … today’s Internet is not the information highway I imagine, although you can think of it as the beginning of the highway.”]

4 Comments

Filed under Everything, Facebook, Google, Social networks, Tech, Twitter

May 3, 2010

Don’t tell Facebook what you like, tell Twitter

There seems to be a lot to like technically about the announcements at Facebook’s f8 conference, especially for a Semantic Web aficionado. But I won’t have anything to do with it as a user. Along with the usual “your privacy is our toy” subtext, I really don’t like the lack of data portability. “Web 2.0” is starting to look a lot like “AOL 2.0”. Here is a better way to do it.

Taking the new “like” button as a simple example, I’d much rather tell Twitter what I like than Facebook. A simple #like hashtag in a tweet can be used to express positive feelings for what the tweet describes. Here is a quick list of the many advantages of this approach over the newly-introduced Facebook “like” feature.

It’s public

Your tweets are available to all. Your Facebook profile can still consume them, so if you think Facebook does the best job at organizing this information about you and your friends you can still go there to view the results. But other applications and networks can tap into the same data, so you can also benefits from innovation coming out of companies which do not want to be Facebook sharecroppers.

It’s publicly public

By which I mean that there is no pretense of privacy and no nasty surprise when trust is violated. Which is going to happen again and again. Especially when it’s not just a matter of displaying data but also of inferring new information based on the raw data collected. At which point it’s almost impossible to segregate access to the derived information based on the privacy settings of the individual data pieces. On Twitter, it’s all public, we all know it from the start, and as such we’re not fooled into sharing more than we should. See the fallacy of privacy settings.

It works on all things

Rather than only being on a web page, you can use a #like hashtag to describe any URI (dereferenceable or not) or even plain text. Just like RDF allows the value of an attribute to be either a URI or a scalar value (string, number…). For example, you can express that you like a quote or a verse of a poem by including them directly in the tweet. It’s not as identifiable as something that has a URI, but it can still be part of your profile. And smart consumers of this data might still be able to do some processing on it (e.g. recognizing it as a line from a song).

It can still be 1-click

You don’t necessarily have to copy/paste a URL (or text) into twitter. A web site can still do this for you, as long as it has your permission to post on your behalf. With that approach, it looks exactly like the Twitter “like” button to the user. You don’t have to be a Twitter user, just to have a Twitter account. No need for a Twitter client or to visit the Twitter web site if you don’t want to. It’s also OK if you have zero followers, Twitter is just a technical conduit in this approach.

It can evolve

The success of Twitter is also the success of self-organization as illustrated by the emergence of @replies, #hashtags and RT, directly form the users. Rather than having Facebook decide what verbs make sense to allow users to express their thoughts on the Web, let people decide and see what verbs emerge (e.g. to describe what you like, dislike, are curious about, are considering buying, etc). The only thing we need is an understanding that the hashtag qualifies the user’s attitude towards what’s described by the rest of the tweet. Or maybe hashtags should not be reused for this, maybe we need a new breed, “semtags” (semantic tags), with a different syntax, e.g. “^like”. This way you can semtag a hashtag, e.g. “^like #nyc” might replace “I ♥ NY” on twitter feeds (and tee shirts). It can be as simple or as complex as needed, based on what sticks in the real world. Nerds like me will try to qualify it (e.g. “^!like” for “I don’t like”) and might even come up with ontologies (^love subClassOf ^like). These experiences will probably fail and that’s fine. Evolution strives on failures.

It is transparent

Even if you let a site write these messages on your twitter feed, you can see exactly what goes on. There is no secret channel as with Facebook. The fact that it goes on your Twitter timeline acts as a validation, ensuring that only relevant, human-readable messages get added to your profile. Which is the only way in which we can maintain control of our profile information. If sites start to send too much information or opaque information you’ll see it. And so will your followers. This will put pressure on sites to make the posted data sparse and meaningful, because they know that their users won’t want to scare away their followers with social spam. See, for example, how the outcries over foursquare spam seem to have forced a clean-up (or at least so it looks to me, but maybe it’s just because I’ve unfollowed the spammers). Keeping social profiling on a human scale is a bug, not a feature.

It is persisted in many places

Who do you think is more likely to be around in 20 years, Facebook or the Library of Congress? Tweets are archived in many places, including Twitter itself, of course, but also Google, Bing and the Library of Congress. Plus, it’s very easy for you to set up a system to save all your tweets. Even if Twitter disappears, all the data in your profile that was built from your tweets will still be around. And if Google, Bing and the Library of Congress all go dark before Facebook, well that’s fine because the profile data from your tweets can be there too.

In effect, you should think of Facebook as a repository and Twitter as a stream. Don’t publish directly to one repository. Publish to a stream and benefit from all the repositories and other consumers that tap into it. It’s a well-known enterprise integration pattern (message bus), but it’s not just good for enterprise applications.

In fact, more than Twitter itself it’s this pattern that I want to encourage. Twitter is just the most obvious implementation, at this time, of a profile data bus. It already has almost everything we need (though a more fine-grained authorization model, or a delegated authorization model, would make me more likely to allow sites to tweet on my behalf). What matters is the switch from social networks owning data to you owning your data and social networks competing on how much value they can deliver to you based on the data. For example, LinkedIn might be the best for work connections, Facebook for personal connections, Google for brute search/retrieval of information, etc. I don’t want to maintain different profile data and privacy settings for each of them. I have one global privacy settings, which controls what I share with the world. Based on this, I want these sites to compete on the value they provide to me. It may not be what Facebook wants, but if what works best for us.

If you like this proposal, you know what you have to do. Go ahead and tweet:

^like https://stage.vambenepe.com/archives/1464

Or just retweet it.

[UPDATED 2010/5/6: See the next post for some clarifications.]

10 Comments

Filed under Everything, Facebook, Google, Mashup, RDF, Semantic tech, Social networks, Twitter

April 22, 2010

The fallacy of privacy settings

Another round of “update your Facebook privacy settings right now” messages recently swept through Twitter and blogs. As also happened a few months ago, when Facebook last modified some privacy settings to better accommodate their business goals. This is borderline silly. So, once and for all, here is the rule:

Don’t put anything on any social network that you don’t want to be made public.

Don’t count on your privacy settings on the site to keep your “private” data out of the public eye. Here are the many ways in which they can fail (using Facebook as a stand-in for all the other social networks, this is not specific to Facebook):

You make a mistake when configuring the privacy settings
Facebook changes the privacy mechanisms on you during one of their privacy policy updates
Facebook has a security flaw that bypasses access control
One of you friends who has access to your private data accidentally/stupidly/maliciously shares it more widely
A Facebook application to which you grant access betrays your trust in accessing the data and exposing it
A Facebook application gets hacked
A Facebook application retains your data in its cache
Your account (or one of your friends’ account) gets hacked
Anonymized data that Facebook shares with researchers gets correlated back to real users
Some legal action (not necessarily related to you personally) results in a large amount of Facebook data (including yours) seized and exported for legal review
Facebook looses some backup media
Facebook gets acquired (or it goes out of business and its assets are sold to the highest bidder)
Facebook (or whoever runs their hardware) disposes of hardware without properly wiping it
[Added 2012/3/8] Your employer or schoold demands that you hand over your account password (or “friend” a monitor)
Etc…

All in all, you should not think of these privacy settings as locks protecting your data. Think of them as simply a “do not disturb” sign (or a necktie…) hanging on the knob of an unlocked door. I am not advising against using privacy settings, just against counting on them to work reliably. If you’d rather your work colleagues don’t see your holiday pictures, then set your privacy settings so they can’t see them. But if it would really bother you if they saw them, then don’t post the pictures on Facebook at all. Think of it like keeping a photo in your wallet. You get to choose who you show it to, until the day you forget your wallet in the office bathroom, or at a party, and someone opens it to find the owner. You already know this instinctively, which is why you probably wouldn’t carry photos in your wallet that shouldn’t be shown publicly. It’s the same on Facebook.

This is what was so disturbing about the Buzz/GMail privacy fiasco. It took data (your list of GMail contacts) that was not created for the purpose of sharing it with anyone, and turned this into profile data in a social network. People who signed up for GMail didn’t sign up for a social network, they signed up for a Web-based email. What Google wants, on the other hand, is a large social network like Facebook, so it tried to make GMail into one by auto-following GMail contacts in your Buzz profile. It’s as if your insurance company suddenly decided it wanted to enter the social networking business and announced one day that you were now “friends” with all their customers who share the same medical condition. And will you please log in and update your privacy settings if you have a problem with that, you backward-looking, privacy-hugging, profit-dissipating idiot.

On the other hand, that’s one thing I like about Twitter. By and large (except for the few people who lock their accounts) almost all the information you put in Twitter is expected to be public. There is no misrepresentation, confusion or surprise. I don’t consider this lack of configurable privacy as a sign that Twitter doesn’t respect the privacy of its users. To the contrary, I almost see this as the most privacy-friendly approach: make it clear that everything is public. Because it is anyway.

One could almost make a counter-intuitive case that providing privacy settings is anti-privacy because it gives an unwarranted sense of security and nudges users towards providing more private data than they otherwise would. At least if the policy settings are not contractual (can you sue Facebook for changing its privacy terms on you?). At least it’s been working that way so far for Facebook, intentionally of not, as illustrated by all the articles that stress the importance of setting our privacy settings right (implicit message: it’s ok to put private information as long as you set privacy settings).

Yes you should have clear privacy settings. But the place to store them is in your brain and the place to enforce them is by controlling what your fingers do before data gets on Facebook. Facebook and similar networks can only leak data that they posses. A lot of that data comes from you directly uploading it. And that’s the point where you have control. After this, you really don’t. Other data comes from tracking and analyzing your activities and connections, without explicit data upload from you. That’s a lot harder for you to control (you rarely even get asked for your privacy preferences on this data), but that’s out of scope for this blog entry.

Just like banks that are too big to fail are too big to exist, data that is too sensitive to leak from Facebook is too sensitive to be on Facebook.

5 Comments

Filed under Everything, Facebook, Google, Off-topic, Security, Social networks, Twitter

September 30, 2009

PaaS as a satisfying and success-ready hobbyist platform

I don’t know anyone in Silicon Valley who can code and doesn’t fantasize about writing an accidental killer app. One that gets designed during a long layover in DEN and implemented in a rainy weekend (El Nino is my VC). One that was only supposed to meet the needs of a few friends and is used by half of the world a few months/years later.

I am not talking about seasoned entrepreneurs, who have a network, discipline, resources and enough experience to know that it takes a lot more than a cool idea. Rather, about programming hobbyists (who may of may not be programmers in their day jobs),

By definition, hobbyists only do things that are satisfying. In the rarefied air of Silicon Valley, it also helps if there is a conceivable “upside” to dream about. Platform as a Service (PaaS, e.g. Google App Engine) provides both to software-oriented hobbyist. And make it very cheap (borderline free), which doesn’t hurt.

Satisfying

In a well-crafted PaaS environment, the development process and the result are both satisfying. I am not a Google shrill, but GAE is a fair example. The barrier to entry is very low (the download is less than 10MB and contains all you need to get started). In an hour you have an application running locally. In an hour and 5 minutes you have it deployed and accessible on the web for all. And yet this ease of bootstrap does not come at the cost of too many longer-term limitations (now that the environment has gown a bit from the original limitations and provides scheduled and background jobs). Unlike Yahoo Pipes, for which the first impression is “nifty!” and the second is “gimme a textual representation of my pipe now!”.

Beyond the easy ramp-up, the main source of satisfaction developing in a PaaS environment is that you spend 99% of your time working on the application. Not the OS, not the firewall, not the application container, not the database. Not to mention having to deal with your co-lo provider or the leased line for the servers in the basement. If you are a hobbyist with only a few spare hours per week, that’s a make or break deal. It also means that you have a fighting chance of developing a secure application because you are responsible for a much smaller surface of attack.

Eons ago (in computing time), Visual Basic was the name of the game for these people. More recent was the rise of PHP. It dramatically lowered the barrier to adoption and provided a quick route to a working web application. I know several non-professional developers (e.g. web designers) who are scared of any “normal” programming language but happily write PHP (often of equivalent complexity BTW). Combine this with the wide availability of ISP-managed PHP environments and you get close to what GAE gives you. At the risk of adding to the annoying trend of retroactively cloudifying everything, I think of ISP-hosted PHP as the first generation of PaaS. But it is focused on “show what’s in the DB” scenarios rather than service-centric / mash-up / web 2.0 integration. And even for DB-centric scenarios, by and large PHP coders don’t want to think too much about model and queries (and it shows). I think Google decided to go with Python rather than the easy route of aping the hosted PHP environments in large part to avoid hitting such ceilings down the road. Not surprisingly, PHP support is currently the most requested GAE feature, ahead of Perl and Ruby. Lets see if Google tries to get the PHP community on board or prefers to stay clear of such PaaS legacy (already!).

Ready for success

In the unlikely event that your application catches fire and sees wide adoption (which is not impossible, especially if well integrated in a social network), what are you going to do about it? Keeping in mind two constraints: first, this is a part-time hobby of yours. Second, don’t dream of riches. We are talking about an influx of facebookers or twitterers here, with no intention to pay for anything. But click they will. If you were going to answer: “I get funding and hire a real IT staff” then think again. You most likely won’t get funding for your toy app without revenue potential. And even if you do, by the time you have it it’s too late and people have moved on because you were not there when the spotlight was on you.

With a PaaS-based application you have a fighting chance. If the spike is short enough, you may not even hit the limit of the free quota. If it does, you have the choice of whether you are willing to pay to support the extra traffic or not. No change in code required (though it may be advised anyway, if your app wasn’t architected for efficient scaling – PaaS doesn’t entirely take this off your hands).

That “sudden spike” story is a commonly-invoked use case for EC2. And it’s probably true for a start-up with an IT staff (of at least one full-time person). But despite Amazon’s efforts (and other providers such as RightScale) this type of scaling is something you have to architect for and putting it in place takes away from the time you spend coding application features. It also means that you are responsible for more infrastructure (OS and application container at least). Not to mention that IaaS providers don’t usually offer free resources for limited usage, the way Google does (I suspect 99% of GAE apps never get over this quota). Even if a small EC2 instance is not very expensive (though it adds up over time if you keep it up for that occasional user), the difference between “free” and “cheap” is significant. As an application provider you’d like for this not to be the case with your users, but as a consumer of infrastructure service you’re on the other side of the deal, aren’t you?

There is a reason why suburbia-bound SUVs are advertised crossing mountain streams. The “I could if I wanted to” line has appeal. For the software hobbyist, knowing that your application won’t crash if it happened to meet success (even if only for a couple of days, e.g. the Slashdot effect) is a good feeling (“I could if *they* wanted to”). In truth this occasion is rare (and likely to end up like this), but you are ready for the eventuality. And if there are enough such hobbyists, then statistically some will encounter it.

The provider’s upside

That last point brings the topic of the PaaS provider’s upside in this. I have read several critical comments arguing that no company will rewrite their application for GAE (true) and that no start-up will write their new code for it either because of the risk of lock-in (also true: “being bought by Google” is not a bad outcome but “has to be bought by Google” is a bad exit strategy). But I think this misses the point of casting such a wide nest and starting with creating a great tool for hobbyists.

After all, Google has made a great business monetizing millions of small sites none of which makes much money by itself. At the very least least this can grow the web and, symbiotically, Google. With two possible upsides:

Some of these hobbyist applications may actually take off and Google becomes their natural partner/godfather (including managing their user accounts). For example, wouldn’t it be nice for Google if Craigslist or Twitter was running on GAE?
The platform eventually evolves into something that makes sense for start-ups, SMBs and/or enterprises to use. Google works out the kinks with less demanding users first.

Interesting times

Two closing thoughts, which I’ll leave undeveloped for now:

There is an especially good synergy between mobile apps and PaaS. Once you get past the restaurant tip calculators, many mobile apps need a server-hosted sidekick to do the heavy lifting of gathering/storing/transforming data. As a hobbyist, you want to spend most of your time making you mobile app cool. Which leaves even less time for administrating server components. On the server side, you are even less likely to want to deal with anything but application logic. PaaS is especially attractive in these scenarios. Google and Microsoft have to navigate these waters carefully but look for some synergy/integration stories around GAE + Android and Azure + Windows Mobile respectively. Not clear what Apple’s story is here or if they think they need one. If it surfaces as an issue then we have yet another reason to restart the “Apple buys Adobe” rumor. Or maybe Sanjiva will get a middle-of-the-night call from Steve Jobs…
A platform to build/run your application is one thing. A way to reach users is another (arguably much more critical). Things like mobile app stores (especially Apple’s of course), Facebook and next generation app stores. But this goes beyond the scope of this post.

Just to be clear, I am not in any way suggesting that PaaS is only for hobbyists. I am just saying that right now it is a great tool for them, the best way for an individual programmer to have fun and have an impact. This doesn’t take away from the value that PaaS will eventually deliver to larger organizations.

[UPDATED 2009/10/4: Microsoft Azure apparently supports PHP.]

5 Comments

Filed under Cloud Computing, Everything, Google, Google App Engine, Implementation, Mashup, Microsoft, PaaS, Utility computing, WSO2

June 2, 2009

/me thinks Google Wave looks like IRC

If you’re not yet seasick with all the reviews of Google Wave, here are a few additional thoughts.

My mental picture for a Wave is that of an IRC channel on which each message is an edit to an XML doc. And where the IRC server (or a bot, like Zakim) keeps a log of all messages. I think it’s the use of bots in Wave as in IRC that pushed me towards this view. The character-per-character update reminded me of the arguments about the comparative values of the Unix “talk” command and IRC. And if the IRC comparison holds water, hang on for the upcoming bot wars. BTW, doesn’t this Wave Federation Protocol look like an ideal opportunity to resurrect the IRC bot attack code that leveraged server splits?

Leaving IRC aside, the other obvious lens through which to look at Wave is the good old WS/REST debate. Let’s brace ourselves for the “is Wave RESTful” analysis that are sure to follow. I’ll note, tongue in cheek, that an alternative (to XMPP) way to implement a Wave could be provided by the WS specifications currently being worked on in the W3C Web Services Resource Access working group : send a succession of WS-RT “Put” messages to a WS-Eventing event sink that, in turn, acts as an event source. Or formalize the sink/source combination more cleanly as a broker from WS-BrokeredNotification. Finally a non-management use case for these specifications! Good luck doing character-by-character updates over this, but I am not sure that this is the most fundamental part of Wave anyway (though it makes for a good demo).

Nick Gall is right to separate the “technology showcase” aspect from the “killer app” aspect. The demo is very nice but it takes more than cool technology to change years of habits and social conventions, supported by hundreds of tools. So I am not sure how much of a killer app this collaboration demo is, however nice. On the other hand, I can see how the underlying framework (or at least the techniques used to create it) could quickly spread. I need more time looking at the federation protocol to decide what I think about it. This blog entry clearly describes the three Ps (product, platform, protocol) and some of the history.

As far as how this may relate to systems management, I don’t see too much alignment from a modeling perspective. What really matters in IT models are the relationships between the entities and Wave puts a lot more focus on the content of each wave than its relationships with others. At least for now. The underlying synchronization techniques on the other hand seem more readily applicable. The Rasmussen brothers previously created Google Maps which I found very inspiring from an IT management point of view. Years later the IT management industry still hasn’t caught up with them.

2 Comments

Filed under Everything, Google, JavaScript, Mashup

March 11, 2009

Cloudman to the rescue?

GMail had a hiccup last night. Last month it had a more serious outage and so did AWS.

In the movie:
Superman: Easy, miss. I’ve got you.
Lois Lane: You’ve got me? Who’s got you?

In the IT world:
Cloud provider: Easy, CIO. I’ve got you.
CIO: You’ve got me? Who’s got you?

With apologies to Werner Vogels.

Comments Off on Cloudman to the rescue?

Filed under Amazon, Application Mgmt, Cloud Computing, Everything, Google, IT Systems Mgmt, Utility computing

November 30, 2008

What you’ve been spared (aka blog drafts boneyard #1)

I try to keep posts on this blog relevant to the general topic of IT management. Less than 10% of messages are in the “off-topic” category and even those are usually somewhat related to computer technology (mostly rants against the misuse of Flash and against the stupid ways in which US Social Security numbers are used). What this means in practice is that off-topic drafts are often abandoned when I realize that they are not relevant enough to make the cut. My “drafts” folder is a boneyard of such entries. Today, I am relaxing my standards and subjecting you to a list of them (they are still computer-related). Hopefully, either you find at least some of them interesting, or you come out with a renewed appreciation of what you’ve been spared over the years. Since they are all in one post, they are easy to just skip it altogether without being too tempted to hit the “unsubsribe” button for those who really only want to read about IT management (at least from me).

Here is a list of the topics covered below:

Messing with a blogger’s head (stats pumping)
Google search suggestions (resistance is futile)
Google to navigate rather than search (faster than bookmarks)
What’s a computer (can you build one with a spoon and a rubber band)
Is this a site or a feed (this site’s encapsulation is broken)

Messing with a blogger’s head

I recently looked at the HTTP logs for this site. Maybe I am the last blogger to realize this, but it looks like the online blog readers (e.g. Google Reader, Bloglines…) tell you how many subscribers they have for your feed. They do this through the user-agent HTTP header, which gets logged. It looks something like this:

Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 102 subscribers; feed-id=…)

Of course that’s only on a per-feed basis, so you need to add all the feeds (Atom and the different RSS versions) to get a total. Still, it’s a lot more visibility than I had before.

My first thought was “hey, some people are reading, better watch what I write”. But I quickly discarded that in favor of a more intriguing idea: if bloggers use this data, how hard would it be to mess with their heads? After all, this is not verifiable. Anyone can send HTTP requests with any user-agent they want. I can pick a blog and starts sending HTTP GET requests on their feeds with a user agent that pretends to be “Feedfetcher-Google”. And I can set the “subscribers” number to anything I want. To not be too suspicious, I could slowly pump it up, to look like a realistic increase.

Of course, an alert blogger would probably smell a rat if the number of subscribers shoots up and the number of incoming links and comments didn’t change, if the site still didn’t show up near the top of Google searches, or if the technorati “authority” didn’t change. Etc. There are pleny to ways to reality-test this. But people have an amazing ability to suspend disbelief when they like what they see, however logic-defying. If you don’t believe me, I have a pile of mortgage-backed securities to sell you.

This stat-pumping experiment could be done as a practical joke. It could be done out of meanness. It could be done as an unethical and pointless sociological study (how many subscribers does it take for someone to go buy a Porsche on the assumption that the traffic will eventually turn into $$$, how does the impression of popularity change the writing on the blog…). It could even be done as a fraud (guaranteed increase in your subscription numbers if you sign up for my blog marketing service or you get your money back: just check your logs to see the results… – of course you could also generate fake users to create real subscriptions). It hits bloggers where they are the most vulnerable: the ego.

If you are thinking of doing this as a way to be nice to someone who needs encouragements, it will probably backfire. Before you process, listen to act two of this radio show (description: “A group called Improv Everywhere decides that an unknown band, Ghosts of Pasha, playing their first ever tour in New York, ought to think they’re a smash hit. So they study the band’s music and then crowd the performance, pretending to be hard-core fans. Improv Everywhere just wants to make the band happy — to give them the best day of their lives. But the band doesn’t see it that way.”)

Google search suggestions

When you enter a Google search query (on google.com or in the Firefox search bar), as soon as you’ve typed a few characters it proposes to complete your search terms (BTW, it’s not just Google, it is now an well-know extension to OpenSearch but Google pioneered it, at least according to the spec). Something about this just doesn’t sound right. If you think you know what I am looking for, why not propose the most likely answers rather than trying to complete my search request? If you get it right, then I’ll stop typing and I’ll click. Plus, Google already concentrates viewers on a small set of pages for each search query, with this feature won’t they compound this by concentrating people to a smaller set of queries, further shrinking the Web?

Since Google feels free to give me plenty of unsolicited suggestions, here is mine to them. If you are going to hand-held people as they write their queries, provide suggestions that desambiguate rather than suggestions that overly constraint. For example, if I type “python”, I get these suggestions:

“python tutorial”, “python list”, “python strong”, “python ide”, “python download”, “python for loop”, “python datetime”, “python re”, “python time”, “python os”, all clearly about the programming language. Wouldn’t it be more useful to detect algorithmically that results from searching on “python” fall into three largely disjoint groups, to detect a common word in each group and to ask the user to qualify their “python” request with either “programming”, “snake” or “monty”? Rather than the simpler but, in my opinion, less valuable approach of showing the most popular search queries that start with “python”?

On the other hand, this “most popular” feature has one benefit: it provides plenty of fodder for pop psychology, as I found out when tried to ask Google why they provide these search suggestions. As soon as I typed “why”, I got suggestions including “why men cheat” and “why did I get married”.

The part I like about all this, is the meta-meta aspect. Google doesn’t only suggest what you might want to read based on your search, they even suggest what you might want to search on. What’s the next meta level? Suggesting that you want to do a web search when you’re not even thinking of doing one? You can bet they will if they can. What a butler indeed.

Google to navigate rather than search

Still on the topic of Google, but a positive comment this time. It struck me one day that pretty much every single bookmark I have in Firefox is for an Oracle-internal site, not the public Web. After thinking about it for a minute, I realized the reason: Google doesn’t index the Oracle intranet. When I find a good page there, I can’t be sure I’ll be able to find it again easily, so I bookmark it. On the Web, on the other hand, why bother bookmarking it. I pretty much know I can find it from my Firefox search bar.

Most of the time, when I use Google, it’s not to find a new page. It’s to get back to a specific page. Case in point, when I want to look something up in the XPath spec (which I have done a few times lately in the context a CMDBf). I know it’s on the W3C web site, I could go there and navigate to the page in a few clicks. I also have a copy of it on my disk, I could open my file explorer and get it from there. But instead I just type “xpath” in Google. Again, I am not looking really “searching” (trying to find information about XPath), I am just navigating (finding my way back to the spec).

So I started a post to share this brilliant insight, at which point I saw (using Google in “search” mode for once) that Robin Cannon has already perfectly described it.

So I’ll just add a few thoughts to complement what Robin wrote:

I am sure the implication in terms of advertising have long been studied by Google (I would guess that people who use Google for navigation are a lot less likely to click on ads than those who are actually searching).
AOL had to die for the “AOL keyword” to live.
There are serious privacy aspects to letting Google know what you’re up to all the time (but I am not logged into Google, I clean up my cookie jar relatively often and, at least at work, I am behind a large enough firewall to have a mostly anonymized IP).
Somewhat ironically, there a potential security benefits. For example, the HP employee credit union is called “Addison Avenue credit union”. Googling for “addison avenue” gets you right there. If you mistype the name and ask for “adison avenue”, you get a suggestion that maybe you meant “addison avenue”, along with a list of links related to “madison avenue”. That’s enough data to realize and correct your mistake. On the other hand, directly typing adisonavenue.com into the navigation bar could have taken you to a spoof site (in reality it takes you to a link farm, not quite as bad, but you never know what it will turn into tomorrow).

BTW, am I the only one who doesn’t know what 2 of the top 3 “Google Fastest Rising Search Terms 2007” relate to (from the list in Robin’s post)?

What is a computer

It started with this New Scientist article: Ten weirdest computers. With all these examples, how do we define what a computer is? Fundamentally, it’s a physical system that can process data. Meaning that you can define a logical data model that can be mapped to the physical characteristics of the system. And the system is such that it (through the laws of physics) changes in such a way that after a time its new physical configuration represents data that corresponds to a calculation that took place on your original data. You get the resulting data by measuring physical characteristics of the system (not necessarily the same physical characteristics that you controlled to represent the input data) and deriving the result data from it. In short, to use a computer:

Step 1: you create a system that represents your input data
Step 2: you let the laws of physics “do their thing” on the sytem
Step 3: you measure the system to derive your output data

For example, take a spring scale and a bunch of 1kg weights. That’s a computer. At least it can add (within a given range). To calculate “4+8” you put four 1kg weights on the scale, then you put eight more, then you read the number next to the needle and it should tell you “12”. This is an example in which the physical characteristics that you use to provide input data (putting weights on the scale) is different in nature from the physical characteristics that you measure to get the output (the position of the needle, which is really a way to measure the compression of the spring in the scale).

Based on this, we can ask the next (and more practically useful) question: what makes a *good* computer? It has the following characteristics:

easy to set up
easy to measure results at needed precision level
not too many side effects (e.g. energy consumption)
fast and versatile (planting a pine tree seed and waiting for a pine cone to come out in order to calculate a Fibonacci sequence is a little too slow and too specialized)
able to process large amounts of data (that’s where the mechanical scale doesn’t… scale).

On that last topic, there are two ways to process large amounts of data. The way used by current computers is to process little at once but very fast and in a way that makes it very easy to use the output of one operation as input to the next one. The alternative would be to compute a large problem in one go of the physical system. For example, maybe one day we’ll know how to represent a mathematical problem in DNA form, such that we know that the solution to the problem corresponds to the DNA sequence most useful to a bacteria in a given environment, e.g. most likely to resist a given antibiotic. Setting up the computation system, in this case, would be engineering the antibiotic that selects for the problem’s solution. You can put that antibiotic in your Petri dish (or in the food of your 1000 cows, now that’s a “computer farm”), wait for a few days, then sequence the DNA of the bacteria that’s in the dish (or in your cow’s “output” matter, think of it as a “core dump”).

You can think of it as the RISC versus CISC debate, except with many more orders of magnitude in difference between the alternatives.

It is also interesting to note that networks and storage mechanisms (the other two consitutive elements of a data center, along with computers) can be thought of in a very similar way. If step 2 doesn’t change the data and can be made to last long enough, you have a storage system (e.g. engrave text on stone, store stone for a few thousand years, read text from stone). If instead of being far apart in time the locations in which you perform steps 1 and 3 are far apart in space (with 2 still not changing the data), then you have a networking system.

Is this a site or a feed

Like 99% of the blogs out there, this site is just an HTML rendition of an RSS (or Atom) feed. Isn’t it a little silly to have millions of Web site (visited by humans) that have their structure dictated by a machine-to-machine protocol? It is especially ironic on a site like mine, which occasionally talks about data models and protocols (and on which you would therefore expect the difference between the two to be understood). But no. Every time a new release of CMDBf comes out, for example, I create a new post with an updated version of the pseudo-algorithm for performing a graph query. Rather than having one page that gets updated (with potentially a “history” feature to access older versions).

As much as I’d like to blame the limitations of WordPress, I think it’s more a sign of my laziness. There are plenty of WordPress extensions that I have never considered. Or I could move to Drupal. The key question is, is there a way to get a site that is more useful as a unit (“show me what information William provides on his site”), while keeping the value of the feed (“tell me when William adds new content”) and not adding to my workload?

Comments Off on What you’ve been spared (aka blog drafts boneyard #1)

Filed under Everything, Google, Off-topic

June 30, 2008

Moving towards utility/cloud computing standards?

This Forbes article (via John) channels 3Tera’s Bert Armijo’s call for standardization of utility computing. He calls it “Open Cloud” and it would “allow a company’s IT systems to be shared between different cloud computing services and moved freely between them“. Bert talks a bit more about it on his blog and, while he doesn’t reference the Forbes interview (too modest?), he points to Cloudscape as the vision.

A few early thoughts on all this:

Bottom line: I applaud Bert’s efforts but I couldn’t sleep well tonight if I didn’t also warn him that “there be dragons”.

And for those who haven’t seen it yet, here is a very good document on the topic (but it is focused on big vendors, not on how smaller companies can play the standards game).

[UPDATED 2008/6/30: A couple hours after posting this, I see that Coté has just published a blog post that elaborates on his view of cloud standards. As an addition to the podcast I mentioned earlier.]

[UPDATED 2008/7/2: If you read this in your feed viewer (rather than directly on vambenepe.com) and you don’t see the comments, you should go have a look. There are many clarifications and some additional insight from the best authorities on the topic. Thanks a lot to all the commenters.]

20 Comments

Filed under Amazon, Automation, Business, DMTF, Everything, Google, Google App Engine, Grid, HP, IBM, IT Systems Mgmt, Mgmt integration, Modeling, OVF, Portability, Specs, Standards, Utility computing, Virtualization

June 13, 2008

Some breathing room for Google App Engine requests

As promised to Felix here is the code that shows how to give extra breathing room to Google App Engine (GAE) requests that may otherwise be killed for taking too long to complete. The approach is similar to the one previously described. But rather than trying to emulate a long-running process, I am simply allowing a request to spread its work over a handful of invocations, thus getting several 9 seconds slots (since this seems to be how much time GAE gives you per request right now).

If all your requests need this then you are going to run into the same “high CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled” problem seen in the previous experiment. But if 90% of your requests complete in a normal time and only 10% of the requests need more time, then this approach can help prevent your users from getting an error for 1 out of every 10 requests. And you should fly under the radar of the GAE resource cop.

The way it works is simply that if your request is interrupted for having run too long the client gets a redirect to a new instance of the same handler. Because the code saves its results incrementally in the datastore, the new instance can build on the work of the previous one.

This specific example retrieves the ubuntu-8.04-server-i386.jigdo file (98K) from a handful of Ubuntu mirrors and returns the average/min/max download times (without checking if the transfer was successful or not). I also had to add a 1 second sleep after each fetch in order to trigger the DeadlineExceededError because the fetch operations go too quickly when running on GAE rather than my machine (I guess Google has better connectivity than my mediocre AT&T-provided DSL line, who would have thought).

#!/usr/bin/env python
#
# Copyright 2008 William Vambenepe
#

import wsgiref.handlers
import os
import logging
import time

from google.appengine.ext import db
from google.appengine.ext.webapp import template
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
from google.appengine.runtime import DeadlineExceededError

targetUrls = ["http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo"]

class MeasurementSet(db.Model):
  iteration = db.IntegerProperty()
  measurements = db.ListProperty(float)

class MainHandler(webapp.RequestHandler):
  def get(self):
    try:
      key = self.request.get("key")
      set = MeasurementSet.get(key)
      if (set == None):
        raise ValueError
      set.iteration = set.iteration + 1
      set.put()
      logging.debug("Resuming existing set, with key " + str(key))
    except:
      set = MeasurementSet()
      set.iteration = 1
      set.measurements = []
      set.put()
      logging.debug("Starting new set, with key " + str(set.key()))
    try:
      # Dereference remaining URLs
      for target in targetUrls[len(set.measurements):]:
        startTime = time.time()
        urlfetch.fetch(target)
        timeElapsed = time.time() - startTime
        time.sleep(1)
        logging.debug(target + " dereferenced in " + str(timeElapsed) + " sec")
        set.measurements.append(timeElapsed)
        set.put()
      # We're done dereferencing URLs, let's publish the results
      longestIndex = 0
      shortestIndex = 0
      totalTime = set.measurements[0]
      for i in range(1, len(targetUrls)):
        totalTime = totalTime + set.measurements[i]
        if set.measurements[i] < set.measurements[shortestIndex]:
          shortestIndex = i
        elif set.measurements[i] > set.measurements[longestIndex]:
          longestIndex = i
      template_values = {"iterations": set.iteration,
                         "longestTime": set.measurements[longestIndex],
                         "longestTarget": targetUrls[longestIndex],
                         "shortestTime": set.measurements[shortestIndex],
                         "shortestTarget": targetUrls[shortestIndex],
                         "average": totalTime/len(targetUrls)}
      path = os.path.join(os.path.dirname(__file__), "steps.html")
      self.response.out.write(template.render(path, template_values))
      logging.debug("Set with key " + str(set.key()) + " has returned")
    except DeadlineExceededError:
      logging.debug("Set with key " + str(set.key())
                    + " interrupted during iteration "+ str(set.iteration)
                    + " with " + str(len(set.measurements)) + " URLs retrieved")
      self.redirect("/steps?key=" + str(set.key()))
      logging.debug("Set with key " + str(set.key()) + " sent redirection")

def main():
  application = webapp.WSGIApplication([("/steps", MainHandler)], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

I can’t guarantee I will keep it available, but at the time of this writing the application is deployed here if you want to give it a spin. A typical run produces this kind of log:

06-13 01:44AM 36.814 /steps
XX.XX.XX.XX - - [13/06/2008:01:44:45 -0700] "GET /steps HTTP/1.1" 302 0 - -
  D 06-13 01:44AM 36.847
    Starting new set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 37.870
    http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.022078037262 sec
  D 06-13 01:44AM 38.913
    http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0184168815613 sec
  D 06-13 01:44AM 39.962
    http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0166189670563 sec
  D 06-13 01:44AM 41.12
    http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0205371379852 sec
  D 06-13 01:44AM 42.103
    http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0197179317474 sec
  D 06-13 01:44AM 43.146
    http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0171189308167 sec
  D 06-13 01:44AM 44.215
    http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0160200595856 sec
  D 06-13 01:44AM 45.256
    http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.015625 sec
  D 06-13 01:44AM 45.805
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM interrupted during iteration 1 with 8 URLs retrieved
  D 06-13 01:44AM 45.806
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM sent redirection
  W 06-13 01:44AM 45.808
    This request used a high amount of CPU, and was roughly 28.5 times over the average request CPU limit.
    High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

Followed by:

06-13 01:44AM 46.72 /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
XX.XX.XX.XX - - [13/06/2008:01:44:50 -0700] "GET /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM HTTP/1.1" 200 472
  D 06-13 01:44AM 46.110
    Resuming existing set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 47.128
    http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.016991853714 sec
  D 06-13 01:44AM 48.177
    http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0238039493561 sec
  D 06-13 01:44AM 49.318
    http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0177929401398 sec
  D 06-13 01:44AM 50.378
    http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0226020812988 sec
  D 06-13 01:44AM 50.410
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM has returned
  W 06-13 01:44AM 50.413
  This request used a high amount of CPU, and was roughly 13.4 times over the average request CPU limit.
  High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

I believe we can optimize the performance by taking advantage of the fact that successive requests are likely (but not guaranteed) to hit the same instance, allowing global variables to be re-used rather than always going to the datastore. My code is a proof of concept, not an optimized implementation.

Of course, the alternative is to drive things from the client, using JavaScript HTTP requests (rather than HTTP redirect) to repeat the HTTP request until the work has been completed. The list of pros and cons of each approach is left as an exercise to the reader.

[UPDATED 2008/6/13: Added log output. Removed handling of “OverQuotaError” which was not useful since, unlike “DeadlineExceededError”, quotas are not per-request. As a result, splitting the work over multiple requests doesn’t help. Slowing down a request might help, at which point the approach above might come in handy to prevent this slowdown from triggering “DeadlineExceededError”.]

[UPDATED 2008/6/30: Steve Jones provides an interesting analysis of the cut-off time for GAE. Confirms that it’s mainly based on wall-clock time rather than CPU time. And that you can sometimes go just over 9 seconds but never up to 10 seconds, which is consistent with my (much less detailed and rigorous) observations.]

2 Comments

Filed under Everything, Google, Google App Engine, Implementation, Utility computing

June 6, 2008

Emulating a long-running process (and a scheduler) in Google App Engine

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

The most obvious option is to piggyback on HTTP requests from your visitors. But:
- this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
- since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
- more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
- it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
- it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

15 Comments

Filed under Brain teaser, Everything, Google, Google App Engine, Implementation, Testing, Utility computing

May 31, 2008

Google App Engine: less is more

“If you have a stove, a saucepan and a bottle of cold water, how can you make boiling water?”

If you ask this question to a mathematician, they’ll think about it a while, and finally tell you to pour the water in the saucepan, light up the stove and put the saucepan on it until the water boils. Makes sense. Then ask them a slightly different question: “if you have a stove and a saucepan filled with cold water, how can you make boiling water?”. They’ll look at you and ask “can I also have a bottle”? If you agree to that request they’ll triumphantly announce: “pour the water from the saucepan into the bottle and we are back to the previous problem, which is already solved.”

In addition to making fun of mathematicians, this is a good illustration of the “fake machine” approach to utility computing embodied by Amazon’s EC2. There is plenty of practical value in emulating physical machines (either in your data center, using VMWare/Xen/OVM or at a utility provider’s site, e.g. EC2). They are all rooted in the fact that there is a huge amount of code written with the assumption that it is running on an identified physical machine (or set of machines), and you want to keep using that code. This will remain true for many many years to come, but is it the future of utility computing?

Google’s App Engine is a clear break from this set of assumptions. From this perspective, the App Engine is more interesting for what it doesn’t provide than for what it provides. As the description of the Sandbox explains:

“An App Engine application runs on many web servers simultaneously. Any web request can go to any web server, and multiple requests from the same user may be handled by different web servers. Distribution across multiple web servers is how App Engine ensures your application stays available while serving many simultaneous users [not to mention that this is also how they keep their costs low — William]. To allow App Engine to distribute your application in this way, the application runs in a restricted ‘sandbox’ environment.”

The page then goes on to succinctly list the limitations of the sandbox (no filesystem, limited networking, no threads, no long-lived requests, no low-level OS functions). The limitations are better described and commented upon here but even that article misses one major limitation, mentioned here: the lack of scheduler/cron.

Rather than a feature-by-feature comparison between the App Engine and EC2 (which Amazon would won handily at this point), what is interesting is to compare the underlying philosophies. Even with Amazon EC2, you don’t get every single feature your local hardware can deliver. For example, in its initial release EC2 didn’t offer a filesystem, only a storage-as-a-service interface (S3 and then SimpleDB). But Amazon worked hard to fix this as quickly as possible in order to be appear as similar to a physical infrastructure as possible. In this entry, announcing persistent storage for EC2, Amazon’s CTO takes pain to highlight this achievement:

“Persistent storage for Amazon EC2 will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.”

and

“And the great thing is it that it is all done with using standard technologies such that you can use this with any kind of application, middleware or any infrastructure software, whether it is legacy or brand new.”

Amazon works hard to hide (from the application code) the fact that the infrastructure is a huge, shared, distributed system. The beauty (and business value) of their offering is that while the legacy code thinks it is running in a good old data center, the paying customer derives benefits from the fact that this is not the case (e.g. fast/easy/cheap provisioning and reduced management responsibilities).

Google, on the other hand, embraces the change in underlying infrastructure and requires your code to use new abstractions that are optimized for that infrastructure.

To use an automotive analogy, Amazon is offering car drivers to switch to a gas/electric hybrid that refuels in today’s gas stations while Google is pushing for a direct jump to hydrogen fuel cells.

It’s not the idea of moving to a more abstracted development framework that worries me about Google’s offering (JEE, Spring and Ruby on Rails show that developers want this move anyway, for productivity reasons, even if there is no change in the underlying infrastructure to further motivate it). It’s the fact that by defining their offering at the level of this framework (as opposed to one level below, like Amazon), Google puts itself in the position of having to select the right framework. Sure, they can support more than one. But the speed of evolution in that area of the software industry shows that it’s not mature enough (yet?) for any party to guess where application frameworks are going. Community experimentation has been driving application frameworks, and Google App Engine can’t support this. It can only select and freeze a few framework.

Time will tell which approach works best, whether they should exist side by side or whether they slowly merge into a “best of both worlds” offering (Amazon already offers many features, like snapshots, that aim for this “best of both worlds”). Unmanaged code (e.g. C/C++ compiled programs) and managed code (JVM or CLR) have been coexisting for a while now. Traditional applications and utility-enabled applications may do so in the future. For all I know, Google may decide that it makes business sense for them too to offer a Xen-based solution like EC2 and Amazon may decide to offer a more abstracted utility computing environment along the lines of the App Engine. But at this point, I am glad that the leaders in utility computing have taken different paths as this will allow the whole industry to experiment and progress more quickly.

The comparison is somewhat blurred by the fact that the Google offering has not reached the same maturity level as Amazon’s. It has restrictions that are not directly related to the requirements of the underlying infrastructure. For example, I don’t see how the distributed infrastructure prevents the existence of a scheduling service for background jobs. I expect this to be fixed soon. Also, Amazon has a full commercial offering, with a price list and an ecosystem of tools, why Google only offers a very limited beta environment for which you can’t buy extra capacity (but this too is changing).

2 Comments

Filed under Amazon, Everything, Google, Google App Engine, OVM, Portability, Tech, Utility computing, Virtualization, VMware