Archive for the ‘Google App Engine’ Category

Post

GAE Traffic Splitting

In Application Mgmt,Automation,Cloud Computing,DevOps,Everything,Google,Google App Engine,Utility computing on February 28, 2012 by @vambenepe

Interesting addition to the Google App Engine (GAE) platform in release 1.6.3:  Traffic Splitting lets you run several versions of your application (using a DNS sub-domain for each version) and choose to direct a certain percentage of requests to a specific version. This lets you, among other things, slowly phase in your updates and test the result on a small set of users.

That’s nice, but until I read the documentation for the feature I had assumed (and hoped) it was something else.

Rather than using traffic splitting to test different versions of my app (something which the platform now makes convenient but which I could have implemented on my own), it would be nice if that mechanism could be used to test updates to the GAE platform itself. As described in “Come for the PaaS Functional Model, stay for the Cloud Operational Model“, it’s wishful thinking to assume that changes to the PaaS platform (an update applied by Google admins) cannot have a negative effect on your application. In other words, “When NoOps meets Murphy’s Law, my money is on Murphy“.

What would be nice is if Google could give application owners advanced warning of a platform change and let them use the Traffic Splitting feature to direct a portion of the incoming requests to application instances running on the new platform. And also a way to include the platform version in all log messages.

Here’s the issue as I described it in the aforementioned “Cloud Operational Model” post:

In other words, if a patch or update is worth testing in a staging environment if you were to apply it on-premise, what makes you think that it’s less likely to cause a problem if it’s the Cloud provider who rolls it out? Sure, in most cases it will work just fine and you can sing the praise of “NoOps”. Until the day when things go wrong, your users are affected and you’re taken completely off-guard. Good luck debugging that problem, when you don’t even know that an infrastructure change is being rolled out and when it might not even have been rolled out uniformly across all instances of your application.

How is that handled in your provider’s Operational Model? Do you have visibility into the change schedule? Do you have the option to test your application on the new infrastructure or to at least influence in any way how and when the change gets rolled out to your instances?

Hopefully, the addition of Traffic Splitting to Google App Engine is a step towards addressing that issue.

Post

Nice incremental progress in Google App Engine SDK 1.4

In Application Mgmt,Cloud Computing,Everything,Google,Google App Engine,IT Systems Mgmt,Manageability,Middleware,PaaS,Utility computing on December 2, 2010 by @vambenepe

When Google released version 1.3.8 of the Google App Engine SDK in October, they introduced an instance console, showing you how many instances are serving your application and some basic metrics about these instances. I wrote a blog to consider the implications of providing this level of visibility to application administrators. It also pointed out some shortcomings of this first version of the console.

The most glaring problem was that the console showed an “average latency” which was just a straight average of the latencies of all the instances, independently of the traffic they see. Which is a meaningless number.

Today, Google released an update to the SDK (1.4), and along with it some minor updates to the instance console. Except that, as you can see below, the screen capture in their announcement happens to show three instances that have processed exactly the same number of messages. Which means that we can’t tell whether they have fixed the “unweighted average” problem or not. Is this just by chance? Google, WTF? (which stands for “what’s the formula?”, of course).

I decided it was worth spending a few minutes to find the answer. I don’t have any app currently in use on GAE, but it doesn’t take much work to generate enough load to wake up one of my old apps and get it to spin a couple of instances. Here is the resulting console instance:

If you run the numbers, you can see that they’ve fixed that issue; the average latency is now weighted based on instance traffic. Thanks Google for listening.

Apparently, not all the updates have trickled down to my version of the instance console. The “requests”, “errors” and “age” columns are missing. I assume they’re on their way. Seeing the age of the instances, especially, is a nice addition, one of those I requested in my blog.

In the grand scheme of things, these minor updates to the console (which remains quite basic) are not the big news. The major announcement with SDK 1.4 is that the dreaded 30 seconds limit on execution time has been lifted for background tasks (those from Task Queue and Cron). It’s now a much more manageable 10 minutes. This doesn’t apply to the execution of Web requests served by your app.

Google App Engine has been under criticism recently, and that 30-second limit (along with reliability issues) figured prominently in the complains. Assuming the reliability issues are also coming under control, this update will go a long way towards addressing these issues.

Just so you realize how lucky you are if you are just now starting with Google App Engine, here are the kind of hoops you had to jump through, in the early days, to process any task that took a significant amount of time. This was done a year before the Cron and Task Queue features were added to GAE.

Another nice addition with SDK 1.4 is that you can now retrieve the source code of your application from Google’s servers. Of course you should never need that if you are rigorous and well-organized… Presumably this is only for Python since in the Java case Google’s servers never see the source code.

The steady progress of the GAE SDK continues.

Post

Lifting the curtain on PaaS Cloud infrastructure (can you handle the truth?)

In Application Mgmt,Automation,Cloud Computing,Everything,Google App Engine,IT Systems Mgmt,Manageability,Mgmt integration,Middleware,PaaS,Utility computing on October 19, 2010 by @vambenepe

The promise of PaaS is that application owners don’t need to worry about the infrastructure that powers the application. They just provide application artifacts (e.g. WAR files) and everything else is taken care of. Backups. Scaling. Infrastructure patching. Network configuration. Geographic distribution. Etc. All these headaches are gone. Just pick from a menu of quality of service options (and the corresponding price list). Make your choice and forget about it.

In theory.

In practice no abstraction is leak-proof and the abstractions provided by PaaS environments are even more porous than average. The first goal of PaaS providers should be to shore them up, in order to deliver on the PaaS value proposition of simplification. But at some point you also have to acknowledge that there are some irreducible leaks and take pragmatic steps to help application administrators deal with them. The worst thing you can do is have application owners suffer from a leaky abstraction and refuse to even acknowledge it because it breaks your nice mental model.

Google App Engine (GAE) gives us a nice and simple example. When you first deploy an application on GAE, it is deployed as just one instance. As traffic increases, a second instance comes up to handle the load. Then a third. If traffic decreases, one instance may disappear. Or one of them may just go away for no reason (that you’re aware of).

It would be nice if you could deploy your application on what looks like a single, infinitely scalable, machine and not ever have to worry about horizontal scale-out. But that’s just not possible (at a reasonable cost) so Google doesn’t try particularly hard to hide the fact that many instances can be involved. You can choose to ignore that fact and your application will still work. But you’ll notice that some requests take a lot more time to complete than others (which is typically the case for the first request to hit a new instance). And some requests will find an empty local cache even though your application has had uninterrupted traffic. If you choose to live with the “one infinitely scalable machine” simplification, these are inexplicable and unpredictable events.

Last week, as part of the release of the GAE SDK 1.3.8, Google went one step further in acknowledging that several instances can serve your application, and helping you deal with it. They now give you a console (pictured below) which shows the instances currently serving your application.

I am very glad that they added this console, because it clearly puts on the table the question of how much your PaaS provider should open the kimono. What’s the right amount of visibility, somewhere between “one infinitely scalable computer” and giving you fan speeds and CPU temperature?

I don’t know what the answer is, but unfortunately I am pretty sure this console is not it. It is supposed to be useful “in debugging your application and also understanding its performance characteristics“. Hmm, how so exactly? Not only is this console very simple, it’s almost useless. Let me enumerate the ways.

Misleading

Actually it’s worse than useless, it’s misleading. As we can see on the screen shot, two of the instances saw no traffic during the collection period (which, BTW, we don’t know the length of), while the third one did all the work. At the top, we see an “average latency” value. Averaging latency across instances is meaningless if you don’t weight it properly. In this case, all the requests went to the instance that had an average latency of 1709ms, but apparently the overall average latency of the application is 569.7ms (yes, that’s 1709/3). Swell.

No instance identification

What happens when the console is refreshed? Maybe there will only be two instances. How do I know which one went away? Or say there are still three, how do I know these are the same three? For all I know it could be one old instance and two new ones. The single most important data point (from the application administrator’s perspective) is when a new instance comes up. I have no way, in this UI, to know reliably when that happens: no instance identification, no indication of the age of an instance.

Average memory

So we get the average memory per instance. What are we supposed to do with that information? What’s a good number, what’s a bad number? How much memory is available? Is my app memory-bound, CPU-bound or IO-bound on this instance?

Configuration management

As I have described before, change and configuration management in a PaaS setting is a thorny problem. This console doesn’t tackle it. Nowhere does it say which version of the GAE platform each instance is running. Google announces GAE SDK releases (the bits you download), but these releases are mostly made of new platform features, so they imply a corresponding update to Google’s servers. That can’t happen instantly, there must be some kind of roll-out (whether the instances can be hot-patched or need to be recycled). Which means that the instances of my application are transitioned from one platform version to another (and presumably that at a given point in time all the instances of my application may not be using the same platform version). Maybe that’s the source of my problem. Wouldn’t it be nice if I knew which platform version an instance runs? Wouldn’t it be nice if my log files included that? Wouldn’t it be nice if I could request an app to run on a specific platform version for debugging purpose? Sure, in theory all the upgrades are backward-compatible, so it “shouldn’t matter”. But as explained above, “the worst thing you can do is have application owners suffer from a leaky abstraction and refuse to even acknowledge it“.

OK, so the instance monitoring console Google just rolled out is seriously lacking. As is too often the case with IT monitoring systems, it reports what is convenient to collect, not what is useful. I’m sure they’ll fix it over time. What this console does well (and really the main point of this blog) is illustrate the challenge of how much information about the underlying infrastructure should be surfaced.

Surface too little and you leave application administrators powerless. Surface more data but no control and you’ll leave them frustrated. Surface some controls (e.g. a way to configure the scaling out strategy) and you’ve taken away some of the PaaS simplicity and also added constraints to your infrastructure management strategy, making it potentially less efficient. If you go down that route, you can end up with the other flavor of PaaS, the IaaS-based PaaS in which you have an automated way to create a deployment but what you hand back to the application administrator is a set of VMs to manage.

That IaaS-centric PaaS is a well-understood beast, to which many existing tools and management practices can be applied. The “pure PaaS” approach pioneered by GAE is much more of a terra incognita from a management perspective. I don’t know, for example, whether exposing the platform version of each instance, as described above, is a good idea. How leaky is the “platform upgrades are always backward-compatible” assumption? Google, and others, are experimenting with the right abstraction level, APIs, tools, and processes to expose to application administrators. That’s how we’ll find out.

Post

Analyzing the VMforce announcement

In Application Mgmt,BPM,Cloud Computing,Everything,Google App Engine,Middleware,PaaS,Portability,Spring,Tech,Virtualization,VMforce,VMware on April 27, 2010 by @vambenepe

Let’s start with the disclosures: by most interpretations I work for a competitor to what Salesforce.com and VMWare are trying to do with VMforce. And all I know about VMforce is what I read in a few authoritative blogs by VMWare’s Steve Herrod, VMWare/SpringSource’s Rod Johnson and Salesforce’s Anshu Sharma. So no hard feeling if you jump off right now.

Overall, I like what I see. Let me put it this way. I am now a lot more likely to write an application on force.com than I was last week. How could this not be a good thing for SalesForce, me and others like me?

On the other hand, this is also not the major announcement that the “VMforce is coming” drum-roll had tried to make us expect. If you fell for it, then I guess you can be disappointed. I didn’t and I’m not (Phil Wainewright fell for it and yet isn’t disappointed, asserting that “VMforce.com redefines the PaaS landscape” for reasons not entirely clear to me even after reading his article).

The new thing is that force.com now supports an additional runtime, in addition to Apex. That new runtime uses the Java language, with the constraint that it is used via the Spring framework. Which is familiar territory to many developers. That’s it. That’s the VMforce announcement for all practical purposes from a user’s perspective. It’s a great step forward for force.com which was hampered by the non-standard nature of Apex, but it’s just a new runtime. All the other benefits that Anshu Sharma lists in his blog (search, reporting, mobile, integration, BPM, IdM, administration) are not new. They are the platform services that force.com offers to application writers, whether they use Apex or the new Java/Spring runtime.

It’s important to realize that there are two main parts to a full PaaS platform like force.com or Google App Engine. First there are application runtimes (Apex and now Java for force.com, Python and Java for GAE). They are language-dependent and you can have several of them to support different programming languages. Second are the platform services (reports, mobile, BPM, IdM etc for force.com as we saw above, mostly IdM for Google at this point) which are mostly language agnostic (beyond a library used to access them). I think of data storage (e.g. mySQL, force.com database, Google DataStore) as part of the runtime, but it’s on the edge of the grey zone. A third category is made of actual application services (e.g. the CRM web services out of SalesForce.com or the application services out of Google Apps) which I tend not to consider part of PaaS but again there are gray zones between application support services and application services. E.g. how domain-specific does your rule engine have to be before it moves from one category to the other?

As Umit Yalcinalp (who works for SalesForce) told me on Twitter “regardless of the runtime the devs using the Force.com db will get the same platform benefits, chatter, workflow, analytics”. What I called the platform services above. Which, really, is where most of the PaaS value lies anyway. A language runtime is just a starting point.

So where are VMWare and SpringSource in this picture? Well, from the point of view of the user nowhere, really. SalesForce could have built this platform themselves, using the Spring framework on top of Tomcat, WebLogic, JBoss… Itself running on any OS they want. With or without a hypervisor. These are all implementation details and are SalesForce’s problem, not ours as application developers.

It so happens that they have chosen to run this as a partnership with VMWare/SpringSource which makes a lot of sense from a portfolio/expertise perspective, of course. But this choice is not visible to the application developer making use of this platform. And it shouldn’t be. That’s the whole point of PaaS after all, that we don’t have to care.

But VMWare and SpringSource really want us to know that they are there, so Rod Johnson leads by lifting the curtain and explaining that:

“VMforce uses the Force.com physical infrastructure to run vSphere with a special customized vCloud layer that allows for seamless scaling and management. Above this layer VMforce runs SpringSource tc Server instances that provide the execution environment for the enterprise applications that run on VMforce.”

[Side note: notice what's missing? The operating system. It's there of course, most likely some Linux distribution but Rod glances over it, maybe because it's a missing link in VMWare's "we have all the pieces" story; unlike Oracle who can provide one or, even better, do without.  Just saying...]

VMWare wants us to know they are under the covers because of course they have much larger aspirations than to be a provider to SalesForce. They want to use this as a proof point to sell their SpringSource+VMWare stack in other settings, such as private clouds and other public cloud providers (modulo whatever exclusivity period may be in their contract with SalesForce). And VMforce, if it works well when it launches, is a great validation for this strategy. It’s natural that they want people to know that they are behind the curtain and can be called on to replicate this elsewhere.

But let’s be clear about what part they can replicate. It’s the Java/Spring language runtime and its underlying infrastructure. Not the platform services that are part of the SalesForce platform. Not an IdM solution, not a rules engine, not a business process engine, etc. We can expect that they are hard at work trying to fill these gaps, as the RabbitMQ acquisition illustrates, but for now all this comes from force.com and isn’t directly replicable. Which means that applications that use them aren’t quite so portable.

In his post, Steve Herrod quickly moves past the VMforce announcement to focus on the SpringSource+VMWare infrastructure part, the one he hopes to see multiplied everywhere. The key promise, from the developers’ perspective, is application portability. And while the use of Java+Spring definitely helps a lot in terms of code portability I see some promises in terms of data portability that will warrant scrutiny when VMforce actually rolls out: “you should be able to extract the code from the cloud it currently runs in and move it, along with its data, to another cloud choice”.

It sounds very nice, but the underlying issues are:

  • Does the code change depending on whether I am talking to a local relational DB in my private cloud or whether I am on VMforce and using the force.com database?
  • If it doesn’t then the application is portable, but an extra service i still needed to actually move the data from one cloud to the other (can this be done in-flight? what downtime is needed?)
  • What about the other VMforce.com services (chatter, workflow, analytics…)? If I use them in my code can I keep using them once I migrate out of VMforce to a private cloud? Are they remotely invocable? Does the code change? And if I want to completely sever my links with SalesForce, can I find alternative implementations of these application platform services in my private cloud? Or from another public cloud provider? The answer to these is probably no, which means that you are only portable out of VMforce if you restrain yourself from using much of the value of the platform. It’s not even clear whether you can completely restrain yourself from using it, e.g. can you run on force.com without using their IdM system?

All these are hard questions. I am not blaming anyone for not answering them today since no-one does. But we shouldn’t sweep them under the rug. I am sure VMWare is working on finding workable compromises but I doubt it will be as simple, clean and portable as Steve Herrod implies. It’s funny  how Steve and Anshu’s posts seem to reinforce and congratulate one another, until you realize that they are in large part talking about very different things. Anshu’s is almost entirely about the force.com application platform services (sprinkled with some weird Facebook envy), Steve’s is entirely about the application runtime and its infrastructure.

One thing that I am surprised not to see mentioned is the management aspect of the platform, especially considering the investment that SpringSource made in Hyperic. I can only assume that work is under way on this and that we’ll hear about it soon. One aspect of the management story that concerns me a bit is the lack of acknowledgment of the challenges of configuration management in a PaaS setting. Especially when I read Steve Herrod asserting that the VMWare/SpringSource PaaS platform is going to free us from the burden of “handling code modifications that may be required as the middleware versions change”. There seems to be a misconception that because the application administrators are not the ones doing the infrastructure updates they don’t need to worry about the impact of these updates on their application. Is Steve implying that the first release of the VMWare/SpringSource PaaS stack is going to be so perfect that the hypervisor, guest OS and app server will never have to be patched and versioned? If that’s not the case, then why are those patches suddenly less likely impact the application code? In fact the situation is even worse as the application administrator does not know which hypervisor/OS/middleware patches are being applied and when. They can’t test against the new version ahead of time for validation and they can’t make sure the change is scheduled during a non-critical period for their business. I wrote an entire blog post on this issue six months ago and it’s a little bit disheartening to see the issue flatly denied and ignored. Management is not just monitoring.

Here is another intriguing comment in Steve’s entry: “one of the key differentiators with EC2 based PaaS will be the efficiencies for the many-app model. Customers are frustrated with the need to buy a whole VM as the minimum service unit for their applications. Our PaaS will provide fine-grained resource separation”. I had to read it twice when I realized that the VMWare CTO was telling us that splitting a physical machine into VMs is not a good enough way to share its resources and that you really need middleware-level multi-tenancy. But who can disagree that a GAE-like architecture can support more low-traffic applications on the same server than anything based on VM-based sharing? Which (along with deep pockets) puts Google in position to offer free hosting for low-traffic applications, a great way to build adoption.

These are very early days in the history of PaaS. VMWare, like the rest of us, will need to tackle all these issues one by one. In the meantime, this is an interesting announcement and a noticeable milestone. Let’s just keep our eyes open on the incremental nature of progress and the long list of remaining issues.

[UPDATED 2010/4/29: See the follow-up post, PaaS portability challenges and the VMforce example.]

[UPDATED 2010/6/9: This entry points out how the OS level is a gap in VMWare's portfolio. They took a step in addressing this today, by partnering with Novell to offer SUSE support.]

yalcinalp

Post

Is Business Process Execution the killer app for PaaS?

In Application Mgmt,BPEL,BPM,Business Process,Cloud Computing,Everything,Google App Engine,Middleware,PaaS,Portability,Tech,Utility computing on February 10, 2010 by @vambenepe

Have you noticed the slow build-up of business process engines available “as a service”? Force.com recently introduced a “Visual Process Manager”. Amazon is looking for product managers to help customers “securely compos[e] processes using capabilities from all parts of their organization as well as those outside their organization, including existing legacy applications, long-running activities, human interactions, cloud services, or even complex processes provided by business partners”. I’ve read somewhere (can’t find a link right now) that WSO2 was planning to make its Business Process Server available as a Cloud service. I haven’t tracked Azure very closely, but I expect AppFabric to soon support a BizTalk-like process engine. And I wouldn’t be surprised if VMWare decided to make an acquisition in the area of business process execution.

Attacking PaaS from the business process angle is counter-intuitive. Rather, isn’t the obvious low-hanging fruit for PaaS a simple synchronous HTTP request handler (e.g. a servlet or its Python, Ruby, etc equivalent)? Which is what Google App Engine (GAE) and Heroku mainly provide. GAE almost defined PaaS as a category in the same way that Amazon EC2 defined IaaS. The expectation that a CGI or servlet-like container naturally precedes a business process engine is also reinforced by the history of middleware stacks. Simple HTTP request-response is the first thing that gets defined (the first version of the servlet package was java.servlet.* since it even predates javax), the first thing that gets standardized (JSR 53: servlet 2.3 and JSP 1.2) and the first thing that gets widely commoditized (e.g. Apache Tomcat). Rather than a core part of the middleware stack, business process engines (BPEL and the like) are typically thought of as a more “advanced” or “enterprise” capability, one that come later, as part of the extended middleware stack.

But nothing says it has to be that way. If you think about it a bit longer, there are some reasons why business process execution might actually be a more logical beach head for PaaS  than simple HTTP request handlers.

1) Small contract

Architecturally, the contract between a business process engine and the deployed entities (process definitions) is much smaller than the contract of a GAE-style HTTP handler. Those GAE contracts include an entire programming language and lots of libraries. A BPEL container, on the other hand, has a simple contract. It’s documented in one specification (plus a few dependencies) and offers basic activities like routing logic, message correlation, simple data manipulation, compensation handlers and service invocation. You may not think of BPEL as “simple” but would you rather implement a BPEL engine or a complete Python interpreter along with most of the core libraries? I thought so. That’s what I mean by a simpler (narrower) contract. And BPEL is just one example, I suspect some PaaS platforms will take a more bare-bone approach (e.g. no “scopes”).

Just like “good fences make good neighbors”, small contracts make good Cloud services. When your container only interprets a business process definition (typically an XML document), you don’t need to worry about intercepting/preventing all the nasty low-level APIs (e.g. unfettered network access, filesystem reads, OS calls…) that are not acceptable in a PaaS situation. But that is what Google had to do in the process of pairing down a general-purpose programming language to fit into the constraints of a PaaS container. There is no intrinsic reason why a synchronous HTTP request handler has to have access to image-manipulation libraries and a business process handler doesn’t. But the use cases tend to push you in that direction and the expectations have been set. As a result, a business process engine is architecturally a better candidate for being delivered as a Cloud service.

2) Major differentiator over IaaS-based solutions

Practically speaking, it is pretty easy today to get a (synchronous) Web app framework up and running “in the Cloud”. Provisioning a Django, PHP, RoR or Tomcat (plus the Java framework of your choice) stack on EC2 is a well-traveled path. Even auto-scaling these things is pretty well understood. I am the first one to scream that “here is an AMI of our server stack” is *not* the same as PaaS, but truth be told many people are happy enough with it. As a result, the benefit of going from a “web app on IaaS” situation to GAE-like situation is not perceived as very compelling. I suspect the realization may hit later, but for now people are happy to trade the simplified administration and extra scalability of PaaS for the ability to keep their current framework (MySQL and all) unchanged.

There is no fundamental reason why you can’t run a business process engine on top of an IaaS-provisioned infrastructure. It’s just that you are mostly on your own at this point. Even if you find an existing public AMI that meets your needs, I doubt you’ll find a well-tested way to manage, backup and auto-scale this system (marrying IaaS-level invocations with container-level and DB-level tasks). Or if you do it will probably cost you. In that “new frontier” context, a true PaaS alternative to the “build it on top of IaaS” approach is a lot more compelling than if all you need is yet another RoR-on-EC2 system.

When deciding whether to walk back to your hotel after dinner or take a cab, you don’t just consider the distance. How familiar you are with the neighborhood and how safe it appears are also important parameters.

3) There is an existing market

This may not be obvious to people who come to PaaS from a Web application framework perspective, but there is a large market for business process engines in enterprise integration scenarios. Whether it’s Oracle Fusion Middleware, Microsoft BizTalk, webMethods (now Software AG) or others, this is a very common and useful tool in the enterprise computing toolbox. If this is the market you are after (rather than creating Facebook apps or the next Twitter), then you have to address this need. Not to mention that business processes engines are often used for partner integration scenarios (which makes hosting in a public Cloud a natural choice).

Conclusion

In the end, both synchronous and asynchronous execution engines are useful, as are other core services like storage (here is my proposed list of PaaS container types). I just wanted to bring some attention to business process execution because I think PaaS is the context in which its profile will rise to higher prominence. I also anticipate that this rise will lead to some very interesting progress and innovation in the way these processes are defined, deployed and managed. We haven’t yet seen, in this area, the relentless evolutionary pressure that has shaped today’s synchronous Web application frameworks. Fun times ahead.

[UPDATED 2010/2/18: More information about Salesforce.com's Visual Process Manager.]

Post

Backward-compatible vs. forward-compatible: a tale of two clouds

In Amazon,Application Mgmt,Cloud Computing,Everything,Google App Engine,IT Systems Mgmt,Utility computing,Virtualization on January 7, 2010 by @vambenepe

There is the Cloud that provides value by requiring as few changes as possible. And there is the Cloud that provides value by raising the abstraction and operation level. The backward-compatible Cloud versus the forward-compatible Cloud.

The main selling point of the backward-compatible Cloud is that you can take your existing applications, tools, configurations, customizations, processes etc and transition them more or less as they are. It’s what allowed hypervisors to spread so quickly in the enterprise.

The main selling point of the forward-compatible Cloud is that you are more productive and focused. Fewer configuration items to worry about, fewer stack components to install/monitor/update, you can focus on your application and your business goals. You develop and manage at the level of application concepts, not systems. Bottom line, you write and deploy applications more quickly, cheaply and reliably.

To a large extent this maps to the distinction between IaaS and PaaS, but it’s not that simple. For example, a PaaS that endeavors to be a complete JEE environment is mainly aiming for the backward-compatible value proposition. On the other hand, EC2 spot instances, while part of the IaaS layer, are of the forward-compatible kind: not meant to run your current applications unchanged, but rather to give you ways to create applications that better align with your business goals.

Part of the confusion is that it’s sometimes unclear whether a given environment is aiming for forward-compatibility (and voluntary simplification) or whether its goal is backward-compatibility but it hasn’t yet achieved it. Take EC2 for example. At first it didn’t look much like a traditional datacenter, beyond the ability to create hosts. Then we got fixed IP, EBS, boot from EBS, etc and it got more and more realistic to run applications unchanged. But not quite, as this recent complaint by Hoff illustrates. He wants a lot more control on the network setup so he can deploy existing n-tier applications that have specific network topology/config requirements without re-engineering them.  It’s a perfectly reasonable request, in the context of the backward-compatible Cloud value proposition. But one that will never be granted by a Cloud that aims for forward-compatibility.

Similarly, the forward-compatible Cloud doesn’t always successfully abstract away lower-level concerns. It’s one thing to say you don’t have to worry about backup and security but it means that you now have to make sure that your Cloud provider handles them at an acceptable level. And even on technical grounds, abstractions still leak. Take Google App Engine, for example. In theory you only deal with requests and not even think about the servers that process them (you have no idea how many servers are used). That’s nice, but once a while your Java application gets a DeadlineExceededException. That’s because the GAE platform had to start using a new JVM to serve this request (for example, your traffic is growing or the JVM previously used went down) and it took too long for the application to load in the new JVM, resulting in this loading request being killed. So you, as the developer, have to take special steps to mitigate a problem that originates at a lower level of the stack than you’re supposed to be concerned about.

All in all, the distinction between backward-compatible and forward-compatible Clouds is not a classification (most Cloud environments are a mix). Rather, it’s another mental axis on which to project your Cloud plans. It’s another way to think about the benefits that you expect from your use of the Cloud. Both providers and consumers should understand what they are aiming for on that axis. Hopefully this can help prevent shout matches of the “it’s a bug, no it’s a feature” variety.

[UPDATED 2010/3/4: Apparently, Steve Ballmer thinks along the same lines. Though the way he sees it, Azure is forward-compatible, while Amazon is backward-compatible: "I think Amazon has done a nice job of helping you take the server-based programming model - the programming model of yesterday, that is not scale-agnostic - and then bringing it into the cloud. On the other hand, what we're trying to do with Azure is let you write a different kind of application."]

[UPDATED 2010/3/5: I now have the quasi-proof that indeed Steve Ballmer stole the idea from my blog. Look at this entry in my HTTP log. This visitor came the evening before Steve's "Cloud" talk at the University of Washington. I guess I am not the only one to procrastinate until the 11th hour when I have a deadline. Every piece of information in this log entry points at Steve Ballmer. How can it be anyone other than him?

131.107.0.71 - - [03/Mar/2010:23:51:52 -0800] “GET /archives/1198 HTTP/1.0″ 200 4820 “http://www.bing.com/search?q=Brilliant+Cloud+Insight” “Mozilla/1.22 (compatible; MSIE 2.0; Microsoft Bob)”

(in case you are not fluent in the syntax and semantics of HTTP log files, this is a joke)]

Post

Expanding on “twitter with a brain”

In Automation,Everything,Google App Engine,Implementation,Mashup,PaaS,Portability,Protocols,Security,Social networks,Twitter,Yahoo on November 30, 2009 by @vambenepe

Chuck Shotton recently made a compelling case (“Twitter with a Brain“) for Twitter tools to allow the user to change the protocol endpoint. That is, instead of always going to twitter.com, you can tell your Twitter client to send all requests to myTweetInterceptor.me.com. Why would you do this? You should read his blog entry, but in short his point is that the intermediary can add all kinds of new features that neither the Twitter client nor Twitter itself support. As always in computer science, a new level of decoupling adds opportunities for extensions (and breakage too, of course).

I fully agree with what he writes and I would very much like to see his call to action answered. In fact, I want more than what he is asking for. So here is my call to action:

1) It’s not just Twitter

Why just Twitter? This should be true for any client using any protocol. Why not also the APIs for the various Google and Yahoo services? The APIs for the other social networks beyond Twitter? For shopping sites like Amazon and EBay? Etc. And of course to all the various Cloud providers out there. Just because I am using the Amazon EC2 API it doesn’t mean I necessarily want the requests to go straight to Amazon. Client tools should always make the endpoint configurable, period.

2) It’s not just the clients, it’s also (and especially) the third party sites

Chuck’s examples are about features that the Twitter clients could provide but don’t, so an intermediary would be an easy way to hack support for them (others presumably include modifying the client – if open source -, writing a plug-in for it – it there is such mechanism -, or running a network interceptor on the local client – unless the protocol is encrypted-).

That’s nice and I’d love to see this, but the big deal for me is less with clients and more with third party sites. You know, all these sites that ask for your Twitter login/password. Or those that ask for your GMail/Yahoo account info to retrieve a list of your contacts. I never grant these requests, but I would consider it if they allowed me to tell them what endpoint URL to use. For example, rather than using my Twitter login to go straight to twitter.com, they would use a login/password that I create and talk to twitterIntermediary.vambenepe.com. The requests would be in the exact same shape as what they send today to Twitter, just directed to another URL. There, I could have a proxy that only allows some requests (e.g. “update twitter background image” but not “send update”) and forwards them using my real Twitter credentials. Or, for email accounts, I could have a proxy that allows requests that read my address book but not those that read my mails. The goal here is not to add features, it is to delegate trust in a fine-grained (and audited) manner. This, to me, is the burning need, rather than a 3rd place to implement Twitter lists.

I would probably write these proxies using a PaaS platform like the Google App Engine. Or maybe even Yahoo Pipes. I have long struggled to think of use cases for which Yahoo Pipes hits the sweetspot, and this may well be it. Especially if people write modules to handle specific APIs (e.g. a “Twitter API” module that shows all operations and lets you enable/disable them one by one in a pipe). The one thing missing would be a way for a pipe to keep a log of its invocations, for auditing.

You want access to my email and social network accounts? Give me the ability to filter you requests and you’ll get access. If it’s blind trust you want, I am afraid I have a very limited supply.

[Note: I wanted to add this as a comment on Chuck's blog, but he doesn't seem to allow them: "go start your own blog and/or shut up and eat your vegetables" is his recommendation. Since I already have my own blog, I guess I don't have to eat my vegetables if I don't want to. I just hope my kids don't learn about this rule or they'll be blogging in no time.]

[UPDATED 2009/11/30: WRT to Chuck's request, it looks like it's being done already. But no luck with the third party sites so far, which is what I most want to see.]

Post

Cloud platform patching conundrum: PaaS has it much worse than IaaS and SaaS

In Application Mgmt,Cloud Computing,Everything,Google App Engine,Governance,ITIL,Manageability,Mgmt integration,PaaS,SaaS,Utility computing,Virtualization on October 15, 2009 by @vambenepe

The potential user impact of changes (e.g. patches or config changes) made on the Cloud infrastructure (by the Cloud provider) is a sore point in the Cloud value proposition (see Hoff’s take for example). You have no control over patching/config actions taken by the provider, any of which could potentially affect you. In a traditional data center, you can test the various changes on specific applications; you don’t have to apply them at the same time on all servers; and you can even decide to skip some infrastructure patches not relevant to your application (“if it aint’ broken…”). Not so in a Cloud environment, where you may not even know about a change until after the fact. And you have no control over the timing and the roll-out of the patch, so that some of your instances may be running on patched nodes and others may not (good luck with troubleshooting that).

Unfortunately, this is even worse for PaaS than IaaS. Simply because you seat on a lot more infrastructure that is opaque to you. In a IaaS environment, the only thing that can change is the hardware (rarely a cause of problem) and the hypervisor (or equivalent Cloud OS). In a PaaS environment, it’s all that plus whatever flavor of OS and application container is used. Depending on how streamlined this all is (just enough OS/AS versus a traditional deployment), that’s potentially a lot of code and configuration. Troubleshooting is also somewhat easier in a IaaS setup because the error logs are localized (or localizable) to a specific instance. Not necessarily so with PaaS (and even if you could localize the error, you couldn’t guarantee that your troubleshooting test runs on the same node anyway).

In a way, PaaS is squeezed between IaaS and SaaS on this. IaaS gets away with a manageable problem because the opaque infrastructure is not too thick. For SaaS it’s manageable too because the consumer is typically either a human (who is a lot more resilient to change) or a very simple and well-understood interface (e.g. IMAP or some Web services). Contrast this with PaaS where the contract is that of an application container (e.g. JEE, RoR, Django).There are all kinds of subtle behaviors (e.g, timing/ordering issues) that are not part of the contract and can surface after a patch: for example, a bug in the application that was never found because before the patch things always happened in a certain order that the application implicitly – and erroneously – relied on. That’s exactly why you always test your key applications today even if the OS/AS patch should, in theory, not change anything for the application. And it’s not just patches that can do that. For example, network upgrades can introduce timing changes that surface new issues in the application.

And it goes both ways. Just like you can be hurt by the Cloud provider patching things, you can be hurt by them not patching things. What if there is an obscure bug in their infrastructure that only affects your application. First you have to convince them to troubleshoot with you. Then you have to convince them to produce (or get their software vendor to produce) and deploy a patch.

So what are the solutions? Is PaaS doomed to never go beyond hobbyists? Of course not. The possible solutions are:

  • Write a bug-free and high-performance PaaS infrastructure from the start, one that never needs to be changed in any way. How hard could it be? ;-)
  • More realistically, narrowly define container types to reduce both the contract and the size of the underlying implementation of each instance. For example, rather than deploying a full JEE+SOA container componentize the application so that each component can deploy in a small container (e.g. a servlet engine, a process management engine, a rule engine, etc). As a result, the interface exposed by each container type can be more easily and fully tested. And because each instance is slimmer, it requires fewer patches over time.
  • PaaS providers may give their users some amount of visibility and control over this. For example, by announcing upgrades ahead of time, providing updated nodes to test on early and allowing users to specify “freeze” periods where nothing changes (unless an urgent security patch is needed, presumably). Time for a Cloud “refresh” in ITIL/ITSM-land?
  • The PaaS providers may also be able to facilitate debugging of infrastructure-related problem. For example by stamping the logs with a version ID for the infrastructure on the node that generated the log entry. And the ability to request that a test runs on a node with the same version. Keeping in mind that in a SOA / Composite world, the root cause of a problem found on one node may be a configuration change on a different node…

Some closing notes:

  • Another incarnation of this problem is likely to show up in the form of PaaS certification. We should not assume that just because you use a PaaS you are the developer of the application. Why can’t I license an ISV app that runs on GAE? But then, what does the ISV certify against? A given PaaS provider, e.g. Google? A given version of the PaaS infrastructure (if there is such a thing… Google advertises versions of the GAE SDK, but not of the actual GAE runtime)? Or maybe a given PaaS software stack, e.g. the Oracle/Microsoft/IBM/VMWare/JBoss/etc, meaning that any Cloud provider who uses this software stack is certified?
  • I have only discussed here changes to the underlying platform that do not change the contract (or at least only introduce backward-compatible changes, i.e. add APIs but don’t remove any). The matter of non-compatible platform updates (and version coexistence) is also a whole other ball of wax, one that comes with echoes of SOA governance discussions (because in PaaS we are talking about pure software contracts, not hardware or hardware-like contracts). Another area in which PaaS has larger challenges than IaaS.
  • Finally, for an illustration of how a highly focused and specialized container cuts down on the need for config changes, look at this photo from earlier today during the presentation of JRockit Virtual Edition at Oracle Open World. This slide shows (in font size 3, don’t worry you’re not supposed to be able to read), the list of configuration files present on a normal Linux instance, versus a stripped-down (“JeOS”) Linux, versus JRockit VE.


By the way, JRockit VE is very interesting and the environment today is much more favorable than when BEA first did it, but that’s a topic for another post.

[UPDATED 2009/10/22: For more on this (in an EC2-centric context) see section 4 ("service problem resolution") of this IBM paper. It ends with "another possible direction is to develop new mechanisms or APIs to enable cloud users to directly and automatically query and correlate application level events with lower level hardware information to better identify the root cause of the problem".]

[UPDATES 2012/4/1: An example of a PaaS platform update which didn't go well.]

Post

PaaS as a satisfying and success-ready hobbyist platform

In Cloud Computing,Everything,Google,Google App Engine,Implementation,Mashup,Microsoft,PaaS,Utility computing,WSO2 on September 30, 2009 by @vambenepe

I don’t know anyone in Silicon Valley who can code and doesn’t fantasize about writing an accidental killer app. One that gets designed during a long layover in DEN and implemented in a rainy weekend (El Nino is my VC). One that was only supposed to meet the needs of a few friends and is used by half of the world a few months/years later.

I am not talking about seasoned entrepreneurs, who have a network, discipline, resources and enough experience to know that it takes a lot more than a cool idea. Rather, about programming hobbyists (who may of may not be programmers in their day jobs),

By definition, hobbyists only do things that are satisfying. In the rarefied air of Silicon Valley, it also helps if there is a conceivable “upside” to dream about. Platform as a Service (PaaS, e.g. Google App Engine) provides both to software-oriented hobbyist. And make it very cheap (borderline free), which doesn’t hurt.

Satisfying

In a well-crafted PaaS environment, the development process and the result are both satisfying. I am not a Google shrill, but GAE is a fair example. The barrier to entry is very low (the download is less than 10MB and contains all you need to get started). In an hour you have an application running locally. In an hour and 5 minutes you have it deployed and accessible on the web for all. And yet this ease of bootstrap does not come at the cost of too many longer-term limitations (now that the environment has gown a bit from the original limitations and provides scheduled and background jobs). Unlike Yahoo Pipes, for which the first impression is “nifty!” and the second is “gimme a textual representation of my pipe now!”.

Beyond the easy ramp-up, the main source of satisfaction developing in a PaaS environment is that you spend 99% of your time working on the application. Not the OS, not the firewall, not the application container, not the database. Not to mention having to deal with your co-lo provider or the leased line for the servers in the basement. If you are a hobbyist with only a few spare hours per week, that’s a make or break deal. It also means that you have a fighting chance of developing a secure application because you are responsible for a much smaller surface of attack.

Eons ago (in computing time), Visual Basic was the name of the game for these people. More recent was the rise of PHP. It dramatically lowered the barrier to adoption and provided a quick route to a working web application. I know several non-professional developers (e.g. web designers) who are scared of any “normal” programming language but happily write PHP (often of equivalent complexity BTW). Combine this with the wide availability of ISP-managed PHP environments and you get close to what GAE gives you. At the risk of adding to the annoying trend of retroactively cloudifying everything, I think of ISP-hosted PHP as the first generation of PaaS. But it is focused on “show what’s in the DB” scenarios rather than service-centric / mash-up / web 2.0 integration. And even for DB-centric scenarios, by and large PHP coders don’t want to think too much about model and queries (and it shows). I think Google decided to go with Python rather than the easy route of aping the hosted PHP environments in large part to avoid hitting such ceilings down the road. Not surprisingly, PHP support is currently the most requested GAE feature, ahead of Perl and Ruby. Lets see if Google tries to get the PHP community on board or prefers to stay clear of such PaaS legacy (already!).

Ready for success

In the unlikely event that your application catches fire and sees wide adoption (which is not impossible, especially if well integrated in a social network), what are you going to do about it? Keeping in mind two constraints: first, this is a part-time hobby of yours. Second, don’t dream of riches. We are talking about an influx of facebookers or twitterers here, with no intention to pay for anything. But click they will. If you were going to answer: “I get funding and hire a real IT staff” then think again. You most likely won’t get funding for your toy app without revenue potential. And even if you do, by the time you have it it’s too late and people have moved on because you were not there when the spotlight was on you.

With a PaaS-based application you have a fighting chance. If the spike is short enough, you may not even hit the limit of the free quota. If it does, you have the choice of whether you are willing to pay to support the extra traffic or not. No change in code required (though it may be advised anyway, if your app wasn’t architected for efficient scaling – PaaS doesn’t entirely take this off your hands).

That “sudden spike” story is a commonly-invoked use case for EC2. And it’s probably true for a start-up with an IT staff (of at least one full-time person). But despite Amazon’s efforts (and other providers such as RightScale) this type of scaling is something you have to architect for and putting it in place takes away from the time you spend coding application features. It also means that you are responsible for more infrastructure (OS and application container at least). Not to mention that IaaS providers don’t usually offer free resources for limited usage, the way Google does (I suspect 99% of GAE apps never get over this quota). Even if a small EC2 instance is not very expensive (though it adds up over time if you keep it up for that occasional user), the difference between “free” and “cheap” is significant. As an application provider you’d like for this not to be the case with your users, but as a consumer of infrastructure service you’re on the other side of the deal, aren’t you?

There is a reason why suburbia-bound SUVs are advertised crossing mountain streams. The “I could if I wanted to” line has appeal. For the software hobbyist, knowing that your application won’t crash if it happened to meet success (even if only for a couple of days, e.g. the Slashdot effect) is a good feeling (“I could if *they* wanted to”). In truth this occasion is rare (and likely to end up like this), but you are ready for the eventuality. And if there are enough such hobbyists, then statistically some will encounter it.

The provider’s upside

That last point brings the topic of the PaaS provider’s upside in this. I have read several critical comments arguing that no company will rewrite their application for GAE (true) and that no start-up will write their new code for it either because of the risk of lock-in (also true: “being bought by Google” is not a bad outcome but “has to be bought by Google” is a bad exit strategy). But I think this misses the point of casting such a wide nest and starting with creating a great tool for hobbyists.

After all, Google has made a great business monetizing millions of small sites none of which makes much money by itself. At the very least least this can grow the web and, symbiotically, Google. With  two possible upsides:

  • Some of these hobbyist applications may actually take off and Google becomes their natural partner/godfather (including managing their user accounts). For example, wouldn’t it be nice for Google if Craigslist or Twitter was running on GAE?
  • The platform eventually evolves into something that makes sense for start-ups, SMBs and/or enterprises to use. Google works out the kinks with less demanding users first.

Interesting times

Two closing thoughts, which I’ll leave undeveloped for now:

  • There is an especially good synergy between mobile apps and PaaS. Once you get past the restaurant tip calculators, many mobile apps need a server-hosted sidekick to do the heavy lifting of gathering/storing/transforming data. As a hobbyist, you want to spend most of your time making you mobile app cool. Which leaves even less time for administrating server components. On the server side, you are even less likely to want to deal with anything but application logic. PaaS is especially attractive in these scenarios. Google and Microsoft have to navigate these waters carefully but look for some synergy/integration stories around GAE + Android and Azure + Windows Mobile respectively. Not clear what Apple’s story is here or if they think they need one. If it surfaces as an issue then we have yet another reason to restart the “Apple buys Adobe” rumor. Or maybe Sanjiva will get a middle-of-the-night call from Steve Jobs…
  • A platform to build/run your application is one thing. A way to reach users is another (arguably much more critical). Things like mobile app stores (especially Apple’s of course), Facebook and next generation app stores. But this goes  beyond the scope of this post.

Just to be clear, I am not in any way suggesting that PaaS is only for hobbyists. I am just saying that right now it is a great tool for them, the best way for an individual programmer to have fun and have an impact. This doesn’t take away from the value that PaaS will eventually deliver to larger organizations.

[UPDATED 2009/10/4: Microsoft Azure apparently supports PHP.]

Post

Look Ma, no hypervisor!

In Application Mgmt,Cloud Computing,Everything,Google App Engine,Implementation,IT Systems Mgmt,Middleware,Utility computing,Virtualization,VMware,XenSource on September 21, 2009 by @vambenepe

Encouraged by hypervisor vendors, the confusion between virtualization and Cloud Computing is rampant. In the industry, the term “virtualization” (and its corollary, “virtual machine”) is used in so many different ways that it has lost all usefulness. For a recent example, read the introduction of this SNIA/OGF white paper (on Cloud Storage) which asserts that “the new technology underlying this is the system virtual machine that allows multiple instances of an operating system and associated applications to run on single physical machine. Delivering this over the network, on demand, is termed Infrastructure as a Service (IaaS)”.

In fact, even IaaS-type Cloud services don’t imply the use of hypervisors.

We need to decouple the Cloud interface/contract (e.g. “what are the types of resources that can I provision on demand? hosts, app servers, storage capacity, app services…”) from the underlying implementation (e.g. “are hypervisors used by the Cloud provider?”). At the risk of spelling out things that may be obvious to many readers of this blog, here is a simplified matrix of Cloud Computing systems, designed to illustrate that all combinations of interface and implementation are possible and in many cases even reasonable.

IaaS interface PaaS interface
Hypervisor used Yes! (see #1) Yes! (see #2)
Hypervisor not used Yes! (see #3) Yes! (see #4)

#1: IaaS interface, hypervisor-based implementation

This is a very common approach these days, both in public Clouds (EC2, Rackspace and presumably at some point the VMWare vCloud Express service providers) and private Clouds (Citrix, Sun, Oracle, Eucalyptus, VMWare…). Basically, you take a bunch of servers, put hypervisors on all of them and make VMs running on these hypervisors available to the Cloud customers.

But despite its predominance, this is not the only path to a Cloud, not even to an IaaS (e.g. “x86 hosts on demand”) Cloud. The following three other scenarios are all valid too.

#2: PaaS interface, hypervisor-based implementation

This is the road SpringSource has been on, first with Cloud Foundry (using AWS EC2 which is based on the Xen hypervisor) and presumably soon on top of VMWare.

#3: IaaS interface, no hypervisor in the implementation

Let’s remember that the utility computing vision (before the term fell in desuetude in favor of “cloud”) has been around before x86 hypervisors were so common. Take Loudcloud as an illustration. They were building what is now called a “public Cloud” starting back in 1999 and not using any hypervisor. Just bare metal provisioning and advanced provisioning automation software. Then they sold the hosting part to EDS (now HP) and only kept the software, under the name Opsware (now HP too, incidentally). That software was meant to create what we now call a “private Cloud”. See this old DCML announcement as one example of the Opsware vision. And no hypervisor was harmed in the making of this movie.

At the current point in time, the hardware (e.g. multiple cores, shared memory) and software (hypervisors, legacy apps) environment is such that hypervisor-based solutions seem to have an edge over those based on automated provisioning/configuration alone. But these things tend to change quickly in our industry… Especially if you factor in non-technical considerations like compliance, fear of data leakage and the risk of having the hardware underlying your application seized because of an investigation involving another tenant…

And this is not going into finner techno-philosophical points about the different types of hypervisors. Not to mention mainframe LPARs… One could build a hypervisor-free IaaS solution on these.

To some extent, you may even put the “pwned” machines (in a botnet) in this “IaaS with no hypervisor” category (with the small difference that what’s being made available is an x86 with an OS, typically Windows, already installed). If you factor out externalities (like the FBI breaking down your front door at 6:00AM) this approach has claims as the most cost-effective form of Cloud computing available today… Solaris zones are another example of possible foundation for a hypervisor-free IaaS-like offering (here too, with an OS rather than a “raw host” as the interface).

#4: PaaS interface, no hypervisor in the implementation

In the public sphere, this corresponds to Google App Engine.

In the private sphere, several companies have built it themselves on top of WebLogic, by adding some level of “on-demand” application provisioning in order to streamline the relationship between the IT group running the servers and the business groups who want to deploy applications on them. Something that one should ideally be able to buy rather than build.

Waiting for the question to become irrelevant

Like most deeply-ingrained confusions, the conflation of virtualization and Cloud Computing won’t be dispelled as much as made irrelevant. The four categories enumerated in this post are a point-in-time view of a continuously evolving system. What may start today as a bundle of a hypervisor, an OS and an app server may become a somewhat monolithic “PaaS engine” over time as the components are more tightly integrated. That “engine” may have memory isolation mechanisms that look a lot like a hypervisor. But it may not be able to host a generic OS. In the same way that whales don’t have fingers and toes and yet they are still very much apparent in their skeleton.

[UPDATED 2009/10/8: A real-life example of #3! On-demand servers via bare metal provisioning (via Sam). No hypervisor in the picture. See also here.]

[UPDATED 2009/12/29: Another non-hypervisor Cloud provider! NewServers. Here is their API. And a Q&A.]

Post

Toolkits to wrap and bridge Cloud management protocols

In Amazon,API,Automation,Cloud Computing,Everything,Google App Engine,Implementation,IT Systems Mgmt,Manageability,Mgmt integration,Open source,Protocols,Utility computing on September 11, 2009 by @vambenepe

Cloud development toolkits like Libcloud (for Python) and jcloud (for Java) have been around for some time, but over the last two months they have been joined by several other open source contenders. They all claim to abstract the on-the-wire Cloud management protocols sufficiently to let you access different Clouds via the same code; while at the same time providing objects in your programming language of choice and saving you the trouble of dealing with on-the-wire messages. By focusing on interoperability, they slot themselves below the larger role of a “Cloud broker” (which also deals with tasks like transfer and choice). Here is the list, starting with the more recent contenders:

DeltaCloud shares the same goal of translating between different Cloud management protocols but they present their own interface as yet another Cloud REST API/protocol rather than a language-specific toolkit. More along the lines of what UCI is trying to do (not sure what’s up with that project, I recorded my skepticism earlier and am still waiting to be pleasantly surprised).

Of course there are also programming toolkits that are specific to one Cloud provider. They are language-specific wrappers around one Cloud management protocol. AWS protocols (EC2, S3, etc…) represent the most common case, for example amazon-ec2 (a Ruby Gem), Power-EC2Dream (in C# which gives it the tantalizing advantage of being invokable via PowerShell) and typica (for Java). For Clouds beyond AWS, check out the various RightScale Ruby Gems.

The main point of this entry was to list the cross-Cloud development toolkits in the bullet list above. But if you’re in the mood for some pontification you can keep reading.

For some reason, what used to be called “protocols” is often called “APIs” in Cloud settings. Witness the Sun Cloud “API” or the vCloud “API” which only define XML formats for on-the-wire messages. I have never heard of CIM/XML over HTTP, WSDM or WS-Management being referred as APIs though they occupy a very similar place. They are usually considered “protocols”.

It’s a just question of definition whether an on-the-wire protocol (rather than a language-specific set of objects/methods) qualifies as an “Application Programming Interface”. It’s not an “interface” in the Java sense of the term. But I can “program” against it so it could go either way. On this blog I have gone along with the “API” term because that seemed widely used, though in verbal conversations I have tended to stick to “protocol”. One problem with “API” is that it pushes you towards mixing the “what” and the “how” and not respecting the protocol/model dichotomy.

Where is becomes relevant is when you start to see language-specific APIs for Cloud control pop-up as listed above. You now have two classes of things called “API” and it gets a bit confusing. Is it time to bring back the “protocol” term for on-the-wire definitions?

As a developer, whether you’re better off eating your Cloud noodles using chopsticks (on-the-wire protocol definitions) or a fork (language-specific APIs) is an important decision that will stay with you and may come back to bit you (e.g. when the interfaces are versioned). There is a place for both of course, but if we are to learn anything from WS-* it’s that we went way too far in the “give me a java stub” direction. Which doesn’t mean there is no room for them, but be careful how far from the wire semantics you get. It become even trickier when your stub tries not jsut to bridge between XML and Java but also to smooth out the differences between several on-the-wire protocols, as the toolkits above do. The hope, of course, is that there will eventually be enough standardization of on-the-wire protocols to make this a moot point.

Post

Are these your files? I found them on my cloud

In Amazon,Cloud Computing,Everything,Google App Engine,Security,Utility computing,Virtualization,Xen on September 2, 2009 by @vambenepe

Drip drip drip… Is this the sound of your cloud leaking?

It can happen in different ways. See for example this recent research paper, titled “Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds”. It’s a nice read, especially if you find side channels interesting (I came up with one recently, in a different context).

In the first part of the paper, the authors show how to get your EC2 instance co-located (i.e. running in in the same hypervisor) with the instance you are targeting (the one you want to spy on). Once this is achieved, they describe side channel attacks to glean information from this situation.

This paper got me thinking. I noticed that it does not mention trying to go after disk blocks and memory. I don’t know if they didn’t try or they tried and were defeated.

For disk blocks (the most obvious attack vector), Amazon is no dummy and their “proprietary  disk  virtualization  layer  automatically  wipes every block of storage used by  the customer, and guarantees  that one customer’s data  is never exposed to another” as explained in the AWS Security Whitepaper. In fact, they are so confident of this that they don’t even bother forbidding block-based recovery attempts in the AWS customer agreement (they seem mostly concerned about attacks that are not specific to hypervisor environments, like port scanning or network-based DOS). I took this as an invitation to verify their claims, so I launched a few Linux/ext3 and Windows/NTFS instances, attached a couple of EBS volumes to them and ran off-the-shelf file recovery tools. Sure enough, nothing was found on  /dev/sda2 (the empty 150GB partition of local storage that comes with each instance) or on the EBS volumes. They are not bluffing.

On the other hand, there were plenty of recoverable files on /dev/sda1. Here is what a Foremost scan returned on two instances (both of them created from public Fedora AMIs).

The first one:

Finish: Tue Sep  1 05:04:52 2009

5640 FILES EXTRACTED

jpg:= 14
gif:= 670
htm:= 1183
exe:= 2
png:= 3771
------------------------------------------------------------------

And the second one:

Finish: Wed Sep  2 00:32:16 2009

17236 FILES EXTRACTED

jpg:= 236
gif:= 2313
rif:= 11
htm:= 4886
zip:= 182
exe:= 6
png:= 9594
pdf:= 8
------------------------------------------------------------------

These are blocks in the AMI itself, not blocks that were left on the volumes on which the AMI was installed. In other words, all instances built from the same AMI will provide the exact same recoverable files. The C: drive of the Windows instance also had some recoverable files. Not surprisingly they were Windows setup files.

I don’t see this as an AWS flaw. They do a great job providing cleanly wiped raw volumes and it’s the responsibility of the AMI creator not to snapshot recoverable blocks. I am just not sure that everyone out there who makes AMIs available is aware of this. My simple Foremost scans above only looked for the default file types known out of the box by Foremost. I suspect that if I added support for .pem files (used by AWS to store private keys) there may well be a few such files recoverable in some of the publicly accessible AMIs…

Again, kudos to Amazon, but I also wonder if this feature opens a possible DOS approach on AWS: it doesn’t cost me much to create a 1TB EBS volume and to destroy it seconds later. But for Amazon, that’s a lot of blocks to wipe. I wonder how many such instantaneous create/delete actions on large EBS volumes it would take to put a large chunk of AWS storage capacity in the “unavailable – pending wipe” state… That’s assuming that they proactively wipe all the physical blocks. If instead the wipe is virtual (their virtualization layer returns zero as the value for any free block, no matter what the physical value of the block) then this attack wouldn’t work. Or maybe they keep track of the blocks that were written and only wipe these.

Then there is the RAM. The AWS security paper tells us that the physical RAM is kept separated between instances (presumably they don’t use ballooning or the more ambitious Xen Transcendent Memory). But they don’t say anything about what happens when a new instance gets hold of the RAM of a terminated instance.

Amazon probably makes sure the RAM is reset, as the disk blocks are. But what about your private Cloud infrastructure? While the prospect of such Cloud leakage is most terrifying in a public cloud scenario (anyone could make use of it to go after you), in practice I suspect that these attack vectors are currently a lot more exploitable in the various “private clouds” out there. And that for many of these private clouds you don’t need to resort to the exotic side channels described in the “get off of my cloud” paper. Amazon has been around the block (no pun intended) a few times, but not all the private cloud frameworks out there have.

One possible conclusion is that you want to make sure that your cloud vendor does more than writing scripts to orchestrate invocations of the hypervisor APIs. They need to understand the storage, computing and networking infrastructure in details. There is a messy physical world under your clean shinny virtual world. They need to know how to think about security at the system level.

Another one is that this is a mostly an issue for hypervisor-based utility computing and a possible trump card for higher level of virtualization, e.g. PaaS. The attacks described in the paper (as well as block-based file recovery) would not work on Google App Engine. What does co-residency mean in a world where subsequent requests to the same application could hit any machine (though in practice it’s unlikely to be so random)? You don’t get “deployed” to the same host as your intended victim. At best you happen to have a few requests executed while a few requests of your target run on the same physical machine. It’s a lot harder to exploit. More importantly, the attack surface is much more restrained. No direct memory access, no low-level scheduler data, no filesystem… The OS to hardware interface that hypervisors emulate was meant to let the OS control the hardware. The GAE interface/SDK, on the other hand, was meant to give the application just enough capabilities to perform its task, in a way that is as removed from the hardware as possible. Of course there is still an underlying physical reality in the GAE case and there are sure to be some leaks there too. But the small attack surface makes them a lot harder to exploit.

[UPDATED 2009/9/8: Amazon just improved the ability to smoothly update your access certificates. So hopefully any such certificate found on recoverable blocks in an AMI will be out of data and unusable.]

[UPDATED 2009/9/24: Some good security practices that help protect you against block analysis and many other forms of attack.]

[UPDATED 2009/10/15: At Oracle Open World this week, I was assured by an Amazon AWS employee that the DOS scenario I describe in this post would not be a problem for them. But no technical detail as to why that is. Also, you get billed a minimum of one hour for each EBS volume you provision, so that attack would not be as cheap as I thought (unless you use a stolen credit card).]

Post

Reality check on Cloud portability

In Amazon,Application Mgmt,Articles,Cloud Computing,Everything,Google App Engine,HP,IT Systems Mgmt,Mgmt integration,People,Portability,Specs,Standards,Utility computing on April 10, 2009 by @vambenepe

SD Times recently published an interesting article about “cloud interoperability”. It has some well-informed opinions. But, like all Cloud-related discussions, it also suffers from mixing a bunch of things. The word “interoperability” is alternatively applied to the Cloud infrastructure services (in which case this “interoperability” is a way to provide application “portability”) and to the Cloud-hosted applications themselves.

Application-level interoperability (“look, my GAE-hosted app successfully sent an HTTP request to an Azure-hosted app, open the champagne”) is not very new or exciting anymore and is often used as an interoperability smokescreen (hello Salesforce.com). Many of these interop concerns are long solved and the others (like authentication and data migration) need to be solved in ways that don’t care whether the application is hosted in your Silicon Valley garage or near the Columbia river.

Cloud infrastructure compatibility (in other words application portability) is the more interesting discussion. I keep reading that it is needed (“no vendor lock-in, not ever again”) for enterprises to move to the Cloud. Being a natural-born cynic, I always ask myself whether those asking for it are naive (sometimes) or have ulterior motives (e.g. trying to catch-up with Amazon by entangling them in the standards net – some of my fellow cynics see the Open Cloud Manifesto as just this).

Because the reality is that, Manifesto or no Manifesto, you are not going to get application portability across IaaS-type Cloud providers. At least for production applications. Sorry. As a consolation prize, you may get some runtime portability such that we’ll be shown nice demos of prototype apps moving from one provider to another (either as applications or as virtual machines). Clap clap until you realize that they left behind their monitoring capabilities, or that their configuration rules don’t validate anything anymore. And that your printer ran out of red ink when printing the latest compliance report. Oops.

Maybe I am biased because they are both my friends and ex-colleagues, but the HP guys make the most sense in the SD Times article. Tim Hall has it right when he suggests “that the industry should focus on specific problems that it is going to solve around deployment and standardized monitoring”. And the other HP Tim, Mr. van Ash, rightly points out that we should “stop promising miracles”, which Forrester’s Jeffrey Hammond echoes, saying that there is a difference between a standard and “plug-and-play in reality”.

Tim Hall uses SQL as an example of a realistic common baseline. J2EE would be another one. They provide a good reality check. Standards are always supposed to prevent vendor lock-in. And there is a need for some of that, of course. But look at the track records. How many applications do you know that are certified and supported on any SQL database, any Unix operating system and any J2EE app server? And yet, standardizing queries on relational data and standardizing an enterprise-class runtime environment for one programming language are pretty constrained scopes in the grand scheme of things. At least compared to all the aspects that you need to standardize to provide real Cloud portability (security, monitoring, provisioning, configuration, language runtime and/or OS, data storage/retrieval, network configuration, integration with local apps, metering/billing, etc). And we’re supposed to put together a nice bundle of standards that will guarantee drag-and-drop portability across all these concerns? In how many lifetimes? By then, Cloud computing will have been replaced by the next big thing (galaxycomputing.com is still available BTW).

Not to mention that this standardization comes hand in hand with constraints on what you can do. That’s why I read Amazon’s Adam Selipsky’s comment that allowing customers to do “whatever they want” is vital as a way to say “get real” to requests for application portability, while allowing him to sound helpful rather than obstructionist.

This doesn’t mean that these standards are not useful. They make application portability possible if not free. They make for much improved productivity through generic tools and reusable developer knowledge. We still need all this.

Here is the best that can realistically happen in the “application portability across IaaS providers” area for at least 10 years:

  • a set of partial standards for small parts of the Cloud computing domain (see list above), many of which already exist.
  • a set of RightScale-like tools that do a lot of the grunt work of mapping/hiding/transforming between providers, with various degrees of success.
  • the need for application providers to certify their applications on Cloud providers one by one anyway and to provide cloning/migration as a feature of the application rather than an infrastructure-level task.

That’s assuming that IaaS providers become a major business, that there remains a difference between service providers and software providers. The other option is that the whole Cloud excitment goes back to SaaS only, that application creators are also hosting providers, that the only resource you get in a “utility” fashion is the application itself. At which point application portability is not a concern anymore and we go back to “only” worrying about data portability and application interoperability, an easier problem and one on which we have come a long way already. If this is what comes to pass then the challenge of Cloud portability may well be one of the main reasons. Along with the lack of revenue/margin potential for many of the actors in an IaaS world, as my CEO is fond of pointing out.

[UPDATED 2009/4/22: F5's Lori MacVittie provides a very nice illustration of the same point, in her explanation of why OVF is not a cloud portability silver bullet.]

[UPDATED 2009/6/1: Soon after posting this entry I was contacted by people at SD Times about turning it into a "guest view" article in the June issue. It has just been published. It's also in the paper version.]

Post

Long-running processes on Google App Engine: it finally works

In Cloud Computing,Everything,Google App Engine,Implementation,Tech,Utility computing on February 13, 2009 by @vambenepe

I am probably taking things a bit too personally, but I feel like I just successfully guilt-tripped the Google App Engine (GAE) team. Just last night I was complaining that they were teasing me (supporting urllib2, but an older version, without timeout support). And tonight I noticed a new post on their blog, announcing the end of these “High CPU Requests” that have been the bane of my GAE experience.

The reason why I was looking for timeout support in the first place is to avoid generating these dreaded “High CPU Requests” that quickly result in your application being disabled. It’s all explained here and  here.

But now that they are gone, I don’t need timeouts anymore. Around 11:00PM Pacific (on Thursday 2/12) I restarted my application. All it does is create 5 new simple entities in the datastore, one every second (it sleeps for one second between each entry). Then it spawns its successor, which will do the same thing, ad vitam aeternam. The application’s name, “rere”, is short for “request relay”, the pattern used to emulate a long running process. The default page for the app (available here) just returns a list of the 30 last entities created. The point is to illustrate that a single original request spawned an ever-lasting computing task on GAE.

Here is the code:

#!/usr/bin/env python
#
# Copyright 2009 William Vambenepe
#

import wsgiref.handlers
import os
import logging
import time

from google.appengine.ext import db
from google.appengine.ext.webapp import template
from google.appengine.ext import webapp
from google.appengine.api import urlfetch

numberOfHeartbeatsViewed = 30
secondsDurationOfTaskWait = 1
numberOfTasksPerRequest = 5

class HeartBeat(db.Model):
  requestId = db.IntegerProperty()
  date = db.DateTimeProperty(auto_now_add=True)

# The mere existence of an instance of this class in the DB means that the relay has to stop.
class StopExec(db.Model):
  date = db.DateTimeProperty(auto_now_add=True)

class MainHandler(webapp.RequestHandler):
  def get(self):
    hbs = HeartBeat.all().order("-date").fetch(numberOfHeartbeatsViewed)
    template_values = {"hbs": hbs}
    path = os.path.join(os.path.dirname(__file__), "index.html")
    self.response.out.write(template.render(path, template_values))

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
      except (TypeError, ValueError):
        id = 0
      try:
        logging.debug("Request " + str(id) + " launching background task.")
        loopCount = 0
        while(loopCount < numberOfTasksPerRequest):
          hb = HeartBeat()
          hb.requestId = id
          hb.put()
          logging.debug("Request " + str(id) + " saved heartbeat #" + str(loopCount))
          time.sleep(secondsDurationOfTaskWait)
          loopCount = loopCount+1
      finally:
        logging.debug("Launching successor request with id=" + str(id+1))

# This silly back and forth between the two URLs is because of
# "App cannot fetch the same URL as the one used for the request" error.
        if (self.request.url.find("start2") == -1):
          urlfetch.fetch("http://localhost/start2?id=" + str(id+1))
        else:
          urlfetch.fetch("http://localhost/start?id=" + str(id+1))
        logging.debug("Request " + str(id) + " completed")

def main():
  application = webapp.WSGIApplication([("/", MainHandler), ("/start", StartHandler), ("/start2", StartHandler)], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

One thing I had to change from the earlier version (written using version 1.1.0 of the GAE SDK) is that urlfetch now returns an error if your app tries to invoke itself at the same URL (“App cannot fetch the same URL as the one used for the request”). So I have to alternate between http://localhost/start and http://localhost/start2, both of which are mapped to the same handler. This was added sometimes between SDK 1.1.0 and SDK 1.1.9. If it is aimed at preventing the kind of batton-passing that I am doing, it is pretty unefficient considering how easy it is to circumvent.

It is now 1:02AM Pacific the next day (Friday 2/13) and the process is still progressing, based on the single HTTP request I sent to it at 11:00PM the previous evening. The result page currently returns:

  • From request # 1056, with date Fri, 13 Feb 2009 09:02:15 +0000
  • From request # 1056, with date Fri, 13 Feb 2009 09:02:14 +0000
  • From request # 1056, with date Fri, 13 Feb 2009 09:02:13 +0000
  • From request # 1056, with date Fri, 13 Feb 2009 09:02:12 +0000
  • From request # 1056, with date Fri, 13 Feb 2009 09:02:11 +0000
  • From request # 1055, with date Fri, 13 Feb 2009 09:02:10 +0000
  • From request # 1055, with date Fri, 13 Feb 2009 09:02:09 +0000
  • From request # 1055, with date Fri, 13 Feb 2009 09:02:08 +0000
  • From request # 1055, with date Fri, 13 Feb 2009 09:02:07 +0000
  • From request # 1055, with date Fri, 13 Feb 2009 09:02:06 +0000
  • From request # 1054, with date Fri, 13 Feb 2009 09:02:05 +0000
  • From request # 1054, with date Fri, 13 Feb 2009 09:02:04 +0000
  • From request # 1054, with date Fri, 13 Feb 2009 09:02:03 +0000
  • From request # 1054, with date Fri, 13 Feb 2009 09:02:02 +0000
  • From request # 1054, with date Fri, 13 Feb 2009 09:02:01 +0000
  • From request # 1053, with date Fri, 13 Feb 2009 09:01:59 +0000
  • From request # 1053, with date Fri, 13 Feb 2009 09:01:58 +0000

Which shows that 1056 successive requests have participated in the relay (the last one just happened, at 09:02:15 UTC which is 1:02AM Pacific).

Hopefully it will still be running when I wake up tomorrow.

[UPDATED 2009/2/13, 9:08AM Pacific: It's alive!

  • From request # 6411, with date Fri, 13 Feb 2009 17:08:45 +0000
  • From request # 6410, with date Fri, 13 Feb 2009 17:08:44 +0000
  • From request # 6410, with date Fri, 13 Feb 2009 17:08:43 +0000
  • From request # 6410, with date Fri, 13 Feb 2009 17:08:42 +0000
  • From request # 6410, with date Fri, 13 Feb 2009 17:08:41 +0000
  • From request # 6410, with date Fri, 13 Feb 2009 17:08:40 +0000
  • From request # 6409, with date Fri, 13 Feb 2009 17:08:39 +0000

BTW, the code provided uses localhost to run on my local machine. The version uploaded to Google of course replaces this with rere.appspot.com.]

[UPDATED 2009/5/1: For some reason this entry is attracting a lot of comment spam, so I am disabling comments. Contact me if you'd like to comment.]

Post

Google App Engine is teasing me

In Cloud Computing,Everything,Google App Engine,Implementation,Tech,Utility computing on February 11, 2009 by @vambenepe

Version 1.1.9 of the Google App Engine (GAE) SDK was released earlier this week. The first item in the announcement covers the big news, that developers can “use the Python standard libraries urllib, urllib2 or httplib to make HTTP requests”. My first thought reading this was that I was finally going to be able to use timeouts on outgoing HTTP requests. I care about this because my earlier attempt to emulate a long-running process in GAE was stymied by the GAE quota system, something I think I can work around if I can timeout a request once it has spawned its successor (more precisely, once it has spawned the successor of the incoming request that created the new outgoing request).

I got the new SDK last evening and moved the code from using urlfetch to using urllib2 (with timeout). On my local machine it seems to work, but the quota system (that I am trying to finesse) doesn’t run in the SDK. So the only real test happens once you deploy the app in the Google environment. Which is when I realized that GAE uses Python 2.5.2 and that the timeout parameter in urllib2 (and httplib) came with 2.6. Slap.

I was especially disappointed because the links for urllib2 and httplib in the GAE 1.1.9 SDK announcement take us to the Python 2.6.1 documentation. The timeout parameter is right there at the top of these pages, staring at me. It would be more accurate for this announcement to point to the 2.5.2 documentation (here it is for urllib2 and httplib).

It doesn’t really matter of course because this is just a toy project. And a real scheduler seems to be in the works (see this pre-announcement and this work in progress). I just do this as a fun way to get a glimpse of what it takes to turn existing infrastructure into a *aaS product, something that is going on in different ways in many places these days. Linux wasn’t created to run on something else than hardware. Xen wasn’t created to support EC2. Python wasn’t created to support GAE.

And GAE wasn’t created to support long-running processes. But I haven’t given up.

Post

Now I know why GAE has been killing me

In Everything,Google App Engine,Implementation,Tech,Utility computing on December 16, 2008 by @vambenepe

I haven’t written about Google App Engine lately because I haven’t spent too much time using it. And the time I did spend was mostly consumed fighting JavaScript for some client-side aspects that have nothing to do with the fact that the server side is on GAE. I have only touched JavaScript occasionally in the 12 years since writing this game, and I am pretty rusted. Funny how Python came back to me almost instantly but JavaScript just doesn’t.

I am writing a quick post on GAE today because the Google team just published a blog entry that explains why my attempts to create a long-running process in GAE resulted in my application being disabled for having too many “high CPU requests”, a concept that was never documented in the quota system.

Google is now offering a “quota detailed dashboard” and, as this screen capture shows, it tells us that you only get 60 “high CPU requests” and that they replenish at a rate of 2 per minutes. So at least now I know.

What is still not clear is why a request that is doing nothing but waiting for a URLfetch to come back is considered by Google to use too much CPU…

Post

Moving towards utility/cloud computing standards?

In Amazon,Automation,Business,DMTF,Everything,Google,Google App Engine,Grid,HP,IBM,IT Systems Mgmt,Mgmt integration,Modeling,OVF,Portability,Specs,Standards,Utility computing,Virtualization on June 30, 2008 by @vambenepe

This Forbes article (via John) channels 3Tera’s Bert Armijo’s call for standardization of utility computing. He calls it “Open Cloud” and it would “allow a company’s IT systems to be shared between different cloud computing services and moved freely between them“. Bert talks a bit more about it on his blog and, while he doesn’t reference the Forbes interview (too modest?), he points to Cloudscape as the vision.

A few early thoughts on all this:

  • No offense to Forbes but I wouldn’t read too much into the article. Being Forbes, they get quotes from a list of well-known people/companies (Google and Amazon spokespeople, Forrester analyst, Nick Carr). But these quotes all address the generic idea of utility computing standards, not the specifics of Bert’s project.
  • Saying that “several small cloud-computing firms including Elastra and Rightscale are already on board with 3Tera’s standards group” is ambiguous. Are they on-board with specific goals and a candidate specification? Or are they on board with the general idea that it might be time to talk about some kind of standard in the general area of utility computing?
  • IEEE and W3C are listed as possible hosts for the effort, but they don’t seem like a very good match for this area. I would have thought of DMTF, OASIS or even OGF first. On the face of it, DMTF might be the best place but I fear that companies like 3Tera, Rightscale and Elastra would be eaten alive by the board member companies there. It would be almost impossible for them to drive their vision to completion, unlike what they can do in an OASIS working group.
  • A new consortium might be an option, but a risky and expensive one. I have sometimes wondered (after seeing sad episodes of well-meaning and capable start-ups being ripped apart by entrenched large vendors in standards groups) why VCs don’t play a more active role in standards. Standards sound like the kind of thing VCs should be helping their companies with. VC firms are pretty used to working together, jointly investing in companies. Creating a new standard consortium might be too hard for 3Tera, but if the VCs behind 3Tera, Elastra and Rightscale got together and looked at the utility computing companies in their portfolios, it might make sense to join forces on some well-scoped standardization effort that may not otherwise be given a chance in existing groups.
  • I hope Bert will look into the history of DCML, a similar effort (it was about data center automation, which utility computing is not that far from once you peel away the glossy pictures) spearheaded by a few best-of-bread companies but ignored by the big boys. It didn’t really take off. If it had, utility computing standards might now be built as an update/extension of that specification. Of course DCML started as a new consortium and ended as an OASIS “member section” (a glorified working group), so this puts a grain of salt on my “create a new consortium and/or OASIS group” suggestion above.
  • The effort can’t afford to be disconnected from other standards in the virtualization and IT management domains. How does the effort relate to OVF? To WS-Management? To existing modeling frameworks? That’s the main draw towards DMTF as a host.
  • What’s the open source side of this effort? As John mentions during the latest Redmonk/Willis IT management podcast (starting around minute 24), there needs to a open source side to this. Actually, John thinks all you need is the open source side. Coté brings up Eucalyptus. BTW, if you want an existing combination of standards and open source, have a look at CDDLM (standard) and SmartFrog (implementation, now with EC2/S3 deployment)
  • There seems to be some solid technical raw material to start from. 3Tera’s ADL, combined with Elastra’s ECML/EDML, presumably captures a fair amount of field expertise already. But when you think of them as a starting point to standardization, the mindset needs to switch from “what does my product need to work” to “what will the market adopt that also helps my product to work”.
  • One big question (at least from my perspective) is that of the line between infrastructure and applications. Call me biased, but I think this effort should focus on the infrastructure layer. And provide hooks to allow application-level automation to drive it.
  • The other question is with regards to the management aspect of the resulting system and the role management plays in whatever standard specification comes out of Bert’s effort.

Bottom line: I applaud Bert’s efforts but I couldn’t sleep well tonight if I didn’t also warn him that “there be dragons”.

And for those who haven’t seen it yet, here is a very good document on the topic (but it is focused on big vendors, not on how smaller companies can play the standards game).

[UPDATED 2008/6/30: A couple hours after posting this, I see that Coté has just published a blog post that elaborates on his view of cloud standards. As an addition to the podcast I mentioned earlier.]

[UPDATED 2008/7/2: If you read this in your feed viewer (rather than directly on vambenepe.com) and you don't see the comments, you should go have a look. There are many clarifications and some additional insight from the best authorities on the topic. Thanks a lot to all the commenters.]

Post

Some breathing room for Google App Engine requests

In Everything,Google,Google App Engine,Implementation,Utility computing on June 13, 2008 by @vambenepe

As promised to Felix here is the code that shows how to give extra breathing room to Google App Engine (GAE) requests that may otherwise be killed for taking too long to complete. The approach is similar to the one previously described. But rather than trying to emulate a long-running process, I am simply allowing a request to spread its work over a handful of invocations, thus getting several 9 seconds slots (since this seems to be how much time GAE gives you per request right now).

If all your requests need this then you are going to run into the same “high CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled” problem seen in the previous experiment. But if 90% of your requests complete in a normal time and only 10% of the requests need more time, then this approach can help prevent your users from getting an error for 1 out of every 10 requests. And you should fly under the radar of the GAE resource cop.

The way it works is simply that if your request is interrupted for having run too long the client gets a redirect to a new instance of the same handler. Because the code saves its results incrementally in the datastore, the new instance can build on the work of the previous one.

This specific example retrieves the ubuntu-8.04-server-i386.jigdo file (98K) from a handful of Ubuntu mirrors and returns the average/min/max download times (without checking if the transfer was successful or not). I also had to add a 1 second sleep after each fetch in order to trigger the DeadlineExceededError because the fetch operations go too quickly when running on GAE rather than my machine (I guess Google has better connectivity than my mediocre AT&T-provided DSL line, who would have thought).

#!/usr/bin/env python
#
# Copyright 2008 William Vambenepe
#

import wsgiref.handlers
import os
import logging
import time

from google.appengine.ext import db
from google.appengine.ext.webapp import template
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
from google.appengine.runtime import DeadlineExceededError

targetUrls = ["http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo"]

class MeasurementSet(db.Model):
  iteration = db.IntegerProperty()
  measurements = db.ListProperty(float)

class MainHandler(webapp.RequestHandler):
  def get(self):
    try:
      key = self.request.get("key")
      set = MeasurementSet.get(key)
      if (set == None):
        raise ValueError
      set.iteration = set.iteration + 1
      set.put()
      logging.debug("Resuming existing set, with key " + str(key))
    except:
      set = MeasurementSet()
      set.iteration = 1
      set.measurements = []
      set.put()
      logging.debug("Starting new set, with key " + str(set.key()))
    try:
      # Dereference remaining URLs
      for target in targetUrls[len(set.measurements):]:
        startTime = time.time()
        urlfetch.fetch(target)
        timeElapsed = time.time() - startTime
        time.sleep(1)
        logging.debug(target + " dereferenced in " + str(timeElapsed) + " sec")
        set.measurements.append(timeElapsed)
        set.put()
      # We're done dereferencing URLs, let's publish the results
      longestIndex = 0
      shortestIndex = 0
      totalTime = set.measurements[0]
      for i in range(1, len(targetUrls)):
        totalTime = totalTime + set.measurements[i]
        if set.measurements[i] < set.measurements[shortestIndex]:
          shortestIndex = i
        elif set.measurements[i] > set.measurements[longestIndex]:
          longestIndex = i
      template_values = {"iterations": set.iteration,
                         "longestTime": set.measurements[longestIndex],
                         "longestTarget": targetUrls[longestIndex],
                         "shortestTime": set.measurements[shortestIndex],
                         "shortestTarget": targetUrls[shortestIndex],
                         "average": totalTime/len(targetUrls)}
      path = os.path.join(os.path.dirname(__file__), "steps.html")
      self.response.out.write(template.render(path, template_values))
      logging.debug("Set with key " + str(set.key()) + " has returned")
    except DeadlineExceededError:
      logging.debug("Set with key " + str(set.key())
                    + " interrupted during iteration "+ str(set.iteration)
                    + " with " + str(len(set.measurements)) + " URLs retrieved")
      self.redirect("/steps?key=" + str(set.key()))
      logging.debug("Set with key " + str(set.key()) + " sent redirection")

def main():
  application = webapp.WSGIApplication([("/steps", MainHandler)], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

I can’t guarantee I will keep it available, but at the time of this writing the application is deployed here if you want to give it a spin. A typical run produces this kind of log:

06-13 01:44AM 36.814 /steps
XX.XX.XX.XX - - [13/06/2008:01:44:45 -0700] "GET /steps HTTP/1.1" 302 0 - -
  D 06-13 01:44AM 36.847
    Starting new set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 37.870
    http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.022078037262 sec
  D 06-13 01:44AM 38.913
    http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0184168815613 sec
  D 06-13 01:44AM 39.962
    http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0166189670563 sec
  D 06-13 01:44AM 41.12
    http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0205371379852 sec
  D 06-13 01:44AM 42.103
    http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0197179317474 sec
  D 06-13 01:44AM 43.146
    http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0171189308167 sec
  D 06-13 01:44AM 44.215
    http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0160200595856 sec
  D 06-13 01:44AM 45.256
    http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.015625 sec
  D 06-13 01:44AM 45.805
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM interrupted during iteration 1 with 8 URLs retrieved
  D 06-13 01:44AM 45.806
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM sent redirection
  W 06-13 01:44AM 45.808
    This request used a high amount of CPU, and was roughly 28.5 times over the average request CPU limit.
    High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

Followed by:

06-13 01:44AM 46.72 /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
XX.XX.XX.XX - - [13/06/2008:01:44:50 -0700] "GET /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM HTTP/1.1" 200 472
  D 06-13 01:44AM 46.110
    Resuming existing set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 47.128
    http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.016991853714 sec
  D 06-13 01:44AM 48.177
    http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0238039493561 sec
  D 06-13 01:44AM 49.318
    http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0177929401398 sec
  D 06-13 01:44AM 50.378
    http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0226020812988 sec
  D 06-13 01:44AM 50.410
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM has returned
  W 06-13 01:44AM 50.413
  This request used a high amount of CPU, and was roughly 13.4 times over the average request CPU limit.
  High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

I believe we can optimize the performance by taking advantage of the fact that successive requests are likely (but not guaranteed) to hit the same instance, allowing global variables to be re-used rather than always going to the datastore. My code is a proof of concept, not an optimized implementation.

Of course, the alternative is to drive things from the client, using JavaScript HTTP requests (rather than HTTP redirect) to repeat the HTTP request until the work has been completed. The list of pros and cons of each approach is left as an exercise to the reader.

[UPDATED 2008/6/13: Added log output. Removed handling of "OverQuotaError" which was not useful since, unlike "DeadlineExceededError", quotas are not per-request. As a result, splitting the work over multiple requests doesn't help. Slowing down a request might help, at which point the approach above might come in handy to prevent this slowdown from triggering "DeadlineExceededError".]

[UPDATED 2008/6/30: Steve Jones provides an interesting analysis of the cut-off time for GAE. Confirms that it's mainly based on wall-clock time rather than CPU time. And that you can sometimes go just over 9 seconds but never up to 10 seconds, which is consistent with my (much less detailed and rigorous) observations.]

Post

Emulating a long-running process (and a scheduler) in Google App Engine

In Brain teaser,Everything,Google,Google App Engine,Implementation,Testing,Utility computing on June 6, 2008 by @vambenepe

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

  • The most obvious option is to piggyback on HTTP requests from your visitors. But:
    • this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
    • since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
    • more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
  • You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
    • it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
    • it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
  • While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

  • When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
  • When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

Post

Google App Engine: less is more

In Amazon,Everything,Google,Google App Engine,OVM,Portability,Tech,Utility computing,Virtualization,VMware on May 31, 2008 by @vambenepe

“If you have a stove, a saucepan and a bottle of cold water, how can you make boiling water?”

If you ask this question to a mathematician, they’ll think about it a while, and finally tell you to pour the water in the saucepan, light up the stove and put the saucepan on it until the water boils. Makes sense. Then ask them a slightly different question: “if you have a stove and a saucepan filled with cold water, how can you make boiling water?”. They’ll look at you and ask “can I also have a bottle”? If you agree to that request they’ll triumphantly announce: “pour the water from the saucepan into the bottle and we are back to the previous problem, which is already solved.”

In addition to making fun of mathematicians, this is a good illustration of the “fake machine” approach to utility computing embodied by Amazon’s EC2. There is plenty of practical value in emulating physical machines (either in your data center, using VMWare/Xen/OVM or at a utility provider’s site, e.g. EC2). They are all rooted in the fact that there is a huge amount of code written with the assumption that it is running on an identified physical machine (or set of machines), and you want to keep using that code. This will remain true for many many years to come, but is it the future of utility computing?

Google’s App Engine is a clear break from this set of assumptions. From this perspective, the App Engine is more interesting for what it doesn’t provide than for what it provides. As the description of the Sandbox explains:

“An App Engine application runs on many web servers simultaneously. Any web request can go to any web server, and multiple requests from the same user may be handled by different web servers. Distribution across multiple web servers is how App Engine ensures your application stays available while serving many simultaneous users [not to mention that this is also how they keep their costs low -- William]. To allow App Engine to distribute your application in this way, the application runs in a restricted ‘sandbox’ environment.”

The page then goes on to succinctly list the limitations of the sandbox (no filesystem, limited networking, no threads, no long-lived requests, no low-level OS functions). The limitations are better described and commented upon here but even that article misses one major limitation, mentioned here: the lack of scheduler/cron.

Rather than a feature-by-feature comparison between the App Engine and EC2 (which Amazon would won handily at this point), what is interesting is to compare the underlying philosophies. Even with Amazon EC2, you don’t get every single feature your local hardware can deliver. For example, in its initial release EC2 didn’t offer a filesystem, only a storage-as-a-service interface (S3 and then SimpleDB). But Amazon worked hard to fix this as quickly as possible in order to be appear as similar to a physical infrastructure as possible. In this entry, announcing persistent storage for EC2, Amazon’s CTO takes pain to highlight this achievement:

“Persistent storage for Amazon EC2 will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.”

and

“And the great thing is it that it is all done with using standard technologies such that you can use this with any kind of application, middleware or any infrastructure software, whether it is legacy or brand new.”

Amazon works hard to hide (from the application code) the fact that the infrastructure is a huge, shared, distributed system. The beauty (and business value) of their offering is that while the legacy code thinks it is running in a good old data center, the paying customer derives benefits from the fact that this is not the case (e.g. fast/easy/cheap provisioning and reduced management responsibilities).

Google, on the other hand, embraces the change in underlying infrastructure and requires your code to use new abstractions that are optimized for that infrastructure.

To use an automotive analogy, Amazon is offering car drivers to switch to a gas/electric hybrid that refuels in today’s gas stations while Google is pushing for a direct jump to hydrogen fuel cells.

History is rarely kind to promoters of radical departures. The software industry is especially fond of layering the new on top of the old (a practice that has been enabled by the constant increase in underlying computing capacity). If you are wondering why your command prompt, shell terminal or text editor opens with a default width of 80 characters, take a trip back to 1928, when IBM defined its 80-columns punch card format. Will Google beat the odds or be forced to be more accommodating of existing code?

It’s not the idea of moving to a more abstracted development framework that worries me about Google’s offering (JEE, Spring and Ruby on Rails show that developers want this move anyway, for productivity reasons, even if there is no change in the underlying infrastructure to further motivate it). It’s the fact that by defining their offering at the level of this framework (as opposed to one level below, like Amazon), Google puts itself in the position of having to select the right framework. Sure, they can support more than one. But the speed of evolution in that area of the software industry shows that it’s not mature enough (yet?) for any party to guess where application frameworks are going. Community experimentation has been driving application frameworks, and Google App Engine can’t support this. It can only select and freeze a few framework.

Time will tell which approach works best, whether they should exist side by side or whether they slowly merge into a “best of both worlds” offering (Amazon already offers many features, like snapshots, that aim for this “best of both worlds”). Unmanaged code (e.g. C/C++ compiled programs) and managed code (JVM or CLR) have been coexisting for a while now. Traditional applications and utility-enabled applications may do so in the future. For all I know, Google may decide that it makes business sense for them too to offer a Xen-based solution like EC2 and Amazon may decide to offer a more abstracted utility computing environment along the lines of the App Engine. But at this point, I am glad that the leaders in utility computing have taken different paths as this will allow the whole industry to experiment and progress more quickly.

The comparison is somewhat blurred by the fact that the Google offering has not reached the same maturity level as Amazon’s. It has restrictions that are not directly related to the requirements of the underlying infrastructure. For example, I don’t see how the distributed infrastructure prevents the existence of a scheduling service for background jobs. I expect this to be fixed soon. Also, Amazon has a full commercial offering, with a price list and an ecosystem of tools, why Google only offers a very limited beta environment for which you can’t buy extra capacity (but this too is changing).