Cloud platform patching conundrum: PaaS has it much worse than IaaS and SaaS

The potential user impact of changes (e.g. patches or config changes) made on the Cloud infrastructure (by the Cloud provider) is a sore point in the Cloud value proposition (see Hoff’s take for example). You have no control over patching/config actions taken by the provider, any of which could potentially affect you. In a traditional data center, you can test the various changes on specific applications; you don’t have to apply them at the same time on all servers; and you can even decide to skip some infrastructure patches not relevant to your application (“if it aint’ broken…”). Not so in a Cloud environment, where you may not even know about a change until after the fact. And you have no control over the timing and the roll-out of the patch, so that some of your instances may be running on patched nodes and others may not (good luck with troubleshooting that).

Unfortunately, this is even worse for PaaS than IaaS. Simply because you seat on a lot more infrastructure that is opaque to you. In a IaaS environment, the only thing that can change is the hardware (rarely a cause of problem) and the hypervisor (or equivalent Cloud OS). In a PaaS environment, it’s all that plus whatever flavor of OS and application container is used. Depending on how streamlined this all is (just enough OS/AS versus a traditional deployment), that’s potentially a lot of code and configuration. Troubleshooting is also somewhat easier in a IaaS setup because the error logs are localized (or localizable) to a specific instance. Not necessarily so with PaaS (and even if you could localize the error, you couldn’t guarantee that your troubleshooting test runs on the same node anyway).

In a way, PaaS is squeezed between IaaS and SaaS on this. IaaS gets away with a manageable problem because the opaque infrastructure is not too thick. For SaaS it’s manageable too because the consumer is typically either a human (who is a lot more resilient to change) or a very simple and well-understood interface (e.g. IMAP or some Web services). Contrast this with PaaS where the contract is that of an application container (e.g. JEE, RoR, Django).There are all kinds of subtle behaviors (e.g, timing/ordering issues) that are not part of the contract and can surface after a patch: for example, a bug in the application that was never found because before the patch things always happened in a certain order that the application implicitly – and erroneously – relied on. That’s exactly why you always test your key applications today even if the OS/AS patch should, in theory, not change anything for the application. And it’s not just patches that can do that. For example, network upgrades can introduce timing changes that surface new issues in the application.

And it goes both ways. Just like you can be hurt by the Cloud provider patching things, you can be hurt by them not patching things. What if there is an obscure bug in their infrastructure that only affects your application. First you have to convince them to troubleshoot with you. Then you have to convince them to produce (or get their software vendor to produce) and deploy a patch.

So what are the solutions? Is PaaS doomed to never go beyond hobbyists? Of course not. The possible solutions are:

  • Write a bug-free and high-performance PaaS infrastructure from the start, one that never needs to be changed in any way. How hard could it be? ;-)
  • More realistically, narrowly define container types to reduce both the contract and the size of the underlying implementation of each instance. For example, rather than deploying a full JEE+SOA container componentize the application so that each component can deploy in a small container (e.g. a servlet engine, a process management engine, a rule engine, etc). As a result, the interface exposed by each container type can be more easily and fully tested. And because each instance is slimmer, it requires fewer patches over time.
  • PaaS providers may give their users some amount of visibility and control over this. For example, by announcing upgrades ahead of time, providing updated nodes to test on early and allowing users to specify “freeze” periods where nothing changes (unless an urgent security patch is needed, presumably). Time for a Cloud “refresh” in ITIL/ITSM-land?
  • The PaaS providers may also be able to facilitate debugging of infrastructure-related problem. For example by stamping the logs with a version ID for the infrastructure on the node that generated the log entry. And the ability to request that a test runs on a node with the same version. Keeping in mind that in a SOA / Composite world, the root cause of a problem found on one node may be a configuration change on a different node…

Some closing notes:

  • Another incarnation of this problem is likely to show up in the form of PaaS certification. We should not assume that just because you use a PaaS you are the developer of the application. Why can’t I license an ISV app that runs on GAE? But then, what does the ISV certify against? A given PaaS provider, e.g. Google? A given version of the PaaS infrastructure (if there is such a thing… Google advertises versions of the GAE SDK, but not of the actual GAE runtime)? Or maybe a given PaaS software stack, e.g. the Oracle/Microsoft/IBM/VMWare/JBoss/etc, meaning that any Cloud provider who uses this software stack is certified?
  • I have only discussed here changes to the underlying platform that do not change the contract (or at least only introduce backward-compatible changes, i.e. add APIs but don’t remove any). The matter of non-compatible platform updates (and version coexistence) is also a whole other ball of wax, one that comes with echoes of SOA governance discussions (because in PaaS we are talking about pure software contracts, not hardware or hardware-like contracts). Another area in which PaaS has larger challenges than IaaS.
  • Finally, for an illustration of how a highly focused and specialized container cuts down on the need for config changes, look at this photo from earlier today during the presentation of JRockit Virtual Edition at Oracle Open World. This slide shows (in font size 3, don’t worry you’re not supposed to be able to read), the list of configuration files present on a normal Linux instance, versus a stripped-down (“JeOS”) Linux, versus JRockit VE.


By the way, JRockit VE is very interesting and the environment today is much more favorable than when BEA first did it, but that’s a topic for another post.

[UPDATED 2009/10/22: For more on this (in an EC2-centric context) see section 4 (“service problem resolution”) of this IBM paper. It ends with “another possible direction is to develop new mechanisms or APIs to enable cloud users to directly and automatically query and correlate application level events with lower level hardware information to better identify the root cause of the problem”.]

[UPDATES 2012/4/1: An example of a PaaS platform update which didn’t go well.]

9 Comments

Filed under Application Mgmt, Cloud Computing, Everything, Google App Engine, Governance, ITIL, Manageability, Mgmt integration, PaaS, SaaS, Utility computing, Virtualization

9 Responses to Cloud platform patching conundrum: PaaS has it much worse than IaaS and SaaS

  1. Interesting read as usual, but in reality I’m not seeing this being a problem. Sure we’re not seeing millions of applications hosted on “cloud platform services” (can we drop the “aaS” already please?) but vendors like Google are already doing a lot of what you talk about.

    Google App Engine for example is fairly careful about its releases such that they’re in a pretty good state by the time they see the light of day. Updates are incremental so developers can deal with issues one by one rather than bundled together in a large release. It exposes only a narrow view of both Python and Java APIs so as not to give developers enough rope to hang themselves. I *love* this – so many failures are caused because of developers straying from the garden path and many of these “opportunities” have been taken away. There is versioning (though I’m not sure it’s really used) and all their nodes are the same so it’s overwhelmingly unlikely that you’ll run into configuration related problems.

    Salesforce is more from a software background so they do “seasonal” releases… that’s good for stuffy enterprise change management monkeys but not really fitting with the “cloud way” of change management – small, incremental changes on an ongoing basis. One advantage of the Salesforce way is that you can just patch the specific bugs and nothing more (ala Debian Security Advisories) but that doesn’t give you a way to introduce new features without having releases. Google Apps is also worth a look in this regard as they allow you to stop pre-release features from filtering through, in which case they are battle tested by the time you invariably see them.

    Sam

  2. Would it not be just better to offload the actual (Java) computation work onto an appliance that has very little of such configuration artifacts like Azul Systems Compute Appliance – http://www.azulsystems.com/. If you still need native OS integration it is possible though at a cost of a network roundtrip (to and from the appliance). That is better than not being able to do it at all.

    By the way how many of those files visualized above are actually likely to be under change management?

  3. Hi Sam,

    I hope you’re right and I am overly worried. But you haven’t convinced me yet. You write that

    “Google App Engine for example is fairly careful about its releases such that they’re in a pretty good state by the time they see the light of day.”

    That’s nice but you know Oracle and its competitors are also “fairly careful” (actually more than that) about patches and updates. Does this mean that when the customer gets them they should deploy them immediately on their production systems? Not necessarily. If it’s mission critical, they’ll test on a replicated environment, they’ll schedule the change and they’ll have a rollback strategy. That’s not because Oracle is sloppy. Most of the time they’d be fine blindingly applying the patch. But “most of the time” does not cut it for mission-critical systems. The issue I describe is in the context of such systems (maybe I should have been more explicit), which are the systems for which so much config management technology/processes have been created.

    For these systems to run in PaaS environments, this issue needs to be addressed. As usual, it will be through a mix of technology and processes.

  4. Pingback: links for 2009-10-22 « On IT-business alignment and related things

  5. Getting educated via your posts, William. Thankful for this as we still consider ourselves a Paas player, first and foremost, and we think that our releases are pretty much tops with help from the different open source dev communities.

    Of course, personally, the threat that cloudsAmazon creeping up on incorporating Paas features is definitely real but then again, I think our guys are way ahead for now.

    Thanks.
    Alain
    G2iX

  6. Pingback: William Vambenepe — Desirable technical characteristics of PaaS

  7. Pingback: William Vambenepe — Analyzing the VMforce announcement

  8. Pingback: William Vambenepe — Lifting the curtain on PaaS Cloud infrastructure (can you handle the truth?)

  9. Pingback: » Come for the PaaS Functional Model, stay for the Cloud Operational Model Cloud Comedy, Cloud Tragedy