Category Archives: Everything

Cloud platform patching conundrum: PaaS has it much worse than IaaS and SaaS

The potential user impact of changes (e.g. patches or config changes) made on the Cloud infrastructure (by the Cloud provider) is a sore point in the Cloud value proposition (see Hoff’s take for example). You have no control over patching/config actions taken by the provider, any of which could potentially affect you. In a traditional data center, you can test the various changes on specific applications; you don’t have to apply them at the same time on all servers; and you can even decide to skip some infrastructure patches not relevant to your application (“if it aint’ broken…”). Not so in a Cloud environment, where you may not even know about a change until after the fact. And you have no control over the timing and the roll-out of the patch, so that some of your instances may be running on patched nodes and others may not (good luck with troubleshooting that).

Unfortunately, this is even worse for PaaS than IaaS. Simply because you seat on a lot more infrastructure that is opaque to you. In a IaaS environment, the only thing that can change is the hardware (rarely a cause of problem) and the hypervisor (or equivalent Cloud OS). In a PaaS environment, it’s all that plus whatever flavor of OS and application container is used. Depending on how streamlined this all is (just enough OS/AS versus a traditional deployment), that’s potentially a lot of code and configuration. Troubleshooting is also somewhat easier in a IaaS setup because the error logs are localized (or localizable) to a specific instance. Not necessarily so with PaaS (and even if you could localize the error, you couldn’t guarantee that your troubleshooting test runs on the same node anyway).

In a way, PaaS is squeezed between IaaS and SaaS on this. IaaS gets away with a manageable problem because the opaque infrastructure is not too thick. For SaaS it’s manageable too because the consumer is typically either a human (who is a lot more resilient to change) or a very simple and well-understood interface (e.g. IMAP or some Web services). Contrast this with PaaS where the contract is that of an application container (e.g. JEE, RoR, Django).There are all kinds of subtle behaviors (e.g, timing/ordering issues) that are not part of the contract and can surface after a patch: for example, a bug in the application that was never found because before the patch things always happened in a certain order that the application implicitly – and erroneously – relied on. That’s exactly why you always test your key applications today even if the OS/AS patch should, in theory, not change anything for the application. And it’s not just patches that can do that. For example, network upgrades can introduce timing changes that surface new issues in the application.

And it goes both ways. Just like you can be hurt by the Cloud provider patching things, you can be hurt by them not patching things. What if there is an obscure bug in their infrastructure that only affects your application. First you have to convince them to troubleshoot with you. Then you have to convince them to produce (or get their software vendor to produce) and deploy a patch.

So what are the solutions? Is PaaS doomed to never go beyond hobbyists? Of course not. The possible solutions are:

  • Write a bug-free and high-performance PaaS infrastructure from the start, one that never needs to be changed in any way. How hard could it be? ;-)
  • More realistically, narrowly define container types to reduce both the contract and the size of the underlying implementation of each instance. For example, rather than deploying a full JEE+SOA container componentize the application so that each component can deploy in a small container (e.g. a servlet engine, a process management engine, a rule engine, etc). As a result, the interface exposed by each container type can be more easily and fully tested. And because each instance is slimmer, it requires fewer patches over time.
  • PaaS providers may give their users some amount of visibility and control over this. For example, by announcing upgrades ahead of time, providing updated nodes to test on early and allowing users to specify “freeze” periods where nothing changes (unless an urgent security patch is needed, presumably). Time for a Cloud “refresh” in ITIL/ITSM-land?
  • The PaaS providers may also be able to facilitate debugging of infrastructure-related problem. For example by stamping the logs with a version ID for the infrastructure on the node that generated the log entry. And the ability to request that a test runs on a node with the same version. Keeping in mind that in a SOA / Composite world, the root cause of a problem found on one node may be a configuration change on a different node…

Some closing notes:

  • Another incarnation of this problem is likely to show up in the form of PaaS certification. We should not assume that just because you use a PaaS you are the developer of the application. Why can’t I license an ISV app that runs on GAE? But then, what does the ISV certify against? A given PaaS provider, e.g. Google? A given version of the PaaS infrastructure (if there is such a thing… Google advertises versions of the GAE SDK, but not of the actual GAE runtime)? Or maybe a given PaaS software stack, e.g. the Oracle/Microsoft/IBM/VMWare/JBoss/etc, meaning that any Cloud provider who uses this software stack is certified?
  • I have only discussed here changes to the underlying platform that do not change the contract (or at least only introduce backward-compatible changes, i.e. add APIs but don’t remove any). The matter of non-compatible platform updates (and version coexistence) is also a whole other ball of wax, one that comes with echoes of SOA governance discussions (because in PaaS we are talking about pure software contracts, not hardware or hardware-like contracts). Another area in which PaaS has larger challenges than IaaS.
  • Finally, for an illustration of how a highly focused and specialized container cuts down on the need for config changes, look at this photo from earlier today during the presentation of JRockit Virtual Edition at Oracle Open World. This slide shows (in font size 3, don’t worry you’re not supposed to be able to read), the list of configuration files present on a normal Linux instance, versus a stripped-down (“JeOS”) Linux, versus JRockit VE.


By the way, JRockit VE is very interesting and the environment today is much more favorable than when BEA first did it, but that’s a topic for another post.

[UPDATED 2009/10/22: For more on this (in an EC2-centric context) see section 4 (“service problem resolution”) of this IBM paper. It ends with “another possible direction is to develop new mechanisms or APIs to enable cloud users to directly and automatically query and correlate application level events with lower level hardware information to better identify the root cause of the problem”.]

[UPDATES 2012/4/1: An example of a PaaS platform update which didn’t go well.]

9 Comments

Filed under Application Mgmt, Cloud Computing, Everything, Google App Engine, Governance, ITIL, Manageability, Mgmt integration, PaaS, SaaS, Utility computing, Virtualization

PaaS as a satisfying and success-ready hobbyist platform

I don’t know anyone in Silicon Valley who can code and doesn’t fantasize about writing an accidental killer app. One that gets designed during a long layover in DEN and implemented in a rainy weekend (El Nino is my VC). One that was only supposed to meet the needs of a few friends and is used by half of the world a few months/years later.

I am not talking about seasoned entrepreneurs, who have a network, discipline, resources and enough experience to know that it takes a lot more than a cool idea. Rather, about programming hobbyists (who may of may not be programmers in their day jobs),

By definition, hobbyists only do things that are satisfying. In the rarefied air of Silicon Valley, it also helps if there is a conceivable “upside” to dream about. Platform as a Service (PaaS, e.g. Google App Engine) provides both to software-oriented hobbyist. And make it very cheap (borderline free), which doesn’t hurt.

Satisfying

In a well-crafted PaaS environment, the development process and the result are both satisfying. I am not a Google shrill, but GAE is a fair example. The barrier to entry is very low (the download is less than 10MB and contains all you need to get started). In an hour you have an application running locally. In an hour and 5 minutes you have it deployed and accessible on the web for all. And yet this ease of bootstrap does not come at the cost of too many longer-term limitations (now that the environment has gown a bit from the original limitations and provides scheduled and background jobs). Unlike Yahoo Pipes, for which the first impression is “nifty!” and the second is “gimme a textual representation of my pipe now!”.

Beyond the easy ramp-up, the main source of satisfaction developing in a PaaS environment is that you spend 99% of your time working on the application. Not the OS, not the firewall, not the application container, not the database. Not to mention having to deal with your co-lo provider or the leased line for the servers in the basement. If you are a hobbyist with only a few spare hours per week, that’s a make or break deal. It also means that you have a fighting chance of developing a secure application because you are responsible for a much smaller surface of attack.

Eons ago (in computing time), Visual Basic was the name of the game for these people. More recent was the rise of PHP. It dramatically lowered the barrier to adoption and provided a quick route to a working web application. I know several non-professional developers (e.g. web designers) who are scared of any “normal” programming language but happily write PHP (often of equivalent complexity BTW). Combine this with the wide availability of ISP-managed PHP environments and you get close to what GAE gives you. At the risk of adding to the annoying trend of retroactively cloudifying everything, I think of ISP-hosted PHP as the first generation of PaaS. But it is focused on “show what’s in the DB” scenarios rather than service-centric / mash-up / web 2.0 integration. And even for DB-centric scenarios, by and large PHP coders don’t want to think too much about model and queries (and it shows). I think Google decided to go with Python rather than the easy route of aping the hosted PHP environments in large part to avoid hitting such ceilings down the road. Not surprisingly, PHP support is currently the most requested GAE feature, ahead of Perl and Ruby. Lets see if Google tries to get the PHP community on board or prefers to stay clear of such PaaS legacy (already!).

Ready for success

In the unlikely event that your application catches fire and sees wide adoption (which is not impossible, especially if well integrated in a social network), what are you going to do about it? Keeping in mind two constraints: first, this is a part-time hobby of yours. Second, don’t dream of riches. We are talking about an influx of facebookers or twitterers here, with no intention to pay for anything. But click they will. If you were going to answer: “I get funding and hire a real IT staff” then think again. You most likely won’t get funding for your toy app without revenue potential. And even if you do, by the time you have it it’s too late and people have moved on because you were not there when the spotlight was on you.

With a PaaS-based application you have a fighting chance. If the spike is short enough, you may not even hit the limit of the free quota. If it does, you have the choice of whether you are willing to pay to support the extra traffic or not. No change in code required (though it may be advised anyway, if your app wasn’t architected for efficient scaling – PaaS doesn’t entirely take this off your hands).

That “sudden spike” story is a commonly-invoked use case for EC2. And it’s probably true for a start-up with an IT staff (of at least one full-time person). But despite Amazon’s efforts (and other providers such as RightScale) this type of scaling is something you have to architect for and putting it in place takes away from the time you spend coding application features. It also means that you are responsible for more infrastructure (OS and application container at least). Not to mention that IaaS providers don’t usually offer free resources for limited usage, the way Google does (I suspect 99% of GAE apps never get over this quota). Even if a small EC2 instance is not very expensive (though it adds up over time if you keep it up for that occasional user), the difference between “free” and “cheap” is significant. As an application provider you’d like for this not to be the case with your users, but as a consumer of infrastructure service you’re on the other side of the deal, aren’t you?

There is a reason why suburbia-bound SUVs are advertised crossing mountain streams. The “I could if I wanted to” line has appeal. For the software hobbyist, knowing that your application won’t crash if it happened to meet success (even if only for a couple of days, e.g. the Slashdot effect) is a good feeling (“I could if *they* wanted to”). In truth this occasion is rare (and likely to end up like this), but you are ready for the eventuality. And if there are enough such hobbyists, then statistically some will encounter it.

The provider’s upside

That last point brings the topic of the PaaS provider’s upside in this. I have read several critical comments arguing that no company will rewrite their application for GAE (true) and that no start-up will write their new code for it either because of the risk of lock-in (also true: “being bought by Google” is not a bad outcome but “has to be bought by Google” is a bad exit strategy). But I think this misses the point of casting such a wide nest and starting with creating a great tool for hobbyists.

After all, Google has made a great business monetizing millions of small sites none of which makes much money by itself. At the very least least this can grow the web and, symbiotically, Google. With  two possible upsides:

  • Some of these hobbyist applications may actually take off and Google becomes their natural partner/godfather (including managing their user accounts). For example, wouldn’t it be nice for Google if Craigslist or Twitter was running on GAE?
  • The platform eventually evolves into something that makes sense for start-ups, SMBs and/or enterprises to use. Google works out the kinks with less demanding users first.

Interesting times

Two closing thoughts, which I’ll leave undeveloped for now:

  • There is an especially good synergy between mobile apps and PaaS. Once you get past the restaurant tip calculators, many mobile apps need a server-hosted sidekick to do the heavy lifting of gathering/storing/transforming data. As a hobbyist, you want to spend most of your time making you mobile app cool. Which leaves even less time for administrating server components. On the server side, you are even less likely to want to deal with anything but application logic. PaaS is especially attractive in these scenarios. Google and Microsoft have to navigate these waters carefully but look for some synergy/integration stories around GAE + Android and Azure + Windows Mobile respectively. Not clear what Apple’s story is here or if they think they need one. If it surfaces as an issue then we have yet another reason to restart the “Apple buys Adobe” rumor. Or maybe Sanjiva will get a middle-of-the-night call from Steve Jobs…
  • A platform to build/run your application is one thing. A way to reach users is another (arguably much more critical). Things like mobile app stores (especially Apple’s of course), Facebook and next generation app stores. But this goes  beyond the scope of this post.

Just to be clear, I am not in any way suggesting that PaaS is only for hobbyists. I am just saying that right now it is a great tool for them, the best way for an individual programmer to have fun and have an impact. This doesn’t take away from the value that PaaS will eventually deliver to larger organizations.

[UPDATED 2009/10/4: Microsoft Azure apparently supports PHP.]

5 Comments

Filed under Cloud Computing, Everything, Google, Google App Engine, Implementation, Mashup, Microsoft, PaaS, Utility computing, WSO2

The future (2006 version), has arrived

Remember 2006? Things were starting to fall into place for IT management integration and automation:

  • SDD was already on its way to cleanly describe/package/manage the lifecycle of simple and composite applications alike,
  • the first version of SML came out to capture all the relevant constraints of complex and composite systems and open the door to “desired-state management”,
  • the CMDBf effort was started to seamlessly integrate all sources of configuration and provide a bird-eye view of your entire IT infrastructure, and
  • the WSDM/WS-Management convergence/reconciliation was announced and promised to free management consoles from supporting many resource discovery, collection and control mechanisms and from having platform/library dependencies between the manager and its targets.

It looked like we were a year or two from standardization on all these and another year or two from shipping implementations. Things were looking good.

Good news: the schedule was respected. SDD, SML and CMDBf are now all standards (at OASIS, W3C and DMTF respectively). And today the Eclipse COSMOS project announced the release of COSMOS 1.1 which implements them all. The WSDM/WS-Management convergence is the only one that didn’t quite go according to the plan but it is about to come out as a standard too (in a pared-down form).

Bad news: nobody cares. We’ve moved on to “private clouds”.

Having been involved with these specifications in various degrees (a little bit on SDD, a fair amount on SML and a lot on CMDBf and WSDM/WS-Management) I am not as detached as my sarcastic tone may suggest. But as they say in action movies, “don’t let sentiments get in the way of the mission”.

There is still a chance to reuse parts of this stack (e.g. the CMDBf query language) and there are lessons to learn from our errors. The over-promising, the technical misjudgments, the political bickering, the lack of concrete customer validation, etc. To some extent this work was also victim of collateral damages from the excesses of WS-* (I am looking at you WS-Addressing). We also failed to notice the rise of the hypervisor in our peripheral vision.

I tried to capture some important lessons in this post-mortem. For the edification of the cloud generation. I also see a pendulum in action. Where we over-engineered I now see some under-engineering (overly granular interaction models, overemphasis on the virtual machine as the unit of everything, simplistic constraint models, underestimation of config/patching issues…). Things will come around and may eventually look familiar (suggested exercise: compare PubSubHubBub with WS-Notification).

As long as each iteration gets us closer to the goal things are good.

See you in 2012. Same place, same day, same time.

3 Comments

Filed under Application Mgmt, Automation, Cloud Computing, CMDB, CMDB Federation, CMDBf, Desired State, Everything, IT Systems Mgmt, Manageability, Mgmt integration, Modeling, Protocols, SML, Specs, Standards, Utility computing, WS-Management

On Twitter

I created the @vambenepe Twitter account a while ago to reserve the username. Yesterday I posted three tweets, so I guess I am now “on Twitter”, in case anybody cares. We’ll see where this goes. @jamesurquhart gave me a kind (but intimidating) welcome and @Beaker hasn’t called me a “jackass” yet, so things are looking good. BTW, is it just me or has Cisco assembled a top-notch good cop / bad cop team? I hope I manage my blog-to-twitter expansion as well as they did.

The Cloud stuff is where the fun is, but if this Twitter thing is going to be of any use for real work I need to find who to follow in the IT management, application management and systems modeling areas. Any suggestion beyond @cote, @MouthOfOpenNMS, @dmcclure, @puppetmasterd and @theitskeptic (I feel like I am just Twitterifying my blogroll)?

And even then, finding people to follow seems to be the easy part. It took me about 20 minutes last night to realize that I am not going to read all the tweets (and I currently only follow 18 people). Worst case I’ll just track the direct mentions of my handle and some occasional hastags during interesting announcements. And scan the rest once a week. I assume that’s what the Twitter natives like @cote do as well (I seeded my list by picking names I recognized from his 1,130-long follow list). Advice?

The other issue is the 140 characters limit of course, but this should be easier to get used to. In the Apple/Palm tweet last night (about how this might show us what enforcement options standard bodies have) I wanted to invoke Stalin’s dismissive “The Pope! How many divisions has he got?” quote by replacing the pope with the USB Implementers Forum (USB-IF). But no room left unless I sacrificed the image of a working group chair breaking the knee cap of an offending implementer (which, as an ex-WG chair myself, I see some upside to).

Is it bad form to post multi-part tweets? How about, say, 50 parts? I need a protocol to guarantee delivery and order on top of the Twitter API. Maybe REST-* can help me… ;-)

I also wanted to ping Andy Updegrove with the hope that he’d comment on the USB-IF letter (he has looked at the iPhone before, but not this specific issue) for an authoritative opinion. But he doesn’t seem to be on Twitter. The nerve!

And then there is the “follower” thing, which I guess I am now supposed to start obsessing about (folks, if I don’t have a hundred followers by week end the kitten gets it).

In the real world, there are a few people who return my emails and occasionally agree to have lunch with me, but that’s a far cry from calling them “followers”. Even my wife would spit her coffee if I referred to her as my “follower”. But on Twitter, I just posted three tweets yesterday and I already feel like a religious guru with my 24 “followers”.

Jokes aside (on the cult-leader overtones of the word “follower”), the fact that these people are identified is a nice improvement over blog subscribers (who, to me, are just an occasional number within the user-agent field in my Apache httpd logs), at least until they comment/email. Nice to “see” you.

One more step in the slippery slope towards total egomania. Blog > Twitter > Live webcam of the inside of my stomach.

Comments Off on On Twitter

Filed under Everything, Media, Off-topic, Social networks, Twitter

Thoughts on the “Simple Cloud API”

PHP developers with Cloud aspirations rejoice! Zend has announced a PHP toolkit (called the Simple Cloud API project) to abstract and access application-level Cloud services. This is not just YACA (yet another Cloud API), as there are interesting differences between this and all the other Cloud toolkits out there.

First it’s PHP, which was not covered by the existing toolkits. Considering how many web applications are written in PHP (including the one that serves this very blog) this may seem strange, until you realize that most Cloud toolkits out there are focused on provisioning/managing low-level compute resources of the IaaS kind. Something that is far out of PHP’s sweetspot and much more practically handled with Java, Python, Ruby or some .NET language accessible via PowerShell.

Which takes us to the second, and arguably most interesting, characteristic of this toolkit: it is focused on application-level Cloud services (files, documents and queues for now) rather than infrastructure-level. In other word, it’s the first (to my knowledge) PaaS toolkit.

I also notice that Zend has gotten endorsements from IBM, Microsoft, Nirvanix, Rackspace and GoGrid. The first two especially seem to have impressed InfoWorld. Let’s keep in mind that at this point all we are talking about are canned quotes in a press release. Which rank only above politician campaign promises as predictor of behavior. In any case that can’t be the full extent of IBM and Microsoft’s response to the VMWare/Cisco push on IaaS standards. But it may suggest that their response will move the battlefield to include PaaS, which would be a smart move.

Now for a few more acerbic comments:

  • It has “simple” in its name, like SOAP (as Pete Lacey famously lampooned). In the long term this tends to negatively correlate with simplicity, just like the presence of “democratic” in the official name of a country does not bode well for actual democracy.
  • Please, don’t shorten “Simple Cloud API” to SCA which is already claimed in a (potentially) closely related field.
  • Reuven Cohen is technically correct to see it as “a way to create other higher level programmatic API interfaces such as REST or SOAP using an easy, yet portable PHP programming environment”. But pay attention to how many turtles are on this pile: the native provider API, the adapter to the “simple cloud API”, the SOAP or REST remote API and the consuming application’s native API. How much real isolation are you getting when you build your house on such a wobbly foundation

[UPDATE: Comments from someone in the know:  a programmer working on adding Azure support for this Simple Cloud API project.]

2 Comments

Filed under Application Mgmt, Cloud Computing, Everything, IBM, Manageability, Mgmt integration, Microsoft, Middleware, Open source, Portability, Utility computing

Look Ma, no hypervisor!

Encouraged by hypervisor vendors, the confusion between virtualization and Cloud Computing is rampant. In the industry, the term “virtualization” (and its corollary, “virtual machine”) is used in so many different ways that it has lost all usefulness. For a recent example, read the introduction of this SNIA/OGF white paper (on Cloud Storage) which asserts that “the new technology underlying this is the system virtual machine that allows multiple instances of an operating system and associated applications to run on single physical machine. Delivering this over the network, on demand, is termed Infrastructure as a Service (IaaS)”.

In fact, even IaaS-type Cloud services don’t imply the use of hypervisors.

We need to decouple the Cloud interface/contract (e.g. “what are the types of resources that can I provision on demand? hosts, app servers, storage capacity, app services…”) from the underlying implementation (e.g. “are hypervisors used by the Cloud provider?”). At the risk of spelling out things that may be obvious to many readers of this blog, here is a simplified matrix of Cloud Computing systems, designed to illustrate that all combinations of interface and implementation are possible and in many cases even reasonable.

IaaS interface PaaS interface
Hypervisor used Yes! (see #1) Yes! (see #2)
Hypervisor not used Yes! (see #3) Yes! (see #4)

#1: IaaS interface, hypervisor-based implementation

This is a very common approach these days, both in public Clouds (EC2, Rackspace and presumably at some point the VMWare vCloud Express service providers) and private Clouds (Citrix, Sun, Oracle, Eucalyptus, VMWare…). Basically, you take a bunch of servers, put hypervisors on all of them and make VMs running on these hypervisors available to the Cloud customers.

But despite its predominance, this is not the only path to a Cloud, not even to an IaaS (e.g. “x86 hosts on demand”) Cloud. The following three other scenarios are all valid too.

#2: PaaS interface, hypervisor-based implementation

This is the road SpringSource has been on, first with Cloud Foundry (using AWS EC2 which is based on the Xen hypervisor) and presumably soon on top of VMWare.

#3: IaaS interface, no hypervisor in the implementation

Let’s remember that the utility computing vision (before the term fell in desuetude in favor of “cloud”) has been around before x86 hypervisors were so common. Take Loudcloud as an illustration. They were building what is now called a “public Cloud” starting back in 1999 and not using any hypervisor. Just bare metal provisioning and advanced provisioning automation software. Then they sold the hosting part to EDS (now HP) and only kept the software, under the name Opsware (now HP too, incidentally). That software was meant to create what we now call a “private Cloud”. See this old DCML announcement as one example of the Opsware vision. And no hypervisor was harmed in the making of this movie.

At the current point in time, the hardware (e.g. multiple cores, shared memory) and software (hypervisors, legacy apps) environment is such that hypervisor-based solutions seem to have an edge over those based on automated provisioning/configuration alone. But these things tend to change quickly in our industry… Especially if you factor in non-technical considerations like compliance, fear of data leakage and the risk of having the hardware underlying your application seized because of an investigation involving another tenant…

And this is not going into finner techno-philosophical points about the different types of hypervisors. Not to mention mainframe LPARs… One could build a hypervisor-free IaaS solution on these.

To some extent, you may even put the “pwned” machines (in a botnet) in this “IaaS with no hypervisor” category (with the small difference that what’s being made available is an x86 with an OS, typically Windows, already installed). If you factor out externalities (like the FBI breaking down your front door at 6:00AM) this approach has claims as the most cost-effective form of Cloud computing available today… Solaris zones are another example of possible foundation for a hypervisor-free IaaS-like offering (here too, with an OS rather than a “raw host” as the interface).

#4: PaaS interface, no hypervisor in the implementation

In the public sphere, this corresponds to Google App Engine.

In the private sphere, several companies have built it themselves on top of WebLogic, by adding some level of “on-demand” application provisioning in order to streamline the relationship between the IT group running the servers and the business groups who want to deploy applications on them. Something that one should ideally be able to buy rather than build.

Waiting for the question to become irrelevant

Like most deeply-ingrained confusions, the conflation of virtualization and Cloud Computing won’t be dispelled as much as made irrelevant. The four categories enumerated in this post are a point-in-time view of a continuously evolving system. What may start today as a bundle of a hypervisor, an OS and an app server may become a somewhat monolithic “PaaS engine” over time as the components are more tightly integrated. That “engine” may have memory isolation mechanisms that look a lot like a hypervisor. But it may not be able to host a generic OS. In the same way that whales don’t have fingers and toes and yet they are still very much apparent in their skeleton.

[UPDATED 2009/10/8: A real-life example of #3! On-demand servers via bare metal provisioning (via Sam). No hypervisor in the picture. See also here.]

[UPDATED 2009/12/29: Another non-hypervisor Cloud provider! NewServers. Here is their API. And a Q&A.]

3 Comments

Filed under Application Mgmt, Cloud Computing, Everything, Google App Engine, Implementation, IT Systems Mgmt, Middleware, Utility computing, Virtualization, VMware, XenSource

REST-*: good specs, bad branding?

In an earlier post, I argued for standardization of some basic REST-inspired mechanisms for the narrow goal of supporting control interfaces for different forms of Cloud Computing. As I was doing so, I noticed the first report of something called REST-*, introduced by RedHat’s Mark Little and I ended my post by wondering whether we were talking about the same thing or not.

Now that more information has emerged it seems pretty clear that we are not.

Mark Little understands transactions very well. No argument. He is not happy with some aspects of how they are supported over SOAP. Fine. He thinks it can be done better (at least for 80% of the cases and with lower barriers to entry) directly on top of HTTP (no envelope). Fine. He would like this to be standardized so that middleware stacks can interoperate. Fine. Same applies for pub/sub and p2p messaging, the other initial project out of the REST-* effort. All good.

Where it all goes wrong is the attempt to get on the REST bandwagon. REST is not the only proper way to write distributed applications. It’s a good way to do it for a specific (through arguably very large) set of distributed applications. One that may not include financial trading or RFID-enabled inventory tracking. More specifically, REST might not be the appropriate approach for all parts of all distributed applications. Working on smoothly connecting the REST and non-REST parts is interesting. Working on forcing the non-REST parts under the REST mantle less so.

By REST here I mean REST-the-architectural-style (narrowly defined), not REST-the-brand (much more broadly defined). Even if your work does not fall under the umbrella of REST-the-architectural-style, you may choose to position it under REST-the-brand as a pragmatic calculation (like a police department might pragmatically include a plasma TV in the “terrorism preparation” accounting category). In the “pros” category, positioning it as REST gives you instantaneous press coverage. In the “cons” category, it gives you instantaneous twitter coverage (of the kind that Steve Jones reports). All in all, it seems like a bad bargain to me if you want to get things done. But Bill Burke (who works with Mark on this) has chosen to accept it: “I really don’t care in the end if any of the architectural principles of Roy’s thesis are broken as long these requirements are met”. As a side note, the REST-* announcement puts this comment by Bill on Roy’s blog in context…

In any case, the way the proposed umbrella organization is shaping up is also giving me concerns. Less about some nefarious intent than about a certain tone-deafness regarding how it comes across. I am not talking about details such as the REST-* moniker, the fact that http://rest-star.org is just a facade that redirects to http://www.jboss.org/reststar or the fact that their blog feed uses RSS rather than Atom (way to get the REST crowd on your side). Rather I am thinking of statements like “Red Hat, as the founder of REST-*, gets a permanent seat on the board. All other board members must be elected by the overall membership once a year”. Which suggests (probably incorrectly) more arrogance than even Microsoft and IBM combined were able to muster when setting up WS-I (modulo the Sun snub). Speaking of Sun, if the JCP (and Sun’s position in it) is the model that RedHat has in mind it might be helpful to point out to them that Sun invented the language after all…

All in all, the specifications Mark and team have in mind may make perfect sense, but they way they are going about it leaves me highly skeptical.

[UPDATE 2009/9/17: More REST-* skepticism. But it looks like Mark and Bill are taking it in stride, acknowledging a less-than-optimal execution and trying to fix things. I doubt this specific initiative can be salvaged, but I think a lot of the goals are good and need to be realized.  Though my intuition is that it is more likely to get done in a piecemeal fashion, distributed between specialized communities (e.g. the Cloud people, the messaging/AMQP people…) who take on, in a very practical way, the portions most relevant to their needs. Whether all the pieces then get pulled together in one place with a nice bow is not important right now.]

[UPDATED 2009/9/18: Changes!]

1 Comment

Filed under Everything, JBoss, Middleware, Protocols, REST, Specs, Standards

Monitoringforge.org vs. the ghost of openmanagement.org

There is a new site to promote open source “IT monitoring and network management” solutions. It appears to be an initiative from GroundWork. There are many very interesting open source projects in this area and anything that helps lower the cost of finding and filtering them is good. But when the GroundWork VP of marketing says, in an interview, that “this is the first attempt to really create a neighborhood for all these projects to be represented in that is truly neutral” it brings back to mind the Open Management Consortium, announced in May 2006 and apparently defunct (the WayBack Machine has not been able to snapshot the site since February 2008). Back in the days it billed itself as a way “to help advance the promotion, adoption, development and integration of open source systems /network management software”, which sounds pretty similar.

What happened to it and what will keep monitoringforge.org from meeting a similar fate?

2 Comments

Filed under Everything, IT Systems Mgmt, Open source

Cloud Data Management Interface (CDMI) draft released

Have you developed “Cloud API fatigue” from seeing too many IaaS “Cloud APIs” lately? Are you starting to wonder how many different ways there can possibly be to launch a virtual machine via an HTTP POST? Are you wondering why everybody else seems to equate Cloud computing with on-demand server instances?

If yes, then CDMI will come as a breath of fresh air. This specification (just a draft at this point) is a rare example of a different beast. Coming out of SNIA, it endeavors to standardize the way storage resources are managed and accessed in a Cloud environment. They call this DaaS (Data storage as a Service).

The specification has two components (which may benefit from being separated in two specifications at some point). One (called “control paths”) is an interface to manage a data storage service. That interface is expected to work across many forms of data storage from block storage (like AWS EBS) to filesystems (e.g. NFS) to object stores with a CRUD interface (similar to the WebDAV volumes of the Sun API). It also mentions a “simple table space storage” storage form, but that part is pretty fuzzy.

The second component of CDMI (called “data paths”) only applies to the CRUD object store and it describes a RESTful interface for accessing it. This figure from the specification does a good job of illustrating the two different APIs in the specification (and the different types of storage envisioned).

One of the most interesting sections in the document describes the way in which the authors envision the ability to export the storage resources provisioned/managed through CDMI to other Cloud APIs. They illustrate it in an example involving OCCI (see also this joint white paper). This is very interesting and another sign that we need a shared RESTful resource control framework for Cloud computing as a first layer of standardization. One of the reasons I used to justify this claim two weeks ago was that “there will not be one API that provides control of [all the different forms of Cloud Computing], but they can share a base protocol that will make life a lot easier for developers. These Clouds won’t be isolated, developers will use them as a continuum.” One week later, this draft specification illustrates the point very well.

[As a somewhat related side note, this interesting post about what it takes to provide a large-scale resilient data service (the Google App Engine data store). And more about the Google File System in general.]

1 Comment

Filed under Cloud Computing, Everything, Protocols, REST, Specs, Standards, Utility computing, Virtualization

Toolkits to wrap and bridge Cloud management protocols

Cloud development toolkits like Libcloud (for Python) and jcloud (for Java) have been around for some time, but over the last two months they have been joined by several other open source contenders. They all claim to abstract the on-the-wire Cloud management protocols sufficiently to let you access different Clouds via the same code; while at the same time providing objects in your programming language of choice and saving you the trouble of dealing with on-the-wire messages. By focusing on interoperability, they slot themselves below the larger role of a “Cloud broker” (which also deals with tasks like transfer and choice). Here is the list, starting with the more recent contenders:

DeltaCloud shares the same goal of translating between different Cloud management protocols but they present their own interface as yet another Cloud REST API/protocol rather than a language-specific toolkit. More along the lines of what UCI is trying to do (not sure what’s up with that project, I recorded my skepticism earlier and am still waiting to be pleasantly surprised).

Of course there are also programming toolkits that are specific to one Cloud provider. They are language-specific wrappers around one Cloud management protocol. AWS protocols (EC2, S3, etc…) represent the most common case, for example amazon-ec2 (a Ruby Gem), Power-EC2Dream (in C# which gives it the tantalizing advantage of being invokable via PowerShell) and typica (for Java). For Clouds beyond AWS, check out the various RightScale Ruby Gems.

The main point of this entry was to list the cross-Cloud development toolkits in the bullet list above. But if you’re in the mood for some pontification you can keep reading.

For some reason, what used to be called “protocols” is often called “APIs” in Cloud settings. Witness the Sun Cloud “API” or the vCloud “API” which only define XML formats for on-the-wire messages. I have never heard of CIM/XML over HTTP, WSDM or WS-Management being referred as APIs though they occupy a very similar place. They are usually considered “protocols”.

It’s a just question of definition whether an on-the-wire protocol (rather than a language-specific set of objects/methods) qualifies as an “Application Programming Interface”. It’s not an “interface” in the Java sense of the term. But I can “program” against it so it could go either way. On this blog I have gone along with the “API” term because that seemed widely used, though in verbal conversations I have tended to stick to “protocol”. One problem with “API” is that it pushes you towards mixing the “what” and the “how” and not respecting the protocol/model dichotomy.

Where is becomes relevant is when you start to see language-specific APIs for Cloud control pop-up as listed above. You now have two classes of things called “API” and it gets a bit confusing. Is it time to bring back the “protocol” term for on-the-wire definitions?

As a developer, whether you’re better off eating your Cloud noodles using chopsticks (on-the-wire protocol definitions) or a fork (language-specific APIs) is an important decision that will stay with you and may come back to bit you (e.g. when the interfaces are versioned). There is a place for both of course, but if we are to learn anything from WS-* it’s that we went way too far in the “give me a java stub” direction. Which doesn’t mean there is no room for them, but be careful how far from the wire semantics you get. It become even trickier when your stub tries not jsut to bridge between XML and Java but also to smooth out the differences between several on-the-wire protocols, as the toolkits above do. The hope, of course, is that there will eventually be enough standardization of on-the-wire protocols to make this a moot point.

2 Comments

Filed under Amazon, API, Automation, Cloud Computing, Everything, Google App Engine, Implementation, IT Systems Mgmt, Manageability, Mgmt integration, Open source, Protocols, Utility computing

Separating model from protocol in Cloud APIs

What happened to the separation between the model and the protocol in management APIs? For all the arguments we had in the design of WSDM and WS-Management, this was one fundamental concept that took little discussion before everyone agreed: that the protocol (the interaction model and the on-the-wire shape of the messages used) should be defined in a way that is agnostic to the type of resource being managed (computers, elevators or toasters — the perennial silly example). To this end, WSDM took pains to release MUWS (Management Using Web Services) and MOWS (Management Of Web Services) as two different specifications.

Contrast that to the different Cloud APIs (there is a new one released every other day). If they have one thing in common it is that they happily ignore this principle and tackle protocol concerns alongside the resource model. Here are my guesses as to why that is:

1) It’s a land grad

The goal is not to produce the best long-term API, it’s to be out early, to stake your claim and to gain leverage, so that you can steer the final standard close to your implementation. Editorial niceties like properly factoring the specification are not major concerns, there will be plenty of time for this during the standardization process. In fact, leaving such improvements for the standardization phase is a nice way to make it look like the group is not just rubberstamping, while not changing much that actually impacts your implementation. The good old “give them something insignificant to argue about” trick. It works BTW.

As an example of how rushed some of these submissions can be, did you notice that what VMWare submitted to DMTF this week is the vCloud API Specification v0.8 (a 7-page document that is simply a list of operations), not the accompanying vCloud API programming guide v0.8 which is ten times longer and is the real specification, the place where the operation semantics, payload formats and protocol considerations are actually described and without which the previous document cannot possibly be implemented. Presumably the VMWare team was pressed to release on time for a VMWorld announcement and they came up with this to be able to submit without finishing all the needed editorial work. I assume this will follow soon and in the meantime the DMTF members will retrieve the programming guide from the VMWare site in order to make sense of what was submitted to them.

This kind of rush is not rare in the history of specification submission, even those that have been in the work for a long time . For example, the initial CBE submission by IBM had “IBM Confidential” all over the specification and a mention that one should retrieve the most up to date version from the “Autonomic Computing Problem Determination Offering Team Notes Database” (presumably non-IBMers were supposed to break into the server).

If lack of time is the main reason why all these APIs do not factor out the protocol aspects then I have no problem, there is plenty of time to address it. But I suspect that there may be other reasons, that some may see it as a feature rather than a bug. For example:

2) Anything but WS-*

SOAP-based interfaces (WS-* or WS-DeathStar) have a bad rap and doing anything in the opposite way is a crowd pleaser (well, in the blogosphere at least). Modularity and composition of specifications is a major driving force behind the WS-* work, therefore it is bad and we should make all specifications of the new REST order stand-alone.

3) Keep it simple

A more benevolent way to put it is the concern to keep things simple. If you factor specifications out you put on the developer the burden of assembling the complete documentation, plus you introduce versioning issues between the parts. One API document that fully describes the contract is simpler.

4) We don’t need no stinking’ protocol, we have HTTP

Isn’t this the protocol? Through the magic of REST, all that’s needed is a resource model, right? But if you look in the specifications you see sections about authentication, fault handling, long-lived operations, enumeration of long result sets, etc… Things that have nothing to do with the resource model.

So what?

Why is this confluence of model and protocol in one specification bad? If nothing else, the “keep it simple” argument (#3) above has plenty of merits, doesn’t it? Aren’t WSDM and WS-Management just over-engineered?

They may be, but not because they offer this separation. Consider the following practical benefits of separating the protocol from the model:

1) We can at least agree on one part

Thanks to the “REST is the new black” attitude in Cloud circles, there are lots of commonalities between these various Cloud APIs. Especially the more recent ones, those that I think of as “second generation” APIs: vCloud, Sun API, GoGrid and OCCI (Amazon EC2 is the main “1st generation” Cloud API, back when people weren’t too self-conscious about not just using HTTP but really “doing REST”). As an example of convergence between second generation specifications, see for example, how vCloud and the Sun API both use “202 Accepted” and a dedicated “status” resource to handle long-lived operations. More comparisons here.

Where they differ on such protocol matters, it wouldn’t be hard to modify one’s implementation to use an alternative approach. Things become a lot more sensitive when you touch the resource model, which reflects the actual capabilities of the Cloud management infrastructure. How much flexibility in the network setup? What kind of application provisioning? What affinity/anti-affinity control level? Can I get block-level storage? Etc. Having to implement the other guy’s interface in these matters is not just a matter of glue code, it’s a major product feature. As a result, the resource model is a much more strategic control point than the protocol. Would you rather dictate the terms of a contract or the color of the ink in which it is printed?

That being the case, I suspect that there could be relatively quick and painless agreement on that first layer of the Cloud API: a set of protocol considerations, based on HTTP and REST, that provide a resource control framework with support for security, events, long-running operations, faults, many-as-one semantics, enumeration, etc. Or rather, that if there is to be a “quick and painless” agreement on anything related to Cloud computing standards it can only be on something that is limited to protocol concerns. It doesn’t have to be long and complex. It doesn’t have to be factored in 8 different specifications like WS-* did. It can be just one specification. Keep it simple, ignore all use cases that aren’t related to Cloud Computing. In the end, please call it MUR (Management Using REST)… ;-)

2) Many Clouds, one protocol to rule them all

Whichever Cloud taxonomy strikes your fancy (I am so disappointed that SADIST-PIMP hasn’t caught on), it’s pretty clear that there will not be one kind of Cloud. There will be at least some IaaS, some PaaS and plenty of SaaS. There will not be one API that provides control of them all, but they can share a base protocol that will make life a lot easier for developers. These Clouds won’t be isolated, developers will use them as a continuum.

3) Not just one access model

As much as it makes sense to start from simple and mostly synchronous operations, there will be many different interaction models for Cloud Computing. In addition to the base operations, we may get more of a desired-state/blueprint interaction pattern, based on the same resource model. Or, somewhere in-between, some kind of stored execution flow where modules are passed around rather than individual operations. Also, as the level of automation increases you may want a base framework that is more event-friendly for rapid close-loop management. And there are other considerations involved (like resource monitoring, policies…) not currently covered by these specifications but that can surely reuse the protocol aspects. By factoring out the resource model, you make it possible for these other interaction patterns to emerge in a compatible way.

The current Cloud APIs are not far away from this clean factoring. It would be an easy task to extract protocol considerations as a separate document, in large part due to the fact that REST prevents you from burying the resource model inside convoluted operation semantics. To some extent it’s just a partitioning issue, but the same can be said of many intractable and bloody armed conflicts around the world… Good fences make good neighbors in the world of IT specs too.

[UPDATE: Soon after this entry went to “press” (meaning soon after I pressed the publish button), I noticed this report of a “REST-*” proposal by Mark Little of RedHat/JBOSS. I will reserve judgment until Mark has blogged about it or I have seen some other authoritative description. We may be talking about the same thing here. Or maybe not. The REST-* name surprises me a bit as I would expect opponents of such a proposal to name it just this way. We’ll see.]

[UPDATE 2009/9/6: Apparently I am something like the 26th person to think of the “one protocol/API to rule them all” sentence. We geeks have such a shallow set of shared cultural references it’s scary at times.]

[UPDATED 2009/11/12: Lori MacVittie has a very nice follow-up on this, with examples and interesting analogies. Check it out.]

8 Comments

Filed under API, Automation, Cloud Computing, Everything, IT Systems Mgmt, Manageability, Mgmt integration, Modeling, Protocols, REST, Specs, Standards, Utility computing

VMWare publishes (and submits) vCloud API

VMWare published its vCloud API yesterday (it was previously only available to a few partners) and submitted it to the DMTF, as had been previously announced. So much for my speculations involving IBM.

It may be time to update the Cloud API comparison. After a very quick first pass, vCloud looks quite similar to the Sun Cloud API (that’s a compliment). For example, they both handle long-lived operations via a “202 Accepted” complemented by a resource that represents the progress (“status” for Sun, “task” for vCloud). A very visible (but not critical) difference is the use of JSON (Sun) versus XML (vCloud).

As expected, OVF/OVA is central to vCloud. More once I have read the whole specification.

In any case, things are going to get interesting in the DMTF Cloud incubator. I there a path to adoption?Assuming that Amazon keeps sitting it out, what will the other Cloud vendors with an API (Rackspace, GoGrid, Sun…) do? I doubt they ever had plans/aspirations to own or even drive the standard, but how much are they willing to let VMWare do it? How much does Citrix/Xen want to steer standards versus simply implement them in the context of the Xen Cloud project? What about OGF/OCCI with which the DMTF is supposedly collaborating?How much support is VMWare going to receive from its service provider partners? How much traction does VMWare have with Cisco, HP (server division) and IBM on this? What are the plans at Oracle and Microsoft? Speaking of Microsoft, maybe it will at some point want its standard strategy playbook back. At least when VMWare is done using it.

5 Comments

Filed under API, Application Mgmt, Automation, Cloud Computing, DMTF, Everything, IT Systems Mgmt, Mgmt integration, Protocols, REST, Specs, Standards, Utility computing, Virtualization, VMware

Are these your files? I found them on my cloud

Drip drip drip… Is this the sound of your cloud leaking?

It can happen in different ways. See for example this recent research paper, titled “Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds”. It’s a nice read, especially if you find side channels interesting (I came up with one recently, in a different context).

In the first part of the paper, the authors show how to get your EC2 instance co-located (i.e. running in in the same hypervisor) with the instance you are targeting (the one you want to spy on). Once this is achieved, they describe side channel attacks to glean information from this situation.

This paper got me thinking. I noticed that it does not mention trying to go after disk blocks and memory. I don’t know if they didn’t try or they tried and were defeated.

For disk blocks (the most obvious attack vector), Amazon is no dummy and their “proprietary  disk  virtualization  layer  automatically  wipes every block of storage used by  the customer, and guarantees  that one customer’s data  is never exposed to another” as explained in the AWS Security Whitepaper. In fact, they are so confident of this that they don’t even bother forbidding block-based recovery attempts in the AWS customer agreement (they seem mostly concerned about attacks that are not specific to hypervisor environments, like port scanning or network-based DOS). I took this as an invitation to verify their claims, so I launched a few Linux/ext3 and Windows/NTFS instances, attached a couple of EBS volumes to them and ran off-the-shelf file recovery tools. Sure enough, nothing was found on  /dev/sda2 (the empty 150GB partition of local storage that comes with each instance) or on the EBS volumes. They are not bluffing.

On the other hand, there were plenty of recoverable files on /dev/sda1. Here is what a Foremost scan returned on two instances (both of them created from public Fedora AMIs).

The first one:

Finish: Tue Sep  1 05:04:52 2009

5640 FILES EXTRACTED

jpg:= 14
gif:= 670
htm:= 1183
exe:= 2
png:= 3771
------------------------------------------------------------------

And the second one:

Finish: Wed Sep  2 00:32:16 2009

17236 FILES EXTRACTED

jpg:= 236
gif:= 2313
rif:= 11
htm:= 4886
zip:= 182
exe:= 6
png:= 9594
pdf:= 8
------------------------------------------------------------------

These are blocks in the AMI itself, not blocks that were left on the volumes on which the AMI was installed. In other words, all instances built from the same AMI will provide the exact same recoverable files. The C: drive of the Windows instance also had some recoverable files. Not surprisingly they were Windows setup files.

I don’t see this as an AWS flaw. They do a great job providing cleanly wiped raw volumes and it’s the responsibility of the AMI creator not to snapshot recoverable blocks. I am just not sure that everyone out there who makes AMIs available is aware of this. My simple Foremost scans above only looked for the default file types known out of the box by Foremost. I suspect that if I added support for .pem files (used by AWS to store private keys) there may well be a few such files recoverable in some of the publicly accessible AMIs…

Again, kudos to Amazon, but I also wonder if this feature opens a possible DOS approach on AWS: it doesn’t cost me much to create a 1TB EBS volume and to destroy it seconds later. But for Amazon, that’s a lot of blocks to wipe. I wonder how many such instantaneous create/delete actions on large EBS volumes it would take to put a large chunk of AWS storage capacity in the “unavailable – pending wipe” state… That’s assuming that they proactively wipe all the physical blocks. If instead the wipe is virtual (their virtualization layer returns zero as the value for any free block, no matter what the physical value of the block) then this attack wouldn’t work. Or maybe they keep track of the blocks that were written and only wipe these.

Then there is the RAM. The AWS security paper tells us that the physical RAM is kept separated between instances (presumably they don’t use ballooning or the more ambitious Xen Transcendent Memory). But they don’t say anything about what happens when a new instance gets hold of the RAM of a terminated instance.

Amazon probably makes sure the RAM is reset, as the disk blocks are. But what about your private Cloud infrastructure? While the prospect of such Cloud leakage is most terrifying in a public cloud scenario (anyone could make use of it to go after you), in practice I suspect that these attack vectors are currently a lot more exploitable in the various “private clouds” out there. And that for many of these private clouds you don’t need to resort to the exotic side channels described in the “get off of my cloud” paper. Amazon has been around the block (no pun intended) a few times, but not all the private cloud frameworks out there have.

One possible conclusion is that you want to make sure that your cloud vendor does more than writing scripts to orchestrate invocations of the hypervisor APIs. They need to understand the storage, computing and networking infrastructure in details. There is a messy physical world under your clean shinny virtual world. They need to know how to think about security at the system level.

Another one is that this is a mostly an issue for hypervisor-based utility computing and a possible trump card for higher level of virtualization, e.g. PaaS. The attacks described in the paper (as well as block-based file recovery) would not work on Google App Engine. What does co-residency mean in a world where subsequent requests to the same application could hit any machine (though in practice it’s unlikely to be so random)? You don’t get “deployed” to the same host as your intended victim. At best you happen to have a few requests executed while a few requests of your target run on the same physical machine. It’s a lot harder to exploit. More importantly, the attack surface is much more restrained. No direct memory access, no low-level scheduler data, no filesystem… The OS to hardware interface that hypervisors emulate was meant to let the OS control the hardware. The GAE interface/SDK, on the other hand, was meant to give the application just enough capabilities to perform its task, in a way that is as removed from the hardware as possible. Of course there is still an underlying physical reality in the GAE case and there are sure to be some leaks there too. But the small attack surface makes them a lot harder to exploit.

[UPDATED 2009/9/8: Amazon just improved the ability to smoothly update your access certificates. So hopefully any such certificate found on recoverable blocks in an AMI will be out of data and unusable.]

[UPDATED 2009/9/24: Some good security practices that help protect you against block analysis and many other forms of attack.]

[UPDATED 2009/10/15: At Oracle Open World this week, I was assured by an Amazon AWS employee that the DOS scenario I describe in this post would not be a problem for them. But no technical detail as to why that is. Also, you get billed a minimum of one hour for each EBS volume you provision, so that attack would not be as cheap as I thought (unless you use a stolen credit card).]

4 Comments

Filed under Amazon, Cloud Computing, Everything, Google App Engine, Security, Utility computing, Virtualization, Xen

Symptoms Autonomic Framework submission to OASIS: CBE meets ITIL?

IBM, Fujitsu and CA have recently proposed a charter for a new OASIS technical committee, called the Symptoms Autonomic Framework (SAF) TC. Including a specification candidate and other submitted documents, listed here.

For context, you need to remember the Common Base Event (CBE) specification that IBM has shopped around for a long time, initially hand in hand with Cisco. As always, the Cover Pages offer the best references on this saga. CBE was submitted to WSDM and came out (in a much-emaciated form) as the WSDM Event Format (WEF) in WSDM 1.1 part 2.

Because so many parts of CBE were left on the floor of the WSDM editing room and because WSDM itself saw little adoption, I have always been expecting IBM to bring CBE back in some form. When I heard of SAF, my instinct was that this was it.

Not so. SAF is meant to sit on top of an event system like CBE. It turns selected events/situations and other data points into symptoms and tells you what to do next. Its focus is on roles, process and knowledge bases. Not on the event format. The operations and payloads defined are not for exchanging events, they are for exchanging “symptoms”, “syndromes”, “prescriptions” and “protocols”.

As the terms show, the specification espouses the medical dialect (even “protocol” is meant to be understand in the medical sense, not as in “HTTP” or “FTP”). While I have been guilty of a similar analogy myself, I also think that if there is one area from which we don’t want to learn in terms of automation, system integration and proper use of IT in general, it’s the medical field. So let’s be careful not to push the analogy too far (section 8.1 of the SAF specification is a fun read, but not necessarily very compelling).

BTW, since when do we use terms strongly associated with one company in the name of standards group (“autonomic”)?

More fundamentally, the main question is what the chances of success of this effort are. Its a huge endeavor (“enabling interoperable diagnosis and treatment of complex systems”) and it tries to structure activities that have been going on for a long time and in many different ways. No-one will adopt this structure for its own sake, so the question is what practical benefits can be derived from this level of standardization. For example, how reliably can incoming events be mapped in practice to symptoms, how efficiently can symptoms be matched to protocols (in typical IBM fashion there seems to be a big  “XPath is my hammer” assumption lurking), etc…

The discussion on the charter is currently open in OASIS if you want to weigh in.

4 Comments

Filed under Automation, Everything, IBM, IT Systems Mgmt, ITIL, Mgmt integration, Specs, Standards

So long Oslo

This post is a farewell to Microsoft’s Oslo project. Not primarily because the name Oslo will be phased out at the next PDC as Doug Purdy reports. More importantly because Doug’s post puts Oslo so far on my back burner that I am unlikely to ever get back to it.

I have mostly been tracking Oslo (the modeling part of it, i.e. the “D” and then “M” stacks and associated tools) from the point of view of its applicability to IT management and Microsoft’s DSI effort. That made sense at first and I found some encouragements in this direction by (see the partial transcript of David Chappell’s video). But from the start there were also many signs of a very different vision for Oslo, as a software development tool with only a peripheral relationship to IT management.

As time went by it became clear that that vision was the real one and Doug’s announcement that Microsoft will “merge the Data Programmability team (EDM, EF, Astoria, XML, ADO.NET, and tools/designers) and the Oslo team (Quadrant, Repository, M) together” is the final confirmation.

M is still a very interesting piece of technology and I’d like to keep track of it, but it’s now too far removed from my focus area for me to justify paying much attention to it in the foreseeable future. Good luck to the team.

In the meantime, this brings back the question of what’s the current modeling story in the System Center team. SDM? SML? Some M-based replacement? Back to CIM? I am just asking because they had been a lot more public and proactive on this than other management vendors, but this has stopped abruptly.

[UPDATED 2009/8/28: Doug responded with enough of a teaser (“some things coming out of incubation in the management space”) to give me a justification to keep at least an eye on Oslo. At the very least I’ll keep monitoring his blog, waiting for this DSI link to emerge.]

3 Comments

Filed under Everything, Microsoft, Modeling, Oslo

Thoughts on VMWare, SpringSource and PaaS

I am late to the party for  commenting on the upstream and downstream acquisitions involving SpringSource. I was away on vacations, but Rod Johnson obviously didn’t have too many holiday plans of his own in August.

First came the acquisition by VMWare. Then the acquisition of Cloud Foundry and the launch of its SpringSource reincarnation.

You’ve all read a lot about this already, so I’ll limit myself to a few bullet-point comments.

  • I was wrong to think at the time of the Hyperic acquisition that SpringSource would focus on app-centric management, BTM and transaction tracing more than Cloud computing and automation.
  • This move by VMWare helps me make some of the points I have been trying to make internally about Cloud computing.
  • This is a step in the progress from “fake machines” to true “virtual machines” (note to self: I may have to stop referring to “fake machines” as “VMWare-style virtual machines”).
  • Savio Rodrigues makes some interesting points, especially on the difference between a framework and a runtime.
  • Many people have hypervisors, a management console and middleware bits. If you are industry-darlings VMWare/SpringSource people seem more willing to assume that you can put them all in a bag, shake it and out comes a PaaS platform than if you are boring old Oracle, Microsoft or IBM. Fine. But let’s see how the (very real) potential gets delivered. Kudos to Adrian Colyer for taking a shot at describing it in a reasonable way, though there is still a fair amount of hand-waving… and already a drift towards the “I don’t need no cluster, I have a hypervisor and everything is a VM” reflex.
  • The “what does it mean for RedHat” angle seems to miss the point to me and be a byproduct of over-focusing on the “open source” aspect which is not all that relevant. This is more about Oracle, Microsoft and IBM than RedHat in my view.
  • Won’t it be fun when Cisco, VMWare and BMC are all one company and little SpringSource calls the shots from within (I have seen this happen more than once during my days at HP Software)?
  • I have no opinion on the question of whether VMWare over-paid or not. I’ll tell you in two years… :-)

Comments Off on Thoughts on VMWare, SpringSource and PaaS

Filed under Application Mgmt, Cloud Computing, Everything, Middleware, Spring, Utility computing, Virtualization, VMware

REST in practice for IT and Cloud management (part 2: configuration management)

What benefits does REST provide for configuration management (in traditional data centers and in Clouds)?

Part 1 of the “REST in practice for IT and Cloud management” investigation looked at Cloud APIs from leading IaaS providers. It examined how RESTful they are and what concrete benefits derive from their RESTfulness. In part 2 we will now look at the configuration management domain. Even though it’s less trendy, it is just as useful, if not more, in understanding the practical value of REST for IT management. Plus, as long as Cloud deployments are mainly of the IaaS kind, you are still left with the problem of managing the configuration of everything that runs of top the virtual machines (OS, middleware, DB, applications…). Or, if you are a glass-half-full person, here is another way to look at it: the great thing about IaaS (and host virtualization in general) is that you can choose to keep your existing infrastructure, applications and management tools (including configuration management) largely unchanged.

At first blush, REST is ideally suited to configuration management.

The RESTful Cloud APIs have no problem retrieving resource descriptions, but they seem somewhat hesitant in the way they deal with resource-specific actions. Tim Bray described one of the challenges in his well-considered Slow REST post. And indeed, applying REST to these “do something that may take some time and not result exactly in what was requested” scenarios is a lot less straightforward than when you’re just doing document/data retrieval. In contrast you’d think that applying REST to the task of retrieving configuration data from a CMDB or other configuration store would be a no-brainer. Especially in the IT management world, where we already have explicit resource models and a rich set of relationships defined. Let’s give each resource a URI that responds to HTTP GET requests, let’s turn the associations into hyperlinks in the resource presentation, let’s mint a MIME type to represent this format and we are out of the office in time for a 4:00PM tennis game when all the courts are available (hopefully our tennis partners are as bright as us and can get out early too). This “work smarter not harder” approach would allow us to present this list of benefits in our weekly progress report:

-1- A URI-based scheme makes the protocol independent of the resource topology, unlike today’s data stores that usually struggle to represent relationships between stores.

-2- It is simpler to code against than CIM-over-HTTP or WS-Management. It is cross-platform, unlike WMI or JMX.

-3- It makes it trivial to browse the configuration data from a Web browser (the resources themselves could provide an HTML representation based on content-type negotiation, or a simple transformation could generate it for the Web browser).

-4- You get REST-induced caching and scalability.

In the shower after the tennis game, it becomes apparent that benefit #4 is largely irrelevant for IT management use cases. That the browser in #3 would not be all that useful beyond simple use cases. That #2 is good for karma but developers will demand a library that hides this benefit anyway. And that the boss is going to say that he doesn’t care about #1either because his product is “the single source of truth” so it needs to import from the other configuration store, not reference them.

Even if we ignore the boss (once again) it only  leaves #1 as a practical benefit. Surprise, that’s also the aspect that came out on top of the analysis in part 1 (see “the API doesn’t constrain the design of the URI space” highlight, reinforced by Mark’s excellent comment on the role of hypertext). Clearly, there is something useful for IT management in this “hypermedia” thing. This will largely be the topic of part 3.

There are also quite a few things that this RESTification of the configuration management store doesn’t solve:

-1- The ability to query: “show me all the WebLogic instances that run on a Windows host and don’t have patch xyz applied”. You don’t have much of a CMDB if you can’t answer this. For an analogy, remember (or imagine) a pre-1995 Web with no search engine, where you can only navigate by starting from your browser home page and clicking through static links step by step, or through bookmarks.

-2- The ability to retrieve the configuration change history and to compare configurations across resources (or to a reference configuration).

This is not to say that these two features cannot be built on top of a RESTful IT resource model. Just that they are the real meat of configuration management (rather than a simple resource-by-resource configuration browser) and that your brilliant re-architecture hasn’t really helped in addressing them. Does a RESTful foundation make these features harder to build? Not necessarily, but there are some tricky aspects to take care of:

-1- In hypermedia systems, the links are usually part of the resource representation, not resources of their own. In IT management, relationships/associations can have their own lifecycle and configuration properties.

-2- Be careful that you can really maintain the address of a resource. It’s one thing to make sure that a UUID gets maintained as a resource configuration changes, it’s another to ensure that a dereferenceable URI remains unchanged. For example, the admin server of a cluster may move over time from one node to another.

More fundamentally, the ability to deal with multiple resources at the same time and/or to use the model at different levels of granularity is often a challenge. Either you make your protocol more complex to account for this or your pollute your resource model (with a bunch of arbitrary “groups”, implicit or explicit).

We saw this in the Cloud APIs too. It typically goes something like this: you can address an individual server (called “foo”) by sending requests to http://Cloudprovider.com/server/foo. Drop the “foo” part of the URL and now you can address all the servers, for example to retrieve their configuration or possibly to reboot them. This gives me a way of dealing with multiple resources at time, but only along the lines pre-defined by the API. What if I want to deal only with the servers that host nodes of a given cluster. Sorry, not possible. What if the servers have different hosts in their URIs (remember, “the API doesn’t constrain the design of the URI space”)? Oops.

WS-Management, in the SOAP world, takes this one step further with Selectors, through which you can embed some kind of query, the result of which is what you are addressing in your message. Or, if all you want to do is GET, you can model you entire datacenter as one giant virtual XML doc (a document which is never assembled in practice) and use WSRF/WSDM’s “QueryExpression” or WS-Management’s “FragmentTransfer” to the same effect. BTW, I have issues with the details of how these mechanisms work (and I have described an alternative under the motto “if you are going to suffer with WS-Addressing, at least get some value out of it”).

These are all non-RESTful atrocities to a RESTafarian, but in my mind the Cloud REST API reviewed in part 1 have open Pandora’s box by allowing less-qualified URIs to address all instances of a class. I expect you’ll soon see more precise query parameters in these URIs and they’ll look a lot like WS-Management Selectors (e.g. http://Cloudprovider.com/server?OS=Linux&CPUType=X86). Want to take bets about when a Cloud API URI format with an embedded regex first arrives?

When you need this, my gut feeling is that you are better off not worrying too much about trying to look RESTful. There is no shame to using an RPC pattern in the right circumstances. Don’t be the stupid skier who ends up crashing in a tree because he is just too cool for the using snowplow position.

One of the most common reasons to deal with multiple resources together is to run queries such as the “show me all the WebLogic instances that run on a Windows host and don’t have patch xyz applied” example above. Such a query mechanism recently became a DMTF standard, it’s called CMDBf. It is SOAP-based and doesn’t attempt to have anything to do with REST. Not that it didn’t cross the mind of a bunch of people, lead by Michael Coté when CMDBf first emerged (read the comments too). But as James Governor rightly predicted in the first comment, Coté heard “dick” from us on this (I represented HP in CMDBf and ended up being an editor of the specification, focusing on the “query” part). I don’t remember reading the entry back then but I must have since I have been a long time Coté fan. I must have dismissed the idea so quickly that it didn’t even register with my memory. Well, it’s 2009 now, CMDBf v1 is a DMTF standard and guess what? I, and many other SOAP-the-world-till-it-shines alumni, are looking a lot more seriously into what’s in this REST thing (thus this series of posts for me). BTW in this piece Coté also correctly predicted that CMDBf would be “more about CMDB interoperation than federation” but that didn’t take as much foresight (it was pretty obvious to me from the start).

Frankly I am still not sure that there is much benefit from REST in what CMDBf does, which is mostly a query interface. Yes the CMDBf query and its response go over SOAP. Yes in this case SOAP is mostly a useless wrapper since none of the implementations will likely support any WS-* SOAP header (other than paying the WS-Addressing tax). Sure we could remove it and send plain XML over HTTP. Or replace the SOAP wrapper with an Atom wrapper. Would it be anymore RESTful? Not one bit.

And I don’t see how to make it more RESTful. There are plenty of things in the periphery the query operation that can be made RESTful, along the lines of what I described above. REST could make the discovery/reconciliation tasks of the CMDB more efficient. The CMDBf query result format could be improved so that from the returned elements I can navigate my way among resources by following hyperlinks. But the query operation itself looks fundamentally RPCish to me, just like my interaction with the Google search page is really an RPC call that happens to return a Web page full of hyperlinks. In a way, this query (whether Google or CMDBf) can at best be the transition point from RPC to REST. It can return results that open a world of RESTful requests to you, but the query invocation itself is not RESTful. And that’s OK.

In part 3 (now available), I will try to synthesize the lessons from the Cloud APIs (part 1) and configuration management (this post) and extract specific guidance to get the best of what REST has to offer in future IT management protocols. Just so you can plan ahead, in part 4 I will reform the US health care system and in part 5 I will provide a practical roadmap for global nuclear disarmament. Suggestions for part 6 are accepted.

11 Comments

Filed under API, Application Mgmt, Automation, Cloud Computing, CMDB, CMDB Federation, CMDBf, DMTF, Everything, IT Systems Mgmt, Mashup, Mgmt integration, Modeling, REST, SOAP, SOAP header, Specs, Standards, Utility computing

Cloud catalog catalyst or cloud catalog cataclysm?

Like librarians, we IT wonks tend to like things cataloged. To date, the last instance of this has been SOA governance and its various registries and repositories. With UDDI limping along as some kind of organizing standard for the effort. One issue I have with UDDI  is that its technical awkwardness is preventing us from learning from its failure to realize its ambitious goals (“e-business heaven”). It would be too easy to attribute the UDDI disappointment to UDDI. Rather, it should be laid at the feet of unreasonable initial expectations.

The SOA governance saga is still ongoing, now away from the spotlight and mostly from an implementation perspective rather than a standard perspective (by the way, what’s up with GIF?). Instead, the spotlight has turned to Cloud computing and that’s what we are supposedly going to control through cataloging next.

Earlier this year, I commented on the release of an ITSM catalog product for Cloud computing (though I was addressing the convergence of ITSM and Cloud computing more than catalogs per se).

More recently, Lori MacVittie related SOA governance to the need for Cloud catalogs. She makes some good points, but I also see some familiar-looking “irrational exuberance”. The idea of dynamically discovering and invoking a Cloud service reminds me too much of the initial “yellow pages” scenarios for UDDI (which quickly got dropped in favor of a more modest internal governance focus).

I am not convinced by the reason Lori gives for why things are different this time around (“one of the interesting things virtualization brings to the table that SOA did not is the ability to abstract management of services”). She argues that SOA governance only gave you access to the operational WSDL of a Web service, while Cloud catalogs will give you access to their management API. But if your service is an IT service, then your so-called management API (launch/configure/control VMs) is really its operational interface. The real management interface is the one Amazon uses under the cover and they are not going to expose it to you anymore than your bank is going to expose its application server administration console to you (if they do, move your money somewhere else before someone does it for you).

After all, isn’t SOA governance pretty close to a SaaS catalog which is itself a small part of the overall Cloud (IaaS+PaaS+SaaS) catalog question? If we still haven’t succeeded in the smaller scope, what are the odds of striking gold quickly in the larger effort?

Some analysts take a more pragmatic view, involving active brokers rather than simply a new DNS record type. I am doubtful about these brokers (0.2 probability, as Gartner would put it) but at least this moves the question onto business terms (leverage, control) rather than technical terms. Which is where the battle will be fought.

When it comes to Cloud catalogs, I think they are needed (if only for the categorization of Cloud services that they require) but will only play a supporting role, if any, in any move towards dynamic Cloud provisioning. As with SOA governance it’s as an internal tool, supported by strong processes, that they will be most useful.

Throughout human history, catalogs have been substitutes for control more often than instruments of control. Think of astronomy, zoology and… nephology for example. What kind will IT Cloud catalogs be?

2 Comments

Filed under Application Mgmt, Automation, Business, Cloud Computing, Everything, Governance, Manageability, Mgmt integration, Portability, Specs, Utility computing

A small step for SCA, a giant leap for BSM

In a very short post, Khanderao Kand describes how configuration properties for BPEL processes in Oracle SOA Suite 11G are attached to SCA components. Here is the example he provides:

<component name="myBPELServiecComponent">
  ...
  <property name="bpel.config.inMemoryOptimization">true</property>
</component>

It doesn’t look like much. But it’s an major step for application-driven IT management (and eventually BSM).

Take a SCA component. Follow the SCA-defined component-to-composite and service-to-reference relationships upwards and eventually you’ll get to top level application services that have a decent chance of mapping well to business-relevant activities (e.g. order processing). Which means that the metrics of these services (e.g. availability, response time) are likely to be meaningful and important to the line of business. Follow the same SCA relationships downward and you’ll end up (in a SCA-based infrastructure like Oracle SOA Suite 11G), with target components that are meaningful to the IT administrator. Which means that their metrics and configuration settings (like “inMemoryOptimization”) are tracked and controlled by IT. You now have a direct string of connections between this configuration setting and a business relevant metric. You can navigate the connection in both directions: downward/reactive (“my service just went down, what changed in the infrastructure”) versus upward/proactive (“my service is always slow, what can I do to optimize the execution”).

Of course these examples are over-simplistic (and the title of this post is a bit too lyrical, on account of this). Following these SCA relationships in brute-force fashion will yield tens of thousands of low-level configuration settings for any top-level service, with widely differing importance and impact (not to mention that they interact). You need rules to make sense of this. Plus, configuration-based models are a complement to runtime transaction discovery, not a replacement (unless your model of the application includes every single line of code). But it’s not that often that you can see a missing link snap into place that clearly.

What this shows is the emergence of a common set of entities between the developer’s model and the IT admin model. And if the application was developed correctly, some of the entities in the developer’s model correspond to entities in the mental model of the application user and the line of business manager. SCA is the skeleton for this. Attaching configuration to SCA components puts muscle on the bone.

The road to BSM is paved with small improvements in the semantic alignment between IT infrastructure and application services. A couple of years ago, I tried to explain why SCA is very relevant for IT management. Now we can see it.

4 Comments

Filed under Application Mgmt, BPEL, BSM, Business, Business Process, Everything, IT Systems Mgmt, Mgmt integration, Middleware, Modeling, Oracle, SCA, Standards

Anthology of blog posts about protocols and data formats

I just finished reading or re-reading a half-dozen great short texts about data formats and protocols, in the XML/RDF space.

I started with this “do we need WADL” post by Joe Gregorio (since the previous entry made me go back to WADL which is used by Rackspace). Under the guise of a Q&A about WADL, Joe’s post disposes of the notion that IDL-based code generation is any good (of course the reference on this topic is Steve’s Alpine paper, but Joe very elegantly captures the gist of it a few sentences). He then explains what it really take to specify a protocol (hint: it’s not just a syntax). This is about WSDL and XSD as much as WADL.

When I reached the point in Joe’s Q&A where he discusses whether one should ever create a new protocol, I remembered a post on this very topic from Tim Bray, which I easily Googled back to life. Two of them actually, one about why you shouldn’t do it and the other about how to do it since he knows his advice will be ignored. There are so many lessons in these that I won’t even attempt to summarize.

Tim’s second piece then delivered me to this excellent article about the various facets of RDF. It’s six years old but still true. Though if it was written today I expect it would add “graph query language” and possibly even “constraint language” as facets of RDF.

While I am at it, I should add to the list this to this bird-eye view of all the XML obstacles that pedestrians run into (I have highlighted this entry in a previous post).

These are all very well written articles by people who think very clearly about the domain. None of them technically taught me anything I didn’t know before, but they definitely helped me clarify my thoughts (and find the words to explain them to others).

We’re not artists. We’re not scientists. We’re not mathematicians. But there is some beauty in computer protocol design too. These writings are museum pieces, in the “lasting/worthwhile” sense of the term (not the “old/outdated” sense that it often has in the computing world). Don’t rush to read them, they are all several years old and have aged very well. Wait until you have the time to read them carefully.

I didn’t set out to create a best-of compilation of writings about protocols and data formats. I just happened to run into these great entries in a 30 minutes period and I was impressed by how much “above average” they all are. Is it luck? Does the topic of computer protocols naturally attract good thinkers and writers? Am I just in a good mood tonight? Who knows.

There must be others, possibly even better. Elliotte Rusty Harold occasionally surfaces one through his not-so-daily “quote of the day“. Suggestions for more articles of this caliber are welcome. A thousand monkeys may not be able to produce Hamlet, but a thousand bloggers may come close to an equivalent of Feynman’s lectures.

[UPDATED 2010/11/12: Over a year later, here’s an addition to this anthology. Stu’s succinct and beautiful explanation of the underlying issues with partial resource update (and REST in general).]

1 Comment

Filed under Articles, Everything, Protocols, RDF, Specs, Standards