William Vambenepe's blog

IT management in a changing IT world

Humans are particularly adept at utilizing systems of buy lady uk viagra for self-expression, exchanging of ideas, and organization.There are many perspectives from which to understand and pfizer viagra online it.Prontosil, the first commercially available viagra pharmacy online antibiotic was developed by a research team led by Gerhard Domagk (who received the 1939 Nobel Prize for Medicine for his efforts) at the Bayer Laboratories of the I.However there have been no conclusive studies that viagra prescription online that; on the contrary, the majority of the studies indicate that antibiotics do not interfere with contraception , even though there is a possibility that a small percentage of women may experience decreased effectiveness of birth control pills while taking an antibiotic.See negative effects of the fight-or-flight order generic viagra.

Archive for the 'Virtualization' Category

19
Feb
2010

HP has submitted a specification to the DMTF Cloud incubator

by William (@vambenepe on Twitter)

When I lamented, in a previous post, that I couldn’t tell you about recent submissions to the DMTF Cloud incubator, one of those I had in mind was a submission from HP. I can now write this, because the author of the specification, Nigel Cook, has recently blogged about it. Unfortunately he is isn’t publishing the specification itself, just an announcement that it was submitted. Hopefully he is currently going through the long approval process to make the submitted document public (been there, done that, I know it takes time).

In the blog, Nigel makes a good argument for the need to go beyond a hypervisor-centric view of Cloud computing. Even at the IaaS layer there are cases of automated-but-not-virtualized deployment that have all the characteristics of Cloud computing and need to be supported by Cloud management APIs. Not to mention OS-level isolation like Solaris Containers.

Nigel also offers a spirited defense of SOAP-based protocols. I don’t necessarily agree with all his points (“one could easily map the web service definition I described to REST if that was important” suggests a “it’s just SOAP without the wrapper” view of REST), but I am glad he is launching this debate. We need to discuss this rather than assume that REST is the obvious answer. Remember, a few years ago SOAP was just as obvious an answer to any protocol question. It may well be that indeed REST comes out ahead of this discussion, but the process will force us to be explicit about what benefits of REST we are trying to achieve and will allow us to be practical in the way we approach it.

20
Jan
2010

Generalizing the Cloud vs. SOA Governance debate

by William (@vambenepe on Twitter)

There have been some interesting discussions recently about the relationship between Cloud management and SOA management/governance (run-time and design-time). My only regret is that they are a bit too focused on determining winners and loosers rather than defining what victory looks like (a bit like arguing whether the smartphone is the triumph of the phone over the computer or of the computer over the phone instead of discussing what makes a good smartphone).

To define victory, we need to answer this seemingly simple question: in what ways is the relationship between a VM and its hypervisor different from the relationship between two communicating applications?

More generally, there are three broad categories of relationships between the “active” elements of an IT system (by “active” I am excluding configuration, organization, management and security artifacts, like patch, department, ticket and user, respectively, to concentrate instead on the elements that are on the invocation path at runtime). We need to understand if/how/why these categories differ in how we manage them:

  • Deployment relationships: a machine (or VM) in a physical host (or hypervisor), a JEE application in an application server, a business process in a process engine, etc…
  • Infrastructure dependency relationships (other than containment): from an application to the DB that persists its data, from an application tier to web server that fronts it, from a batch job to the scheduler that launches it, etc…
  • Application dependency relationships: from an application to a web service it invokes, from a mash-up to an Atom feed it pulls, from a portal to a remote portlet, etc…

In the old days, the lines between these categories seemed pretty clear and we rarely even thought of them in the same terms. They were created and managed in different ways, by different people, at different times. Some were established as part of a process, others in a more ad-hoc way. Some took place by walking around with a CD, others via a console, others via a centralized repository. Some of these relationships were inventoried in spreadsheets, others on white boards, some in CMDBs, others just in code and in someone’s head. Some involved senior IT staff, others were up to developers and others were left to whoever was manning the controls when stuff broke.

It was a bit like the relationships you have with the taxi that takes you to the airport, the TSA agent who scans you and the pilot who flies you to your destination. You know they are all involved in your travel, but they are very distinct in how you experience and approach them.

It all changes with the Cloud (used as a short hand for virtualization, management automation, on-demand provisioning, 3rd-party hosting, metered usage, etc…). The advent of the hypervisor is the most obvious source of change: relationships that were mostly static become dynamic; also, where you used to manage just the parts (the host and the OS, often even mixed as one), you now manage not just the parts but the relationship between them (the deployment of a VM in a hypervisor). But it’s not just hypervisors. It’s frameworks, APIs, models, protocols, tools. Put them all together and you realize that:

  • the IT resources involved in all three categories of relationships can all be thought of as services being consumed (an “X86+ethernet emulation” service exposed by the hypervisor, a “JEE-compatible platform” service exposed by the application server, an “RDB service” expose by the database, a Web services exposed via SOAP or XML/JSON over HTTP, etc…),
  • they can also be set up as services, by simply sending a request to the API of the service provider,
  • not only can they be set up as services, they are also invoked as such, via well-documented (and often standard) interfaces,
  • they can also all be managed in a similar service-centric way, via performance metrics, SLAs, policies, etc,
  • your orchestration code may have to deal with all three categories, (e.g. an application slowdown might be addressed either by modifying its application dependencies, reconfiguring its infrastructure or initiating a new deployment),
  • the relationships in all these categories now have the potential to cross organization boundaries and involve external providers, possibly with usage-based billing,
  • as a result of all this, your IT automation system really needs a simple, consistent, standard way to handle all these relationships. Automation works best when you’ve simplified and standardize the environment to which it is applied.

If you’re a SOA person, your mental model for this is SOA++ and you pull out your SOA management and governance (config and runtime) tools. If you are in the WS-* obedience of SOA, you go back to WS-Management, try to see what it would take to slap a WSDL on a hypervisor and start dreaming of OVF over MTOM/XOP. If you’re into middleware modeling you might start to have visions of SCA models that extend all the way down to the hardware, or at least of getting SCA and OSGi to ally and conquer the world. If you’re a CMDB person, you may tell yourself that now is the time for the CMDB to do what you’ve been pretending it was doing all along and actually extend all the way into the application. Then you may have that “single source of truth” on which the automation code can reliably work. Or if you see the world through the “Cloud API” goggles, then this “consistent and standard” way to manage relationships at all three layers looks like what your Cloud API of choice will eventually do, as it grows from IaaS to PaaS and SaaS.

Your background may shape your reference model for this unified service-centric approach to IT management, but the bottom line is that we’d all like a nice, clear conceptual model to bridge and unify Cloud (provisioning and containment), application configuration and SOA relationships. A model in which we have services/containers with well-defined operational contracts (and on-demand provisioning interfaces). Consumers/components with well-defined requirements. APIs to connect the two, with predictable results (both in functional and non-functional terms). Policies and SLAs to fine-tune the quality of service. A management framework that monitors these policies and SLAs. A common security infrastructure that gets out of the way. A metering/billing framework that spans all these interactions. All this while keeping out of sight all the resource-specific work needed behind the scene, so that the automation code can look as Zen as a Japanese garden.

It doesn’t mean that there won’t be separations, roles, processes. We may still want to partition the IT management tasks, but we should first have a chance to rejigger what’s in each category. It might, for example, make sense to handle provider relationships in a consistent way whether they are “deployment relationships” (e.g. EC2 or your private IaaS Cloud) or “application dependency relationships” (e.g. SOA, internal or external). On the other hand, some of the relationships currently lumped in the “infrastructure dependency relationships” category because they are “config files stuff” may find different homes depending on whether they remain low-level and resource-specific or they are absorbed in a higher-level platform contract. Any fracture in the management of this overall IT infrastructure should be voluntary, based on legal, financial or human requirements. And not based on protocol, model, security and tool disconnect, on legacy approaches, on myopic metering, that we later rationalize as “the way we’d want things to be anyway because that’s what we are used to”.

In the application configuration management universe, there is a planetary collision scheduled between the hypervisor-centric view of the world (where virtual disk formats wrap themselves in OVF, then something like OVA to address, at least at launch time, application and infrastructure dependency relationships) and the application-model view of the world (SOA, SCA, Microsoft Oslo at least as it was initially defined, various application frameworks…). Microsoft Azure will have an answer, VMWare/Springsouce will have one, Oracle will too (though I can’t talk about it), Amazon might (especially as it keeps adding to its PaaS portfolio) or it might let its ecosystem sort it out, IBM probably has Rational, WebSphere and Tivoli distinguished engineers locked into a room, discussing and over-engineering it at this very minute, etc.

There is a lot at stake, and it would be nice if this was driven (industry-wide or at least within each of the contenders) by a clear understanding of what we are aiming for rather than a race to cobble together partial solutions based on existing control points and products (e.g. the hypervisor-centric party).

[UPDATED 2010/1/25: For an illustration of my statement that "if you’re a SOA person, your mental model for this is SOA++", see Joe McKendrick's "SOA's Seven Greatest Mysteries Unveiled" (bullet #6: "When you get right down to it, cloud is the acquisition or provisioning of reusable services that cross enterprise walls. (...)  They are service oriented architecture, and they rely on SOA-based principles to function.")]

07
Jan
2010

Backward-compatible vs. forward-compatible: a tale of two clouds

by William (@vambenepe on Twitter)

There is the Cloud that provides value by requiring as few changes as possible. And there is the Cloud that provides value by raising the abstraction and operation level. The backward-compatible Cloud versus the forward-compatible Cloud.

The main selling point of the backward-compatible Cloud is that you can take your existing applications, tools, configurations, customizations, processes etc and transition them more or less as they are. It’s what allowed hypervisors to spread so quickly in the enterprise.

The main selling point of the forward-compatible Cloud is that you are more productive and focused. Fewer configuration items to worry about, fewer stack components to install/monitor/update, you can focus on your application and your business goals. You develop and manage at the level of application concepts, not systems. Bottom line, you write and deploy applications more quickly, cheaply and reliably.

To a large extent this maps to the distinction between IaaS and PaaS, but it’s not that simple. For example, a PaaS that endeavors to be a complete JEE environment is mainly aiming for the backward-compatible value proposition. On the other hand, EC2 spot instances, while part of the IaaS layer, are of the forward-compatible kind: not meant to run your current applications unchanged, but rather to give you ways to create applications that better align with your business goals.

Part of the confusion is that it’s sometimes unclear whether a given environment is aiming for forward-compatibility (and voluntary simplification) or whether its goal is backward-compatibility but it hasn’t yet achieved it. Take EC2 for example. At first it didn’t look much like a traditional datacenter, beyond the ability to create hosts. Then we got fixed IP, EBS, boot from EBS, etc and it got more and more realistic to run applications unchanged. But not quite, as this recent complaint by Hoff illustrates. He wants a lot more control on the network setup so he can deploy existing n-tier applications that have specific network topology/config requirements without re-engineering them.  It’s a perfectly reasonable request, in the context of the backward-compatible Cloud value proposition. But one that will never be granted by a Cloud that aims for forward-compatibility.

Similarly, the forward-compatible Cloud doesn’t always successfully abstract away lower-level concerns. It’s one thing to say you don’t have to worry about backup and security but it means that you now have to make sure that your Cloud provider handles them at an acceptable level. And even on technical grounds, abstractions still leak. Take Google App Engine, for example. In theory you only deal with requests and not even think about the servers that process them (you have no idea how many servers are used). That’s nice, but once a while your Java application gets a DeadlineExceededException. That’s because the GAE platform had to start using a new JVM to serve this request (for example, your traffic is growing or the JVM previously used went down) and it took too long for the application to load in the new JVM, resulting in this loading request being killed. So you, as the developer, have to take special steps to mitigate a problem that originates at a lower level of the stack than you’re supposed to be concerned about.

All in all, the distinction between backward-compatible and forward-compatible Clouds is not a classification (most Cloud environments are a mix). Rather, it’s another mental axis on which to project your Cloud plans. It’s another way to think about the benefits that you expect from your use of the Cloud. Both providers and consumers should understand what they are aiming for on that axis. Hopefully this can help prevent shout matches of the “it’s a bug, no it’s a feature” variety.

[UPDATED 2010/3/4: Apparently, Steve Ballmer thinks along the same lines. Though the way he sees it, Azure is forward-compatible, while Amazon is backward-compatible: "I think Amazon has done a nice job of helping you take the server-based programming model - the programming model of yesterday, that is not scale-agnostic - and then bringing it into the cloud. On the other hand, what we're trying to do with Azure is let you write a different kind of application."]

[UPDATED 2010/3/5: I now have the quasi-proof that indeed Steve Ballmer stole the idea from my blog. Look at this entry in my HTTP log. This visitor came the evening before Steve's "Cloud" talk at the University of Washington. I guess I am not the only one to procrastinate until the 11th hour when I have a deadline. Every piece of information in this log entry points at Steve Ballmer. How can it be anyone other than him?

131.107.0.71 - - [03/Mar/2010:23:51:52 -0800] “GET /archives/1198 HTTP/1.0″ 200 4820 “http://www.bing.com/search?q=Brilliant+Cloud+Insight” “Mozilla/1.22 (compatible; MSIE 2.0; Microsoft Bob)”

Hi Steve!]

25
Nov
2009

Can your hypervisor radio for air support?

by William (@vambenepe on Twitter)

As I was reading about Microsoft Azure recently, a military analogy came to my mind. Hypervisors are tanks. Application development and runtime platforms compose the air force.

Tanks (and more generally the mechanization of ground forces) transformed war in the 20th century. They multiplied the fighting capabilities of individuals and changed the way war was fought. A traditional army didn’t stand a chance against a mechanized one. More importantly, a mechanized army that used the new tools with the old mindset didn’t stand a chance against a similarly equipped army that had rethought its strategy to take advantage of the new capabilities. Consider France at the beginning of WWII, where tanks were just canons on wheels, spread evenly along the front line to support ground troops. Contrast this with how Germany, as part of the Blitzkrieg, used tanks and radios to create highly mobile – and yet coordinated – units that caused havoc in the linear French defense.

Exercise for the reader who wants to push the analogy further:

  • Describe how Blitzkrieg-style mobility of troops (based on tanks and motorized troop transports) compares to Live Migration of virtual machines.
  • Describe how the use of radios by these troops compares to the use of monitoring and control protocols to frame IT management actions.

Tanks (hypervisor) were a game-changer in a world of foot soldiers (dedicated servers).

But no matter how good your tanks are, you are at a disadvantage if the other party achieves air superiority. A less sophisticated/numerous ground force that benefits from strong air support is likely to prevail over a stronger ground force with no such support. That’s what came to my mind as I read about how Azure plans to cover the IaaS layer, but in the context of an application-and-data-centric approach. Where hypervisors are not left to fend for themselves based on the limited view of the horizion from the periscope of their turrets but rather orchestrated, supported (and even deployed) from the air, from the application platform.

C-130 tank airdrop

(Yes, I am referring to the Azure vision as it was presented at PDC09, not necessarily the currently available bits.)

Does your Cloud vendor/provider need an air force?

Exercise for the reader who wants to push the analogy to the stratosphere:

  • Describe how business logic/process, business transaction management and business intelligence are equivalent to satellites, surveying the battlefield and providing actionable intelligence.

The new Cloud stack (“military-cloud complex” version):

cloud-military-stack

[Note: I have no expertise in military history (or strategy) beyond high school classes about WWI and WWII, plus a couple of history books and a few war movies. My goal here is less to be accurate on military concerns (though I hope to be) than to draw an analogy which may be meaningful to fellow IT management geeks who share my level of (in)expertise in military matters. This is just yet another way in which I try to explain that, for Clouds as for plain old IT management, "it's the application, stupid".]

19
Nov
2009

Review of Fujitsu’s IaaS Cloud API submission to DMTF

by William (@vambenepe on Twitter)

Things are heating up in the DMTF Cloud incubator. Back in September, VMWare submitted its vCloud API (or rather a “reader’s digest” version of it) to the group. Last week, the group released a white paper titled “Interoperable Clouds”. And a second submission, from Fujitsu, was made last week and publicly announced today.

The Fujitsu submission is called an “API design”. What this means is that it doesn’t tell you anything about what things look like on the wire. It could materialize as another “XML over HTTP” protocol (with or without SOAP wrapper), but it could just as well be implemented as a binary RPC protocol. It’s really more of an esquisse of a resource model than a remote API. The only invocation-related aspect of the document is that it defines explicit operations on various resources (though not their input and outputs). This suggest that the most obvious mapping would be to some XML/HTTP RPC protocol (SOAPy or not). In that sense, it stands out a bit from the more recent Cloud API proposals that take a “RESTful” rather than RPC approach. But in these days of enthusiastic REST-washing I am pretty sure a determined designer could produce a RESTful-looking (but contorted) set of resources that would channel the operations in the specification as HTTP-like verbs on these resources.

Since there are few protocol aspects to this “API design”, if we are to compare it to other “Cloud APIs”, it’s really the resource model that’s worth evaluating. The obvious comparison is to the EC2 model as it provides a pretty similar set of infrastructure resources (it’s entirely focused on the IaaS layer). It lacks EC2 capabilities around availability, security and monitoring. But it adds to the EC2 resource model the notions of VDC (“virtual data center”, a container of IaaS resources), VSYS (see below) and a lightly-defined EFM (Extended Function Module) concept which intends to encompass all kinds of network/security appliances (and presumably makes up for the lack of security groups).

The heart of the specification is the VSYS and its accompanying VSYS Descriptor. We are encouraged to think of the VSYS Descriptor as an extension of OVF that lets you specify this kind of environment:

Example content for a VSYS Descriptor

Example content for a VSYS Descriptor

By forcing the initial VSYS instance to be based on a VSYS Descriptor, but then allowing the VSYS to drift away from the descriptor via direct management actions, the specification takes a middle-of-the-road approach to the “model-based versus procedural” debate. Disciples of the procedural approach will presumably start from a very generic and unconstrained VSYS Descriptor and, from there, script their way to happiness. Model geeks will look for a way to keep the system configuration in sync with a VSYS Descriptor.

How this will work is completely undefined. There is supposed to be a getVSYSConfiguration() operation which “returns the configuration information on the VSYS” but there is no format/content proposed for the response payload. Is this supposed to return every single config file, every setting (OS, MW, application) on all the servers in the VSYS? Surely not. But what then is it supposed to return? The specification defines five VSYS attributes (VSYSID, creator, createTime, description and baseDescriptor) so I know what getSYSAttributes() returns. But leaving getVSYSConfiguration() undefined is like handing someone an airplane maintenance manual that simply reads “put the right part in the right place”. A similar feature is also left as an exercise to the reader in section that sketches an “external configuration service”. We are provided with a URL convention to address the service, but zero information about the format and content of the configuration instructions provided to the VServer.

EC2 has a keypair access mechanism for Linux instances and a clumsy password-retrieval system for Windows instances. The Fujitsu proposal adopts the lowest common denominator (actually the greatest common divisor, but that’s a lost rhetorical cause): random password generation/retrieval for everyone.

I also noticed the statement that a VServer must be “implemented as a virtual machine” which is an unnecessary constraint/assumption. The opposite statement is later made for EFMs, which “can be implemented in various ways (e.g. run on virtual machines or not)”, so I don’t want to read too much into the “hypervisor-required” VServer statement which probably just needs an editorial clean-up.

From a political perspective this specification looks more like a case of “can I play with you? I brought some marbles” than a more aggressive “listen everybody, we’re playing soccer now and I am the captain”. In other words, this may not be as much an attempt to shape the outcome of the incubator as much as to contribute to its work and position Fujitsu as a respected member whose participation needs to be acknowledged.

While this is an alternative submission to the vCloud API, I don’t think VMWare will feel very challenged by it. The specification’s core (VSYS Descriptor) intends to build on OVF, which should be music to VMWare’s ears (it’s the model, not the protocol, which is strategic). And it is light enough on technical details that it will be pretty easy for vCloud to claim that it, indeed, aligns with the intent of this “design”.

All in all, it is good to see companies take the time to write down what they expect out of the DMTF work. And it’s refreshing to see genuine single-company contributions rather than pre-negotiated documents by a clique. Whether they look more like implementable specifications of position paper, they all provide good input to the DMTF Cloud incubator.

15
Oct
2009

Cloud platform patching conundrum: PaaS has it much worse than IaaS and SaaS

by William (@vambenepe on Twitter)

The potential user impact of changes (e.g. patches or config changes) made on the Cloud infrastructure (by the Cloud provider) is a sore point in the Cloud value proposition (see Hoff’s take for example). You have no control over patching/config actions taken by the provider, any of which could potentially affect you. In a traditional data center, you can test the various changes on specific applications; you don’t have to apply them at the same time on all servers; and you can even decide to skip some infrastructure patches not relevant to your application (“if it aint’ broken…”). Not so in a Cloud environment, where you may not even know about a change until after the fact. And you have no control over the timing and the roll-out of the patch, so that some of your instances may be running on patched nodes and others may not (good luck with troubleshooting that).

Unfortunately, this is even worse for PaaS than IaaS. Simply because you seat on a lot more infrastructure that is opaque to you. In a IaaS environment, the only thing that can change is the hardware (rarely a cause of problem) and the hypervisor (or equivalent Cloud OS). In a PaaS environment, it’s all that plus whatever flavor of OS and application container is used. Depending on how streamlined this all is (just enough OS/AS versus a traditional deployment), that’s potentially a lot of code and configuration. Troubleshooting is also somewhat easier in a IaaS setup because the error logs are localized (or localizable) to a specific instance. Not necessarily so with PaaS (and even if you could localize the error, you couldn’t guarantee that your troubleshooting test runs on the same node anyway).

In a way, PaaS is squeezed between IaaS and SaaS on this. IaaS gets away with a manageable problem because the opaque infrastructure is not too thick. For SaaS it’s manageable too because the consumer is typically either a human (who is a lot more resilient to change) or a very simple and well-understood interface (e.g. IMAP or some Web services). Contrast this with PaaS where the contract is that of an application container (e.g. JEE, RoR, Django).There are all kinds of subtle behaviors (e.g, timing/ordering issues) that are not part of the contract and can surface after a patch: for example, a bug in the application that was never found because before the patch things always happened in a certain order that the application implicitly – and erroneously – relied on. That’s exactly why you always test your key applications today even if the OS/AS patch should, in theory, not change anything for the application. And it’s not just patches that can do that. For example, network upgrades can introduce timing changes that surface new issues in the application.

And it goes both ways. Just like you can be hurt by the Cloud provider patching things, you can be hurt by them not patching things. What if there is an obscure bug in their infrastructure that only affects your application. First you have to convince them to troubleshoot with you. Then you have to convince them to produce (or get their software vendor to produce) and deploy a patch.

So what are the solutions? Is PaaS doomed to never go beyond hobbyists? Of course not. The possible solutions are:

  • Write a bug-free and high-performance PaaS infrastructure from the start, one that never needs to be changed in any way. How hard could it be? ;-)
  • More realistically, narrowly define container types to reduce both the contract and the size of the underlying implementation of each instance. For example, rather than deploying a full JEE+SOA container componentize the application so that each component can deploy in a small container (e.g. a servlet engine, a process management engine, a rule engine, etc). As a result, the interface exposed by each container type can be more easily and fully tested. And because each instance is slimmer, it requires fewer patches over time.
  • PaaS providers may give their users some amount of visibility and control over this. For example, by announcing upgrades ahead of time, providing updated nodes to test on early and allowing users to specify “freeze” periods where nothing changes (unless an urgent security patch is needed, presumably). Time for a Cloud “refresh” in ITIL/ITSM-land?
  • The PaaS providers may also be able to facilitate debugging of infrastructure-related problem. For example by stamping the logs with a version ID for the infrastructure on the node that generated the log entry. And the ability to request that a test runs on a node with the same version. Keeping in mind that in a SOA / Composite world, the root cause of a problem found on one node may be a configuration change on a different node…

Some closing notes:

  • Another incarnation of this problem is likely to show up in the form of PaaS certification. We should not assume that just because you use a PaaS you are the developer of the application. Why can’t I license an ISV app that runs on GAE? But then, what does the ISV certify against? A given PaaS provider, e.g. Google? A given version of the PaaS infrastructure (if there is such a thing… Google advertises versions of the GAE SDK, but not of the actual GAE runtime)? Or maybe a given PaaS software stack, e.g. the Oracle/Microsoft/IBM/VMWare/JBoss/etc, meaning that any Cloud provider who uses this software stack is certified?
  • I have only discussed here changes to the underlying platform that do not change the contract (or at least only introduce backward-compatible changes, i.e. add APIs but don’t remove any). The matter of non-compatible platform updates (and version coexistence) is also a whole other ball of wax, one that comes with echoes of SOA governance discussions (because in PaaS we are talking about pure software contracts, not hardware or hardware-like contracts). Another area in which PaaS has larger challenges than IaaS.
  • Finally, for an illustration of how a highly focused and specialized container cuts down on the need for config changes, look at this photo from earlier today during the presentation of JRockit Virtual Edition at Oracle Open World. This slide shows (in font size 3, don’t worry you’re not supposed to be able to read), the list of configuration files present on a normal Linux instance, versus a stripped-down (“JeOS”) Linux, versus JRockit VE.


By the way, JRockit VE is very interesting and the environment today is much more favorable than when BEA first did it, but that’s a topic for another post.

[UPDATED 2009/10/22: For more on this (in an EC2-centric context) see section 4 ("service problem resolution") of this IBM paper. It ends with "another possible direction is to develop new mechanisms or APIs to enable cloud users to directly and automatically query and correlate application level events with lower level hardware information to better identify the root cause of the problem".]

21
Sep
2009

Look Ma, no hypervisor!

by William (@vambenepe on Twitter)

Encouraged by hypervisor vendors, the confusion between virtualization and Cloud Computing is rampant. In the industry, the term “virtualization” (and its corollary, “virtual machine”) is used in so many different ways that it has lost all usefulness. For a recent example, read the introduction of this SNIA/OGF white paper (on Cloud Storage) which asserts that “the new technology underlying this is the system virtual machine that allows multiple instances of an operating system and associated applications to run on single physical machine. Delivering this over the network, on demand, is termed Infrastructure as a Service (IaaS)”.

In fact, even IaaS-type Cloud services don’t imply the use of hypervisors.

We need to decouple the Cloud interface/contract (e.g. “what are the types of resources that can I provision on demand? hosts, app servers, storage capacity, app services…”) from the underlying implementation (e.g. “are hypervisors used by the Cloud provider?”). At the risk of spelling out things that may be obvious to many readers of this blog, here is a simplified matrix of Cloud Computing systems, designed to illustrate that all combinations of interface and implementation are possible and in many cases even reasonable.

IaaS interface PaaS interface
Hypervisor used Yes! (see #1) Yes! (see #2)
Hypervisor not used Yes! (see #3) Yes! (see #4)

#1: IaaS interface, hypervisor-based implementation

This is a very common approach these days, both in public Clouds (EC2, Rackspace and presumably at some point the VMWare vCloud Express service providers) and private Clouds (Citrix, Sun, Oracle, Eucalyptus, VMWare…). Basically, you take a bunch of servers, put hypervisors on all of them and make VMs running on these hypervisors available to the Cloud customers.

But despite its predominance, this is not the only path to a Cloud, not even to an IaaS (e.g. “x86 hosts on demand”) Cloud. The following three other scenarios are all valid too.

#2: PaaS interface, hypervisor-based implementation

This is the road SpringSource has been on, first with Cloud Foundry (using AWS EC2 which is based on the Xen hypervisor) and presumably soon on top of VMWare.

#3: IaaS interface, no hypervisor in the implementation

Let’s remember that the utility computing vision (before the term fell in desuetude in favor of “cloud”) has been around before x86 hypervisors were so common. Take Loudcloud as an illustration. They were building what is now called a “public Cloud” starting back in 1999 and not using any hypervisor. Just bare metal provisioning and advanced provisioning automation software. Then they sold the hosting part to EDS (now HP) and only kept the software, under the name Opsware (now HP too, incidentally). That software was meant to create what we now call a “private Cloud”. See this old DCML announcement as one example of the Opsware vision. And no hypervisor was harmed in the making of this movie.

At the current point in time, the hardware (e.g. multiple cores, shared memory) and software (hypervisors, legacy apps) environment is such that hypervisor-based solutions seem to have an edge over those based on automated provisioning/configuration alone. But these things tend to change quickly in our industry… Especially if you factor in non-technical considerations like compliance, fear of data leakage and the risk of having the hardware underlying your application seized because of an investigation involving another tenant…

And this is not going into finner techno-philosophical points about the different types of hypervisors. Not to mention mainframe LPARs… One could build a hypervisor-free IaaS solution on these.

To some extent, you may even put the “pwned” machines (in a botnet) in this “IaaS with no hypervisor” category (with the small difference that what’s being made available is an x86 with an OS, typically Windows, already installed). If you factor out externalities (like the FBI breaking down your front door at 6:00AM) this approach has claims as the most cost-effective form of Cloud computing available today… Solaris zones are another example of possible foundation for a hypervisor-free IaaS-like offering (here too, with an OS rather than a “raw host” as the interface).

#4: PaaS interface, no hypervisor in the implementation

In the public sphere, this corresponds to Google App Engine.

In the private sphere, several companies have built it themselves on top of WebLogic, by adding some level of “on-demand” application provisioning in order to streamline the relationship between the IT group running the servers and the business groups who want to deploy applications on them. Something that one should ideally be able to buy rather than build.

Waiting for the question to become irrelevant

Like most deeply-ingrained confusions, the conflation of virtualization and Cloud Computing won’t be dispelled as much as made irrelevant. The four categories enumerated in this post are a point-in-time view of a continuously evolving system. What may start today as a bundle of a hypervisor, an OS and an app server may become a somewhat monolithic “PaaS engine” over time as the components are more tightly integrated. That “engine” may have memory isolation mechanisms that look a lot like a hypervisor. But it may not be able to host a generic OS. In the same way that whales don’t have fingers and toes and yet they are still very much apparent in their skeleton.

[UPDATED 2009/10/8: A real-life example of #3! On-demand servers via bare metal provisioning (via Sam). No hypervisor in the picture. See also here.]

[UPDATED 2009/12/29: Another non-hypervisor Cloud provider! NewServers. Here is their API. And a Q&A.]

15
Sep
2009

Cloud Data Management Interface (CDMI) draft released

by William (@vambenepe on Twitter)

Have you developed “Cloud API fatigue” from seeing too many IaaS “Cloud APIs” lately? Are you starting to wonder how many different ways there can possibly be to launch a virtual machine via an HTTP POST? Are you wondering why everybody else seems to equate Cloud computing with on-demand server instances?

If yes, then CDMI will come as a breath of fresh air. This specification (just a draft at this point) is a rare example of a different beast. Coming out of SNIA, it endeavors to standardize the way storage resources are managed and accessed in a Cloud environment. They call this DaaS (Data storage as a Service).

The specification has two components (which may benefit from being separated in two specifications at some point). One (called “control paths”) is an interface to manage a data storage service. That interface is expected to work across many forms of data storage from block storage (like AWS EBS) to filesystems (e.g. NFS) to object stores with a CRUD interface (similar to the WebDAV volumes of the Sun API). It also mentions a “simple table space storage” storage form, but that part is pretty fuzzy.

The second component of CDMI (called “data paths”) only applies to the CRUD object store and it describes a RESTful interface for accessing it. This figure from the specification does a good job of illustrating the two different APIs in the specification (and the different types of storage envisioned).

One of the most interesting sections in the document describes the way in which the authors envision the ability to export the storage resources provisioned/managed through CDMI to other Cloud APIs. They illustrate it in an example involving OCCI (see also this joint white paper). This is very interesting and another sign that we need a shared RESTful resource control framework for Cloud computing as a first layer of standardization. One of the reasons I used to justify this claim two weeks ago was that “there will not be one API that provides control of [all the different forms of Cloud Computing], but they can share a base protocol that will make life a lot easier for developers. These Clouds won’t be isolated, developers will use them as a continuum.” One week later, this draft specification illustrates the point very well.

[As a somewhat related side note, this interesting post about what it takes to provide a large-scale resilient data service (the Google App Engine data store). And more about the Google File System in general.]

02
Sep
2009

VMWare publishes (and submits) vCloud API

by William (@vambenepe on Twitter)

VMWare published its vCloud API yesterday (it was previously only available to a few partners) and submitted it to the DMTF, as had been previously announced. So much for my speculations involving IBM.

It may be time to update the Cloud API comparison. After a very quick first pass, vCloud looks quite similar to the Sun Cloud API (that’s a compliment). For example, they both handle long-lived operations via a “202 Accepted” complemented by a resource that represents the progress (“status” for Sun, “task” for vCloud). A very visible (but not critical) difference is the use of JSON (Sun) versus XML (vCloud).

As expected, OVF/OVA is central to vCloud. More once I have read the whole specification.

In any case, things are going to get interesting in the DMTF Cloud incubator. I there a path to adoption?Assuming that Amazon keeps sitting it out, what will the other Cloud vendors with an API (Rackspace, GoGrid, Sun…) do? I doubt they ever had plans/aspirations to own or even drive the standard, but how much are they willing to let VMWare do it? How much does Citrix/Xen want to steer standards versus simply implement them in the context of the Xen Cloud project? What about OGF/OCCI with which the DMTF is supposedly collaborating?How much support is VMWare going to receive from its service provider partners? How much traction does VMWare have with Cisco, HP (server division) and IBM on this? What are the plans at Oracle and Microsoft? Speaking of Microsoft, maybe it will at some point want its standard strategy playbook back. At least when VMWare is done using it.

02
Sep
2009

Are these your files? I found them on my cloud

by William (@vambenepe on Twitter)

Drip drip drip… Is this the sound of your cloud leaking?

It can happen in different ways. See for example this recent research paper, titled “Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds”. It’s a nice read, especially if you find side channels interesting (I came up with one recently, in a different context).

In the first part of the paper, the authors show how to get your EC2 instance co-located (i.e. running in in the same hypervisor) with the instance you are targeting (the one you want to spy on). Once this is achieved, they describe side channel attacks to glean information from this situation.

This paper got me thinking. I noticed that it does not mention trying to go after disk blocks and memory. I don’t know if they didn’t try or they tried and were defeated.

For disk blocks (the most obvious attack vector), Amazon is no dummy and their “proprietary  disk  virtualization  layer  automatically  wipes every block of storage used by  the customer, and guarantees  that one customer’s data  is never exposed to another” as explained in the AWS Security Whitepaper. In fact, they are so confident of this that they don’t even bother forbidding block-based recovery attempts in the AWS customer agreement (they seem mostly concerned about attacks that are not specific to hypervisor environments, like port scanning or network-based DOS). I took this as an invitation to verify their claims, so I launched a few Linux/ext3 and Windows/NTFS instances, attached a couple of EBS volumes to them and ran off-the-shelf file recovery tools. Sure enough, nothing was found on  /dev/sda2 (the empty 150GB partition of local storage that comes with each instance) or on the EBS volumes. They are not bluffing.

On the other hand, there were plenty of recoverable files on /dev/sda1. Here is what a Foremost scan returned on two instances (both of them created from public Fedora AMIs).

The first one:

Finish: Tue Sep  1 05:04:52 2009

5640 FILES EXTRACTED

jpg:= 14
gif:= 670
htm:= 1183
exe:= 2
png:= 3771
------------------------------------------------------------------

And the second one:

Finish: Wed Sep  2 00:32:16 2009

17236 FILES EXTRACTED

jpg:= 236
gif:= 2313
rif:= 11
htm:= 4886
zip:= 182
exe:= 6
png:= 9594
pdf:= 8
------------------------------------------------------------------

These are blocks in the AMI itself, not blocks that were left on the volumes on which the AMI was installed. In other words, all instances built from the same AMI will provide the exact same recoverable files. The C: drive of the Windows instance also had some recoverable files. Not surprisingly they were Windows setup files.

I don’t see this as an AWS flaw. They do a great job providing cleanly wiped raw volumes and it’s the responsibility of the AMI creator not to snapshot recoverable blocks. I am just not sure that everyone out there who makes AMIs available is aware of this. My simple Foremost scans above only looked for the default file types known out of the box by Foremost. I suspect that if I added support for .pem files (used by AWS to store private keys) there may well be a few such files recoverable in some of the publicly accessible AMIs…

Again, kudos to Amazon, but I also wonder if this feature opens a possible DOS approach on AWS: it doesn’t cost me much to create a 1TB EBS volume and to destroy it seconds later. But for Amazon, that’s a lot of blocks to wipe. I wonder how many such instantaneous create/delete actions on large EBS volumes it would take to put a large chunk of AWS storage capacity in the “unavailable – pending wipe” state… That’s assuming that they proactively wipe all the physical blocks. If instead the wipe is virtual (their virtualization layer returns zero as the value for any free block, no matter what the physical value of the block) then this attack wouldn’t work. Or maybe they keep track of the blocks that were written and only wipe these.

Then there is the RAM. The AWS security paper tells us that the physical RAM is kept separated between instances (presumably they don’t use ballooning or the more ambitious Xen Transcendent Memory). But they don’t say anything about what happens when a new instance gets hold of the RAM of a terminated instance.

Amazon probably makes sure the RAM is reset, as the disk blocks are. But what about your private Cloud infrastructure? While the prospect of such Cloud leakage is most terrifying in a public cloud scenario (anyone could make use of it to go after you), in practice I suspect that these attack vectors are currently a lot more exploitable in the various “private clouds” out there. And that for many of these private clouds you don’t need to resort to the exotic side channels described in the “get off of my cloud” paper. Amazon has been around the block (no pun intended) a few times, but not all the private cloud frameworks out there have.

One possible conclusion is that you want to make sure that your cloud vendor does more than writing scripts to orchestrate invocations of the hypervisor APIs. They need to understand the storage, computing and networking infrastructure in details. There is a messy physical world under your clean shinny virtual world. They need to know how to think about security at the system level.

Another one is that this is a mostly an issue for hypervisor-based utility computing and a possible trump card for higher level of virtualization, e.g. PaaS. The attacks described in the paper (as well as block-based file recovery) would not work on Google App Engine. What does co-residency mean in a world where subsequent requests to the same application could hit any machine (though in practice it’s unlikely to be so random)? You don’t get “deployed” to the same host as your intended victim. At best you happen to have a few requests executed while a few requests of your target run on the same physical machine. It’s a lot harder to exploit. More importantly, the attack surface is much more restrained. No direct memory access, no low-level scheduler data, no filesystem… The OS to hardware interface that hypervisors emulate was meant to let the OS control the hardware. The GAE interface/SDK, on the other hand, was meant to give the application just enough capabilities to perform its task, in a way that is as removed from the hardware as possible. Of course there is still an underlying physical reality in the GAE case and there are sure to be some leaks there too. But the small attack surface makes them a lot harder to exploit.

[UPDATED 2009/9/8: Amazon just improved the ability to smoothly update your access certificates. So hopefully any such certificate found on recoverable blocks in an AMI will be out of data and unusable.]

[UPDATED 2009/9/24: Some good security practices that help protect you against block analysis and many other forms of attack.]

[UPDATED 2009/10/15: At Oracle Open World this week, I was assured by an Amazon AWS employee that the DOS scenario I describe in this post would not be a problem for them. But no technical detail as to why that is. Also, you get billed a minimum of one hour for each EBS volume you provision, so that attack would not be as cheap as I thought (unless you use a stolen credit card).]

Categories