Category Archives: IT Systems Mgmt
Oracle Enterprise Manager Cloud Control 12c, the new version of Oracle’s IT management product came out a few weeks ago, during Open World (video highlights of the launch). That release was known internally for a while as “NG” for “next generation” because the updates it contains are far more numerous and profound than your average release. The design for “NG” (now “12c”) started long before Enterprise Manager Grid Control 11g, the previous major release, shipped. The underlying framework has been drastically improved, from the modeling capabilities, to extensibility, scalability, incident management and, most visibly, UI.
If you’re not an existing EM user then those framework upgrades won’t be as visible to you as the feature upgrades and additions. And there are plenty of those as well, from database management to application management and configuration management. The most visible addition is the all-new self-service portal through which an EM-driven private Cloud can be made available. This supports IaaS-level services (individual VMs or assemblies composed of multiple coordinated VMs) and DBaaS services (we’ve also announced and demonstrated upcoming PaaS services). And it’s not just about delivering these services via lifecycle automation, a lot of work has also gone into supporting the business and organizational aspects of delivering services in a private Cloud: quotas, chargeback, cost centers, maintenance plans, etc…
EM Cloud Control is the first Oracle product with the “12c” suffix. You probably guessed it, the “c” stands for “Cloud”. If you consider the central role that IT management software plays in Cloud Computing I think it’s appropriate for EM to lead the way. And there’s a lot more “c” on the way.
Many (short and focused) demo videos are available. For more information, see the product marketing page, the more technical overview of capabilities or the even more technical product documentation. Or you can just download the product (or, for production systems, get it on eDelivery).
If you missed the launch at Open World, EM12c introduction events are taking place all over the world in November and December. They start today, November 3rd, in Athens, Riga and Beijing.
Note to anyone who still cares about IaaS standards: the DMTF has published a work in progress.
There was a lot of interest in the topic in 2009 and 2010. Some heated debates took place during Cloud conferences and a few symposiums were organized to try to coordinate various standard efforts. The DMTF started an “incubator” on the topic. Many companies brought submissions to the table, in various levels of maturity: VMware, Fujitsu, HP, Telefonica, Oracle and RedHat. IBM and Microsoft might also have submitted something, I can’t remember for sure.
The DMTF has been chugging along. The incubator turned into a working group. Unfortunately (but unsurprisingly), it limited itself to the usual suspects (and not all the independent Cloud experts out there) and kept the process confidential. But this week it partially lifted the curtain by publishing two work-in-progress documents.
They can be found at http://dmtf.org/standards/cloud but if you read this after March 2012 they won’t be there anymore, as DMTF likes to “expire” its work-in-progress documents. The two docs are:
- Cloud Infrastructure Management Interface (CIMI) Model and REST Interface Specification, and
- Cloud Infrastructure Management Interface – Common Information Model (CIMI-CIM) Specification
The first one is the interesting one, and the one you should read if you want to see where the DMTF is going. It’s a RESTful specification (at the cost of some contortions, e.g. section 184.108.40.206.1). It supports both JSON and XML (bad idea). It plans to use RelaxNG instead of XSD (good idea). And also CIM/MOF (not a joke, see the second document for proof). The specification is pretty ambitious (it covers not just lifecycle operations but also monitoring and events) and well written, especially for a work in progress (props to Gil Pilz).
I am surprised by how little reaction there has been to this publication considering how hotly debated the topic used to be. Why is that?
A cynic would attribute this to people having given up on DMTF providing a Cloud API that has any chance of wide adoption (the adjoining CIM document sure won’t help reassure DMTF skeptics).
To the contrary, an optimist will see this low-key publication as a sign that the passions have cooled, that the trusted providers of enterprise software are sitting at the same table and forging consensus, and that the industry is happy to defer to them.
More likely, I think people have, by now, enough Cloud experience to understand that standardizing IaaS APIs is a minor part of the problem of interoperability (not to mention the even harder goal of portability). The serialization and plumbing aspects don’t matter much, and if they do to you then there are some good libraries that provide mappings for your favorite language. What matters is the diversity of resources and services exposed by Cloud providers. Those choices strongly shape the design of your application, much more than the choice between JSON and XML for the control API. And nobody is, at the moment, in position to standardize these services.
So congrats to the DMTF Cloud Working Group for the milestone, and please get the API finalized. Hopefully it will at least achieve the goal of narrowing down the plumbing choices to three (AWS, OpenStack and DMTF). But that’s not going to solve the hard problem.
My colleagues Ashwin Karkala and Govinda Sambamurthy have written a book about modeling and managing business services using the current version of Enterprise Manager Grid Control (11g R1). Nobody would have been better qualified for this task since they built a lot of the features they describe. I acted as a technical reviewer for this book and very much enjoyed reading it in the process.
Whether you are a current EM user who wants to make sure you know and use the BSM features or someone just considering EM for that task, this is the book you want.
The full title is Oracle Enterprise Manager Grid Control 11g R1: Business Service Management.
As a bonus feature, and for a limited time only, if you purchase this book over the next 48 hours you get to follow the authors, @ashwinkarkala and @govindars on Twitter at no extra cost! A $2,000 value (at least).
As the name suggests, “Yoga” lets you practice some contortions that would strain a run-of-the-mill REST programmer. Basically, you can use a request like
to retrieve the id, name and birthday of all members of a softball team, rather than having to retrieve the team roaster and then do a GET on each and every team member to retrieve their name and birthday (and lots of other information you don’t care about).
Where have I seen this before? That use case came up over and over again when we were using SOAP Web services for resource management. I have personally crafted support for it a few times. Using this blog to support my memory, here is the list of SOAP-related management efforts listed in the “post-mortem on the previous IT management revolution”:
Each one of them supports this “partial access” use case: WS-Management has :
Each one of them supports this “partial access” use case: WS-Management has SelectorSet, WSRF has ResourceProperties, CMDBf has ContentSelector, WSRA has Fragments, etc.
Years ago, I also created the XMLFrag SOAP header to attack a more general version of this problem. There may be something to salvage in all this for people willing to break REST orthodoxy (with the full knowledge of what they gain and what they loose).
I’m not being sarcastic when I ask “where have I seen this before”. The problem hasn’t gone away just because we failed to solve it in a pragmatic way with SOAP. If the industry is moving towards HTTP+JSON then we’ll need to solve it again on that ground and it’s no surprise if the solution looks similar.
I have a sense of what’s coming next. XPath-for-JSON-over-the-wire. See, getting individual properties is nice, but sometimes you want more. You want to select only the members of the team who are above 14 years old. Or you just want to count these members rather than retrieve specific information about them individually. Or you just want a list of all the cities they live in. Etc.
But even though we want this, I am not convinced (anymore) that we need it.
What I know we need is better support for graph queries. Kingsley Idehen once provided a good explanation of why that is and how SPARQL and XML query languages (or now JSON query languages) complement one another (wouldn’t that be a nice trifecta: RDF/OWL’s precise modeling, JSON’s friendly syntax and SPARQL’s graph support – but I digress).
Going back to partial resource access, the last feature is the biggie: a fine-grained mechanism to update resource properties. That one is extra-hard.
Did you enjoy the first version of IBM’s Cloud Computing Reference Architecture? Did you even get certified on it? Then rejoice, because there’s a new version. IBM recently submitted the IBM Cloud Computing Reference Architecture 2.0 to The Open Group.
I’m a bit out of practice reading this kind of IBMese (let’s just say that The Open Group was the right place to submit it) but I would never let my readers down. So, even though these box-within-a-box-within-a-box diagrams (see section 2) give me flashbacks to the days of OGF and WSRF, I soldiered on.
I didn’t understand the goal of the document enough to give you a fair summary, but I can share some thoughts.
It starts by talking a lot about SOA. I initially thought this was to make the point that Glen Daniels articulated very well in this tweet:
Yup, correct SOA patterns (loose coupling, dyn refs, coarse interfaces…) are exactly what you need for cloud apps. You knew this.
But no. Rather than Glen’s astute remark, IBM’s point is one meta-level lower. It’s that “Cloud solutions are SOA solutions”. Which I have a harder time parsing. If you though “service” was overloaded before…
While some of the IBM authors are SOA experts, others apparently come from a Telco background so we get OSS/BSS analogies next.
By that point, I’ve learned that Cloud is like SOA except when it’s like Telco (but there’s probably another reference architecture somewhere that explains that Telco is SOA, so it all adds up).
One thing that chagrined me was that even though this document is very high-level it still manages to go down into implementatin technologies long enough to assert, wrongly, that virtualization is required for Cloud solutions. Another Cloud canard repeated here is the IaaS/PaaS/SaaS segmentation of the Cloud world, to which IBM adds a BPaaS (Business Process as a Service) layer for good measure (for my take on how Cloud relates to SOA, and how I dislike the IaaS/PaaS/SaaS pyramid, see this write-up of the presentation I gave at last year’s Cloud Connect, especially the 3rd picture).
It gets a lot better if you persevere to page 29, where the “Architecture Principles” finally get introduced (if had been asked to edit the paper, I would have only kept the last 6 pages). They are:
- Design for Cloud-scale Efficiencies: When realizing cloud characteristics such as elasticity, self-service access, and flexible sourcing, the cloud design is strictly oriented to high cloud scale efficiencies and short time-to-delivery/time-to-change. (“Efficiency Principle”)
- Support Lean Service Management: The Common Cloud Management Platform fosters lean and lightweight service management policies, processes, and technologies. (“Lightweightness Principle”)
- Identify and Leverage Commonalities: All commonalities are identified and leveraged in cloud service design. (“Economies-of-scale principle”)
- Define and Manage generically along the Lifecycle of Cloud Services: Be generic across I/P/S/BPaaS & provide ‘exploitation’ mechanism to support various cloud services using a shared, common management platform (“Genericity”).
Each principle gets a nickname, thanks to which IBM can refer to this list as the ELEG principles (Efficiency, Lightweightness, Economies-of-scale, Genericity). It also spells GLEE, but apparently that’s wasn’t the prefered sequence.
The first principle is hard to disagree with. The second also rings true, including its dings on ITIL (but the irony of IBM exhaulting “Lightweightness” is hard to ignore). The third and fourth principles (by that time I had lost too many brain cells to understand how they differ) really scared me. While I can understand the motivation, they elicited a vision of zombies in blue suits (presumably undead IBM Distinguish Engineers and Fellows) staggering towards me: “frameworks… we want frameworks…”.
There you go. If you want more information (and, more importantly, unbiased information) go read the Reference Architecture yourself. I am not involved in The Open Group, and I have no idea what it plans to do with it (and if it has received other submissions of the same type). Though I wouldn’t be surprised if I see, in 5 years, some panic sales rep asking an internal mailing list “The customer RPF asks for a mapping of our solution to the Open Group Cloud Reference Architecture and apparently IBM has 94 slides about it, what do I do? Has anyone heard about this Reference Architecture? This is urgent.”
Urgent things are long in the making.
I have a new definition for Cloud Computing. No, really.
Many discussions attempted to define Cloud Computing from the perspective of the consumer. To the point where asking “what’s a Cloud” has become a private joke for “let’s waste some time”. Eventually, people settled on the NIST set of definitions either because they like them (probability 0.1), they got tired of arguing (probability 0.4) or they want to sell to the government (probability 0.5).
Well, I have another one. Mine is a definition from the perspective of the Cloud provider (or the creator of Cloud-enablement software). And it’s a simple one.
A Cloud is a computing environment in which the runtime infrastructure and the management infrastructure are indistinguishable.
Ask engineers at Google App Engine to separate their code between the runtime part and the management part. They might not even understand the question.
For companies (like Oracle, where I work) that have a runtime division (Fusion Middleware for us) and a management division (Enterprise Manager), both of which ship products, it’s a challenge.
For companies which only offer one or the other, it’s a huge challenge.
For engineers who have to put it all together, it’s a great time to be in business.
The previous post (“Amazon proves that REST doesn’t matter for Cloud APIs”) attracted some interesting comments on the blog itself, on Hacker News and in a response post by Mike Pearce (where I assume the photo is supposed to represent me being an AWS fanboy). I failed to promptly follow-up on it and address the response, then the holidays came. But Mark Little was kind enough to pick the entry up for discussion on InfoQ yesterday which brought new readers and motivated me to write a follow-up.
Mark did a very good job at summarizing my point and he understood that I wasn’t talking about the value (or lack of value) of REST in general. Just about whether it is useful and important in the very narrow field of Cloud APIs. In that context at least, what seems to matter most is simplicity. And REST is not intrinsically simpler.
It isn’t a controversial statement in most places that RPC is easier than REST for developers performing simple tasks. But on the blogosphere I guess it needs to be argued.
Method calls is how normal developers write normal code. Doing it over the wire is the smallest change needed to invoke a remote API. The complexity with RPC has never been conceptual, it’s been in the plumbing. How do I serialize my method call and send it over? CORBA, RMI and SOAP tried to address that, none of them fully succeeded in keeping it simple and yet generic enough for the Internet. XML-RPC somehow (and unfortunately) got passed over in the process.
So what did AWS do? They pretty much solved that problem by using parameters in the URL as a dead-simple way to pass function parameters. And you get the response as an XML doc. In effect, it’s one-half of XML-RPC. Amazon did not invent this pattern. And the mechanism has some shortcomings. But it’s a pragmatic approach. You get the conceptual simplicity of RPC, without the need to agree on an RPC framework that tries to address way more than what you need. Good deal.
So, when Mike asks “Does the fact that AWS use their own implementation of an API instead of a standard like, oh, I don’t know, REST, frustrate developers who really don’t want to have to learn another method of communicating with AWS?” and goes on to answer “Yes”, I scratch my head. I’ve met many developers struggling to understand REST. I’ve never met a developer intimidated by RPC. As to the claim that REST is a “standard”, I’d like to read the spec. Please don’t point me to a PhD dissertation.
That being said, I am very aware that simplicity can come back to bite you, when it’s not just simple but simplistic and the task at hand demands more. Andrew Wahbe hit the nail on the head in a comment on my original post:
Exposing an API for a unique service offered by a single vendor is not going to get much benefit from being RESTful.
Revisit the issue when you are trying to get a single client to work across a wide range of cloud APIs offered by different vendors; I’m willing to bet that REST would help a lot there. If this never happens — the industry decides that a custom client for each Cloud API is sufficient (e.g. not enough offerings on the market, or whatever), then REST may never be needed.
Andrew has the right perspective. The usage patterns for Cloud APIs may evolve to the point where the benefits of following the rules of REST become compelling. I just don’t think we’re there and frankly I am not holding my breath. There are lots of functional improvements needed in Cloud services before the burning issue becomes one of orchestrating between Cloud providers. And while a shared RESTful API would be the easiest to orchestrate, a shared RPC API will still be very reasonably manageable. The issue will mostly be one of shared semantics more than protocol.
Mike’s second retort was that it was illogical for me to say that software developers are mostly isolated from REST because they use Cloud libraries. Aren’t these libraries written by developers? What about these, he asks. Well, one of them, Boto‘s Mitch Garnaat left a comment:
Good post. The vast majority of AWS (or any cloud provider’s) users never see the API. They interact through language libraries or via web-based client apps. So, the only people who really care are RESTafarians, and library developers (like me).
Perhaps it’s possible to have an API that’s so bad it prevents people from using it but the AWS Query API is no where near that bad. It’s fairly consistent and pretty easy to code to. It’s just not REST.
Yup. If REST is the goal, then this API doesn’t reach it. If usefulness is the goal, then it does just fine.
Mike’s third retort was to take issue with that statement I made:
The Rackspace people are technically right when they point out the benefits of their API compared to Amazon’s. But it’s a rounding error compared to the innovation, pragmatism and frequency of iteration that distinguishes the services provided by Amazon. It’s the content that matters.
Mike thinks that
If Rackspace are ‘technically’ right, then they’re right. There’s no gray area. Morally, they’re also right and mentally, physically and spiritually, they’re right.
Sure. They’re technically, mentally, physically and spiritually right. They may even be legally, ethically, metaphysically and scientifically right. Amazon is only practically right.
This is not a ding on Rackspace. They’ll have to compete with Amazon on service (and price), not on API, as they well know and as they are doing. But they are racing against a fast horse.
More generally, the debate about how much the technical merits of an API matters (beyond the point where it gets the job done) is a recurring one. I am talking as a recovering over-engineer.
In a post almost a year ago, James Watters declared that it matters. Mitch Garnaat weighed on the other side: “given how few people use the raw API we probably spend too much time worrying about details“, “maybe we worry too much about aesthetics“, “I still wonder whether we obsess over the details of the API’s a bit too much“ (in case you can’t tell, I’m a big fan of Mitch).
Speaking of people I admire, Shlomo Swidler (“in general, only library developers use the raw HTTP. Everyone else uses a library“) and Joe Arnold (“library integration (fog / jclouds / libcloud) is more important for new #IaaS providers than an API“) make the right point. Rather than spending hours obsessing about the finer points of your API, spend the time writing love letters to Mitch and Adrian so they support you in their libraries (also, allocate less of your design time to RESTfulness and more to the less glamorous subject of error handling).
OK, I’ll pile on two more expert testimonies. Righscale’s Thorsten von Eicken (“the API itself is more a programming exercise than a fundamental issue, it’s the semantics of the resources behind the API that really matter“) and F5′s Lori MacVittie (“the World Doesn’t Care About APIs“).
Bottom line, I see APIs a bit like military parades. Soldiers know better than to walk in tight formation, wearing bright colors and to the sound of fanfare into the battlefield. So why are parade exercises so prevalent in all armies? My guess is that they are used to impress potential enemies, reassure citizens and reflect on the strength of the country’s leaders. But military parades are also a way to ensure internal discipline. You may not need to use parade moves on the battlefield, but the fact that the unit is disciplined enough to perform them means they are also disciplined enough for the tasks that matter. Let’s focus on that angle for Cloud APIs. If your RPC API is consistent enough that its underlying model could be used as the basis for a REST API, you’re probably fine. You don’t need the drum rolls, stiff steps and the silly hats. And no need to salute either.
When Google released version 1.3.8 of the Google App Engine SDK in October, they introduced an instance console, showing you how many instances are serving your application and some basic metrics about these instances. I wrote a blog to consider the implications of providing this level of visibility to application administrators. It also pointed out some shortcomings of this first version of the console.
The most glaring problem was that the console showed an “average latency” which was just a straight average of the latencies of all the instances, independently of the traffic they see. Which is a meaningless number.
Today, Google released an update to the SDK (1.4), and along with it some minor updates to the instance console. Except that, as you can see below, the screen capture in their announcement happens to show three instances that have processed exactly the same number of messages. Which means that we can’t tell whether they have fixed the “unweighted average” problem or not. Is this just by chance? Google, WTF? (which stands for “what’s the formula?”, of course).
I decided it was worth spending a few minutes to find the answer. I don’t have any app currently in use on GAE, but it doesn’t take much work to generate enough load to wake up one of my old apps and get it to spin a couple of instances. Here is the resulting console instance:
If you run the numbers, you can see that they’ve fixed that issue; the average latency is now weighted based on instance traffic. Thanks Google for listening.
Apparently, not all the updates have trickled down to my version of the instance console. The “requests”, “errors” and “age” columns are missing. I assume they’re on their way. Seeing the age of the instances, especially, is a nice addition, one of those I requested in my blog.
In the grand scheme of things, these minor updates to the console (which remains quite basic) are not the big news. The major announcement with SDK 1.4 is that the dreaded 30 seconds limit on execution time has been lifted for background tasks (those from Task Queue and Cron). It’s now a much more manageable 10 minutes. This doesn’t apply to the execution of Web requests served by your app.
Google App Engine has been under criticism recently, and that 30-second limit (along with reliability issues) figured prominently in the complains. Assuming the reliability issues are also coming under control, this update will go a long way towards addressing these issues.
Just so you realize how lucky you are if you are just now starting with Google App Engine, here are the kind of hoops you had to jump through, in the early days, to process any task that took a significant amount of time. This was done a year before the Cron and Task Queue features were added to GAE.
Another nice addition with SDK 1.4 is that you can now retrieve the source code of your application from Google’s servers. Of course you should never need that if you are rigorous and well-organized… Presumably this is only for Python since in the Java case Google’s servers never see the source code.
The steady progress of the GAE SDK continues.
Alex Scordellis has a good blog post about how to handle partial PUT in REST. It starts by explaining why partial PUT is needed in the first place. And then (including in the comments) it runs into the issues this brings and proposes some solutions.
I have bad news. There are many more issues.
Let’s pick a simple example. What does it mean if an element is not present in a partial update? Is it an explicit omission, intended to represent the need to remove this element in the representation? Or does it mean “don’t change its current value”. If the latter, then how do I do removal? Do I need partial DELETE like I have partial PUT? Hopefully not, but then I have to have a mechanism to remove elements as part of a PUT. Empty value? That doesn’t necessarily mean the same thing as an absent element. Nil value? And how do I handle this with JSON?
And how do you deal with repeating elements? If you PUT an element of that type, is it an addition or a replacement? If replacement, which one(s) are you replacing? Or do you force me to PUT the entire list? No matter how long it is? Even if it increases the risk of concurrency issues?
Lots of similar issues. These two are just off the top of my head, memories from hours locked in a room with my HP, IBM, Intel and Microsoft accomplices.
You know what you end up with? You end up with this. Partial Put in WS-RT. I can hear you scream from here.
I am the ghost of dead partial update mechanisms, coming back to haunt you…
As much as WS-* was criticized for re-inventing HTTP, what we see here is HTTP people re-inventing partial resource update mechanisms like those in WSDM, WS-Management and WS-ResourceTransfer. Which is fine, I am in no way advocating that they should re-use these specs.
But let’s realize that while a lot of the complexity in WS-* was unnecessary, some of it actually was a reflection of the complexity of the task at hand. And that complexity doesn’t go away because you get rid of a SOAP envelope and of stupid WS-Addressing headers.
The good news is that we’ve made a lot of the mistakes already and we’ve learned some lessons (see this technical rant, this post-mortem or this experiment). The bad news is that there are plenty of new mistakes waiting to be made.
Good luck. I mean it sincerely.
It’s all in the title of the post. An elevator pitch short enough for a 1-story ride. A description for business people. People who don’t want to hear about models, virtualization, blueprints and devops. But people who also don’t want to be insulted with vague claims about “business/IT alignment” and “agility”.
The focus is on repeatability. Repeatability saves work and allows new approaches. I’ve found spreadsheets (and “super-spreadsheets”, i.e. more advanced BI tools) to be a good analogy for business people. Compared to analysts furiously operating calculators, spreadsheets save work and prevent errors. But beyond these cost savings, they allow you to do things you wouldn’t even try to do without them. It’s not just the same process, done faster and cheaper. It’s a more mature way of running your business.
Same with the “Cloud” style of IT management.
The promise of PaaS is that application owners don’t need to worry about the infrastructure that powers the application. They just provide application artifacts (e.g. WAR files) and everything else is taken care of. Backups. Scaling. Infrastructure patching. Network configuration. Geographic distribution. Etc. All these headaches are gone. Just pick from a menu of quality of service options (and the corresponding price list). Make your choice and forget about it.
In practice no abstraction is leak-proof and the abstractions provided by PaaS environments are even more porous than average. The first goal of PaaS providers should be to shore them up, in order to deliver on the PaaS value proposition of simplification. But at some point you also have to acknowledge that there are some irreducible leaks and take pragmatic steps to help application administrators deal with them. The worst thing you can do is have application owners suffer from a leaky abstraction and refuse to even acknowledge it because it breaks your nice mental model.
Google App Engine (GAE) gives us a nice and simple example. When you first deploy an application on GAE, it is deployed as just one instance. As traffic increases, a second instance comes up to handle the load. Then a third. If traffic decreases, one instance may disappear. Or one of them may just go away for no reason (that you’re aware of).
It would be nice if you could deploy your application on what looks like a single, infinitely scalable, machine and not ever have to worry about horizontal scale-out. But that’s just not possible (at a reasonable cost) so Google doesn’t try particularly hard to hide the fact that many instances can be involved. You can choose to ignore that fact and your application will still work. But you’ll notice that some requests take a lot more time to complete than others (which is typically the case for the first request to hit a new instance). And some requests will find an empty local cache even though your application has had uninterrupted traffic. If you choose to live with the “one infinitely scalable machine” simplification, these are inexplicable and unpredictable events.
Last week, as part of the release of the GAE SDK 1.3.8, Google went one step further in acknowledging that several instances can serve your application, and helping you deal with it. They now give you a console (pictured below) which shows the instances currently serving your application.
I am very glad that they added this console, because it clearly puts on the table the question of how much your PaaS provider should open the kimono. What’s the right amount of visibility, somewhere between “one infinitely scalable computer” and giving you fan speeds and CPU temperature?
I don’t know what the answer is, but unfortunately I am pretty sure this console is not it. It is supposed to be useful “in debugging your application and also understanding its performance characteristics“. Hmm, how so exactly? Not only is this console very simple, it’s almost useless. Let me enumerate the ways.
Actually it’s worse than useless, it’s misleading. As we can see on the screen shot, two of the instances saw no traffic during the collection period (which, BTW, we don’t know the length of), while the third one did all the work. At the top, we see an “average latency” value. Averaging latency across instances is meaningless if you don’t weight it properly. In this case, all the requests went to the instance that had an average latency of 1709ms, but apparently the overall average latency of the application is 569.7ms (yes, that’s 1709/3). Swell.
No instance identification
What happens when the console is refreshed? Maybe there will only be two instances. How do I know which one went away? Or say there are still three, how do I know these are the same three? For all I know it could be one old instance and two new ones. The single most important data point (from the application administrator’s perspective) is when a new instance comes up. I have no way, in this UI, to know reliably when that happens: no instance identification, no indication of the age of an instance.
So we get the average memory per instance. What are we supposed to do with that information? What’s a good number, what’s a bad number? How much memory is available? Is my app memory-bound, CPU-bound or IO-bound on this instance?
As I have described before, change and configuration management in a PaaS setting is a thorny problem. This console doesn’t tackle it. Nowhere does it say which version of the GAE platform each instance is running. Google announces GAE SDK releases (the bits you download), but these releases are mostly made of new platform features, so they imply a corresponding update to Google’s servers. That can’t happen instantly, there must be some kind of roll-out (whether the instances can be hot-patched or need to be recycled). Which means that the instances of my application are transitioned from one platform version to another (and presumably that at a given point in time all the instances of my application may not be using the same platform version). Maybe that’s the source of my problem. Wouldn’t it be nice if I knew which platform version an instance runs? Wouldn’t it be nice if my log files included that? Wouldn’t it be nice if I could request an app to run on a specific platform version for debugging purpose? Sure, in theory all the upgrades are backward-compatible, so it “shouldn’t matter”. But as explained above, “the worst thing you can do is have application owners suffer from a leaky abstraction and refuse to even acknowledge it“.
OK, so the instance monitoring console Google just rolled out is seriously lacking. As is too often the case with IT monitoring systems, it reports what is convenient to collect, not what is useful. I’m sure they’ll fix it over time. What this console does well (and really the main point of this blog) is illustrate the challenge of how much information about the underlying infrastructure should be surfaced.
Surface too little and you leave application administrators powerless. Surface more data but no control and you’ll leave them frustrated. Surface some controls (e.g. a way to configure the scaling out strategy) and you’ve taken away some of the PaaS simplicity and also added constraints to your infrastructure management strategy, making it potentially less efficient. If you go down that route, you can end up with the other flavor of PaaS, the IaaS-based PaaS in which you have an automated way to create a deployment but what you hand back to the application administrator is a set of VMs to manage.
That IaaS-centric PaaS is a well-understood beast, to which many existing tools and management practices can be applied. The “pure PaaS” approach pioneered by GAE is much more of a terra incognita from a management perspective. I don’t know, for example, whether exposing the platform version of each instance, as described above, is a good idea. How leaky is the “platform upgrades are always backward-compatible” assumption? Google, and others, are experimenting with the right abstraction level, APIs, tools, and processes to expose to application administrators. That’s how we’ll find out.
A bicycle is a convenient way to go buy cigarettes. Until one day you realize that buying cigarettes is a bad idea. At which point you throw away your bicycle.
Sounds illogical? Well, that’s pretty much what the industry has done with service descriptions. It went this way: people used WSDL (and stub generation tools built around it) to build distributed applications that shared too much. Some people eventually realized that was a bad approach. So they threw out the whole idea of a service description. And now, in the age of APIs, we are no more advanced than we were 15 years ago in terms of documenting application contracts. It’s tragic.
The main fallacies involved in this stagnation are:
- Assuming that service descriptions are meant to auto-generate all-encompassing program stubs,
- Looking for the One True Description for a given service,
- Automatically validating messages based on the service description.
I’ll leave the first one aside, it’s been widely covered. Let’s drill in a bit into the other two.
There is NOT One True Description for a given service
Many years ago, in the same galaxy where we live today (only a few miles from here, actually), was a development team which had to implement a web service for a specific WSDL. They fed the WSDL to their SOAP stack. This was back in the days when WSDL interoperability was a “promise” in the “political campaign” sense of the term so of course it didn’t work. As a result, they gave up on their SOAP stack and implemented the service as a servlet. Which, for a team new to XML, meant a long delay and countless bugs. I’ll always remember the look on the dev lead’s face when I showed him how 2 minutes and a text editor were all you needed to turn the offending WSDL in to a completely equivalent WSDL (from the point of view of on-the-wire messages) that their toolkit would accept.
(I forgot what the exact issue was, maybe having operations with different exchange patterns within the same PortType; or maybe it used an XSD construct not supported by the toolkit, and it was just a matter of removing this constraint and moving it from schema to code. In any case something that could easily be changed by editing the WSDL and the consumer of the service wouldn’t need to know anything about it.)
A service description is not the literal word of God. That’s true no matter where you get it from (unless it’s hand-delivered by an angel, I guess). Just because adding “?wsdl” to the URL of a Web service returns an XML document doesn’t mean it’s The One True Description for that service. It’s just the most convenient one to generate for the app server on which the service is deployed.
One of the things that most hurts XML as an on-the-wire format is XSD. But not in the sense that “XSD is bad”. Sure, it has plenty of warts, but what really hurts XML is not XSD per se as much as the all-too-common assumption that if you use XML you need to have an XSD for it (see fat-bottomed specs, the key message of which I believe is still true even though SML and SML-IF are now dead).
I’ve had several conversations like this one:
- The best part about using JSON and REST was that we didn’t have to deal with XSD.
- So what do you use as your service contract?
- Nothing. Just a human-readable wiki page.
- If you don’t need a service contract, why did you feel like you had to write an XSD when you were doing XML? Why not have a similar wiki page describing the XML format?
It’s perfectly fine to have service descriptions that are optimized to meet a specific need rather than automatically focusing on syntax validation. Not all consumers of a service contract need to be using the same description. It’s ok to have different service descriptions for different users and/or purposes. Which takes us to the next fallacy. What are service descriptions for if not syntax validation?
A service description does NOT mean you have to validate messages
As helpful as “validation” may seem as a concept, it often boils down to rejecting messages that could otherwise be successfully processed. Which doesn’t sound quite as useful, does it?
There are many other ways in which service descriptions could be useful, but they have been largely neglected because of the focus on syntactic validation and stub generation. Leaving aside development use cases and looking at my area of focus (application management), here are a few use cases for service descriptions:
Creating test messages (aka “synthetic transactions”)
A common practice in application management is to send test messages at regular intervals (often from various locations, e.g. branch offices) to measure the availability and response time of an application from the consumer’s perspective. If a WSDL is available for the service, we use this to generate the skeleton of the test message, and let the admin fill in appropriate values. Rather than a WSDL we’d much rather have a “ready-to-use” (possibly after admin review) test message that would be provided as part of the service description. Especially as it would be defined by the application creator, who presumably knows a lot more about that makes a safe and yet relevant message to send to the application as a ping.
Attaching policies and SLAs
One of the things that WSDLs are often used for, beyond syntax validation and stub generation, is to attach policies and SLAs. For that purpose, you really don’t need the XSD message definition that makes up so much of the WSDL. You really just need a way to identify operations on which to attach policies and SLAs. We could use a much simpler description language than WSDL for this. But if you throw away the very notion of a description language, you’ve thrown away the baby (a classification of the requests processed by the service) along with the bathwater (a syntax validation mechanism).
Governance / versioning
One benefit of having a service description document is that you can see when it changes. Even if you reduce this to a simple binary value (did it change since I last checked, y/n) there’s value in this. Even better if you can introspect the description document to see which requests are affected by the change. And whether the change is backward-compatible. Offering the “before” XSD and the “after” XSD is almost useless for automatic processing. It’s unlikely that some automated XSD inspection can tell me whether I can keep using my previous messages or I need to update them. A simple machine-readable declaration of that fact would be a lot more useful.
I just listed three, but there are other application management use cases, like governance/auditing, that need a service description.
In the SOAP world, we usually make do with WSDL for these tasks, not because it’s the best tool (all we really need is a way to classify requests in “buckets” – call them “operations” if you want – based on the content of the message) but because WSDL is the only understanding that is shared between the caller and the application.
By now some of you may have already drafted in your head the comment you are going to post explaining why this is not a problem if people just use REST. And it’s true that with REST there is a default categorization of incoming messages. A simple matrix with the various verbs as columns (GET, POST…) and the various resource types as rows. Each message can be unambiguously placed in one cell of this matrix, so I don’t need a service description document to have a request classification on which I can attach SLAs and policies. Granted, but keep these three things in mind:
- This default categorization by verb and resource type can be a quite granular. Typically you wouldn’t have that many different policies on your application. So someone who understands the application still needs to group the invocations into message categories at the right level of granularity.
- This matrix is only meaningful for the subset of “RESTful” apps that are truly… RESTful. Not for all the apps that use REST where it’s an easy mental mapping but then define resource types called “operations” or “actions” that are just a REST veneer over RPC.
- Even if using REST was a silver bullet that eliminated the need for service definitions, as an application management vendor I don’t get to pick the applications I manage. I have to have a solution for what customers actually do. If I restricted myself to only managing RESTful applications, I’d shrink my addressable market by a few orders of magnitude. I don’t have an MBA, but it sounds like a bad idea.
This is not a SOAP versus REST post. This is not a XML versus JSON post. This is not a WSDL versus WADL post. This is just a post lamenting the fact that the industry seems to have either boxed service definitions into a very limited use case, or given up on them altogether. It I wasn’t recovering from standards burnout, I’d look into a versatile mechanism for describing application services in a way that is geared towards message classification more than validation.
Some organizations just have “systems administrators” in charges of their applications. Others call out an “application administrator” role but it is usually overloaded: it doesn’t separate the application platform administrator from the true application administrator. The first one is in charge of the application runtime infrastructure (e.g. the application server, SOA tools, MDM, IdM, message bus, etc). The second is in charge of the applications themselves (e.g. Java applications and the various artifacts that are used to customize the middleware stack to serve the application).
In effect, I am describing something close to the split between the DBA and the application administrators. The first step is to turn this duo (app admin, DBA) into a triplet (app admin, platform admin, DBA). That would be progress, but such a triplet is not actually what I am really after as it is too strongly tied to a traditional 3-tier architecture. What we really need is a first-order separation between the application administrator and the infrastructure administrators (not the plural). And then, if needed, a second-order split between a handful of different infrastructure administrators, one of which may be a DBA (or a DBA++, having expanded to all data storage services, not just relational), another of which may be an application platform administrator.
There are two reasons for the current unfortunate amalgam of the “application administrator” and “application platform administrator” roles. A bad one and a good one.
The bad reason is a shortcomings of the majority of middleware products. While they generally do a good job on performance, reliability and developer productivity, they generally do a poor job at providing a clean separation of the performance/administration functions that are relevant to the runtime and those that are relevant to the deployed applications. Their usual role definitions are more structured along the lines of what actions you can perform rather than on what entities you can perform them. From a runtime perspective, the applications are not well isolated from one another either, which means that in real life you have to consider the entire system (the middleware and all deployed applications) if you want to make changes in a safe way.
The good reason for the current lack of separation between application administrators and middleware administrators is that middleware products have generally done a good job of supporting development innovation and optimization. Frameworks appear and evolve to respond to the challenges encountered by developers. Knobs and dials are exposed which allow heavy customization of the runtime to meet the performance and feature needs of a specific application. With developers driving what middleware is used and how it is used, it’s a natural consequence that the middleware is managed in tight correlation with how the application is managed.
Just like there is tension between DBAs and the “application people” (application administrators and/or developers), there is an inherent tension in the split I am advocating between application management and application platform management. The tension flows from the previous paragraph (the “good reason” for the current amalgam): a split between application administrators and application platform administrators would have the downside of dampening application platform innovation. Or rather it redirects it, in a mutation not unlike the move from artisans to industry. Rather than focusing on highly-specialized frameworks and highly-tuned runtimes, the application platform innovation is redirected towards the goals of extreme cost efficiency, high reliability, consistent security and scalability-by-default. These become the main objectives of the application platform administrator. In that perspective, the focus of the application architect and the application administrator needs to switch from taking advantage of the customizability of the runtime to optimize local-node performance towards taking advantage of the dynamism of the application platform to optimize for scalability and economy.
Innovation in terms of new frameworks and programming models takes a hit in that model, but there are ways to compensate. The services offered by the platform can be at different levels of generality. The more generic ones can be used to host innovative application frameworks and tools. For example, a highly-specialized service like an identity management system is hard to use for another purpose, but on the other hand a JVM can be used to host not just business applications but also platform-like things like Hadoop. They can run in the “application space” until they are mature enough to be incorporated in the “application platform space” and become the responsibility of the application platform administrator.
The need to keep a door open for innovation is part of why, as much as I believe in PaaS, I don’t think IaaS is going away anytime soon. Not only do we need VMs for backward-looking legacy apps, we also need polyvalent platforms, like a VM, for forward-looking purposes, to allow developers to influence platform innovation, based on their needs and ideas.
Forget the guillotine, maybe I should carry an axe around. That may help get the point across, that I want to slice application administrators in two, head to toe. PaaS is not a question of runtime. It’s a question of administrative roles.
From ancient Mesopotamia to, more recently, Holland, Switzerland, Japan, Singapore and Korea, the success of many societies has been in part credited to their lack of natural resources. The theory being that it motivated them to rely on human capital, commerce and innovation rather than resource extraction. This approach eventually put them ahead of their better-endowed neighbors.
A similar dynamic may well propel Microsoft ahead in PaaS (Platform as a Service): IaaS with Windows is so painful that it may force Microsoft to focus on PaaS. The motivation is strong to “go up the stack” when the alternative is to cultivate the arid land of Windows-based IaaS.
I should disclose that I work for one of Microsoft’s main competitors, Oracle (though this blog only represents personal opinions), and that I am not an expert Windows system administrator. But I have enough experience to have seen some of the many reasons why Windows feels like a much less IaaS-friendly environment than Linux: e.g. the lack of SSH, the cumbersomeness of RDP, the constraints of the Windows license enforcement system, the Windows update mechanism, the immaturity of scripting, the difficulty of managing Windows from non-Windows machines (despite WS-Management), etc. For a simple illustration, go to EC2 and compare, between a Windows AMI and a Linux AMI, the steps (and time) needed to get from selecting an image to the point where you’re logged in and in control of a VM. And if you think that’s bad, things get even worse when we’re not just talking about a few long-lived Windows server instances in the Cloud but a highly dynamic environment in which all steps have to be automated and repeatable.
I am not saying that there aren’t ways around all this, just like it’s not impossible to grow grapes in Holland. It’s just usually not worth the effort. This recent post by RighScale illustrates both how hard it is but also that it is possible if you’re determined. The question is what benefits you get from Windows guests in IaaS and whether they justify the extra work. And also the additional license fee (while many of the issues are technical, others stem more from Microsoft’s refusal to acknowledge that the OS is a commodity). [Side note: this discussion is about Windows as a guest OS and not about the comparative virtues of Hyper-V, Xen-based hypervisors and VMWare.]
Under the DSI banner, Microsoft has been working for a while on improving the management/automation infrastructure for Windows, with tools like PowerShell (which I like a lot). These efforts pre-date the Cloud wave but definitely help Windows try to hold it own on the IaaS battleground. Still, it’s an uphill battle compared with Linux. So it makes perfect sense for Microsoft to move the battle to PaaS.
Just like commerce and innovation will, in the long term, bring more prosperity than focusing on mining and agriculture, PaaS will, in the long term, yield more benefits than IaaS. Even though it’s harder at first. That’s the good news for Microsoft.
On the other hand, lack of natural resources is not a guarantee of success either (as many poor desertic countries can testify) and Microsoft will have to fight to be successful in PaaS. But the work on Azure and many research efforts, like the “next-generation programming model for the cloud” (codename “Orleans”) that Mary Jo Foley revealed today, indicate that they are taking it very seriously. Their approach is not restricted by a VM-centric vision, which is often tempting for hypervisor and OS vendors. Microsoft’s move to PaaS is also facilitated by the fact that, while system administration and automation may not be a strength, development tools and application platforms are.
The forward-compatible Cloud will soon overshadow the backward-compatible Cloud and I expect Microsoft to play a role in it. They have to.
I’ve been tracking the modeling technology previously known as “Microsoft Oslo” with a sympathetic eye for the almost three years since it’s been introduced. I look at it from the perspective of model-driven IT management but the news hadn’t been good on that front lately (except for Douglas Purdy’s encouraging hint).
The prospects got even bleaker today, at least according to the usually-well-informed Mary Jo Foley, who writes: “Multiple contacts of mine are telling me that Microsoft has decided to shelve Quadrant and ‘refocus’ M.” Is “M” the end of the SDM/SML/M model-driven management approach at Microsoft? Or is the “refocus” a hint that M is returning “home” to address IT management use cases? Time (or Doug) will tell…
While we’re talking about Microsoft and IT automation, I have one piece of free advice for the Microsofties: people *really* want to SSH into Windows servers. Here’s how I know. This blog rarely talks about Microsoft but over the course of two successive weekends over a year ago I toyed with ways to remotely manage Windows machines using publicly documented protocols. In effect, showing what to send on the wire (from Linux or any platform) to leverage the SOAP-based management capabilities in recent versions of Windows. To my surprise, these posts (1, 2, 3) still draw a disproportionate amount of traffic. And whenever I look at my httpd logs, I can count on seeing search engine queries related to “windows native ssh” or similar keywords.
If heterogeneous Cloud is something Microsoft cares about they need to better leverage the potential of the PowerShell Remoting Protocol. They can release open-source Python, Java and Ruby client-side libraries. Alternatively, they can drastically simplify the protocol, rather than its current “binary over SOAP” (you read this right) incarnation. Because the poor Kridek who is looking for the “WSDL for WinRM / Remote Powershell” is in for a nasty surprise if he finds it and thinks he’ll get a ready-to-use stub out of it.
That being said, a brave developer willing to suck it up and create such a Python/Ruby/Java library would probably make some people very grateful.
Oracle recently published a Cloud management API on OTN and also submitted a subset of the API to the new DMTF Cloud Management working group. The OTN specification, titled “Oracle Cloud Resource Model API”, is available here. In typical DMTF fashion, the DMTF-submitted specification is not publicly available (if you have a DMTF account and are a member of the right group you can find it here). It is titled the “Oracle Cloud Elemental Resource Model” and is essentially the same as the OTN version, minus sections 9.2, 9.4, 9.6, 9.8, 9.9 and 9.10 (I’ll explain below why these sections have been removed from the DMTF submission). Here is also a slideset that was recently used to present the submitted specification at a DMTF meeting.
So why two documents? Because they serve different purposes. The Elemental Resource Model, submitted to DMTF, represents the technical foundation for the IaaS layer. It’s not all of IaaS, just its core. You can think of its scope as that of the base EC2 service (boot a VM from an image, attach a volume, connect to a network). It’s the part that appears in all the various IaaS APIs out there, and that looks very similar, in its model, across all of them. It’s the part that’s ripe for a simple standard, hopefully free of much of the drama of a more open-ended and speculative effort. A standard that can come out quickly and provide interoperability right out of the gate (for the simple use cases it supports), not after years of plugfests and profiles. This is the narrow scope I described in an earlier rant about Cloud standards:
I understand the pain of customers today who just want to have a bit more flexibility and portability within the limited scope of the VM/Volume/IP offering. If we really want to do a standard today, fine. Let’s do a very small and pragmatic standard that addresses this. Just a subset of the EC2 API. Don’t attempt to standardize the virtual disk format. Don’t worry about application-level features inside the VM. Don’t sweat the REST or SOA purity aspects of the interface too much either. Don’t stress about scalability of the management API and batching of actions. Just make it simple and provide a reference implementation. A few HTTP messages to provision, attach, update and delete VMs, volumes and IPs. That would be fine. Anything else (and more is indeed needed) would be vendor extensions for now.
Of course IaaS goes beyond the scope of the Elemental Resource Model. We’ll need load balancing. We’ll need tunneling to the private datacenter. We’ll need low-latency sub-networks. We’ll need the ability to map multi-tier applications to different security zones. Etc. Some Cloud platforms support some of these (e.g. Amazon has an answer to all but the last one), but there is a lot more divergence (both in the “what” and the “how”) between the various Cloud APIs on this. That part of IaaS is not ready for standardization.
Then there are the extensions that attempt to make the IaaS APIs more application-aware. These too exist in some Cloud APIs (e.g. vCloud vApp) but not others. They haven’t naturally converged between implementations. They haven’t seen nearly as much usage in the industry as the base IaaS features. It would be a mistake to overreach in the initial phase of IaaS standardization and try to tackle these questions. It would not just delay the availability of a standard for the base IaaS use cases, it would put its emergence and adoption in jeopardy.
This is why Oracle withheld these application-aware aspects from the DMTF submission, though we are sharing them in the specification published on OTN. We want to expose them and get feedback. We’re open to collaborating on them, maybe even in the scope of a standard group if that’s the best way to ensure an open IP framework for the work. But it shouldn’t make the upcoming DMTF IaaS specification more complex and speculative than it needs to be, so we are keeping them as separate extensions. Not to mention that DMTF as an organization has a lot more infrastructure expertise than middleware and application expertise.
Again, the “Elemental Resource Model” specification submitted to DMTF is the same as the “Oracle Cloud Resource Model API” on OTN except that it has a different license (a license grant to DMTF instead of the usual OTN license) and is missing some resources in the list of resource types (section 9).
Both specifications share the exact same protocol aspects. It’s pretty cleanly RESTful and uses a JSON serialization. The credit for the nice RESTful protocol goes to the folks who created the original Sun Cloud API as this is pretty much what the Oracle Cloud API adopted in its entirety. Tim Bray described the genesis and design philosophy of the Sun Cloud API last year. He also described his role and explained that “most of the heavy lifting was done by Craig McClanahan with guidance from Lew Tucker“. It’s a shame that the Oracle specification fails to credit the Sun team and I kick myself for not noticing this in my reviews. This heritage was noted from the get go in the slides and is, in my mind, a selling point for the specification. When I reviewed the main Cloud APIs available last summer (the first part in a “REST in practice for IT and Cloud management” series), I liked Sun’s protocol design the best.
The resource model, while still based on the Sun Cloud API, has seen many more changes. That’s where our tireless editor, Jack Yu, with help from Mark Carlson, has spent most of the countless hours he devoted to the specification. I won’t do a point to point comparison of the Sun model and the Oracle model, but in general most of the changes and additions are motivated by use cases that are more heavily tilted towards private clouds and compatibility with existing application infrastructure. For example, the semantics of a Zone have been relaxed to allow a private Cloud administrator to choose how to partition the Cloud (by location is an obvious option, but it could also by security zone or by organizational ownership, as heretic as this may sound to Cloud purists).
The most important differences between the DMTF and OTN versions relate to the support for assemblies, which are groups of VMs that jointly participate in the delivery of a composite application. This goes hand-in-hand with the recently-released Oracle Virtual Assembly Builder, a framework for creating, packing, deploying and configuring multi-tier applications. To support this approach, the Cloud Resource Model (but not the Elemental Model, as explained above) adds resource types such as AssemblyTemplate, AssemblyInstance and ScalabilityGroup.
So what now? The DMTF working group has received a large number of IaaS APIs as submissions (though not the one that matters most or the one that may well soon matter a lot too). If all goes well it will succeed in delivering a simple and useful standard for the base IaaS use cases, and we’ll be down to a somewhat manageable triplet (EC2, RackSpace/OpenStack and DMTF) of IaaS specifications. If not (either because the DMTF group tries to bite too much or because it succumbs to infighting) then DMTF will be out of the game entirely and it will be between EC2, OpenStack and a bunch of private specifications. It will be the reign of toolkits/library/brokers and hell on earth for all those who think that such a bridging approach is as good as a standard. And for this reason it will have to coalesce at some point.
As far as the more application-centric approach to hypervisor-based Cloud, well, the interesting things are really just starting. Let’s experiment. And let’s talk.
Bernd Harzog recently wrote a blog entry to examine whether “the CMDB [is] irrelevant in a Virtual and Cloud based world“. If I can paraphrase, his conclusion is that there will be something that looks like a CMDB but the current CMDB products are ill-equipped to fulfill that function. Here are the main reasons he gives for this prognostic:
- A whole new class of data gets created by the virtualization platform – specifically how the virtualization platform itself is configured in support of the guests and the applications that run on the guest.
- A whole new set of relationships between the elements in this data get created – specifically new relationships between hosts, hypervisors, guests, virtual networks and virtual storage get created that existing CMDB’s were not built to handle.
- New information gets created at a very rapid rate. Hundreds of new guests can get provisioned in time periods much too short to allow for the traditional Extract, Transform and Load processes that feed CMDB’s to be able to keep up.
- The environment can change at a rate that existing CMDB’s cannot keep up with. Something as simple as vMotion events can create thousands of configuration changes in a few minutes, something that the entire CMDB architecture is simply not designed to keep up with.
- Having portions of IT assets running in a public cloud introduces significant data collection challenges. Leading edge APM vendors like New Relic and AppDynamics have produced APM products that allow these products to collect the data that they need in a cloud friendly way. However, we are still a long way away from having a generic ability to collect the configuration data underlying a cloud based IT infrastructure – notwithstanding the fact that many current cloud vendors would not make this data available to their customers in the first place.
- The scope of the CMDB needs to expand beyond just asset and configuration data and incorporate Infrastructure Performance, Applications Performance and Service assurance information in order to be relevant in the virtualization and cloud based worlds.
I wanted to expand on some of these points.
New model elements for Cloud (bullets #1 and #2)
These first bullets are not the killers. Sure, the current CMDBs were designed before the rise of virtualized environment, but they are usually built on a solid modeling foundation that can easily be extend with new resources classes. I don’t think that extending the model to describe VM, VNets, Volumes, hypervisors and their relationships to the physical infrastructure is the real challenge.
New approach to “discovery” (bullets #3 and #4)
This, on the other hand is much more of a “dinosaurs meet meteorite” kind of historical event. A large part of the value provided by current CMDBs is their ability to automate resource discovery. This is often achieved via polling/scanning (at the hardware level) and heuristics/templates (directory names, port numbers, packet inspection, bird entrails…) for application discovery. It’s imprecise but often good enough in static environments (and when it fails, the CMDB complements the automatic discovery with a reconciliation process to let the admin clean things up). And it used to be all you could get anyway so there wasn’t much point complaining about the limitations. The crown jewel of many of today’s big CMDBs can often be traced back to smart start-ups specialized in application discovery/mapping, like Appilog (now HP, by way of Mercury) and nLayers (now EMC). And more recently the purchase of Tideway by BMC (ironically – but unsurprisingly – often cast in Cloud terms).
But this is not going to cut it in “the Cloud” (by which I really mean in a highly automated IT environment). As Bernd Harzog explains, the rate of change can completely overwhelm such discovery heuristics (plus, some of the network scans they sometimes use will get you in trouble in public clouds). And more importantly, there now is a better way. Why discover when you can ask? If resources are created via API calls, there are also API calls to find out which resources exist and how they are configured. This goes beyond the resources accessible via IaaS APIs, like what VMWare, EC2 and OVM let you retrieve. This “don’t guess, ask” approach to discovery needs to also apply at the application level. Rather than guessing what software is installed via packet inspection or filesystem spelunking, we need application-aware discovery that retrieves the application and configuration and dependencies from the application itself (or its underlying framework). And builds a model in which the connections between application entities are expressed in terms of the configuration settings that drive them rather than the side effects by which they can be noticed.
If I can borrow the words of Lew Cirne:
“All solutions built in the pre-cloud era are modeled on jvms (or their equivalent), hosts and ports, rather than the logical application running in a more fluid environment. If the solution identifies a web application by host/port or some other infrastructural id, then you cannot effectively manage it in a cloud environment, since the app will move and grow, and your management system (that is, everything offered by the Big 4, as well as all infrastructure management companies that pay lip service to the application) will provide nearly-useless visibility and extraordinarily high TCO.”
I don’t agree with everything in Lew Cirne’s post, but this diagnostic is correct and well worded. He later adds:
“So application management becomes the strategic center or gravity for the client of a public or private cloud, and infrastructure-centric tools (even ones that claim to be cloudy) take on a lesser role.”
Which is also very true even if counter-intuitive for those who think that
cloud = virtualization (in the “fake machine” interpretation of virtualization)
Embracing such a VM-centric view naturally raises the profile of infrastructure management compared to application management, which is a fallacy in Cloud computing.
Drawing the line between Cloud infrastructure management and application management (bullet #5)
This is another key change that traditional CMDBs are going to have a hard time with. In a Big-4 CMDB, you’re after the mythical “single source of truth”. Even in a federated CMDB (which doesn’t really exist anyway), you’re trying to have a unified logical (if not physical) repository of information. There is an assumption that you want to manage everything from one place, so you can see all the inter-dependencies, across all layers of the stack (even if individual users may have a scope that is limited by permissions). Not so with public Clouds and even, I would argue, any private Cloud that is more than just a “cloud” sticker slapped on an old infrastructure. The fact that there is a clean line between the infrastructure model and the application model is not a limitation. It is empowering. Even if your Cloud provider was willing to expose a detailed view of the underlying infrastructure you should resist the temptation to accept. Despite the fact that it might be handy in the short term and provide an interesting perspective, it is self-defeating in the long term from the perspective of realizing the productivity improvements promised by the Cloud. These improvements require that the infrastructure administrator be freed from application-specific issues and focus on meeting the contract of the platform. And that the application administrator be freed from infrastructure-level concerns (while at the same time being empowered to diagnose application-level concerns). This doesn’t mean that the application and infrastructure models should be disconnected. There is a contract and both models (infrastructure and consumption) should represent it in the same way. It draws a line, albeit one with some width.
Blurring the line between configuration and monitoring (bullet #6)
This is another shortcoming of current CMDBs, but one that I think is more easily addressed. The “contract” between the Cloud infrastructure and the consuming application materializes itself in a mix of configuration settings, administrative capabilities and monitoring data. This contract is not just represented by the configuration-centric Cloud API that immediately comes to mind. It also includes the management capabilities and monitoring points of the resulting instances/runtimes.
Whether all these considerations mean that traditional CMDBs are doomed in the Cloud as Bernd Harzog posits, I don’t know. In this post, BMC’s Kia Behnia acknowledges the importance of application management, though it’s not clear that he agrees with their primacy. I am also waiting to see whether the application management portfolio he has assembled can really maps to the new methods of application discovery and management.
But these are resourceful organization, with plenty of smart people (as I can testify: in the end of my HP tenure I worked with the very sharp CMDB team that came from the Mercury acquisition). And let’s keep in mind that customers also value the continuity of support of their environment. Most of them will be dealing with a mix of old-style and Cloud applications and they’ll be looking for a unified management approach. This helps CMDB incumbents. If you doubt the power to continuity, take a minute to realize that the entire value proposition of hypervisor-style virtualization is centered around it. It’s the value of backward-compatibility versus forward-compatibility. in addition, CMDBs are evolving into CMS and are a lot more than configuration repositories. They are an important supporting tool for IT management processes. Whether, and how, these processes apply to “the Cloud” is a topic for another post. In the meantime, read what the IT Skeptic and Rodrigo Flores have to say.
I wouldn’t be so quick to count the Big-4 out, even though I work every day towards that goal, building Oracle’s application and middleware management capabilities in conjunction with my colleagues focused on infrastructure management.
If the topic of application-centric management in the age of Cloud is of interest to you (and it must be if you’ve read this long entry all the way to the end), You might also find this previous entry relevant: “Generalizing the Cloud vs. SOA Governance debate“.
The VMforce announcement is a great step for SalesForce, in large part because it lets them address a recurring concern about the force.com PaaS offering: the lack of portability of Apex applications. Now they can be written using Java and Spring instead. A great illustration of how painful this issue was for SalesForce is to see the contortions that Peter Coffee goes through just to acknowledge it: “On the downside, a project might be delayed by debates—some in good faith, others driven by vendor FUD—over the perception of platform lock-in. Political barriers, far more than technical barriers, have likely delayed many organizations’ return on the advantages of the cloud”. The issue is not lock-in it’s the potential delays that may come from the perception of lock-in. Poetic.
Similarly, portability between clouds is also a big theme in Steve Herrod’s blog covering VMforce as illustrated by the figure below. The message is that “write once run anywhere” is coming to the Cloud.
Because this is such a big part of the VMforce value proposition, both from the SalesForce and the VMWare/SpringSource side (as well as for PaaS in general), it’s worth looking at the portability aspect in more details. At least to the extent that we can do so based on this pre-announcement (VMforce is not open for developers yet). And while I am taking VMforce as an example, all the considerations below apply to any enterprise PaaS offering. VMforce just happens to be one of the brave pioneers, willing to take a first step into the jungle.
Beyond the use of Java as a programming language and Spring as a framework, the portability also comes from the supporting tools. This is something I did not cover in my initial analysis of VMforce but that Michael Cote covers well on his blog and Carl Brooks in his comment. Unlike the more general considerations in my previous post, these matters of tooling are hard to discuss until the tools are actually out. We can describe what they “could”, “should” and “would” do all day long, but in the end we need to look at the application in practice and see what, if anything, needs to change when I redirect my deployment target from one cloud to the other. As SalesForce’s Umit Yalcinalp commented, “the details are going to be forthcoming in the coming months and it is too early to speculate”.
So rather than speculating on what VMforce tooling will do, I’ll describe what portability questions any PaaS platform would have to address (or explicitly decline to address).
That’s the easiest to address. Thanks to Java, the runtime portability problem for the core language is pretty much solved. Still, moving applications around require changes to way the application communicates with its infrastructure. Can your libraries and frameworks for data access and identity, for example, successfully encapsulate and hide the different kinds of data/identity stores behind them? Even when the stores are functionally equivalent (e.g. SQL, LDAP), they may have operational differences that matter for an enterprise application. Especially if the database is delivered (and paid for) as a service. I may well design my application differently depending on whether I am charged by the amount of data in the DB, by the number of requests to the DB, by the quantity of app-to-DB traffic or by the total processing time of my requests in the DB. Apparently force.com considers the number of “database objects” in its pricing plans and going over 200 pushes you from the “Enterprise” version to the more expensive “Unlimited” version. If I run against my local relational database I don’t think twice about having 201 “database objects”. But if I run in force.com and I otherwise can live within the limits of the “Enterprise” version I’d probably be tempted to slightly alter my data model to fit under 200 objects. The example is borderline silly, but the underlying truth is that not all differences in application infrastructure can be automatically encapsulated by libraries.
While code portability is a solvable problem for a reasonably large set of use cases, things get hairier for the more demanding applications. A large part of the PaaS value proposition is contingent on the willingness to give up some low-level optimizations. This, and harder portability in some cases, may just have to be part of the cost of running demanding applications in a PaaS environment. Or just keep these off PaaS for now. This is part of the backward-compatible versus forward compatible Cloud dilemma.
I have covered data portability in the previous entry, in response to Steve Herrod’s comment that “you should be able to extract the code from the cloud it currently runs in and move it, along with its data, to another cloud choice”. Your data in the force.com database can already be moved somewhere else… as long as you’re willing to write code to get it and perform any needed transformation. In theory, any data that you can read is data that you can move (thus fulfilling Steve’s promise). The question is at what cost. Presumably Steve is referring to data migration tools that VMWare will build (or acquire) and make part of its cloud enablement platform. Another way in which VMWare is trying to assemble a more complete middleware portfolio (see Oracle ODI for an example of a complete data integration offering, which goes far beyond ETL).
There is a subtle difference between the intrinsic portability of Java (which will run in any JRE, modulo JDK version) and the extrinsic portability of data which can in theory be moved anywhere but each place you move it to may require a different process. A car and an oak armoire are both “portable”, but one is designed for moving while the other will only move if you bring a truck and two strong guys.
Application service portability
I covered this in my previous entry and Bob Warfield summarized it as “take advantage of all those juicy services and it will be hard to back out of that platform, Java or no Java”. He is referring to all the platform services (search, reporting, mobile, integration, BPM, IdM, administration) that make a large part of the force.com value proposition. They won’t be waiting for you in your private cloud (though some may be remotely invocable, depending on how SalesForce wants to play its cards). Applications that depend on them will have to be changed, at least until we have standards interfaces for all these services (don’t hold your breath).
Even if you can seamlessly migrate your application and your data from your internal servers to force.com, what do you think is going to happen to your management console, especially if it uses operating system agents? These agents are not coming along for the ride, that’s for sure. Are you going to tell your administrators that rather than having a centralized configuration/monitoring/event console they are going to have to look at cute “monitoring” web pages for each application? And all the transaction tracing, event correlation, configuration policy and end-user monitoring features they were relying on are unfortunate victims of the relentless march of progress? Good luck with that sale.
VMWare’s answer will probably be that they will eventually provide you with all the management capabilities that you need. And it’s a fair one, along the lines of the “Application-to-Disk Management” message at the recent launch of Oracle Enterprise Manager 11G. With the difference that EM is not the only way to manage a top-to-bottom Oracle stack, just the one that we think is the best. BMC and HP aren’t locked out.
VMWare and SpringSource (+Hyperic) could indeed theoretically assemble a full-fledged management solution. But this doesn’t happen overnight, even with acquisitions as I know from experience both at HP Software and currently at Oracle. Integration (of management domains across the stack, of acquired application management products, of support data/services from oracle.com) is one of the main advances in Enterprise Manager 11G and it took work.
And even then, this leads to the next logical question. If you can move from cloud to cloud but you are forced to use VMWare development, deployment and management tools, haven’t you traded one lock-in problem for another?
Not to mention that your portability between clouds, if it depends on VMWare tools, is limited to VMWare-powered clouds (private or public). In effect, there are now three levels of portability:
- not portable (only runs on VMforce)
- portable to any cloud (public or private) built using VMWare infrastructure
- portable to any Java/Spring Cloud platform
Is your application portable the way cash is portable, or the way a gift card is portable (across stores of a retail chain)?
If this reminds you of the java portability debates of the early days of Enterprise Java that’s no surprise. Remember, we’re replaying the tape.
Oracle has a busy week in store for people who are interested in application management. Today, the company announced:
- Oracle Virtual Assembly Builder, to package and easily deploy virtualized composite applications. It’s an application-aware (via metadata) set of VM disk images. It comes with a graphical builder tool.
- Oracle WebLogic Suite Virtualization Option (not the most Twitter-friendly name, so if you see me tweet about “WebLogic Virtual” or “WLV” that’s what I mean), an optimized version of WebLogic Server which runs on JRockit Virtual Edition, itself on top of OVM. Notice what’s missing? The OS. If you think you’ll miss it, you may be suffering from learned helplessness. Seek help.
Later this week, Oracle will announce Oracle Enterprise Manager 11g. I am not going to steal the thunder a couple of days before the announcement, but I can safely say that a large chunk of the new features relate to application management.