Category Archives: Utility computing

September 14, 2008

Dell is the best friend of Cloud Computing

Dell took quite a beating last month for (unsuccessfully) trying to trademark the term “Cloud Computing”. This has earned them a reputation as a clown in the Cloud Computing community.

I think it’s unfair. In my experience, the most compelling arguments for Cloud Computing come from Dell. Dell doesn’t make the move to Cloud Computing simply desirable, it makes it indispensable.

How? Not with its “Dell Cloud Computing Solutions” consultants. Not with its XS23 Cloud Server.

With a laptop. The Latitude D420. More specifically, the D420 that I am writing on right now.

I have been using laptops as my primary work machine for over 10 years. This one is by far the worst in terms of stability.

For months, I grappled with undiagnosable crashes. A motherboard replacement fixed those (I think). But the machine still fails to hibernate 20% of the time (sometimes even fresh out of a reboot). And the docking/undocking process is still a roll of the dice. It only works more or less reliably if the laptop is hibernated (but going to hibernation itself is not reliable, see above). If the machine is either turned on or in stand-by, all bets are off. And I am not talking about ending up with a messed up screen resolution. I consider that a successful docking. I am talking about blank screens (laptop and monitor), an unresponsive machine and eventually a hard reboot. By now, the colleagues sitting in the nearby offices must have learned quite a few French swear words.

And please don’t blame Windows XP. It’s not perfect but I’ve had some rock-solid Windows XP laptops, that could go through dozens of hibernate/wake-up cycles and not need a reboot until some OS security patch had to be installed. The NC6400 that I left behind when I quit HP was such an example. More stable than my home Linux laptop.

Anytime my Dell crashes, I risk loosing data in whatever files were open at the time. I’ve become pretty good at rebuilding a corrupted Thunderbird profile and importing the old emails and filters. I’ve learned to appreciate Firefox’s practice to regularly create a backup copy of the bookmarks. I know how to set up auto-save in any application that has the feature. My left hand does the “Ctrl-S” motion on my pillow a hundred times each night.

But above all, I have come to realize how good life will be when all my data, configuration and preferences are in the Cloud. When all my emails, documents, bookmarks, contacts, RSS subscriptions, calendar items are safely removed from this productivity-preventing machine. When recovering from another temperamental bout from this enemy (that I still carry home every day) will only be a matter of logging back onto whatever SaaS application I was using.

Dell has made me a true believer in Cloud Computing.

The first draft of this entry was written (on the afformentioned Linux laptop) during the 13 minutes it takes for the chkdsk.exe process to scan an 80GB hard drive after yet another crash.

2 Comments

Filed under Everything, SaaS, Utility computing

September 10, 2008

Oslo, blog posts and my crystal ball

There is more and more information coming out about Oslo in anticipation of the Microsoft PDC in October.

David Chappell recorded a video about it last month. More recently Doug Purdy and Don Box each posted a short description of Oslo. Don describes the goal of Oslo as “simplify the process of developing, deploying, and managing software”. But when he lists ancestor technologies to illustrate that “Microsoft has been moving in this direction for over a decade now”, they are all about development, not management: COM type libraries, .NET metadata attributes, XAML. Interesting that neither SDM nor SML gets a mention. Neither did SCA by the way, but I wasn’t really expecting that one… :-)

Maybe the I am the only one looking for a SDM/SML echo here, just because I came to hear of Oslo through the DSI angle. Am I wrong to see Oslo as an enabler for DSI? This eWeek article doesn’t have anything to do with IT management. Reading it, Oslo is all about allowing people to write code through drag and drop. Yawn. And Don Box endorses the article.

Maybe it’s just me (an IT management guy more than a software development guy) but I don’t care so much about how the application model is created. I care a lot more about what it allows you to do in terms of IT management. Please don’t make me pull out the often-quoted figure about the percentage of IT budget spent on operations versus development/licensing. The eWeek piece fails to excite me, but fortunately David Chappell’s video interview is a lot more aligned with my thinking, so I still hold hopes for Oslo as an IT management enabler. Here is my approximate transcript of an example that David provides (at around 4:20) in the video:

“If someone comes to you and says i’ve got this business process and the SLA is not being met, what do you do? You’ve got to trace this through the right business process and the right application that supports that part of the process and find the machine it runs on and maybe look at the workflow that implements it and maybe look at the services that it provides. This involves talking to business analysts, or the IT pros or the architect or the developer, all of whom have their own view of the world, their own tools, their own prospective. The repository provides a common place to store all this stuff, to link it all together, and with a visual editor to have a common tool that lets you actually go through and answer this kind of questions.”

Now you’re talking.

And if Oslo is not the new blood of DSI, then what is? The DSI story is getting dated, SML is fading in our memories and of the three parts that supposedly compose DSI (“virtualized infrastructure, design for operations, and knowledge-driven management”), only virtualization is actually represented on the list of technologies on the DSI home page. Has DSI turned into just allowing System Center to manage a hypervisor? I still hold hopes that the Oslo data is going to spice things up there. It would be good for the industry at large, not just Microsoft.

I won’t be at the PDC but it will be interesting to see what filters out of these sessions. The first session in the list adds management of hybrid application systems (hybrid as in “cloud/on-premise combination” or “software+services” as Microsoft calls it), to the long “can do” list for Oslo. Impressive, if there is some meat behind the abstract. I think this task is often overlooked in discussions around management aspects of Cloud computing (see “the new, interesting thing is going to be the IT infrastructure to manage your usage of utility computing services as well as their interactions with your in-house software” in this previous entry).

Yes, I am reading way too much into session abstracts, but while I am at it I can’t help noticing that there is a lot of SQL and very little XML/XSD/XPath mentioned there. Even though one of the presenters is Gudge, the only person I have ever met who fully understands XSD (actually even he doesn’t, I’ve seen him in the WS-I days have to refer to… his book).

Even though I am sure we’ll be told that SML can be built on top of Oslo, the SQL orientation won’t make that so easy (I want to see how to build XSD+Schematron validation on top of a relational store using Oslo’s drag and drop development tool). And it puts Microsoft on a different architectural direction from IBM, who, as far as I can tell, thinks that the world is a big XML document. Neither is the most appropriate for IT management models. I prefer a graph model and associated graph queries along the lines of SPARQL or CMDBf.

But that’s just late-night idle speculations on my part (aka “blogging”). Let’s see what comes out in October.

[UPDATED 2008/9/10: Interesting timing. Microsoft is joining OMG, home of UML and BPMN. Coming next: a submission of a “new version” of UML and BPMN that happens to contain the extensions and tweaks that Microsoft made to them in the process of implementing Oslo. This, BTW, is the final nail in the SML coffin (SML isn’t even mentioned in the press release).]

3 Comments

Filed under Application Mgmt, CMDBf, Conference, Desired State, Everything, Graph query, IT Systems Mgmt, Mgmt integration, Microsoft, Middleware, Modeling, Oslo, Query, SaaS, SCA, SML, SPARQL, Specs, Tech, Trade show, Utility computing, Virtualization

August 20, 2008

It’s party time again for the tinkerers

Around 1995 and 1996, if you knew how to set up an HTTP server on a Solaris box, hand-write a few HTML pages and create a simple CGI script to save the content of a form into a file (extra credit if you remembered to append to the file rather than overwriting it every time), then you were a world-class web designer. At least in my neck of the woods, which wasn’t Silicon Valley at the time. These people were self-trained, of course. I made some side money back then, creating a few web sites with just these limited skills. I am sure there were already people who had really thought about web design and could create useful and attractive sites (rather than simply functional ones). But all twelve of them were busy elsewhere and I would guess that none of them spoke French anyway. They were not my competition in Paris, when talking, for example, to a large French bank who wanted to create a web site to hire college students. My only competition was a bunch of Photoshop clowns whose idea of web design was to create a brochure in Photoshop/Framemaker and make the whole web page one big JPEG file.

Compare this to utility computing (aka clouds) today. Any Linux sysadmin who has, over the last year, made the effort to read and experiment with cloud computing (typically Amazon EC2), to survey available tools and to write a few scripts to tie them together is now an IT rock star, a potential catalyst for operations as a competitive advantage.

Just like self-taught HTML dilettantes didn’t keep control of the web design playground for long, early cloud adopters among sysadmins won’t enjoy they differentiation forever. But I would guess that they do today. Anyone has statistics in terms of valuation for such skills on the job market?

Of course the Photoshop crowd eventually got their Frontpage, Dreamweaver, etc to let them claim that they could create web sites. These tools were pretty bad at first because they tried to make things look familiar to graphic designers (image maps galore!). They slowly got better.

The same thing is likely to happen in utility computing. Traditional IT management tools will soon get cloud features. Like the HTML WYSIWYG tools, they’ll probably tend to be too influenced by current IT management concepts and methods. For example, all the ITIL cheerleaders out there are probably going to bend cloud features to fit ITIL rather than the other way around. Even though utility computing might well invalidate some pretty fundamental assumptions/requirements of parts of ITIL.

The productivity increases created by utility computing are probably large enough that even these tools will provide great value. And they’ll improve. In the same way that the Web was a major enough improvement that even poorly designed web sites were way ahead of the alternatives.

Today, you obviously can’t make a living as an “HTML in notepad” developer. You must either be a real graphic designer and use tools to turn your designs in Web artifacts or be deep in Web technologies. Or both. Similarly, you soon won’t be providing much value if you just know how to start and provision EC2 instances. You’ll need to either be a real IT admin who can manage the utility resources as part of a larger system (like the applications) or be a hard-core utility computing expert who tackles hard problems like optimizing your resource consumption across cloud providers or securing and ensuring the compliance of your distributed IT system.

But for now, the party is raging and the dress code is still pretty lax.

Comments Off on It’s party time again for the tinkerers

Filed under Everything, IT Systems Mgmt, Utility computing

August 1, 2008

Grid cloudification #2

On a recent drive to work, I heard another echo of the Grid world in the context of Cloud computing: I was listening to the Cloud Cafe podcast with Enomaly’s Reuven Cohen when he mentioned (near the 27 minute mark) that they use Ganglia for monitoring their environment.

I am familiar with Ganglia from some HP Labs projects around PlanetLab that I was involved in. Ganglia is used quite a lot for monitoring in the PlanetLab environment.

So Ganglia is one. Is any other project/tool/product coming from the Grid/HPC efforts of the last 10 years now used by the cool Cloud kids? Globus? SmartFrog? Platform? Condor? Others?

A few seconds later in the podcast, Reuven provides this juicy quote: “is the cloud an excuse for bad code”. But that’s a topic for another post.

1 Comment

Filed under Everything, Grid, IT Systems Mgmt, Manageability, Utility computing

July 30, 2008

Grid cloudification

Grid computing is moulting and, to no surprise, the new skin has “cloud” written all over it.

That’s one way to interpret the announcement today that HP, Intel and Yahoo are going to launch a compute cloud. Seeing Intel and HP work together on this is no surprise. Back at HP I had some involvement with the collaboration between HP Labs and Intel on PlanetLab.

I have only read the Gigaom article and Steve’s, so this post is not an analysis of the announcement. Just a few questions that come to mind. They can be most concisely expressed by trying to understand the difference with Amazon’s EC2. The quotes below all come from the Gigaom article.

“six physical locations” -> Amazon has availability zones, including the choice of three geographies.

“between 1,000 and 4,000 mostly Intel cores” -> According to this well-publicized story, Amazon can deliver 5,000 servers (each linked to at least one physical core) to one customer without breaking a sweat.

“We want, unlike other partnerships including Google and IBM’s where the lower-level stacks are not provided in a open manner to the world, open access to all levels of the hardware” -> The quote seems to conveniently avoid comparison with EC2 which provides a much lower abstraction level: virtual machines with mountable raw block storage devices. How much lower can you go without handing out access cards to physically walk into the datacenter? Access to the BMC on the motherboard? Access to some internal bus? Remote-controlled little robots that will slide cards in and out of a chassis?

“researchers will be able to access the cloud through a proposal process later this year” -> Ec2 offers pay-as-you go, which tends to be a good driver for people to use the infrastructure efficiently. And of course someone can always give researchers a grant in the form of EC2 rent money.

Just to be clear, I am not belittling the announcement because for one thing I haven’t read much about it and for another I probably know many of the HP Labs people involved and they are part of the “mucho sapiens” branch of “homo sapiens”. I know they wouldn’t bother putting this out if it was nothing more than giving researchers some free EC2 time.

But these are the questions I’ll be trying to answer for myself as I read more about this project.

[UPDATED 2008/9/19: Russ Daniels (who was HP Software CTO when I was at HP and is now CTO of Cloud Services Strategy) comments on the announcement.]

Comments Off on Grid cloudification

Filed under Amazon, Everything, Grid, HP, Manageability, Tech, Utility computing, Virtualization, Yahoo

July 24, 2008

Cloud Computing trivia

A few silly trivia questions for everyone out there who has drunk the Kloud-Aid.

Q) When was the cloudcomputing.com domain registered?

A) February 28, 2007. Yes, less a year and a half ago it could have been yours of 10 bucks. A nice reminder of how quickly the buzzword took over. For comparison, utilitycomputing.com was registered in July 2002 and gridcomputing.com in February 2000. By the way, fogcomputing.com got snapped up a month ago today and is currently parked…

Q) who owns cloudcomputing.com?

A) Dell. Ironically, one of the companies that has the most to loose from it… Of course they don’t see it that way and they redirect that domain to a dell.com page that explains all they have to offer in this area.

Q) Where does the name come from?

A) According to Wikipedia, “the term cloud computing derives from the common depiction in most technology architecture diagrams, of the Internet or IP availability, using an illustration of a cloud”. OK, then are databases now called Cylinder Computing?

Q) How does one make money in Cloud Computing?

A) By registering the domain name and re-selling it at the peak of the hype. CylinderComputing.com is still available…

[UPDATED 2008/8/3: For the record, that last answer was supposed to be a joke. It seemed pretty obvious at the time, but one week later the news comes out that Dell is trying to get a trademark on the term “cloud computing”… More analysis here.]

1 Comment

Filed under Everything, Utility computing

July 21, 2008

Animoto is no infrastructure flexibility benchmark

I have nothing against Animoto. From what I know about them (mostly from John’s podcast with Brad Jefferson) they built their system, using EC2, in a very smart way.

But I do have something against their story being used to set the benchmark for infrastructure flexibility. For those who haven’t heard it five times already, the summary of “their story” is ramping up from 50 to 5000 machines in a week (according to the podcast). Or from 50 to 3500 (according to the this AWS blog entry). Whatever. If I auto-generate my load (which is mostly what they did when they decided to auto-create a custom video for each new user) I too can create the need for a thousands of machines.

This was probably a good business decision for Animoto. They got plenty of visibility at a low cost. Plus the extra publicity from being an EC2 success story (I for one would never have heard of them through their other channels). Good for them. Good for Amazon who made it possible. And who got a poster child out of it. Good for the facebookers who got to waste another 30 seconds of their time straining their eyes. Everyone is happy, no animal got hurt in the process, hurray.

That’s all good but it doesn’t mean that from now on any utility computing solution needs to support ramping up by a factor of 100 in a week. What if Animoto had been STD’ed (slashdoted, technoratied and dugg) at the same time as the Facebook burst, resulting in the need for 50,000 servers? Would 1,000 X be the new benchmark? What if a few of the sites that target the “lonely guy” demographic decided to use Animoto for… ok let’s not got there.

There are three types of user requirements. The Animoto use case is clearly not in the first category but I am not convinced it’s in the third one either.

The “pulled out of thin air” requirements that someone makes up on the fly to justify a feature that they’ve already decided needs to be there. Most frequently encountered in standards working groups.
The “it happened” requirements that assumes that because something happened sometimes somewhere it needs to be supported all the time everywhere.
The “it makes business sense” requirements that include a cost-value analysis. The kind that comes not from asking “would you like this” to a customer but rather “how much more would you pay for this” or “what other feature would you trade for this”.

When cloud computing succeeds (i.e. when you stop hearing about it all the time and, hopefully, we go back to calling it “utility computing”), it will be because the third category of requirements will have been identified and met. Best exemplified by the attitude of Tarus (from OpenNMS) in the latest Redmonk podcast (paraphrased): sure we’ll customize OpenNMS for cloud environments; as soon as someone pays us to do it.

4 Comments

Filed under Amazon, Business, CMDB Federation, Everything, Mgmt integration, Specs, Tech, Utility computing

June 30, 2008

Moving towards utility/cloud computing standards?

This Forbes article (via John) channels 3Tera’s Bert Armijo’s call for standardization of utility computing. He calls it “Open Cloud” and it would “allow a company’s IT systems to be shared between different cloud computing services and moved freely between them“. Bert talks a bit more about it on his blog and, while he doesn’t reference the Forbes interview (too modest?), he points to Cloudscape as the vision.

A few early thoughts on all this:

Bottom line: I applaud Bert’s efforts but I couldn’t sleep well tonight if I didn’t also warn him that “there be dragons”.

And for those who haven’t seen it yet, here is a very good document on the topic (but it is focused on big vendors, not on how smaller companies can play the standards game).

[UPDATED 2008/6/30: A couple hours after posting this, I see that Coté has just published a blog post that elaborates on his view of cloud standards. As an addition to the podcast I mentioned earlier.]

[UPDATED 2008/7/2: If you read this in your feed viewer (rather than directly on vambenepe.com) and you don’t see the comments, you should go have a look. There are many clarifications and some additional insight from the best authorities on the topic. Thanks a lot to all the commenters.]

20 Comments

Filed under Amazon, Automation, Business, DMTF, Everything, Google, Google App Engine, Grid, HP, IBM, IT Systems Mgmt, Mgmt integration, Modeling, OVF, Portability, Specs, Standards, Utility computing, Virtualization

June 18, 2008

SaaS management: it’s MUWS and MOWS all over again

One of the most repetitive tasks when I was evangelizing WSDM was to explain the difference between the MUWS and MOWS specifications (the sum of which composes the entire WSDM body of work). MUWS (management using web services) describes how to use Web services to expose manageability capabilities of potentially any resource (a server, an application, a toaster…). MOWS (management of web services) defines a monitoring and control model for resources that are Web services themselves (so you can measure the number of messages received for example).

I ended up sounding like a cow when I was presenting. A retarded cow even, since my French accent forced me to say it slowly so people could hear the difference.

In retrospect, we should not have tried to tackle both in the same group. And not just because my dignity was bruised. It was a distraction inside the working group, and a source of confusion outside of it. We should have focused on MUWS (as WS-Management did) and possibly created a protocol-independent monitoring/control model for Web services separately. Something that, BTW, is still missing today.

I am being reminded of this MUWS vs. MOWS state of affair these days, when the topics of SaaS and IT management meet, often under the term “SaaS management”. By that, some people mean “delivering IT management as a hosted service, rather than running the management software in the same datacenter as the application”. Other mean “managing, using an on-premise deployment of the management software, a business application that is being delivered as a service (e.g. Oracle CRM On Demand), along with other local IT resources”. The latter is what I was talking about in this post. And sometimes it’s both at the same time (the business application is delivered as a service along with a hosted management console for status/issues/requests…). Not to mention the extra dimension of providing IT management to the administrators in charge of running a multi-tenant application in a SaaS scenario (instead of meeting the needs of their customer’s administrators).

All of these scenarios are valid. So far, we don’t have good names for them. And the MUWS/MOWS experience shows that good names matter. IMaaS (IT Management as a Service) and MoSaaS (Management of Software as a Service) won’t cut it.

[UPDATED 2008/6/23: This seems to be an example of MoSaaS (or rather MoIaaS) delivered through IMaaS. I am subjecting you to such an awful-sounding sentence as a way drive home the need for better names. The real value of course will come when these capabilities are delivered alongside (and integrated with) all your IT management capabilities. John has a nice analysis that lets some air out of the fluff.]

2 Comments

Filed under Application Mgmt, Everything, IT Systems Mgmt, SaaS, Standards, Utility computing

June 13, 2008

Some breathing room for Google App Engine requests

As promised to Felix here is the code that shows how to give extra breathing room to Google App Engine (GAE) requests that may otherwise be killed for taking too long to complete. The approach is similar to the one previously described. But rather than trying to emulate a long-running process, I am simply allowing a request to spread its work over a handful of invocations, thus getting several 9 seconds slots (since this seems to be how much time GAE gives you per request right now).

If all your requests need this then you are going to run into the same “high CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled” problem seen in the previous experiment. But if 90% of your requests complete in a normal time and only 10% of the requests need more time, then this approach can help prevent your users from getting an error for 1 out of every 10 requests. And you should fly under the radar of the GAE resource cop.

The way it works is simply that if your request is interrupted for having run too long the client gets a redirect to a new instance of the same handler. Because the code saves its results incrementally in the datastore, the new instance can build on the work of the previous one.

This specific example retrieves the ubuntu-8.04-server-i386.jigdo file (98K) from a handful of Ubuntu mirrors and returns the average/min/max download times (without checking if the transfer was successful or not). I also had to add a 1 second sleep after each fetch in order to trigger the DeadlineExceededError because the fetch operations go too quickly when running on GAE rather than my machine (I guess Google has better connectivity than my mediocre AT&T-provided DSL line, who would have thought).

#!/usr/bin/env python
#
# Copyright 2008 William Vambenepe
#

import wsgiref.handlers
import os
import logging
import time

from google.appengine.ext import db
from google.appengine.ext.webapp import template
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
from google.appengine.runtime import DeadlineExceededError

targetUrls = ["http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo"]

class MeasurementSet(db.Model):
  iteration = db.IntegerProperty()
  measurements = db.ListProperty(float)

class MainHandler(webapp.RequestHandler):
  def get(self):
    try:
      key = self.request.get("key")
      set = MeasurementSet.get(key)
      if (set == None):
        raise ValueError
      set.iteration = set.iteration + 1
      set.put()
      logging.debug("Resuming existing set, with key " + str(key))
    except:
      set = MeasurementSet()
      set.iteration = 1
      set.measurements = []
      set.put()
      logging.debug("Starting new set, with key " + str(set.key()))
    try:
      # Dereference remaining URLs
      for target in targetUrls[len(set.measurements):]:
        startTime = time.time()
        urlfetch.fetch(target)
        timeElapsed = time.time() - startTime
        time.sleep(1)
        logging.debug(target + " dereferenced in " + str(timeElapsed) + " sec")
        set.measurements.append(timeElapsed)
        set.put()
      # We're done dereferencing URLs, let's publish the results
      longestIndex = 0
      shortestIndex = 0
      totalTime = set.measurements[0]
      for i in range(1, len(targetUrls)):
        totalTime = totalTime + set.measurements[i]
        if set.measurements[i] < set.measurements[shortestIndex]:
          shortestIndex = i
        elif set.measurements[i] > set.measurements[longestIndex]:
          longestIndex = i
      template_values = {"iterations": set.iteration,
                         "longestTime": set.measurements[longestIndex],
                         "longestTarget": targetUrls[longestIndex],
                         "shortestTime": set.measurements[shortestIndex],
                         "shortestTarget": targetUrls[shortestIndex],
                         "average": totalTime/len(targetUrls)}
      path = os.path.join(os.path.dirname(__file__), "steps.html")
      self.response.out.write(template.render(path, template_values))
      logging.debug("Set with key " + str(set.key()) + " has returned")
    except DeadlineExceededError:
      logging.debug("Set with key " + str(set.key())
                    + " interrupted during iteration "+ str(set.iteration)
                    + " with " + str(len(set.measurements)) + " URLs retrieved")
      self.redirect("/steps?key=" + str(set.key()))
      logging.debug("Set with key " + str(set.key()) + " sent redirection")

def main():
  application = webapp.WSGIApplication([("/steps", MainHandler)], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

I can’t guarantee I will keep it available, but at the time of this writing the application is deployed here if you want to give it a spin. A typical run produces this kind of log:

06-13 01:44AM 36.814 /steps
XX.XX.XX.XX - - [13/06/2008:01:44:45 -0700] "GET /steps HTTP/1.1" 302 0 - -
  D 06-13 01:44AM 36.847
    Starting new set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 37.870
    http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.022078037262 sec
  D 06-13 01:44AM 38.913
    http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0184168815613 sec
  D 06-13 01:44AM 39.962
    http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0166189670563 sec
  D 06-13 01:44AM 41.12
    http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0205371379852 sec
  D 06-13 01:44AM 42.103
    http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0197179317474 sec
  D 06-13 01:44AM 43.146
    http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0171189308167 sec
  D 06-13 01:44AM 44.215
    http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0160200595856 sec
  D 06-13 01:44AM 45.256
    http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.015625 sec
  D 06-13 01:44AM 45.805
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM interrupted during iteration 1 with 8 URLs retrieved
  D 06-13 01:44AM 45.806
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM sent redirection
  W 06-13 01:44AM 45.808
    This request used a high amount of CPU, and was roughly 28.5 times over the average request CPU limit.
    High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

Followed by:

06-13 01:44AM 46.72 /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
XX.XX.XX.XX - - [13/06/2008:01:44:50 -0700] "GET /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM HTTP/1.1" 200 472
  D 06-13 01:44AM 46.110
    Resuming existing set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 47.128
    http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.016991853714 sec
  D 06-13 01:44AM 48.177
    http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0238039493561 sec
  D 06-13 01:44AM 49.318
    http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0177929401398 sec
  D 06-13 01:44AM 50.378
    http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0226020812988 sec
  D 06-13 01:44AM 50.410
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM has returned
  W 06-13 01:44AM 50.413
  This request used a high amount of CPU, and was roughly 13.4 times over the average request CPU limit.
  High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

I believe we can optimize the performance by taking advantage of the fact that successive requests are likely (but not guaranteed) to hit the same instance, allowing global variables to be re-used rather than always going to the datastore. My code is a proof of concept, not an optimized implementation.

Of course, the alternative is to drive things from the client, using JavaScript HTTP requests (rather than HTTP redirect) to repeat the HTTP request until the work has been completed. The list of pros and cons of each approach is left as an exercise to the reader.

[UPDATED 2008/6/13: Added log output. Removed handling of “OverQuotaError” which was not useful since, unlike “DeadlineExceededError”, quotas are not per-request. As a result, splitting the work over multiple requests doesn’t help. Slowing down a request might help, at which point the approach above might come in handy to prevent this slowdown from triggering “DeadlineExceededError”.]

[UPDATED 2008/6/30: Steve Jones provides an interesting analysis of the cut-off time for GAE. Confirms that it’s mainly based on wall-clock time rather than CPU time. And that you can sometimes go just over 9 seconds but never up to 10 seconds, which is consistent with my (much less detailed and rigorous) observations.]

2 Comments

Filed under Everything, Google, Google App Engine, Implementation, Utility computing

June 6, 2008

Emulating a long-running process (and a scheduler) in Google App Engine

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

The most obvious option is to piggyback on HTTP requests from your visitors. But:
- this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
- since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
- more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
- it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
- it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

15 Comments

Filed under Brain teaser, Everything, Google, Google App Engine, Implementation, Testing, Utility computing

May 31, 2008

Google App Engine: less is more

“If you have a stove, a saucepan and a bottle of cold water, how can you make boiling water?”

If you ask this question to a mathematician, they’ll think about it a while, and finally tell you to pour the water in the saucepan, light up the stove and put the saucepan on it until the water boils. Makes sense. Then ask them a slightly different question: “if you have a stove and a saucepan filled with cold water, how can you make boiling water?”. They’ll look at you and ask “can I also have a bottle”? If you agree to that request they’ll triumphantly announce: “pour the water from the saucepan into the bottle and we are back to the previous problem, which is already solved.”

In addition to making fun of mathematicians, this is a good illustration of the “fake machine” approach to utility computing embodied by Amazon’s EC2. There is plenty of practical value in emulating physical machines (either in your data center, using VMWare/Xen/OVM or at a utility provider’s site, e.g. EC2). They are all rooted in the fact that there is a huge amount of code written with the assumption that it is running on an identified physical machine (or set of machines), and you want to keep using that code. This will remain true for many many years to come, but is it the future of utility computing?

Google’s App Engine is a clear break from this set of assumptions. From this perspective, the App Engine is more interesting for what it doesn’t provide than for what it provides. As the description of the Sandbox explains:

“An App Engine application runs on many web servers simultaneously. Any web request can go to any web server, and multiple requests from the same user may be handled by different web servers. Distribution across multiple web servers is how App Engine ensures your application stays available while serving many simultaneous users [not to mention that this is also how they keep their costs low — William]. To allow App Engine to distribute your application in this way, the application runs in a restricted ‘sandbox’ environment.”

The page then goes on to succinctly list the limitations of the sandbox (no filesystem, limited networking, no threads, no long-lived requests, no low-level OS functions). The limitations are better described and commented upon here but even that article misses one major limitation, mentioned here: the lack of scheduler/cron.

Rather than a feature-by-feature comparison between the App Engine and EC2 (which Amazon would won handily at this point), what is interesting is to compare the underlying philosophies. Even with Amazon EC2, you don’t get every single feature your local hardware can deliver. For example, in its initial release EC2 didn’t offer a filesystem, only a storage-as-a-service interface (S3 and then SimpleDB). But Amazon worked hard to fix this as quickly as possible in order to be appear as similar to a physical infrastructure as possible. In this entry, announcing persistent storage for EC2, Amazon’s CTO takes pain to highlight this achievement:

“Persistent storage for Amazon EC2 will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.”

and

“And the great thing is it that it is all done with using standard technologies such that you can use this with any kind of application, middleware or any infrastructure software, whether it is legacy or brand new.”

Amazon works hard to hide (from the application code) the fact that the infrastructure is a huge, shared, distributed system. The beauty (and business value) of their offering is that while the legacy code thinks it is running in a good old data center, the paying customer derives benefits from the fact that this is not the case (e.g. fast/easy/cheap provisioning and reduced management responsibilities).

Google, on the other hand, embraces the change in underlying infrastructure and requires your code to use new abstractions that are optimized for that infrastructure.

To use an automotive analogy, Amazon is offering car drivers to switch to a gas/electric hybrid that refuels in today’s gas stations while Google is pushing for a direct jump to hydrogen fuel cells.

History is rarely kind to promoters of radical departures. The software industry is especially fond of layering the new on top of the old (a practice that has been enabled by the constant increase in underlying computing capacity). If you are wondering why your command prompt, shell terminal or text editor opens with a default width of 80 characters, take a trip back to 1928, when IBM defined its 80-columns punch card format. Will Google beat the odds or be forced to be more accommodating of existing code?

It’s not the idea of moving to a more abstracted development framework that worries me about Google’s offering (JEE, Spring and Ruby on Rails show that developers want this move anyway, for productivity reasons, even if there is no change in the underlying infrastructure to further motivate it). It’s the fact that by defining their offering at the level of this framework (as opposed to one level below, like Amazon), Google puts itself in the position of having to select the right framework. Sure, they can support more than one. But the speed of evolution in that area of the software industry shows that it’s not mature enough (yet?) for any party to guess where application frameworks are going. Community experimentation has been driving application frameworks, and Google App Engine can’t support this. It can only select and freeze a few framework.

Time will tell which approach works best, whether they should exist side by side or whether they slowly merge into a “best of both worlds” offering (Amazon already offers many features, like snapshots, that aim for this “best of both worlds”). Unmanaged code (e.g. C/C++ compiled programs) and managed code (JVM or CLR) have been coexisting for a while now. Traditional applications and utility-enabled applications may do so in the future. For all I know, Google may decide that it makes business sense for them too to offer a Xen-based solution like EC2 and Amazon may decide to offer a more abstracted utility computing environment along the lines of the App Engine. But at this point, I am glad that the leaders in utility computing have taken different paths as this will allow the whole industry to experiment and progress more quickly.

The comparison is somewhat blurred by the fact that the Google offering has not reached the same maturity level as Amazon’s. It has restrictions that are not directly related to the requirements of the underlying infrastructure. For example, I don’t see how the distributed infrastructure prevents the existence of a scheduling service for background jobs. I expect this to be fixed soon. Also, Amazon has a full commercial offering, with a price list and an ecosystem of tools, why Google only offers a very limited beta environment for which you can’t buy extra capacity (but this too is changing).

2 Comments

Filed under Amazon, Everything, Google, Google App Engine, OVM, Portability, Tech, Utility computing, Virtualization, VMware

March 31, 2008

Where will you be when the Semantic Web gets Grid’ed?

I see the tide rising for semantic technologies. On the other hand, I wonder if they don’t need to fail in order to succeed.

Let’s use the Grid effort as an example. By “Grid effort” I mean the work that took place in and around OGF (or GGF as it was known before its merger w/ EGA). That community, mostly made of researchers and academics, was defining “utility computing” and creating related technology (e.g. OGSA, OGSI, GridFTP, JSDL, SAGA as specs, Globus and Platform as implementations) when Amazon was still a bookstore. There was an expectation that, as large-scale, flexible, distributed computing became a more pressing need for the industry at large, the Grid vision and technology would find their way into the broader market. That’s probably why IBM (and to a lesser extent HP) invested in the effort. Instead, what we are seeing is a new approach to utility computing (marketed as “cloud computing”), delivered by Amazon and others. It addresses utility computing with a different technology than Grid. With X86 virtualization as a catalyst, “cloud computing” delivers flexible, large-scale computing capabilities in a way that, to the users, looks a lot like their current environment. They still have servers with operating systems and applications on them. It’s not as elegant and optimized as service factories, service references (GSR), service handle (GSH), etc but it maps a lot better to administrators’ skills and tools (and to running the current code unchanged). Incremental changes with quick ROI beat paradigm shifts 9 times out of 10.

Is this indicative of what is going to happen with semantic technologies? Let’s break it down chronologically:

Trailblazers (often faced with larger/harder problems than the rest of us) come up with a vision and a different way to think about what computers can do (e.g. the “computers -> compute grid” transition).
They develop innovative technology, with a strong theoretical underpinning (OGSA-BES and those listed above).
There are some successful deployments, but the adoption is mostly limited to a few niches. It is seen as too complex and too different from current practices for broad adoption.
Outsiders use incremental technology to deliver 80% of the vision with 20% of the complexity. Hype and adoption ensue.

If we are lucky, the end result will look more like the nicely abstracted utility computing vision than the “did you patch your EC2 Xen images today” cloud computing landscape. But that’s a necessary step that Grid computing failed to leapfrog.

Semantic web technologies can easily be mapped to the first three bullets. Replace “computers -> computer grid” with “documents/data -> information” in the first one. Fill in RDF, RDFS, OWL (with all its flavors), SPARQL etc as counterparts to OGSA-BES and friends in the second. For the third, consider life sciences and defense as niche markets in which semantic technologies are seeing practical adoption. What form will bullet #4 take for semantic technology (e.g. who is going to be the EC2 of semantic technology)? Or is this where it diverges from Grid and instead gets adopted in its “original” form?

1 Comment

Filed under Everything, Grid, HP, IBM, RDF, Research, Semantic tech, Specs, Standards, Tech, Utility computing, Virtualization

March 27, 2008

Amazon to the rescue

In his 15 Ways to Tell Its Not Cloud Computing post, James Governor asserts that:

“If you know where the machines are… its not a cloud.”

I took issue with this in a comment on his post.

And today, Amazon EC2 makes me feel smug:

“Availability Zones give you additional control of where your EC2 instances are run. We use a two level model which consists of geographic regions broken down into logical zones.”

Here are more details on how it works. And Amazon’s feature guide for availability zones.

2 Comments

Filed under Everything, IT Systems Mgmt, Utility computing

March 25, 2008

Elastra and data center configuration formats

I heard tonight for the first time of a company called Elastra. It sounds like they are trying to address a variation of the data center automation use cases covered by Opsware (now HP) and Bladelogic (now BMC). Elastra seems to be in an awareness-building phase and as far as I can tell it’s working (since I heard about them). They got to me through John’s blog. They are also using the more conventional PR channel (and in that context they follow all the cheesy conventions: you get to “unlock the value”, with “the leading provider” who gives you “a new product that revolutionizes…” etc, all before the end of the first paragraph). And while I am making fun of the PR-talk I can’t help zeroing on this quote from the CEO, who “wanted to pick up where utility computing left off – to go beyond the VM and toward virtualizing complex applications that span many machines and networks”. Does he feels the need to narrowly redefine “utility computing” (who knew that all that time “utility computing” was just referring to a single hypervisor?) as a way to justify the need for the new “cloud” buzzword (you’ll notice that I haven’t quite given up yet, this post is in the “utility computing” category and I still do not have a “cloud” category)?

The implied difference with Opsware and Bladelogic seems to be that while these incumbent (hey Bladelogic, how does it feel to be an “incumbent”?) automate data center management tasks in old boring data centers, Elastra does it in clouds. More specifically “public and private compute clouds”. I think I know roughly what a public cloud is supposed to be (e.g. EC2), but a private cloud? How is that different from a data center? Is a private cloud a data center that has the Elastra management software deployed? In that case, how is automating private clouds with Elastra different from automating data centers with Elastra? Basically it sounds like they don’t want to be seen as competing with Opsware and Bladelogic so they try to redefine the category. Which makes it easier to claim (see above) to be “the leading provider of software for designing, deploying, and managing applications in public and private compute clouds” without having the discovery or change management capabilities of Opsware (or anywhere near the same number of customers).

John seems impressed by their “public cloud” capabilities (I don’t think he has actually tested them yet though) and I trust him on that. Knowing the complexities of internal data centers, I am a lot more doubtful of the “private cloud” claims (at least if I interpret them correctly).

Anyway, I am getting carried away with some easy nitpicking on the PR-talk, but in truth it uses a pretty standard level of obfuscation/hype for this type of press release. Sad, I know.

The interesting thing (and the reason I started this blog entry in the first place) is that they seem to have created structures to capture system design (ECML) and deployment (EDML) rules. From John’s blog:

“At the core of Elastra’s architecture are the system design specifications called ECML and EDML. ECML is an XML markup language to specify a cloud design (i.e., multiple system design of firewalls, load balancers, app servers, db servers, etc…). The EDML markup provides the provisioning instructions.”

John generously adds “Elastra seems to be the first to have designed their autonomics into a standards language” which seems to assume that anything in XML is a standard. Leaving the “standard” debate aside, an XML format does tend improve interoperability and that’s a good thing.

So where are the specifications for these ECML and EDML formats? I would be very interested in reading them, but they don’t appear to be available anywhere. Maybe that would be a good first step towards making them industry standards.

I would be especially interested in comparing this to what the SML/CML effort is going after. Here are some propositions that need to be validated or disproved. Comparing SML/CML to ECML/EDML could help shade light on them:

SML/CML encompasses important and useful datacenter automation use cases.
Some level of standardization of cross-domain system design/deployment/management is needed.
SML/CML will be too late.
SML/CML will try to do too many things at once.

You can perform the same exercise with OVF. Why isn’t OVF based on SML? If you look at the benefits that could be theoretically be derived by that approach (hardware, VM, network and application configuration all in the same metamodel) it tells you about all that is attractive about SML. At the same time, if you look at the fact that OVF is happening while CML doesn’t seem to go anywhere, it tells you that the “from the very top all the way down to the very bottom” approach that SML is going after is very difficult to pull off. Especially with so many cooks in the kitchen.

And BTW, what is the relationship between ECML/EDML and OVF? I’d like to find out where the Elastra specifications land in all this. In the worst case, they are just an XML rendering of the internals of the Elastra application, mixing all domains of the IT stack. The OOXML of data center automation if you want. In the best case, it is a supple connective tissue that links stiffer domain-specific formats.

[UPDATED 2008/3/26: Elastra’s “introduction to elastic programing” white paper has a few words about the relationship between OVF and EDML: “EDML builds on the foundation laid by Open Virtual Machine Format (OVF) and extends that language’s capabilities to specify ways in which applications are deployed onto a Virtual Machine system”. Encouraging, if still vague.]

[UPDATED 2008/3/31: A week ago I hadn’t heard of Elastra and now I learn that I had been tracking the blog of its lead-architect-to-be all along! Maybe Stu will one day explain what a “private cloud” is. His description of his new company seems to confirm my impression that they are really focused (for now at least) on “public clouds” and not the Opsware-like “private clouds” automation capabilities. Maybe the “private clouds” are just in the business plan (and marketing literature) to be able to show a huge potential markets to VCs so they pony up the funds. Or maybe they really plan to go after this too. Being able to seamlessly integrate both (for mixed deployments) is the holly grail, I am just dubious that focusing on this rather than doing one or the other “right” is the best starting point for a new company. My guess is that despite the “private cloud” talk, they are really focusing on “public clouds” for now. That’s what I would do anyway.]

[UPDATED on 2008/6/25: Stephen O’Grady has an interesting post about the role of standards in Cloud computing. But he only looks at it from the perspective of possible standardization of the interfaces used by today’s Cloud providers. A full analysis also needs to include the role, in Cloud Computing, of standards (app runtime standards, IT management standards, system modeling standards, etc…) that started before Cloud computing was big. Not everything in Cloud computing is new. And even less is new about how it will be used. Especially if, as I expect, utility computing and on-premise computing are going to become more and more intertwined, resulting in the need to manage them as a whole. If my app is deployed at Amazon, why doesn’t it (and its hosts) show up in my CMDB and in my monitoring panel? As Coté recently wrote, “as the use of cloud computing for an extension of data centers evolves, you could see a stronger linking between Hyperic’s main product, HQ and something like Cloud Status”.]

9 Comments

Filed under Automation, CML, Everything, IT Systems Mgmt, OVF, SML, Tech, Utility computing, Virtualization

February 15, 2008

Fog Computing

As happened with Salesforce.com a couple of years ago, Amazon S3 is having serious problems serving its customers today. Like Salesforce.com at the time, Amazon is criticized for not being transparent enough about it.

Right now, “cloud computing” is also “fog computing”. There is very little visibility (if any) into the infrastructure that is being consumed as a service. Part of this is a feature (a key reason for using these services is freedom from low-level administration) but part of it is a defect.

The clamor for Amazon to provide more updates about the outage on the AWS blog is a bit misplaced in that sense. Sure, that kind of visibility (“well folks, it was bring-your-hamster-to-work day at the Amazon data center today and turns out they love chewing cables. Our bad. The local animal refuge is sending us all their cats to help deal with the mess. Stay tuned”) gives a warm fuzzy (!) feeling but that’s not very actionable.

It’s not a matter for Amazon of giving access to its entire management stack (even in view-only mode) to its customers. It’s a matter of extracting customer-facing metrics that are relevant and exposing them in a way that can be consumed by the customer’s IT management tools. So they can be integrated in the overall IT decisions. And it’s not just monitoring even though that’s a good start. Saying “I don’t want to know how you run the service, all I care is what you do for me”, only takes you so far in enterprise computing. This opacity is a great way to hide single points of failure:

I predict (as usual, no date) that we will see companies that thought they were hedging their bets by using two different SaaS providers only to realize, on the day Amazon goes down again, that both SaaS providers were hosting on Amazon EC2 (or equivalent). Or, on the day a BT building catches fire, that both SaaS providers had their data centers there.

Just another version of “for diversification, I had a high yield fund and a low risk fund. I didn’t really read the prospectus. Who would have guessed that they were both loaded with mortgage debt?”

More about IT management in a utility computing world in a previous entry.

[UPDATED: Things have improved a bit since writing this. Amazon now has a status panel. But it’s still limited to monitoring. Today it’s Google App Engine who is taking the heat.]

Comments Off on Fog Computing

Filed under Everything, Governance, IT Systems Mgmt, Utility computing

January 27, 2008

IT management in a world of utility IT

A cynic might call it “could computing” rather than “cloud computing”. What if you could get rid of your data center. What if you could pay only for what you use. What if you could ramp up your capacity on the fly. We’ve been hearing these promising pitches for a while now and recently the intensity has increased, fueled by some real advances.

As an IT management architect who is unfortunately unlikely to be in position to retire anytime soon (donations accepted for the send-William-to-retirement-on-a-beach fund) it forces me to wonder what IT management would look like in a world in which utility computing is a common reality.

First, these utility computing providers themselves will need plenty of IT management, if not necessarily the exact same kind that is being sold to enterprises today. You still need provisioning (automated of course). You definitely need access measuring and billing. Disaster recovery. You still have to deal with change planning, asset management and maybe portfolio management. You need processes and tools to support them. Of course you still have to monitor, manage SLAs, and pinpoints problems and opportunities for improvement. Etc. Are all of these a source of competitive advantage? Google is well-known for writing its infrastructure software (and of course also its applications) in house but there is no reason it should be that way, especially as the industry matures. Even when your business is to run a data center, not all aspects of IT management provide competitive differentiation. It is also very unclear at this point what the mix will be of utility providers that offer raw infrastructure (like EC2/S3) versus applications (like CRM as a service), a difference that may change the scope of what they would consider their crown jewels.

An important variable in determining the market for IT management software directed at utility providers is the number of these providers. Will there be a handful or hundreds? Many people seem to assume a small number, but my intuition goes the other way. The two main reasons for being only a handful would be regulation and infrastructure limitations. But, unlike with today’s utilities, I don’t see either taking place for utility computing (unless you assume that the network infrastructure is going to get vertically integrated in the utility data center offering). The more independent utility computing providers there are, the more it makes sense for them to pool resources (either explicitly through projects like the Collaborative Software Initiative or implicitly by buying from the same set of vendors) which creates a market for IT management products for utility providers. And conversely, the more of a market offering there is for the software and hardware building blocks of a utility computing provider, the lower the economies of scale (e.g. in software development costs) that would tend to concentrate the industry.

Oracle for one is already selling to utility providers (SaaS-type more than EC2-type at this point) with solutions that address scalability, SLA and multi-tenancy. Those solutions go beyond the scope of this article (they include not just IT management software but also databases and applications) but Oracle Enterprise Manager for IT management is also part of the solution. According to this Aberdeen report the company is doing very well in that market.

The other side of the equation is the IT management software that is needed by the consumers of utility computing. Network management becomes even more important. Identity/security management. Desktop management of some sort (depending on whether and what kind of desktop virtualization you use). And, as Microsoft reminds us with S+S, you will most likely still be running some software on-premises that needs to be managed (Carr agrees). The new, interesting thing is going to be the IT infrastructure to manage your usage of utility computing services as well as their interactions with your in-house software. Which sounds eerily familiar. In the early days of WSMF, one of the scenarios we were attempting to address (arguably ahead of the times) was service management across business partners (that is, the protocols and models were supposed to allow companies to expose some amount of manageability along with the operational services, so that service consumers would be able to optimize their IT management decision by taking into account management aspects of the consumed services). You can see this in the fact that the WSMF-WSM specification (that I co-authored and edited many years ago at HP) contains a model of a “conversation” that represents “set of related messages exchanged with other Web services” (a decentralized view of a BPEL instance, one that represents just one service’s view of its participation in the instance). Well, replace “business partner” with “SaaS provider” and you’re in a very similar situation. If my business application calls a mix of internal services, SaaS-type services and possibly some business partner services, managing SLAs and doing impact/root cause analysis works a lot better if you get some management information from these other services. Whether it is offered by the service owner directly, by a proxy/adapter that you put on your end or by a neutral third party in charge of measuring/enforcing SLAs. There are aspects of this that are “regular” SOA management challenges (i.e. that apply whenever you compose services, whether you host them yourself or not) and there are aspects (security, billing, SLA, compliance, selection of partners, negotiation) that are handled differently in the situation where the service is consumed from a third party. But by and large, it remains a problem of management integration in a word of composed, orchestrated and/or distributed applications. Which is where it connects with my day job at Oracle.

Depending on the usage type and the level of industry standardization, switching from one utility computing provider to the other may be relatively painless and easy (modify some registry entries or some policy or even let it happen automatically based on automated policies triggered by a price change for example) or a major task (transferring huge amounts of data, translating virtual machines from one VM format to another, performing in-depth security analysis…). Market realities will impact the IT tools that get developed and the available IT tools will in return shape the market.

Another intriguing opportunity, if you assume a mix of on-premises computing and utility-based computing, is that of selling back your spare capacity on the grid. That too would require plenty of supporting IT management software for provisioning, securing, monitoring and policing (coming soon to an SEC filing: “our business was hurt by weak sales of our flagship Pepsi cola drink, partially offset by revenue from renting computing power from our data center to the Coca cola company to handle their exploding ERP application volume”). I believe my neighbors with solar panels on their roofs are able to run their electric counter backward and sell power to PG&E when they generate more than they use. But I’ll stop here with the electric grid analogy because it is already overused. I haven’t read Carr’s book so the comment may be unfair, but based on extracts he posted and reviews he seems to have a hard time letting go of that analogy. It does a good job of making the initial point but gets tiresome after a while. Having personally experienced the Silicon Valley summer rolling black-outs, I very much hope the economics of utility computing won’t be as warped. For example, I hope that the telcos will only act as technical, not commercial intermediaries. One of the many problems in California is that the consumer don’t buy from the producers but from a distributor (PG&E in the Bay Area) who sells at a fixed price and then has to buy at pretty much any price from the producers and brokers who made a killing manipulating the supply during these summers. Utility computing is another area in which economics and technology are intrinsically and dynamically linked in a way that makes predictions very difficult.

For those not yet bored of this topic (or in search of a more insightful analysis), Redmonk’s Coté has taken a crack at that same question, but unlike me he stays clear of any amateurish attempt at an economic analysis. You may also want to read Ian Foster’s analysis (interweaving pieces of technology, standards, economy, marketing, computer history and even some movie trivia) on how these “clouds” line up with the “grids” that he and others have been working on for a while now. Some will see his post as a welcome reminder that the only thing really new in “cloud” computing is the name and others will say that the other new thing is that it is actually happening in a way that matters to more than a few academics and that Ian is just trying to hitch his jalopy to the express train that’s passing him. For once I am in the “less cynical” camp on this and I think a lot of the “traditional” Grid work is still very relevant. Did I hear “EC2 components for SmartFrog”?

[UPDATED 2008/6/30: For a comparison of “cloud” and “grid”, see here.]

[UPDATED 2008/9/22: More on the Cloud vs. Grid debate: a paper critical of Grid (in the OGF sense of the term) efforts and Ian Foster’s reply (reat the comments too).]

11 Comments

Filed under Business, Everything, IT Systems Mgmt, Utility computing, Virtualization