Google App Engine | Cloud Comedy, Cloud Tragedy

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

The most obvious option is to piggyback on HTTP requests from your visitors. But:
- this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
- since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
- more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
- it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
- it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

“If you have a stove, a saucepan and a bottle of cold water, how can you make boiling water?”

If you ask this question to a mathematician, they’ll think about it a while, and finally tell you to pour the water in the saucepan, light up the stove and put the saucepan on it until the water boils. Makes sense. Then ask them a slightly different question: “if you have a stove and a saucepan filled with cold water, how can you make boiling water?”. They’ll look at you and ask “can I also have a bottle”? If you agree to that request they’ll triumphantly announce: “pour the water from the saucepan into the bottle and we are back to the previous problem, which is already solved.”

In addition to making fun of mathematicians, this is a good illustration of the “fake machine” approach to utility computing embodied by Amazon’s EC2. There is plenty of practical value in emulating physical machines (either in your data center, using VMWare/Xen/OVM or at a utility provider’s site, e.g. EC2). They are all rooted in the fact that there is a huge amount of code written with the assumption that it is running on an identified physical machine (or set of machines), and you want to keep using that code. This will remain true for many many years to come, but is it the future of utility computing?

Google’s App Engine is a clear break from this set of assumptions. From this perspective, the App Engine is more interesting for what it doesn’t provide than for what it provides. As the description of the Sandbox explains:

“An App Engine application runs on many web servers simultaneously. Any web request can go to any web server, and multiple requests from the same user may be handled by different web servers. Distribution across multiple web servers is how App Engine ensures your application stays available while serving many simultaneous users [not to mention that this is also how they keep their costs low — William]. To allow App Engine to distribute your application in this way, the application runs in a restricted ‘sandbox’ environment.”

The page then goes on to succinctly list the limitations of the sandbox (no filesystem, limited networking, no threads, no long-lived requests, no low-level OS functions). The limitations are better described and commented upon here but even that article misses one major limitation, mentioned here: the lack of scheduler/cron.

Rather than a feature-by-feature comparison between the App Engine and EC2 (which Amazon would won handily at this point), what is interesting is to compare the underlying philosophies. Even with Amazon EC2, you don’t get every single feature your local hardware can deliver. For example, in its initial release EC2 didn’t offer a filesystem, only a storage-as-a-service interface (S3 and then SimpleDB). But Amazon worked hard to fix this as quickly as possible in order to be appear as similar to a physical infrastructure as possible. In this entry, announcing persistent storage for EC2, Amazon’s CTO takes pain to highlight this achievement:

“Persistent storage for Amazon EC2 will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.”

and

“And the great thing is it that it is all done with using standard technologies such that you can use this with any kind of application, middleware or any infrastructure software, whether it is legacy or brand new.”

Amazon works hard to hide (from the application code) the fact that the infrastructure is a huge, shared, distributed system. The beauty (and business value) of their offering is that while the legacy code thinks it is running in a good old data center, the paying customer derives benefits from the fact that this is not the case (e.g. fast/easy/cheap provisioning and reduced management responsibilities).

Google, on the other hand, embraces the change in underlying infrastructure and requires your code to use new abstractions that are optimized for that infrastructure.

To use an automotive analogy, Amazon is offering car drivers to switch to a gas/electric hybrid that refuels in today’s gas stations while Google is pushing for a direct jump to hydrogen fuel cells.

History is rarely kind to promoters of radical departures. The software industry is especially fond of layering the new on top of the old (a practice that has been enabled by the constant increase in underlying computing capacity). If you are wondering why your command prompt, shell terminal or text editor opens with a default width of 80 characters, take a trip back to 1928, when IBM defined its 80-columns punch card format. Will Google beat the odds or be forced to be more accommodating of existing code?

It’s not the idea of moving to a more abstracted development framework that worries me about Google’s offering (JEE, Spring and Ruby on Rails show that developers want this move anyway, for productivity reasons, even if there is no change in the underlying infrastructure to further motivate it). It’s the fact that by defining their offering at the level of this framework (as opposed to one level below, like Amazon), Google puts itself in the position of having to select the right framework. Sure, they can support more than one. But the speed of evolution in that area of the software industry shows that it’s not mature enough (yet?) for any party to guess where application frameworks are going. Community experimentation has been driving application frameworks, and Google App Engine can’t support this. It can only select and freeze a few framework.

Time will tell which approach works best, whether they should exist side by side or whether they slowly merge into a “best of both worlds” offering (Amazon already offers many features, like snapshots, that aim for this “best of both worlds”). Unmanaged code (e.g. C/C++ compiled programs) and managed code (JVM or CLR) have been coexisting for a while now. Traditional applications and utility-enabled applications may do so in the future. For all I know, Google may decide that it makes business sense for them too to offer a Xen-based solution like EC2 and Amazon may decide to offer a more abstracted utility computing environment along the lines of the App Engine. But at this point, I am glad that the leaders in utility computing have taken different paths as this will allow the whole industry to experiment and progress more quickly.

The comparison is somewhat blurred by the fact that the Google offering has not reached the same maturity level as Amazon’s. It has restrictions that are not directly related to the requirements of the underlying infrastructure. For example, I don’t see how the distributed infrastructure prevents the existence of a scheduling service for background jobs. I expect this to be fixed soon. Also, Amazon has a full commercial offering, with a price list and an ecosystem of tools, why Google only offers a very limited beta environment for which you can’t buy extra capacity (but this too is changing).

Category Archives: Google App Engine

Emulating a long-running process (and a scheduler) in Google App Engine

Google App Engine: less is more