Emulating a long-running process (and a scheduler) in Google App Engine

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

  • The most obvious option is to piggyback on HTTP requests from your visitors. But:
    • this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
    • since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
    • more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
  • You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
    • it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
    • it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
  • While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

  • When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
  • When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

15 Comments

Filed under Brain teaser, Everything, Google, Google App Engine, Implementation, Testing, Utility computing

15 Responses to Emulating a long-running process (and a scheduler) in Google App Engine

  1. I’ve been trying to tackle this problem myself! Your solution is quite interesting, I’m wondering if a modification to it might work – instead of using urlfetch to grab another one and letting the originals be killed, what if you used a simple location redirect, so you’d have to have your browser open the whole time, but you could watch your progress in the url bar and can easily stop the process by closing the tab. Hmm….

    Also, you can delete a whole application? I can’t figure out how to do this, when I try to hit the delete button it pops a message that says it can’t delete the default version. I would very much love an easy way to delete an app, or at least delete it’s datastore.

  2. Yes and in fact I wrote another prototype that does just that. This time I was going after another use case. Rather than trying to simulate a long-running process, I was simply trying to solve the problem of having too many user requests fail because they take too long to execute. By using a redirect in the same manner you suggest, I allowed the request to spread its work over several invocations thus getting several 9 seconds slots (since this seems to be how much time GAE gives you per request right now). In theory it could go on for ever, but in practice, as I’ve reported in the blog post above, you get killed after a few minutes. But if you have an app that usually returns within one invocation but runs out of time say 10% of the time, this approach can make it a lot more robust and is potentially sustainable (you don’t encounter the wrath of the GAE resource cop if most of you requests complete fast enough). It’s a bit like giving kids w/ ADHD extra time for school exams, except this time you give requests with special needs extra CPU time to complete.

    Since you inquired about this, I’ll post the code for that prototype when I get home tonight.

    [UPDATED 2008/6/13: Done.]

  3. Pingback: William Vambenepe’s blog » Blog Archive » Some breathing room for Google App Engine requests

  4. I am also trying to a long time running app. Is it possible to fetch the app by itself to make it running in a long time?

  5. Yiqiang,

    Not sure I understand the question. If this is what you are asking, yes it is possible for an app to call itself. The problem is that once this is done the calling instance of the app has to hang and wait for the response to come back. It can’t just call itself and let the old instance die. The old instance dies eventually (killed by GAE environment after approx 9 seconds) but it is marked as having used to much CPU (even though in fact it is a clock time metric, not a CPU time metric). And eventually your app gets killed for that. That’s what I described in the earlier blog entry (http://stage.vambenepe.com/archives/207)

  6. Andrew Bilyk

    Hello. Please help me with a next problem. I need to implement functionality of testing some urls (currently get response code only) and the problem is that I need to run this test each 10-15 seconds. Can I provide this with Google App Engine somehow? May be GAE has some services? I have read somewhere about Cron and Commet GAE services but have’t found any examples.
    Thank you.

  7. Sorry Andrew, I don’t know of a GAE-provided cron service. There are such services out there that you can use to invoke your GAE app once a while though.

  8. Andrew Bilyk

    Thank you for answer. Just to be shure, I can’t implement described functionality with standard GAE Developers Tools. Am I right?

  9. Yes. That’s the result of my attempt: there are some things you can use for a short period but nothing that’s functional in any practical way.

  10. Pingback: Mashup.se » Google App Engine, del 4: Begränsningar och hur man tar sig runt dem

  11. Pingback: Google App Engine limitations, and how to get around them | Digitalistic - Mashup or die trying

  12. Pingback: William Vambenepe’s blog » Blog Archive » Now I know why GAE has been killing me

  13. Pingback: William Vambenepe’s blog » Blog Archive » Google App Engine is teasing me

  14. Pingback: William Vambenepe’s blog » Blog Archive » Long-running processes on Google App Engine: it finally works

  15. Pingback: William Vambenepe — PaaS as a satisfying and success-ready hobbyist plaform