Some breathing room for Google App Engine requests

As promised to Felix here is the code that shows how to give extra breathing room to Google App Engine (GAE) requests that may otherwise be killed for taking too long to complete. The approach is similar to the one previously described. But rather than trying to emulate a long-running process, I am simply allowing a request to spread its work over a handful of invocations, thus getting several 9 seconds slots (since this seems to be how much time GAE gives you per request right now).

If all your requests need this then you are going to run into the same “high CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled” problem seen in the previous experiment. But if 90% of your requests complete in a normal time and only 10% of the requests need more time, then this approach can help prevent your users from getting an error for 1 out of every 10 requests. And you should fly under the radar of the GAE resource cop.

The way it works is simply that if your request is interrupted for having run too long the client gets a redirect to a new instance of the same handler. Because the code saves its results incrementally in the datastore, the new instance can build on the work of the previous one.

This specific example retrieves the ubuntu-8.04-server-i386.jigdo file (98K) from a handful of Ubuntu mirrors and returns the average/min/max download times (without checking if the transfer was successful or not). I also had to add a 1 second sleep after each fetch in order to trigger the DeadlineExceededError because the fetch operations go too quickly when running on GAE rather than my machine (I guess Google has better connectivity than my mediocre AT&T-provided DSL line, who would have thought).

#!/usr/bin/env python
#
# Copyright 2008 William Vambenepe
#

import wsgiref.handlers
import os
import logging
import time

from google.appengine.ext import db
from google.appengine.ext.webapp import template
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
from google.appengine.runtime import DeadlineExceededError

targetUrls = ["http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo",
              "http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo"]

class MeasurementSet(db.Model):
  iteration = db.IntegerProperty()
  measurements = db.ListProperty(float)

class MainHandler(webapp.RequestHandler):
  def get(self):
    try:
      key = self.request.get("key")
      set = MeasurementSet.get(key)
      if (set == None):
        raise ValueError
      set.iteration = set.iteration + 1
      set.put()
      logging.debug("Resuming existing set, with key " + str(key))
    except:
      set = MeasurementSet()
      set.iteration = 1
      set.measurements = []
      set.put()
      logging.debug("Starting new set, with key " + str(set.key()))
    try:
      # Dereference remaining URLs
      for target in targetUrls[len(set.measurements):]:
        startTime = time.time()
        urlfetch.fetch(target)
        timeElapsed = time.time() - startTime
        time.sleep(1)
        logging.debug(target + " dereferenced in " + str(timeElapsed) + " sec")
        set.measurements.append(timeElapsed)
        set.put()
      # We're done dereferencing URLs, let's publish the results
      longestIndex = 0
      shortestIndex = 0
      totalTime = set.measurements[0]
      for i in range(1, len(targetUrls)):
        totalTime = totalTime + set.measurements[i]
        if set.measurements[i] < set.measurements[shortestIndex]:
          shortestIndex = i
        elif set.measurements[i] > set.measurements[longestIndex]:
          longestIndex = i
      template_values = {"iterations": set.iteration,
                         "longestTime": set.measurements[longestIndex],
                         "longestTarget": targetUrls[longestIndex],
                         "shortestTime": set.measurements[shortestIndex],
                         "shortestTarget": targetUrls[shortestIndex],
                         "average": totalTime/len(targetUrls)}
      path = os.path.join(os.path.dirname(__file__), "steps.html")
      self.response.out.write(template.render(path, template_values))
      logging.debug("Set with key " + str(set.key()) + " has returned")
    except DeadlineExceededError:
      logging.debug("Set with key " + str(set.key())
                    + " interrupted during iteration "+ str(set.iteration)
                    + " with " + str(len(set.measurements)) + " URLs retrieved")
      self.redirect("/steps?key=" + str(set.key()))
      logging.debug("Set with key " + str(set.key()) + " sent redirection")

def main():
  application = webapp.WSGIApplication([("/steps", MainHandler)], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

I can’t guarantee I will keep it available, but at the time of this writing the application is deployed here if you want to give it a spin. A typical run produces this kind of log:

06-13 01:44AM 36.814 /steps
XX.XX.XX.XX - - [13/06/2008:01:44:45 -0700] "GET /steps HTTP/1.1" 302 0 - -
  D 06-13 01:44AM 36.847
    Starting new set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 37.870
    http://mirror.anl.gov/pub/ubuntu-iso/CDs/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.022078037262 sec
  D 06-13 01:44AM 38.913
    http://ubuntu.mirror.ac.za/ubuntu-release/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0184168815613 sec
  D 06-13 01:44AM 39.962
    http://mirrors.cytanet.com.cy/linux/ubuntu/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0166189670563 sec
  D 06-13 01:44AM 41.12
    http://ftp.kaist.ac.kr/pub/ubuntu-cd/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0205371379852 sec
  D 06-13 01:44AM 42.103
    http://ftp.itu.edu.tr/Mirror/Ubuntu/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0197179317474 sec
  D 06-13 01:44AM 43.146
    http://ftp.belnet.be/mirror/ubuntu.com/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0171189308167 sec
  D 06-13 01:44AM 44.215
    http://ubuntu-releases.sh.cvut.cz/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0160200595856 sec
  D 06-13 01:44AM 45.256
    http://ftp.crihan.fr/releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.015625 sec
  D 06-13 01:44AM 45.805
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM interrupted during iteration 1 with 8 URLs retrieved
  D 06-13 01:44AM 45.806
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM sent redirection
  W 06-13 01:44AM 45.808
    This request used a high amount of CPU, and was roughly 28.5 times over the average request CPU limit.
    High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

Followed by:

06-13 01:44AM 46.72 /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
XX.XX.XX.XX - - [13/06/2008:01:44:50 -0700] "GET /steps?key=agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM HTTP/1.1" 200 472
  D 06-13 01:44AM 46.110
    Resuming existing set, with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM
  D 06-13 01:44AM 47.128
    http://ftp.uni-kl.de/pub/linux/ubuntu.iso/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.016991853714 sec
  D 06-13 01:44AM 48.177
    http://ftp.duth.gr/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0238039493561 sec
  D 06-13 01:44AM 49.318
    http://no.releases.ubuntu.com/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0177929401398 sec
  D 06-13 01:44AM 50.378
    http://neacm.fe.up.pt/pub/ubuntu-releases/hardy/ubuntu-8.04-server-i386.jigdo dereferenced in 0.0226020812988 sec
  D 06-13 01:44AM 50.410
    Set with key agN2YnByFAsSDk1lYXN1cmVtZW50U2V0GBoM has returned
  W 06-13 01:44AM 50.413
  This request used a high amount of CPU, and was roughly 13.4 times over the average request CPU limit.
  High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.

I believe we can optimize the performance by taking advantage of the fact that successive requests are likely (but not guaranteed) to hit the same instance, allowing global variables to be re-used rather than always going to the datastore. My code is a proof of concept, not an optimized implementation.

Of course, the alternative is to drive things from the client, using JavaScript HTTP requests (rather than HTTP redirect) to repeat the HTTP request until the work has been completed. The list of pros and cons of each approach is left as an exercise to the reader.

[UPDATED 2008/6/13: Added log output. Removed handling of “OverQuotaError” which was not useful since, unlike “DeadlineExceededError”, quotas are not per-request. As a result, splitting the work over multiple requests doesn’t help. Slowing down a request might help, at which point the approach above might come in handy to prevent this slowdown from triggering “DeadlineExceededError”.]

[UPDATED 2008/6/30: Steve Jones provides an interesting analysis of the cut-off time for GAE. Confirms that it’s mainly based on wall-clock time rather than CPU time. And that you can sometimes go just over 9 seconds but never up to 10 seconds, which is consistent with my (much less detailed and rigorous) observations.]

2 Comments

Filed under Everything, Google, Google App Engine, Implementation, Utility computing

2 Responses to Some breathing room for Google App Engine requests

  1. Very nice! Nice usage of the data store to keep track. I had a much simpler/lazier plan in mind using just query strings, but the data store is probably safer and more extensible for additional state. Too bad GAE doesn’t have a queueing system like AWS does.

  2. Pingback: William Vambenepe — Nice incremental progress in Google App Engine SDK 1.4