Category Archives: Brain teaser

June 6, 2008

Emulating a long-running process (and a scheduler) in Google App Engine

As previously described, Google App Engine (GAE) doesn’t support long running processes. Each process lives in the context of an HTTP request handler and needs to complete within a few seconds. If you’re trying to get extra CPU cycles for some task then Amazon EC2, not GAE, is the right tool (including the option to get high-CPU instances for the CPU-intensive tasks).

More surprising is the fact that GAE doesn’t offer a scheduler. Your app can only get invoked when someone sends it an HTTP request and you can’t ask GAE to generate a canned request every so often (crontab-style). That seems both limiting and arbitrary. In fact, I would be surprised if GAE didn’t soon add support for this.

In the meantime, your best bet is to get an account on a separate server that lets you schedule jobs, at which point you can drive your GAE application from that external scheduler (through HTTP requests to your GAE app). But just for the intellectual exercise, how would one meet the need while staying entirely within the confines of the Google-provided infrastructure?

The most obvious option is to piggyback on HTTP requests from your visitors. But:
- this assumes that you consistently get visitors at a frequency greater than your scheduler’s interval,
- since you can’t launch sub-processes in GAE, this delays your responses to the visitor,
- more worrisome, if your scheduled task takes more than a few seconds this means your application might be interrupted by GAE before you respond to the visitor, resulting in a failed request from their perspective.
You can try to improve a bit on this by doing this processing not as part of the main request from your visitor but rather by putting in the response HTML some JavaScript that will asynchronously send you HTTP requests in the background (typically not visible to the user). This way, a given visitor will give you repeated invocations for as long as the page is open in the browser. And you can set the invocation interval. You can even create some kind of server-controlled auto-modulation of the interval (increasing it as your number of concurrent visitors increases) so that you don’t eat all your Google-allocated incoming HTTP quota with these XMLHttpRequest invocations. This would probably be a very workable way to do it in practice even though:
- it only works if your application has visitors who use web browsers, not if it only consumed by programs (e.g. through RSS feeds or other XML format),
- it puts the burden on your visitors who may or may not appreciate it, assuming they realize it is happening (how would you feel if your real estate agent had to borrow your cell phone to arrange home visits for you and their other customers?).
While GAE doesn’t offer a scheduler, another Google service, Google Reader, offers one of sorts. If you register a feed there, Google’s FeedReader will retrieve it once a while (based on my logs, it happens approximately every hour for each of the two feeds for this blog). You can create multiple URLs that all map to the same handler and return some dummy RSS. If you register these feeds with Google Reader, they’ll get pulled once a while. Of course there is no guarantee that the pulling of the different feeds will be nicely spread out, but if you register enough of them you should manage to get invoked with a frequency compatible with you desired scheduler’s frequency.

That’s all nice, but it doesn’t entirely live within the GAE application. It depends on either the visitors or Google Reader. Can we do this entirely within GAE?

The idea is that since a GAE app can only executes within an HTTP request handler, which only runs for a few seconds, you can emulate a long-running process by automatically starting a successor request when the previous one is killed. This is made possible by two characteristics of the GAE runtime:

When an HTTP request is canceled on the client side, the request execution on the server is permitted to continue (until it returns or GAE kills it for having run too long).
When GAE kills a request for having run too long, it does it through an exception that you have a chance to handle (at least for a few seconds, until you get killed for good), which is when you initiate the HTTP request that spawns the successor process.

If you’ve watched (or played) Rugby, this is equivalent to passing the ball to a teammate during that short interval between when you’re tackled and when you hit the ground (I have no idea whether the analogy also applies to Rugby’s weird cousin called American Football).

In practice, all you have to do is structure your long running task like this:

class StartHandler(webapp.RequestHandler):
  def get(self):
    if (StopExec.all().count() == 0):
      try:
        id = int(self.request.get("id"))
        logging.debug("Request " + str(id) + " is starting its work.")
        # This is where you do your work
      finally:
        logging.debug("Request " + str(id) + " has been stopped.")
        # Save state to the datastore as needed
        logging.debug("Launching successor request with id=" + str(id+1))
        res = urlfetch.fetch("http://myGaeApp.appspot.com/start?id=" + str(id+1))

Once you have deployed this app, just point your browser to http://myGaeApp.appspot.com/start?id=0 (assuming of course that your GAE app is called “myGaeApp”) and the long-running process is started. You can hit the “stop” button on your browser and turn off your computer, the process (or more exactly the succession of processes) has a life of its own entirely within the GAE infrastructure.

The “if (StopExec.all().count() == 0)” statement is my way of keeping control over the beast (if only Dr. Frankenstein had as much foresight). StopExec is an entity type in the datastore for my app. If I want to kill this self-replicating process, I just need to create an entity of this type and the process will stop replicating. Without this, the only way to stop it would be to delete the whole application through the GAE dashboard. In general, using the datastore as shared memory is the way to communicate with this emulation of a long-running process.

A scheduler is an obvious example of a long-running process that could be implemented that way. But there are other examples. The only constraint is that your long-running process should expect to be interrupted (approximately every 9 seconds based on what I have seen so far). It will then re-start as part of a new instance of the same request handler class. You can communicate state between one instance and its successor either via the request parameters (like the “id” integer that I pass in the URL) or by writing to the datastore (in the “finally” clause) and reading from it (at the beginning of your task execution).

By the way, you can’t really test such a system using the toolkit Google provides for local testing, because that toolkit behaves very differently from the real GAE infrastructure in the way it controls long-running processes. You have to run it in the real GAE environment.

Does it work? For a while. The first time I launched it, it worked for almost 30 minutes (that’s a lot of 9 second-long processes). But I started to notice these worrisome warnings in the logs: “This request used a high amount of CPU, and was roughly 21.7 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.”

And indeed, after 30 minutes of happiness my app was disabled for a bit.

My quota figures on the dashboard actually looked pretty good. This was not a very busy application.

CPU Used 175.81 of 199608.00 Gigacycles (0%)
Data Sent 0.00 of 2048.00 Megabytes (0%)
Data Received 0.00 of 2048.00 Megabytes (0%)
Emails Sent 0.00 of 2000.00 Emails (0%)
Megabytes Stored 0.05 of 500.00 Megabytes (0%)

But the warning in the logs points to some other restriction. Google doesn’t mind if you use a given number of CPU cycles through a lot of small requests, but it complains if you use the same number of cycles through a few longer requests. Which is not really captured in the “understanding application quotas” page. I also question whether my long requests actually consume more CPU than normal (shorter) requests. I stripped the application down to the point where the “this is where you do your work” part was doing nothing. The only actual work, in the “finally” clause, was to opens an HTTP connection and wait for it to return (which never happens) until the GAE runtime kills the request completely. Hard to see how this would actually use much CPU. Yet, same warning. The warning text is probably not very reflective of the actual algorithm that flags my request as a hog.

What this means is that no matter how small and slim the task is, the last line (with the urlfetch.fetch() call) by itself is enough to get my request identified as a hog. Which means that eventually the app is going to get disabled. Which is silly really because by that the time fetch() gets called nothing useful is happening in this request (the work has transitioned to the successor request) and I’d be happy to have it killed as soon as the successor has been spawned. But GAE doesn’t give you a way to set client-side timeout on outgoing HTTP requests. Neither can you configure the GAE cop to kill you early so that you don’t enter the territory of “this request used a high amount of CPU”.

I am pretty confident that the ability to set client-side HTTP timeout will be added to the urlfetch API. Even Google’s documentation acknowledges this limitation: “Note: Since your application must respond to the user’s request within several seconds, a URL fetch action to a slow remote server may cause your application to return a server error to the user. There is currently no way to specify a time limit to the URL fetch action.” Of course, by the time they fix this they may also have added a real scheduler…

In the meantime, this was a fun exploration of the GAE environment. It makes it clear to me that this environment is still a toy. But a very interesting and promising one.

[UPDATED 2009/28: Looks like a real GAE scheduler is coming.]

15 Comments

Filed under Brain teaser, Everything, Google, Google App Engine, Implementation, Testing, Utility computing

June 4, 2008

Small brain teaser: my work phone number

My work phone number is a typical US 10 digits number. In addition:

a) My office is in the same area code as Stanford University.
b) The area code appears twice in my phone number
c) The number of the beast doesn’t appear in my phone number.
d) An even number can only be in an even-numbered position if the value of that number is also its position (the leftmost digit is in position 1; 0 is an even number).
e) The number of occurrences of non-zero numbers is always less than the value of the number.
f) The answer uses as few different numbers as possible to meet all these conditions. For example, if 123-123-1231 and 123-412-3412 both met all the constraints above (which they obviously don’t), then the answer would be 123-123-1231 because it only uses the numbers 1, 2 and 3, while the other uses an additional number (4).

Asking your Oracle-employed brother-in-law to look me up in the employee phone book is considered cheating…

[UPDATED 2009/1/23: Clarified last bullet with an example, based on reader feedback.]

[UPDATED 2012/10/1: I have now left Oracle, so this is not my number anymore. You can still try to solve the puzzle and email me the answer for verification.]

Comments Off on Small brain teaser: my work phone number

Filed under Brain teaser, Everything, Off-topic

March 26, 2008

XPath brain teasers: graph queries in XPath 1.0

Consider this piece of XML, in which the <g> elements represent groups that the people are part of (groups can have several members and people can be members of several groups). For example, “paul” is a member of groups 2, 3 and 4.

<doc>
  <person name="alan"><g>1</g><g>2</g><g>4</g></person>
  <person name="marc"><g>1</g><g>2</g><g>3</g></person>
  <person name="paul"><g>2</g><g>3</g><g>4</g></person>
  <person name="ivan"><g>2</g><g>4</g></person>
  <person name="eric"><g>4</g></person>
</doc>

This is essentially a graph structure, represented as a tree because of the constraints of XML.

Using a graph query language like SPARQL, answering questions such as “which groups contain alan, paul and ivan” would be trivial. In SPARQL that would be something like:

SELECT ?group
WHERE {
  [ ns:hasName "alan" ] ns:partOf ?group .
  [ ns:hasName "paul" ] ns:partOf ?group .
  [ ns:hasName "ivan" ] ns:partOf ?group . }

In the CMDBf query language, another graph query language, it would be more verbose but just as straightforward to express:

<query>
  <itemTemplate id="alan">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>alan</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="paul">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>paul</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="ivan">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>ivan</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="group"/>
  <relationshipTemplate id="alan-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="alan"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
  <relationshipTemplate id="paul-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="paul"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
  <relationshipTemplate id="ivan-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="ivan"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
</query>

But using the right tool for the job is just no fun. How can we answer this question using XPath 1.0? Your first response might be “this is the wrong XML format”. And yes, we could switch things around and make people children of groups rather than the contrary, as in:

<invertedDoc>
  <group number="1"><p>alan</p><p>marc</p></group>
  <group number="2"><p>alan</p><p>marc</p><p>paul</p></group>
  <group number="3"><p>marc</p><p>paul</p></group>
  <group number="4"><p>alan</p><p>paul</p><p>ivan</p><p>eric</p></group>
</invertedDoc>

That would make the “is there a group that contains alan, paul and ivan” question very easy to answer in XPath 1.0, but then I would ask you “which persons are part of groups 1, 2 and 4” and you’d be back to the same problem. You won’t get off the hook that easily.

So, XPath brain teaser #1 is: how to answer “which groups contain alan, paul and ivan” using XPath 1.0 on the first XML document (<doc>, not <invertedDoc>)?

The answer is:

/doc/person/g[../@name="alan" and text()=/doc/person/g[../@name="paul"
  and text()=/doc/person/g[../@name="ivan"]]]

Which returns:

<g>2</g>
<g>4</g>

It doesn’t look like much, but go through it carefully and you’ll see that we have somewhat of a recursive loop (as close as XPath can get to recursion). With these loops, we go through the entire document n^m times, where n is the number of <people> elements and m is the number of names that we need to look for in each group (3 in the present case: alan, paul an ivan). In our simple example, that’s 5^3=125. Not very efficient for a query that could, with the right language, be answered in one pass through the document (I am assuming a basic XPath engine, not one that may be able pre-analyze the query and optimize its execution).

Which takes us to XPath brain teaser #2: can you find a way to answer that same question with fewer passes through the doc?

There is an answer, but it requires the document to adopt a convention to make all group IDs multiples of 10. 1 stays 1, 2 becomes 10, 3 becomes 100, etc.

The document that we are querying against now looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<doc>
  <person name="alan"><g>1</g><g>10</g><g>1000</g></person>
  <person name="marc"><g>1</g><g>10</g><g>100</g></person>
  <person name="paul"><g>10</g><g>100</g><g>1000</g></person>
  <person name="ivan"><g>10</g><g>1000</g></person>
  <person name="eric"><g>1000</g></person>
</doc>

On this document, the following XPath:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g)

returns: 3131

Which is the answer to our question. It doesn’t look like it? Well, here is the key to decode this answer: every “3” digit that appears in this number represents a group that contains all three required members (alan, paul and ivan). In this example, we have a “3” in the “thousands” position (so group 1000 qualifies) and a “3” in the “tens” position (so group 10 qualifies).

How do we get the 3131 result? In that XPath statement, the processor simply picks out the <person> elements that correspond to alan, paul and ivan. Then it simply adds up the value of all the <g> elements contained in all these selected <person> elements. And that’s our 3131.

The transformation of group values from n to 10^(n-1) is what allows us to turn a recursive loop into a simple addition of group values. Each column in the running sum keeps track of the number of people who are in the group that corresponds to that column (the “units” column corresponds to group 1, the “tens” column corresponds to group 10, the “hundreds” column corresponds to group 100, etc). This is why we had to turn the group IDs to multiples of 10.

Does this approach meet our goal of requiring fewer passes through the document than the XPath that is the solution to brain teaser #1? Yes, because we only scan the content of the <people> elements we are interested in (and we only scan each of them once). We don’t care how many groups there are. So we go from n^m passes through the entire document to m passes (one for each <person> element that we need to locate). In our example, it means 125 versus 3.

One potential gotcha is that we are assuming that a given group only appears once inside a given <person> element. Which seems logical. But what if the maintainer of the document is sloppy and we suspect that he may sometimes add a group inside a <person> element without first checking whether that <person> element already contains that group? We can protect ourselves against this by filtering out the redundant <g> elements inside a <person>. To do so, we replace replace:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g)

with:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g[not(text()=preceding-sibling::g)])

The [not(text()=preceding-sibling::g)] part removes <g> elements that have a preceding sibling with the same value. At little processing cost.

If you don’t like the looks of this “3131” result, you can add a simple transformation into the XPath to turn it into 1010, which can be interpreted as the sum of the numbers corresponding to all the groups that satisfy our request (again, groups 1000 and 10 in this case):

translate(sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g), "123456789", "001000000")

Returns: 1010.

If you are still not satisfied, we can actually extract the <g> elements (basically the same result as in the XPath statement that corresponds to brain teaser #1), but at the cost of a bit more work for the XPath processor: instead of calculating the 3131 result once, you do it once for each group that alan is a member of (why alan? it doesn’t matter, pick paul or ivan if you want). The corresponding XPath is:

/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g) div text()) mod 10 = 3]

Which returns:

<g>10</g>
<g>1000</g>

And here too, if you are concerned that the same group may appear more than once inside the <person name=”alan”> element and you don’t want that to appear in the result, you can remove the <g> elements that have a preceding sibling with the same value (you have to remove them twice, once in the sum calculation and once in the selection of the <g> elements for display, which is why [not(text()=preceding-sibling::g)] appears twice below):

/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]))/g[not(text()=preceding-sibling::g)])
  div text()) mod 10 = 3][not(text()=preceding-sibling::g)]

BTW, a practical advantage of presenting the result as a set of element nodes rather than as a number is that many interactive XPath engines (including many on-line ones as well as JDeveloper 10.1.3.2) aren’t happy with resulting nodesets in which the nodes are not element nodes. Of course XPath APIs don’t have that problem.

We have already acknowledged one limitation of our approach, the need to transform the XML doc (by turning “2” into “10”, “3” into “100”, etc). Now comes XPath brain teaser 3: what are the other limitations of this approach?

The first one is obvious (and doesn’t’ have much to do with XPath per se): what happens when there is a carry-over in the computation of the sum() function? Bad stuff is the answer. Basically, we can’t have this. Which means that since our calculations take place in base 10 (the only one XPath supports) we are limited to a maximum number of 9 persons in a group. We can look for groups that contain alan, paul and ivan, but not for those that contain all 15 members of a rugby team.

The second limitation requires a bit more XPath wonkery. Or rather IEEE 754 wonkery since numbers in XPath are defined as using the IEEE 754 double-precision (64-bit) format. Which has a 52 bits mantissa. The format normalizes the mantissa such that it only has one significant bit before the decimal. And since that bit can only be “1” it is ignored in the representation, which means we actually get 53 bits worth of precision. I would have thought that this would give us 16 significant digits in decimal form, but when I test this by converting 9999999999999999 into 64-bit representation I get 0100001101000001110000110111100100110111111000001000000000000000 or 4341C37937E08000 in hex which gets turned back into the decimal value 10000000000000000. Looks like we can only count on 15 digits worth of precision for a decimal integer in XPath.

What does it mean for our application? It means that we can only track 15 groups in our sum(). So if the document has more than 15 different groups we are out of luck. In the spirit of a “glass half full”, let’s count (no pun intended) ourselves lucky that XPath chose double precision (64-bit) and not single precision (32-bit)…

It would be nice if we could free ourselves of the constraint of having group IDs be multiples of 10. Maybe we can turn them into multiples of 10 as we go, by calculating 10^(n-1) whenever we hit such an ID? The first problem with this is that XPath does not have an exponentiation (^) operator. But this one is surmountable, because we don’t need a generic exponentiation operator, we just need to be able to calculate 10^n for n ranging from 0 to 14 (remember, we are limited to 15 digits of precision). We can simply seed our XPath with an enumerated result list. Sure it’s ugly, but by now it should be clear that we are far removed from any practical application anyway (practically-minded people would have long moved to another query language or at least to version 2.0 of XPath). If you’re still reading you must admit to yourself that your inner geek is intrigued by this attempt to push XPath where it was never meant to go. Our poor man exponentiation function looks like this:

substring-before(substring-after("A0:1 A1:10 A2:100 A3:1000 A4:10000
  A5:100000 A6:1000000 A7:10000000 A8:100000000 A9:1000000000
  A10:10000000000 A11:100000000000 A12:1000000000000
  A13:10000000000000 A14:100000000000000", concat("A", 12, ":")), " ")

When you execute this XPath (on whatever document), it returns: “1000000000000”. Replace the 12 with any other integer between 0 and 14 and the XPath will return 10 to the power of your integer. So in effect, we have emulated the exponentiation function for all needed values.

Unfortunately, this doesn’t take us very far. It would be tempting to plug this ad-hoc exponentiation function in our precedent XPath (at the place where we retrieve the value of the <g> element, as in:

sum(substring-before(substring-after("A1:1 A2:10 A3:100 A4:1000
  A5:10000 A6:100000 A7:1000000 A8:10000000 A9:100000000
  A10:1000000000 A11:10000000000 A12:100000000000
  A13:1000000000000 A14:10000000000000 A15:100000000000000",
  concat("A", ((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g, ":")), " "))

And to hope that our 3131 result pops out again. But this is not to be.

There are two problems. First, this is not valid XPath because the sum() function can only apply to a nodeset, not strings (or numbers for that matter). Second, even if sum() was more forgiving what we are sending to it is not several strings. It’s one string. That’s because the insertion of the ((/doc/person[@name=”alan”]) | (/doc/person[@name=”paul”]) | (/doc/person[@name=”ivan”]) )/g nodeset as an operand to a function that expects a string (in this case, our ad-hoc exponentiation function) doesn’t generate a set of text nodes that contain the result of running the function on all nodes in the nodeset. Rather, it generates the result of the evaluation of the function on the one string that corresponds to the string-value for the nodeset (which is the string value of its first node). Feel free to re-read this slowly.

You can’t modify nodesets in XPath, just integers and strings. Once you’ve turned your nodeset into another object, you’re out of the loop. Literally.

Sorry to end with a downer. At least I hope this entertained you, helped you better understand XPath or illuminated the difference between a graph query language and a tree query language.

[UPDATED 2008/3/27: For more XPath fun, Dare Obasanjo provides a guided walk through some tricky aspects of the XPath syntax. Unlike me, his focus is on understanding the syntax, not abusing it… ;-)]

11 Comments

Filed under Brain teaser, CMDBf, Everything, Graph query, SPARQL, XPath