Consider this piece of XML, in which the <g> elements represent groups that the people are part of (groups can have several members and people can be members of several groups). For example, “paul” is a member of groups 2, 3 and 4.
<doc> <person name="alan"><g>1</g><g>2</g><g>4</g></person> <person name="marc"><g>1</g><g>2</g><g>3</g></person> <person name="paul"><g>2</g><g>3</g><g>4</g></person> <person name="ivan"><g>2</g><g>4</g></person> <person name="eric"><g>4</g></person> </doc>
This is essentially a graph structure, represented as a tree because of the constraints of XML.
Using a graph query language like SPARQL, answering questions such as “which groups contain alan, paul and ivan” would be trivial. In SPARQL that would be something like:
SELECT ?group WHERE { [ ns:hasName "alan" ] ns:partOf ?group . [ ns:hasName "paul" ] ns:partOf ?group . [ ns:hasName "ivan" ] ns:partOf ?group . }
In the CMDBf query language, another graph query language, it would be more verbose but just as straightforward to express:
<query> <itemTemplate id="alan"> <recordConstraint> <propertyValue namespace="http://example.com/people" localName="name"> <equal>alan</equal> </propertyValue> </recordConstraint> </itemTemplate> <itemTemplate id="paul"> <recordConstraint> <propertyValue namespace="http://example.com/people" localName="name"> <equal>paul</equal> </propertyValue> </recordConstraint> </itemTemplate> <itemTemplate id="ivan"> <recordConstraint> <propertyValue namespace="http://example.com/people" localName="name"> <equal>ivan</equal> </propertyValue> </recordConstraint> </itemTemplate> <itemTemplate id="group"/> <relationshipTemplate id="alan-in-group"> <recordConstraint> <recordType namespace="http://example.com/people" localName="partOf"/> </recordConstraint> <sourceTemplate ref="alan"/> <targetTemplate ref="group"/> </relationshipTemplate> <relationshipTemplate id="paul-in-group"> <recordConstraint> <recordType namespace="http://example.com/people" localName="partOf"/> </recordConstraint> <sourceTemplate ref="paul"/> <targetTemplate ref="group"/> </relationshipTemplate> <relationshipTemplate id="ivan-in-group"> <recordConstraint> <recordType namespace="http://example.com/people" localName="partOf"/> </recordConstraint> <sourceTemplate ref="ivan"/> <targetTemplate ref="group"/> </relationshipTemplate> </query>
But using the right tool for the job is just no fun. How can we answer this question using XPath 1.0? Your first response might be “this is the wrong XML format”. And yes, we could switch things around and make people children of groups rather than the contrary, as in:
<invertedDoc> <group number="1"><p>alan</p><p>marc</p></group> <group number="2"><p>alan</p><p>marc</p><p>paul</p></group> <group number="3"><p>marc</p><p>paul</p></group> <group number="4"><p>alan</p><p>paul</p><p>ivan</p><p>eric</p></group> </invertedDoc>
That would make the “is there a group that contains alan, paul and ivan” question very easy to answer in XPath 1.0, but then I would ask you “which persons are part of groups 1, 2 and 4” and you’d be back to the same problem. You won’t get off the hook that easily.
So, XPath brain teaser #1 is: how to answer “which groups contain alan, paul and ivan” using XPath 1.0 on the first XML document (<doc>, not <invertedDoc>)?
The answer is:
/doc/person/g[../@name="alan" and text()=/doc/person/g[../@name="paul" and text()=/doc/person/g[../@name="ivan"]]]
Which returns:
<g>2</g> <g>4</g>
It doesn’t look like much, but go through it carefully and you’ll see that we have somewhat of a recursive loop (as close as XPath can get to recursion). With these loops, we go through the entire document n^m times, where n is the number of <people> elements and m is the number of names that we need to look for in each group (3 in the present case: alan, paul an ivan). In our simple example, that’s 5^3=125. Not very efficient for a query that could, with the right language, be answered in one pass through the document (I am assuming a basic XPath engine, not one that may be able pre-analyze the query and optimize its execution).
Which takes us to XPath brain teaser #2: can you find a way to answer that same question with fewer passes through the doc?
There is an answer, but it requires the document to adopt a convention to make all group IDs multiples of 10. 1 stays 1, 2 becomes 10, 3 becomes 100, etc.
The document that we are querying against now looks like this:
<?xml version="1.0" encoding="iso-8859-1"?> <doc> <person name="alan"><g>1</g><g>10</g><g>1000</g></person> <person name="marc"><g>1</g><g>10</g><g>100</g></person> <person name="paul"><g>10</g><g>100</g><g>1000</g></person> <person name="ivan"><g>10</g><g>1000</g></person> <person name="eric"><g>1000</g></person> </doc>
On this document, the following XPath:
sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g)
returns: 3131
Which is the answer to our question. It doesn’t look like it? Well, here is the key to decode this answer: every “3” digit that appears in this number represents a group that contains all three required members (alan, paul and ivan). In this example, we have a “3” in the “thousands” position (so group 1000 qualifies) and a “3” in the “tens” position (so group 10 qualifies).
How do we get the 3131 result? In that XPath statement, the processor simply picks out the <person> elements that correspond to alan, paul and ivan. Then it simply adds up the value of all the <g> elements contained in all these selected <person> elements. And that’s our 3131.
The transformation of group values from n to 10^(n-1) is what allows us to turn a recursive loop into a simple addition of group values. Each column in the running sum keeps track of the number of people who are in the group that corresponds to that column (the “units” column corresponds to group 1, the “tens” column corresponds to group 10, the “hundreds” column corresponds to group 100, etc). This is why we had to turn the group IDs to multiples of 10.
Does this approach meet our goal of requiring fewer passes through the document than the XPath that is the solution to brain teaser #1? Yes, because we only scan the content of the <people> elements we are interested in (and we only scan each of them once). We don’t care how many groups there are. So we go from n^m passes through the entire document to m passes (one for each <person> element that we need to locate). In our example, it means 125 versus 3.
One potential gotcha is that we are assuming that a given group only appears once inside a given <person> element. Which seems logical. But what if the maintainer of the document is sloppy and we suspect that he may sometimes add a group inside a <person> element without first checking whether that <person> element already contains that group? We can protect ourselves against this by filtering out the redundant <g> elements inside a <person>. To do so, we replace replace:
sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g)
with:
sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g[not(text()=preceding-sibling::g)])
The [not(text()=preceding-sibling::g)] part removes <g> elements that have a preceding sibling with the same value. At little processing cost.
If you don’t like the looks of this “3131” result, you can add a simple transformation into the XPath to turn it into 1010, which can be interpreted as the sum of the numbers corresponding to all the groups that satisfy our request (again, groups 1000 and 10 in this case):
translate(sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g), "123456789", "001000000")
Returns: 1010.
If you are still not satisfied, we can actually extract the <g> elements (basically the same result as in the XPath statement that corresponds to brain teaser #1), but at the cost of a bit more work for the XPath processor: instead of calculating the 3131 result once, you do it once for each group that alan is a member of (why alan? it doesn’t matter, pick paul or ivan if you want). The corresponding XPath is:
/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g) div text()) mod 10 = 3]
Which returns:
<g>10</g> <g>1000</g>
And here too, if you are concerned that the same group may appear more than once inside the <person name=”alan”> element and you don’t want that to appear in the result, you can remove the <g> elements that have a preceding sibling with the same value (you have to remove them twice, once in the sum calculation and once in the selection of the <g> elements for display, which is why [not(text()=preceding-sibling::g)] appears twice below):
/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]))/g[not(text()=preceding-sibling::g)]) div text()) mod 10 = 3][not(text()=preceding-sibling::g)]
BTW, a practical advantage of presenting the result as a set of element nodes rather than as a number is that many interactive XPath engines (including many on-line ones as well as JDeveloper 10.1.3.2) aren’t happy with resulting nodesets in which the nodes are not element nodes. Of course XPath APIs don’t have that problem.
We have already acknowledged one limitation of our approach, the need to transform the XML doc (by turning “2” into “10”, “3” into “100”, etc). Now comes XPath brain teaser 3: what are the other limitations of this approach?
The first one is obvious (and doesn’t’ have much to do with XPath per se): what happens when there is a carry-over in the computation of the sum() function? Bad stuff is the answer. Basically, we can’t have this. Which means that since our calculations take place in base 10 (the only one XPath supports) we are limited to a maximum number of 9 persons in a group. We can look for groups that contain alan, paul and ivan, but not for those that contain all 15 members of a rugby team.
The second limitation requires a bit more XPath wonkery. Or rather IEEE 754 wonkery since numbers in XPath are defined as using the IEEE 754 double-precision (64-bit) format. Which has a 52 bits mantissa. The format normalizes the mantissa such that it only has one significant bit before the decimal. And since that bit can only be “1” it is ignored in the representation, which means we actually get 53 bits worth of precision. I would have thought that this would give us 16 significant digits in decimal form, but when I test this by converting 9999999999999999 into 64-bit representation I get 0100001101000001110000110111100100110111111000001000000000000000 or 4341C37937E08000 in hex which gets turned back into the decimal value 10000000000000000. Looks like we can only count on 15 digits worth of precision for a decimal integer in XPath.
What does it mean for our application? It means that we can only track 15 groups in our sum(). So if the document has more than 15 different groups we are out of luck. In the spirit of a “glass half full”, let’s count (no pun intended) ourselves lucky that XPath chose double precision (64-bit) and not single precision (32-bit)…
It would be nice if we could free ourselves of the constraint of having group IDs be multiples of 10. Maybe we can turn them into multiples of 10 as we go, by calculating 10^(n-1) whenever we hit such an ID? The first problem with this is that XPath does not have an exponentiation (^) operator. But this one is surmountable, because we don’t need a generic exponentiation operator, we just need to be able to calculate 10^n for n ranging from 0 to 14 (remember, we are limited to 15 digits of precision). We can simply seed our XPath with an enumerated result list. Sure it’s ugly, but by now it should be clear that we are far removed from any practical application anyway (practically-minded people would have long moved to another query language or at least to version 2.0 of XPath). If you’re still reading you must admit to yourself that your inner geek is intrigued by this attempt to push XPath where it was never meant to go. Our poor man exponentiation function looks like this:
substring-before(substring-after("A0:1 A1:10 A2:100 A3:1000 A4:10000 A5:100000 A6:1000000 A7:10000000 A8:100000000 A9:1000000000 A10:10000000000 A11:100000000000 A12:1000000000000 A13:10000000000000 A14:100000000000000", concat("A", 12, ":")), " ")
When you execute this XPath (on whatever document), it returns: “1000000000000”. Replace the 12 with any other integer between 0 and 14 and the XPath will return 10 to the power of your integer. So in effect, we have emulated the exponentiation function for all needed values.
Unfortunately, this doesn’t take us very far. It would be tempting to plug this ad-hoc exponentiation function in our precedent XPath (at the place where we retrieve the value of the <g> element, as in:
sum(substring-before(substring-after("A1:1 A2:10 A3:100 A4:1000 A5:10000 A6:100000 A7:1000000 A8:10000000 A9:100000000 A10:1000000000 A11:10000000000 A12:100000000000 A13:1000000000000 A14:10000000000000 A15:100000000000000", concat("A", ((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g, ":")), " "))
And to hope that our 3131 result pops out again. But this is not to be.
There are two problems. First, this is not valid XPath because the sum() function can only apply to a nodeset, not strings (or numbers for that matter). Second, even if sum() was more forgiving what we are sending to it is not several strings. It’s one string. That’s because the insertion of the ((/doc/person[@name=”alan”]) | (/doc/person[@name=”paul”]) | (/doc/person[@name=”ivan”]) )/g nodeset as an operand to a function that expects a string (in this case, our ad-hoc exponentiation function) doesn’t generate a set of text nodes that contain the result of running the function on all nodes in the nodeset. Rather, it generates the result of the evaluation of the function on the one string that corresponds to the string-value for the nodeset (which is the string value of its first node). Feel free to re-read this slowly.
You can’t modify nodesets in XPath, just integers and strings. Once you’ve turned your nodeset into another object, you’re out of the loop. Literally.
Sorry to end with a downer. At least I hope this entertained you, helped you better understand XPath or illuminated the difference between a graph query language and a tree query language.
[UPDATED 2008/3/27: For more XPath fun, Dare Obasanjo provides a guided walk through some tricky aspects of the XPath syntax. Unlike me, his focus is on understanding the syntax, not abusing it… ;-)]
Pingback: William Vambenepe’s blog » Blog Archive » How not to re-use XML technologies