William Vambenepe's blog

IT management in a changing IT world

garth brooks ringtonesfree mudvayne ringtonesdevil wears prada ringtonefree rascal flatts ringtones

Archive for the 'SPARQL' Category

20
May
2008

I have seen the future of CMDBf

by William Vambenepe

I got a sneak peak at CMDBf v2 today.

I am calling it v2 based on the assumption that the one being currently standardized in DMTF will end up being called 1.0 (because it’s the first one out of DMTF) or 1.1 (to prevent confusion with the submitted version).

At the Semantic Technology Conference, David Booth from HP presented his work (along with his partner, Steve Battle from HP Labs) to provide a SPARQL front-end to HP’s Universal CMDB (the engine under what was the Mercury MAM product). Here are the slides.

The mapping from SPARQL to TQL (the native query interface for UCMDB) was made pretty easy by the fact that TQL is a graph-oriented query language. How much harder would it be to similarly transform a CMDBf (v1) query interface into a SPARQL query interface (and vice-versa)? Not much. The only added difficulty would come from the CMDBf XPath constraints. TQL has a property value mechanism that is very similar to CMDBf’s “propertyValue” constraint and maps well to SPARQL functions. The introduction of XPath as a constraint language in CMDBf makes things harder. It could be handled by adding XPath support to the SPARQL engine using function extensibility. Or by turning the entire XML into RDF and emulating XPath in SPARQL. But in either case, you’ll have impedence mismatch at some point because concepts such as element order that exist in XPath have no native equivalent in RDF.

The use of XPath in selectors on the other hand is not a problem. HP’s prototype uses Gloze (available as a Jena package) to turn the XML returned by UCMDB into RDF. An XSLT transform could turn that same XML into a CMDBf-valid XML response instead and that XSLT could easily handle the XPath selectors from the query request. This is another reason why constraints and selectors should remain separate in CMDBf (fortunately the specification is back to doing this properly).

Here is why I call this prototype CMDBf v2: The CMDBf effort (v1 or 1.1), in its current form of re-inventing a graph query, can succeed. Let’s assume the working group strikes a reasonable balance between completeness and complexity, and vendors choose to compete on innovation and execution rather than lock-in (insert cynical comment here). CMDBf may then end up being supported by the main CMDB vendors. It wouldn’t provide federation capabilities, but having a common CMDB query interface supported by the Big Four would help with management integration. And yet, while the value would be real, it would only provide a little help to solve a larger problem:

  • As a technology limited to IT systems management, it would be unlikely to see widely available tools (e.g. user consoles and language-specific libraries).
  • It wouldn’t get the kind of robustness and interoperability that comes from wide adoption. While pretty similar, there might be some minor differences in the various implementations. Once your implementation has been tweaked to work with the implementations from the Big Four, you’ll call it done. Just like SNMP, another technology that is specific to IT systems management (see it happen here).
  • Even if it works perfectly at the query level, it will just hasten the time when developers run into the real problem, model interoperability. CMDBf doesn’t help at all with this. In fact, it makes it harder by hard-coding some dependencies on an XML back-end (the XPath constraints).

In the long run, IT management has to become more automated and integrated. That’s a given. The way it happens may or may not go through CMDB-like configuration stores. But if it does, we’ll have to eventually move beyond CMDBf (v1) towards something that addresses the three requirements above. And federation. I don’t know if it will be called CMDBf v2, and/or if it will come from the DMTF (by then, the CMDBf brand might be an asset or a liability depending on developer experience with the specification). But I strongly suspect (”probability 0.8″ as a Gartner analyst might put it) that it will use semantic technologies. Because the real, hard, underlying problem is a problem of semantic integration. In that sense, David and Steve’s prototype is a sneak peek at what will come after CMDBf v1/1.1.

Pretty much since the beginning of CMDBf I have been pushing for it to ideally embrace SPARQL (with no success) or to at least stay close to it conceptually in order to make the eventual mapping/evolution smooth (with a bit more success). This includes pushing for a topological query language, trying to keep XML idiosyncrasies at bay and keeping constraints and selectors cleanly separated. Rather than working within the CMDBf group, David took the alternative approach of simply doing it. Hopefully this will help convince people of the value of re-using semantic web technology for IT systems management. Yes semantic technologies have been designed for a much more general use case. But the use cases that CMDB systems address are a subset of the use cases addressed by semantic technologies. It’s hard for domain experts to see their domain as just a subset of a larger problem, but this is the case here. Isn’t HTTP serving the IT management community better than a systems management-specific alternative would?

By the way, there is no inferencing taking place in the HP prototype. We are just talking about re-using an existing, well though-through graph query language. Sure OWL inferencing and some rules could be seamless layered on top of this. But this is in no way required to do (better) what CMDBf v1 tries to do.

And then there is the “federation” question. Who do you trust more to deliver this? A bunch of IT system management architects in DMTF or the web and query experts at W3C, HP Labs etc who designed and implemented SPARQL over many years? BTW, it sounds like SPQARL federation was discussed at WWW 2008, based on these meeting notes (search for “federation”).

26
Mar
2008

XPath brain teasers: graph queries in XPath 1.0

by William Vambenepe

Consider this piece of XML, in which the <g> elements represent groups that the people are part of (groups can have several members and people can be members of several groups). For example, “paul” is a member of groups 2, 3 and 4.

<doc>
  <person name="alan"><g>1</g><g>2</g><g>4</g></person>
  <person name="marc"><g>1</g><g>2</g><g>3</g></person>
  <person name="paul"><g>2</g><g>3</g><g>4</g></person>
  <person name="ivan"><g>2</g><g>4</g></person>
  <person name="eric"><g>4</g></person>
</doc>

This is essentially a graph structure, represented as a tree because of the constraints of XML.


Using a graph query language like SPARQL, answering questions such as “which groups contain alan, paul and ivan” would be trivial. In SPARQL that would be something like:

SELECT ?group
WHERE {
  [ ns:hasName "alan" ] ns:partOf ?group .
  [ ns:hasName "paul" ] ns:partOf ?group .
  [ ns:hasName "ivan" ] ns:partOf ?group . }

In the CMDBf query language, another graph query language, it would be more verbose but just as straightforward to express:

<query>
  <itemTemplate id="alan">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>alan</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="paul">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>paul</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="ivan">
    <recordConstraint>
      <propertyValue namespace="http://example.com/people" localName="name">
        <equal>ivan</equal>
      </propertyValue>
    </recordConstraint>
  </itemTemplate>
  <itemTemplate id="group"/>
  <relationshipTemplate id="alan-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="alan"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
  <relationshipTemplate id="paul-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="paul"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
  <relationshipTemplate id="ivan-in-group">
    <recordConstraint>
      <recordType namespace="http://example.com/people" localName="partOf"/>
    </recordConstraint>
    <sourceTemplate ref="ivan"/>
    <targetTemplate ref="group"/>
  </relationshipTemplate>
</query>

But using the right tool for the job is just no fun. How can we answer this question using XPath 1.0? Your first response might be “this is the wrong XML format”. And yes, we could switch things around and make people children of groups rather than the contrary, as in:

<invertedDoc>
  <group number="1"><p>alan</p><p>marc</p></group>
  <group number="2"><p>alan</p><p>marc</p><p>paul</p></group>
  <group number="3"><p>marc</p><p>paul</p></group>
  <group number="4"><p>alan</p><p>paul</p><p>ivan</p><p>eric</p></group>
</invertedDoc>

That would make the “is there a group that contains alan, paul and ivan” question very easy to answer in XPath 1.0, but then I would ask you “which persons are part of groups 1, 2 and 4″ and you’d be back to the same problem. You won’t get off the hook that easily.

So, XPath brain teaser #1 is: how to answer “which groups contain alan, paul and ivan” using XPath 1.0 on the first XML document (<doc>, not <invertedDoc>)?

The answer is:

/doc/person/g[../@name="alan" and text()=/doc/person/g[../@name="paul"
  and text()=/doc/person/g[../@name="ivan"]]]

Which returns:

<g>2</g>
<g>4</g>

It doesn’t look like much, but go through it carefully and you’ll see that we have somewhat of a recursive loop (as close as XPath can get to recursion). With these loops, we go through the entire document n^m times, where n is the number of <people> elements and m is the number of names that we need to look for in each group (3 in the present case: alan, paul an ivan). In our simple example, that’s 5^3=125. Not very efficient for a query that could, with the right language, be answered in one pass through the document (I am assuming a basic XPath engine, not one that may be able pre-analyze the query and optimize its execution).

Which takes us to XPath brain teaser #2: can you find a way to answer that same question with fewer passes through the doc?

There is an answer, but it requires the document to adopt a convention to make all group IDs multiples of 10. 1 stays 1, 2 becomes 10, 3 becomes 100, etc.

The document that we are querying against now looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<doc>
  <person name="alan"><g>1</g><g>10</g><g>1000</g></person>
  <person name="marc"><g>1</g><g>10</g><g>100</g></person>
  <person name="paul"><g>10</g><g>100</g><g>1000</g></person>
  <person name="ivan"><g>10</g><g>1000</g></person>
  <person name="eric"><g>1000</g></person>
</doc>

On this document, the following XPath:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g)

returns: 3131

Which is the answer to our question. It doesn’t look like it? Well, here is the key to decode this answer: every “3″ digit that appears in this number represents a group that contains all three required members (alan, paul and ivan). In this example, we have a “3″ in the “thousands” position (so group 1000 qualifies) and a “3″ in the “tens” position (so group 10 qualifies).

How do we get the 3131 result? In that XPath statement, the processor simply picks out the <person> elements that correspond to alan, paul and ivan. Then it simply adds up the value of all the <g> elements contained in all these selected <person> elements. And that’s our 3131.

The transformation of group values from n to 10^(n-1) is what allows us to turn a recursive loop into a simple addition of group values. Each column in the running sum keeps track of the number of people who are in the group that corresponds to that column (the “units” column corresponds to group 1, the “tens” column corresponds to group 10, the “hundreds” column corresponds to group 100, etc). This is why we had to turn the group IDs to multiples of 10.

Does this approach meet our goal of requiring fewer passes through the document than the XPath that is the solution to brain teaser #1? Yes, because we only scan the content of the <people> elements we are interested in (and we only scan each of them once). We don’t care how many groups there are. So we go from n^m passes through the entire document to m passes (one for each <person> element that we need to locate). In our example, it means 125 versus 3.

One potential gotcha is that we are assuming that a given group only appears once inside a given <person> element. Which seems logical. But what if the maintainer of the document is sloppy and we suspect that he may sometimes add a group inside a <person> element without first checking whether that <person> element already contains that group? We can protect ourselves against this by filtering out the redundant <g> elements inside a <person>. To do so, we replace replace:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g)

with:

sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g[not(text()=preceding-sibling::g)])

The [not(text()=preceding-sibling::g)] part removes <g> elements that have a preceding sibling with the same value. At little processing cost.

If you don’t like the looks of this “3131″ result, you can add a simple transformation into the XPath to turn it into 1010, which can be interpreted as the sum of the numbers corresponding to all the groups that satisfy our request (again, groups 1000 and 10 in this case):

translate(sum(((/doc/person[@name="alan"]) | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g), "123456789", "001000000")

Returns: 1010.

If you are still not satisfied, we can actually extract the <g> elements (basically the same result as in the XPath statement that corresponds to brain teaser #1), but at the cost of a bit more work for the XPath processor: instead of calculating the 3131 result once, you do it once for each group that alan is a member of (why alan? it doesn’t matter, pick paul or ivan if you want). The corresponding XPath is:

/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g) div text()) mod 10 = 3]

Which returns:

<g>10</g>
<g>1000</g>

And here too, if you are concerned that the same group may appear more than once inside the <person name=”alan”> element and you don’t want that to appear in the result, you can remove the <g> elements that have a preceding sibling with the same value (you have to remove them twice, once in the sum calculation and once in the selection of the <g> elements for display, which is why [not(text()=preceding-sibling::g)] appears twice below):

/doc/person[@name="alan"]/g[floor(sum(((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]))/g[not(text()=preceding-sibling::g)])
  div text()) mod 10 = 3][not(text()=preceding-sibling::g)]

BTW, a practical advantage of presenting the result as a set of element nodes rather than as a number is that many interactive XPath engines (including many on-line ones as well as JDeveloper 10.1.3.2) aren’t happy with resulting nodesets in which the nodes are not element nodes. Of course XPath APIs don’t have that problem.

We have already acknowledged one limitation of our approach, the need to transform the XML doc (by turning “2″ into “10″, “3″ into “100″, etc). Now comes XPath brain teaser 3: what are the other limitations of this approach?

The first one is obvious (and doesn’t’ have much to do with XPath per se): what happens when there is a carry-over in the computation of the sum() function? Bad stuff is the answer. Basically, we can’t have this. Which means that since our calculations take place in base 10 (the only one XPath supports) we are limited to a maximum number of 9 persons in a group. We can look for groups that contain alan, paul and ivan, but not for those that contain all 15 members of a rugby team.

The second limitation requires a bit more XPath wonkery. Or rather IEEE 754 wonkery since numbers in XPath are defined as using the IEEE 754 double-precision (64-bit) format. Which has a 52 bits mantissa. The format normalizes the mantissa such that it only has one significant bit before the decimal. And since that bit can only be “1″ it is ignored in the representation, which means we actually get 53 bits worth of precision. I would have thought that this would give us 16 significant digits in decimal form, but when I test this by converting 9999999999999999 into 64-bit representation I get 0100001101000001110000110111100100110111111000001000000000000000 or 4341C37937E08000 in hex which gets turned back into the decimal value 10000000000000000. Looks like we can only count on 15 digits worth of precision for a decimal integer in XPath.

What does it mean for our application? It means that we can only track 15 groups in our sum(). So if the document has more than 15 different groups we are out of luck. In the spirit of a “glass half full”, let’s count (no pun intended) ourselves lucky that XPath chose double precision (64-bit) and not single precision (32-bit)…

It would be nice if we could free ourselves of the constraint of having group IDs be multiples of 10. Maybe we can turn them into multiples of 10 as we go, by calculating 10^(n-1) whenever we hit such an ID? The first problem with this is that XPath does not have an exponentiation (^) operator. But this one is surmountable, because we don’t need a generic exponentiation operator, we just need to be able to calculate 10^n for n ranging from 0 to 14 (remember, we are limited to 15 digits of precision). We can simply seed our XPath with an enumerated result list. Sure it’s ugly, but by now it should be clear that we are far removed from any practical application anyway (practically-minded people would have long moved to another query language or at least to version 2.0 of XPath). If you’re still reading you must admit to yourself that your inner geek is intrigued by this attempt to push XPath where it was never meant to go. Our poor man exponentiation function looks like this:

substring-before(substring-after("A0:1 A1:10 A2:100 A3:1000 A4:10000
  A5:100000 A6:1000000 A7:10000000 A8:100000000 A9:1000000000
  A10:10000000000 A11:100000000000 A12:1000000000000
  A13:10000000000000 A14:100000000000000", concat("A", 12, ":")), " ")

When you execute this XPath (on whatever document), it returns: “1000000000000″. Replace the 12 with any other integer between 0 and 14 and the XPath will return 10 to the power of your integer. So in effect, we have emulated the exponentiation function for all needed values.

Unfortunately, this doesn’t take us very far. It would be tempting to plug this ad-hoc exponentiation function in our precedent XPath (at the place where we retrieve the value of the <g> element, as in:

sum(substring-before(substring-after("A1:1 A2:10 A3:100 A4:1000
  A5:10000 A6:100000 A7:1000000 A8:10000000 A9:100000000
  A10:1000000000 A11:10000000000 A12:100000000000
  A13:1000000000000 A14:10000000000000 A15:100000000000000",
  concat("A", ((/doc/person[@name="alan"])
  | (/doc/person[@name="paul"])
  | (/doc/person[@name="ivan"]) )/g, ":")), " "))

And to hope that our 3131 result pops out again. But this is not to be.

There are two problems. First, this is not valid XPath because the sum() function can only apply to a nodeset, not strings (or numbers for that matter). Second, even if sum() was more forgiving what we are sending to it is not several strings. It’s one string. That’s because the insertion of the ((/doc/person[@name="alan"]) | (/doc/person[@name="paul"]) | (/doc/person[@name="ivan"]) )/g nodeset as an operand to a function that expects a string (in this case, our ad-hoc exponentiation function) doesn’t generate a set of text nodes that contain the result of running the function on all nodes in the nodeset. Rather, it generates the result of the evaluation of the function on the one string that corresponds to the string-value for the nodeset (which is the string value of its first node). Feel free to re-read this slowly.

You can’t modify nodesets in XPath, just integers and strings. Once you’ve turned your nodeset into another object, you’re out of the loop. Literally.

Sorry to end with a downer. At least I hope this entertained you, helped you better understand XPath or illuminated the difference between a graph query language and a tree query language.

[UPDATED 2008/3/27: For more XPath fun, Dare Obasanjo provides a guided walk through some tricky aspects of the XPath syntax. Unlike me, his focus is on understanding the syntax, not abusing it... ;-)]

04
Mar
2008

Of graphs and trees: Kingsley Idehen to the rescue

by William Vambenepe

I just read the transcript of Jon Udell’s podcast interview of Kingsley Idehen. It’s almost two years old but it contains something that I have tried (and mostly failed) to explain for a while now, so maybe borrowing someone else’s words (and credibility) would help.

Kingsley says:

“A graph model, ideally, will allow you to explore almost all the comprehensible dimensions of the nodes in that network. So you can traverse that network in a myriad of different ways and it will give you much more flexibility than if you’re confined to a tree, in effect, the difference between XQuery and SPARQL. I always see the difference between these two things as this. If you visualize nodes on a network, SPARQL is going to get you to the right node. Your journey to what you want is facilitated by SPARQL, and then XQuery can then take you deeper into this one node, which has specific data that the graph traversal is taking you to.”

Nicely said, especially considering that this is not a prepared statement but a transcript of a (presumably) unscripted interview.

He later provides an example:

“Let’s take a microformat as an example. HCard, or an hCalendar, is a well-formed format. In a sense, it’s XML. You can locate the hCard in question, so if you had a collection of individuals who had full files on the network in the repository, it could be a graph of a social network or a group of people. Now, through that graph you could ultimately locate common interests. And eventually you may want to set up calendars but if the format of the calendar itself is well formed, with XQuery you can search a location, with XPath it’s even more specific. Here you simply want to get to a node in the content and to get a value. Because the content is well formed you can traverse within the content, but XQuery doesn’t help you find that content as effectively because in effect XQuery is really all about a hierarchical model.”

Here is one way to translate this to the IT management domain. Replace hCard with an XML-formated configuration record. Replace the graph of social relationships with a graph of IT-relevant relationships (dependency, ownership, connections, containment…). Rather than attempt to XQuery across an entire CMDB (or, even worse, an entire CMDB federation), use a graph query (ideally SPARQL) to find the items of interest and then use XPath/XQuery to drill into the content of the resulting records. The graph query language in CMDBf is an attempt to do that, but it has to constantly battle attempts to impose a tree-based view of the world.

This also helps illustrate why SPARQL is superior to the CMDBf query language. It’s not just that it’s a better graph query language, one that has received much more review and validation by people more experienced in graph theory and queries, and one that is already widely implemented. It also does something that CMDBf doesn’t attempt to do: it lets you navigate the graph based on the semantics appropriate for the task at hand (dependency relationships, governance rules, distributed performance management…), something that CMDBf cannot do. There is more to classification than simply class inheritance. I think this is what Kingsley refers to when he says “in a myriad of different ways” in the quote above.

Here is a way to summarize the larger point (that tree and graph views are complementary):

Me Tarzan, you Jena

Where Tarzan (appropriately) represents the ability to navigate trees and Jane/Jena represents the ability to navigate graphs (Jena, from HP Labs, is the leading open source RDF/OWL/SPARQL framework). As in the movie, they complement each other (to the point of saving one another’s life and falling in love, but I don’t ask quite that much of SPARQL and XQuery).

On a related topic, I recently saw some interesting news from TopQuadrant. Based on explicit requests from the majority of their customers, they have added capabilities to their TopBraid Composer product to better make use of the RDF/OWL support in the Oracle database. TopQuadrant is at the forefront of many semantic web applications and the fact that they see Oracle being heavily used by their customers is an interesting external validation.

[UPDATED 2008/03/05: more related news! The W3C RDB2RDF incubator group has started is life at W3C, chaired by my colleague Ashok Malhotra, to work on mappings between RDF/OWL and relational data.]

15
Jan
2008

SPARQL is a W3C Recommendation

by William Vambenepe

SPARQL is now a W3C Recommendation (which is how W3C calls its approved standard specifications). Congratulations to those who made it happen, including my esteemed ex-colleagues at HP Labs Bristol. Just on time for the DMTF CMDBf working group to consider it as a candidate for its query language… :-)

And just below that SPARQL announcement we see a notice that the SML working group has released a third set of working drafts (SML, SML-IF). Just on time for the DMTF to be reminded of the goodness of open access to developing standards… :-)