Category Archives: Query

Big Data in the Cloud at Google I/O

Last week was a great party for the entire Google developer family, including Google Cloud Platform. And within the Cloud Platform, Big Data processing services. Which is where my focus has been in the almost two years I’ve been at Google.

It started with a bang, when our fearless leader Urs unveiled Cloud Dataflow in the keynote. Supported by a very timely demo (streaming analytics for a World Cup game) by my colleague Eric.

After the keynote, we had three live sessions:

In “Big Data, the Cloud Way“, I gave an overview of the main large-scale data processing services on Google Cloud:

  • Cloud Pub/Sub, a newly-announced service which provides reliable, many-to-many, asynchronous messaging,
  • the aforementioned Cloud Dataflow, to implement data processing pipelines which can run either in streaming or batch mode,
  • BigQuery, an existing service for large-scale SQL-based data processing at interactive speed, and
  • support for Hadoop and Spark, making it very easy to deploy and use them “the Cloud Way”, well integrated with other storage and processing services of Google Cloud Platform.

The next day, in “The Dawn of Fast Data“, Marwa and Reuven described Cloud Dataflow in a lot more details, including code samples. They showed how to easily construct a streaming pipeline which keeps a constantly-updated lookup table of most popular Twitter hashtags for a given prefix. They also explained how Cloud Dataflow builds on over a decade of data processing innovation at Google to optimize processing pipelines and free users from the burden of deploying, configuring, tuning and managing the needed infrastructure. Just like Cloud Pub/Sub and BigQuery do for event handling and SQL analytics, respectively.

Later that afternoon, Felipe and Jordan showed how to build predictive models in “Predicting the future with the Google Cloud Platform“.

We had also prepared some recorded short presentations. To learn more about how easy and efficient it is to use Hadoop and Spark on Google Cloud Platform, you should listen to Dennis in “Open Source Data Analytics“. To learn more about block storage options (including SSD, both local and remote), listen to Jay in “Optimizing disk I/O in the cloud“.

It was gratifying to see well-informed people recognize the importance of these announcement and partners understand how this will benefit their customers. As well as some good press coverage.

It’s liberating to now be able to talk freely about recent progress on our quest to equip Google Cloud users with easy to use data processing tools. Everyone can benefit from Google’s experience making developers productive while efficiently processing data at large scale. With great power comes great productivity.

1 Comment

Filed under Big Data, BigQuery, Cloud Computing, Cloud Dataflow, Everything, Google Cloud Platform, Implementation, Open source, Query, Spark, Tech, Utility computing

REST + RDF, finally a practical solution?

The W3C has recently approved the creation of the Linked Data Platform (LDP) Working Group. The charter contains its official marching orders. Its co-chair Erik Wilde shared his thoughts on the endeavor.

This is good. Back in 2009, I concluded a series of three blog posts on “REST in practice for IT and Cloud management” with:

I hereby conclude my “REST in practice for IT and Cloud management” series, with the intent to eventually start a “Linked Data in practice for IT and Cloud management” series.

I never wrote that later part, because my work took me away from that pursuit and there wasn’t much point writing down ideas which I hadn’t  put to the test. But if this W3C working group is successful, they will give us just that.

That’s a big “if” though. Religious debates and paralyzing disconnects between theorists and practitioners are all-too-common in tech, but REST and Semantic Web (of which RDF is the foundation) are especially vulnerable. Bringing these two together and trying to tame both sets of daemons at the same time is a daring proposition.

On the other hand, there is already a fair amount of relevant real-life experience (e.g. data.gov.uk – read Jeni Tennison on the choice of Linked Data). Plus, Erik is a great pick to lead this effort (I haven’t met his co-chair, IBM’s Arnaud Le Hors). And maybe REST and RDF have reached the mythical point where even the trolls are tired and practicality can prevail. One can always dream.

Here are a few different ways to think about this work:

The “REST doesn’t provide interoperability” perspective

RESTful API authors too often think they can make the economy of a metamodel. Or that a format (like XML or JSON) can be used as a metamodel. Or they punt the problem towards defining a multitude of MIME types. This will never buy you interoperability. Stu explained it before. Most problems I see addressed via RESTful APIs, in the IT/Cloud management realm, are modeling problems first and only secondarily protocol/interaction problems. And their failures are failures of modeling. LDP should bring modeling discipline to REST-land.

The “RDF was too much, too soon” perspective

The RDF stack is mired in complexity. By the time people outside of academia had formed a set of modeling requirements that cried for RDF, the Semantic Web community was already deep in the weeds and had overloaded its basic metamodel with enough classification and inference technology to bury its core value as a simple graph-oriented and web-friendly metamodel. What XSD-fever was to making XML seem overly complex, OWL-fever was to RDF. Tenfold.

Everything that the LDP working group is trying to achieve can be achieved today with existing Semantic Web technologies. Technically speaking, no new work is needed. But only a handful of people understand these technologies enough to know what to use and what to ignore, and as such this application doesn’t have a chance to materialize. Which is why the LDP group is needed. But there’s a reason why its starting point document is called a “profile”. No new technology is needed. Only clarity and agreement.

For the record, I like OWL. It may be the technology that most influenced the way I think about modeling. But the predominance of RDFS and OWL (along with ugly serializations) in Semantic Web discussions kept RDF safely out of sight of those in industry who could have used it. Who knows what would have happened if a graph query language (SPARQL) had been prioritized ahead of inference technology (OWL)?

The Cloud API perspective

The scope of the LDP group is much larger than Cloud APIs, but my interest in it is mostly grounded in Cloud API use cases. And I see no reason why the requirements of Cloud APIs would not be 100% relevant to this effort.

What does this mean for the Cloud API debate? Nothing in the short term, but if this group succeeds, the result will probably be the best technical foundation for large parts of the Cloud management landscape. Which doesn’t mean it will be adopted, of course. The LDP timeline calls for completion in 2014. Who knows what the actual end date will be and what the Cloud API situation will be at that point. AWS APIs might be entrenched de-facto standards, or people may be accustomed to using several APIs (via libraries that abstract them away). Or maybe the industry will be clamoring for reunification and LDP will arrive just on time to underpin it. Though the track record is not good for such “reunifications”.

The “ghost of WS-*” perspective

Look at the 16 “technical issues” in the LCD working group charter. I can map each one to the relevant WS-* specification. E.g. see this as it relates to #8. As I’ve argued many times on this blog, the problems that WSMF/WSDM/WS-Mgmt/WS-RA and friends addressed didn’t go away with the demise of these specifications. Here is yet another attempt to tackle them.

The standards politics perspective

Another “fun” part of WS-*, beyond the joy of wrangling with XSD and dealing with multiple versions of foundational specifications, was the politics. Which mostly articulated around IBM and Microsoft. Well, guess what the primary competition to LDP is? OData, from Microsoft. I don’t know what the dynamics will be this time around, Microsoft and IBM alone don’t command nearly as much influence over the Cloud infrastructure landscape as they did over the XML middleware standardization effort.

And that’s just the corporate politics. The politics between standards organizations (and those who make their living in them) can be just as hairy; you can expect that DMTF will fight W3C, and any other organization which steps up, for control of the “Cloud management” stack. Not to mention the usual coo-petition between de facto and de jure organizations.

The “I told you so” perspective

When CMDBf started, I lobbied hard to base it on RDF. I explained that you could use it as just a graph-based metamodel, that you could  ignore the ontology and inference part of the stack. Which is pretty much what LDP is doing today. But I failed to convince the group, so we created our own metamodel (at least we explicitly defined one) and our own graph query language and that became CMDBf v1. Of course it was also SOAP-based.

KISS and markup

In closing, I’ll just make a plea for practicality to drive this effort. It’s OK to break REST orthodoxy. And not everything needs to be in RDF either. An overarching graph model is needed, but the detailed description of the nodes can very well remain in JSON, XML, or whatever format does the job for that node type.

All the best to LDP.

4 Comments

Filed under API, Cloud Computing, CMDB Federation, CMDBf, DMTF, Everything, Graph query, IBM, Linked Data, Microsoft, Modeling, Protocols, Query, RDF, REST, Semantic tech, SPARQL, Specs, Standards, Utility computing, W3C

Yoga framework for REST-like partial resource access

A tweet by Stefan Tilkov brought Yoga to my attention, “a framework for supporting REST-like URI requests with field selectors”.

As the name suggests, “Yoga” lets you practice some contortions that would strain a run-of-the-mill REST programmer. Basically, you can use a request like

GET /teams/4234.json?selector=:(members:(id,name,birthday)

to retrieve the id, name and birthday of all members of a softball team, rather than having to retrieve the team roaster and then do a GET on each and every team member to retrieve their name and birthday (and lots of other information you don’t care about).

Where have I seen this before? That use case came up over and over again when we were using SOAP Web services for resource management. I have personally crafted support for it a few times. Using this blog to support my memory, here is the list of SOAP-related management efforts listed in the “post-mortem on the previous IT management revolution”:

WSMF, WS-Manageability, WSDM, OGSI, WSRF, WS-Management, WS-ResourceTransfer, WSRA, WS-ResourceCatalog, CMDBf

Each one of them supports this “partial access” use case: WS-Management has :

WSMF, WS-Manageability, WSDM, OGSI, WSRF, WS-Management, WS-ResourceTransfer, WSRA, WS-ResourceCatalog, CMDBf

Each one of them supports this “partial access” use case: WS-Management has SelectorSet, WSRF has ResourceProperties, CMDBf has ContentSelector, WSRA has Fragments, etc.

Years ago, I also created the XMLFrag SOAP header to attack a more general version of this problem. There may be something to salvage in all this for people willing to break REST orthodoxy (with the full knowledge of what they gain and what they loose).

I’m not being sarcastic when I ask “where have I seen this before”. The problem hasn’t gone away just because we failed to solve it in a pragmatic way with SOAP. If the industry is moving towards HTTP+JSON then we’ll need to solve it again on that ground and it’s no surprise if the solution looks similar.

I have a sense of what’s coming next. XPath-for-JSON-over-the-wire. See, getting individual properties is nice, but sometimes you want more. You want to select only the members of the team who are above 14 years old. Or you just want to count these members rather than retrieve specific information about them individually. Or you just want a list of all the cities they live in. Etc.

But even though we want this, I am not convinced (anymore) that we need it.

What I know we need is better support for graph queries. Kingsley Idehen once provided a good explanation of why that is and how SPARQL and XML query languages (or now JSON query languages) complement one another (wouldn’t that be a nice trifecta: RDF/OWL’s precise modeling, JSON’s friendly syntax and SPARQL’s graph support – but I digress).

Going back to partial resource access, the last feature is the biggie: a fine-grained mechanism to update resource properties. That one is extra-hard.

5 Comments

Filed under API, CMDBf, Everything, Graph query, IT Systems Mgmt, Manageability, Mgmt integration, Modeling, Protocols, Query, REST, SOAP, SOAP header, Specs, Standards, Web services, WS-Management, WS-ResourceCatalog, WS-ResourceTransfer, WS-Transfer, XMLFrag, XPath

Square peg, REST hole

For all its goodness, REST sometimes feels like trying to fit a square peg in the proverbial round hole. Some interaction patterns just don’t lend themselves well to the REST approach. Here are a few examples, taken from the field of IT/Cloud management.

Long-lived operations. You can’t just hang on for a synchronous response. Tim Bray best described the situation, which he called Slow REST. Do you create an “action in progress” resource?

Query: how do you query for “all the instances of app foo deployed in a container that has patch 1234 installed” in a to-each-resource-its-own-URL world? I’ve seen proposals that create a “query” resource and build it up incrementally by POSTing constraints to it. Very RESTful. Very impractical too.

Events: the process of creating and managing subscriptions maps well to the resource-oriented RESTful approach. It’s when you consider event delivery mechanisms that things get nasty. You quickly end up worrying a lot more about firewalls and the cost of keeping HTTP connections open than about RESTful purity.

Enumeration: what if your resource state is a very long document and you’d rather retrieve it in increments? A basic GET is not going to cut it. You either have to improve on GET or, once again, create a specifically crafted resource (an enumeration context) to serve as a crutch for your protocol.

Filtering: take that same resource with a very long representation. Say you just want a small piece of it (e.g. one XML element). How do you retrieve just that piece?

Collections: it’s hard to manage many resources as one when they each have their own control endpoint. It’s especially infuriating when the URLs look like http://myCloud.com/resources/XXX where XXX, the only variable part, is a resource Id and you know – you just know – that there is one application processing all your messages and yet you can’t send it a unique message and tell it to apply the same request to a list of resources.

The afterlife: how do you retrieve data about a resource once it’s gone? Which is what a DELETE does to it. Except just because it’s been removed operationally doesn’t mean you have no interest in retrieving data about it.

I am not saying that these patterns cannot be supported in a RESTful way. In fact, the problem is that they can. A crafty engineer can come up with carefully-defined resources that would support all such usages. But at the cost of polluting the resource model with artifacts that have little to do with the business at hand and a lot more with the limitations of the access mechanism.

Now if we move from trying to do things in “the REST way” to doing them in “a way that is as simple as possible and uses HTTP smartly where appropriate” then we’re in a better situation as we don’t have to contort ourselves. It doesn’t mean that the problems above go away. Events, for example, are challenging to support even outside of any REST constraint. It just means we’re not tying one hand behind our back.

The risk of course is to loose out on many of the important benefits of REST (simplicity, robustness of links, flexibility…). Which is why it’s not a matter of using REST or not but a matter of using ideas from REST in a practical way.

With WS-*, on the other hand, we get a square peg to fit in a square hole. The problem there is that the peg is twice as wide as the hole…

37 Comments

Filed under Cloud Computing, Everything, Implementation, Modeling, Protocols, Query, REST, Utility computing

“Federationing”

I am glad to see that, as it inches towards standardization in the DMTF, the CMDBf specification is getting more visibility. Forrester’s Glenn O’Donnell recently wrote very positively about it on his blog, presenting it as a key enabler for a federation of MDRs (Management Data Repositories, a term introduced by the CMDBf specification so don’t look for it in ITIL). He argues this is the only way (rather than a single data store) to fulfill the ITIL-defined role of a CMS. Rob England (the IT Skeptic) has also shared his thoughts about CMDBf and they were noticeably less enthusiastic, to say the least. While Glenn calls the specification “profound”, Rob calls it “the most over-hyped vendor marketing smokescreen ever”. There is plenty of room in between them, which is where I sit. As I explained before, it does have real value (as a query language/protocol for system integration) but is nowhere near providing “federation” capabilities.

I am happy to see Glenn approve of CMDBf and I agree with him that accurate specialized MDRs are more useful than a single store that attempts to capture all the relevant data. As Glenn puts it, “pockets of the truth are far superior to unified ambiguity”. But I wasn’t very comfortable with the tone of his article, which seemed to almost encourage the proliferation of these MDRs. Maybe he was just trying to present a clean break with the “one big CMDB” approach and overreached. Or maybe I am just not reading properly.

Because while I agree that the answer is not “one and only one store” I also don’t want to loose the value of having as much unification of the IT model as possible. Both at the data level (i.e. same metamodel/model, consistent retention/roll-up policies…) and the access level (i.e. in the same physical store, with shared access control, accessible using a well-known DSL for data manipulation…). Metamodel transformation and model bridging are costly (in accuracy, maintenance, reliability). If your CMS does more than just support a  “model navigation” GUI it may then need to run large queries that go across several portions of your IT model, including multiple different domains (e.g. a compliance rule kicked-off at the app level based on the type of data it manipulates that ends up having to look at the physical location of the servers running the hypervisors for the virtual machines that power the app). Through such global queries you can apply configuration rules, do impact analysis, event correlation, provide context to your transaction tracing, etc. No consolidation means no such queries (or a very limited subset). Considering the current state of federation, there is a lot more that you can do with your CMS if you have a very small number of MDRs rather than a sea of “federated” MDRs. This is why, as Oracle acquires IT management companies, we deliberately integrate their repositories with Enterprise Manager.

[UPDATED 2009/4/8: More, along the same line, from Glenn and his co-author Carlos Casanova available here. And my CMDBf partner-in-crime Van Wiles also responded to Glenn, bringing a BMC perspective.]

1 Comment

Filed under Application Mgmt, Automation, CMDB, CMDB Federation, CMDBf, Everything, IT Systems Mgmt, ITIL, Mgmt integration, Modeling, Query, Specs, Standards

WS Resource Access working group starting at W3C

Things went quiet for a while, but the W3C Web Services Resource Access Working Group has finally taken life, as was announced last week. It’s a well-know PR trick to announce bad news on a Friday such that it goes undetected, is it a coincidence that W3C picked a Friday for this announcement?

As you can tell by this last remark, I have no trouble containing my enthusiasm about this new group. Which should not come as a surprise to regular readers of this blog (see this, this, this and this, chronologically).

The most obvious potential pushback against this effort is the questionable architectural need to redo over SOAP what can be done over simple HTTP. Along the lines of Erik Wilde’s “HTTP over SOAP over HTTP” post. But I don’t expect too much noise about this aspect, because even on the blogosphere people eventually get tired of repeating the same arguments. If some really wanted to put up a fight against this, it would have been done when the group was first announced, not now. That resource modeling party is over.

While I understand the “WS-Transfer is just HTTP over SOAP over HTTP” argument, this is not my problem with this group. For one thing, this group is not really about WS-Transfer, it’s about WS-ResourceTransfer (WS-RT) which adds fine-grained resource access on top of WS-Transfer. Which is not something that HTTP gives you out of the box. You may argue that this is not needed (just model your addressable resources in a fine-grained way and use “hypermedia” to navigate between them) but I don’t really buy this. At least not in the context of IT management models, which is where the whole thing started. You may be able to architect an IT management system in such RESTful way, but even if you can it’s too far away from current IT modeling practices to be practical in many scenarios (unfortunately, as it would be a great complement to an RDF-based IT model). On the other hand, I am not convinced that this fine-grained access needs to go beyond “read” (i.e. no need for “fine-grained write”).

The next concern along that “HTTP over SOAP over HTTP” line of thought might then be why build this on top of SOAP rather than on top of HTTP. I don’t really buy this one either. SOAP, through the SOAP processing model (mainly the use of headers, something that WS-RT unfortunately butchers) is better suited than HTTP for such extensions. And enough of them have already been defined that you may want to piggyback on. The main problem with SOAP is the WS-Addressing tumor that grew on it (first I thoughts it was just a wart, but then it metastatized). WS-RT is affected by it, but it’s not intrinsic to WS-RT.

Finally, it would be a little hard for me to reject SOAP-based resources access altogether, having been associated with many such systems: WSMF, WSDM/WSRF, WS-Management and even WS-RT in its pre-submission days (and my pre-Oracle days). Not that I have signed away my rights to change my mind.

So my problem with WS-RAWG is not a fundamental architectural problem. It’s not even a problem with the defects in the current version of WS-RT. They are fixable and the alternative specifications aren’t beauty queens either.

Rather, my concerns are focused on the impact on the interoperability landscape.

When WS-RT started (when I was involved in it), it was as part of a convergence effort between HP, IBM, Intel and Microsoft. With the plan to use this to unify the competing WS-Management and WSDM/WSRF stacks. Sure it was also an opportunity to improve things a bit, but 90% of the value came from the convergence/unification aspect, not technical improvements.

With three of the four companies having given up on this, it isn’t much of a convergence anymore. Rather then paring-down the number of conflicting options that developers have to chose from (a choice that usually results in “I won’t pick either sine there is no consensus, I’ll just do it my own way”), this effort is going to increase it. One more candidate. WS-Management is not going to go away, and it’s pretty likely that in W3C WS-RT will move further away from it.

Not to mention the fact that CMDBf (and its SOAP-based graph-oriented query protocol) has since emerged and is progressing towards standardization. At this point, my (notoriously buggy) crystal ball shows a mix of WS-management and CMDBf taking the prize overall. With WS-Management used to access individual resources and CMDBf used to access any kind of overall system view. Which, as a side note, means that DMTF has really taken this game over (at least in the IT management domain) from W3C and OASIS. Not that W3C really wanted to be part of the game in the first place…

11 Comments

Filed under CMDBf, DMTF, Everything, HP, IBM, IT Systems Mgmt, Manageability, Mgmt integration, Microsoft, Query, REST, SOAP, SOAP header, Specs, Standards, W3C, WS-Management, WS-ResourceTransfer, WS-Transfer

Here be (XML) dragons

Spoiler alert: if you like to learn things the hard way, don’t follow this link. It points to a clear description of all the problems, frustrations, disillusions and “ah ah!” moments that are ahead of you as you start to use XML and grow into an expert.

If, on the other hand, you like to be fully prepared and informed when you choose a technology and if you don’t mind sacrificing some adventure and excitement in the process, then you owe it to yourself to read Erik Wilde and Robert Glushko’s XML Fever article. Even if you already consider yourself an XML expert. Especially if you do.

I knew I would like it when I read this in the introduction:

Advanced strains of XML fever often take hold after exposure to the proliferation of more complex and esoteric XML-based technologies layered on top of it. These advanced diseases are harder to catch, but they are also harder to remedy because people who have caught these advanced strains tend to congregate with others with the same diseases and they are continually reinfecting each other.

Oh yes they do. And they speak with such authority that they infect others around them. People who don’t even understand these “more complex and esoteric XML-based technologies” end up being convinced of their magical properties and the need to use them.

I am not going to attempt to summarize the article because it is too tightly packed with great content to be summarized without being butchered. The “tree trauma” section alone could probably save the world billions of dollars in lost productivity if it was widely read.  I’ll just quote a few sections to motivate you to go read the whole thing.

Tree tremors. Whereas tree trauma (discussed earlier) is a basic strain of XML fever caused by the various flavors of trees in XML technologies, tree tremors are a more serious condition afflicting victims trying to manage data in XML that is not inherently tree-structured. The most common causes are data models requiring nontree graph structures and document models needing overlapping structures. In both cases, mapping these models to XML’s tree model results in XML structures that cannot conveniently represent the application-level model.

(…)

The choice of schema languages, however, is more often determined by available tool support and acquired habits than by a thorough analysis of what would be the most appropriate language.

(…)

Triple shock. While RDF itself is simple, large datasets easily contain millions of triples (for truly large datasets this can go up to billions), and managing and querying such a big dataset can become a considerable challenge. If the schema of these large datasets is simple, but ontology overkill has set in and it has been reformulated as an ontology, handling this dataset may become considerably harder, without any immediate benefit.

This is true not just for RDF (a graph model that can be serialized in XML) but for any non-tree model that can be serialized in XML (which is to say any model one can think of). Including every graph model.

Maybe it would help if the article stated more clearly that it’s ok to serialize such a model as XML (e.g. for transmission) as long as you don’t process it (at the application level) as XML. As long as it gets accessed using an API and concepts that are aligned with the semantics of the model.

Imagine that you are receiving an RDF dataset over the wire. You could (if your app runs on the network card rather than in CPU) process it as a bunch of electrical impulses, but that wouldn’t be very convenient. You could process it as a bunch of bits, but that’s still hard. You could process it as a character stream but that’s not that much better. You could process it as XML but that’s still no great. Or you could process it as RDF triplets and be home on time to have dinner with your family. It’s not the fact that it is represented as XML at some point that’s the problem, it’s the fact that your application processes it as XML. Said in another way, just because it makes sense to store it or to send it over the network in XML doesn’t mean that you have to process it as XML in your application.

There is at least one more problem (not covered by the article) that people will eventually run into. You’d think that XML technologies are a consistent and complementary set. Not true. The lack of consistency is illustrated by the “tree trauma” section of the article. But there is also a complementarity problem, in the sense that there are large gaps between the specifications, as anyone who has tried to serialize an XPath nodeset has found out.

As the article points out, all this doesn’t mean that XML is bad or useless. XML technologies can be very useful, but for not for all tasks.

3 Comments

Filed under Everything, Graph query, Modeling, Query, RDF, Specs, Standards, Tech, XPath, XQuery

CMIS, APP, Zen-SOAP and WS-KitchenSink: some data points

The recent release of an early draft of a content management specification (CMIS, for Content Management Interoperability Services) provides an interesting perspective on not just SOAP-versus-REST but also Zen-SOAP versus WS-KitchenSink.

I know little about content management and I have no comment about the specification from that respect. Others have better informed opinions on that aspect.

What is of interest to me, and where I have some experience, is the way the spec-defined operations are bound to underlying protocols. Here is the way the specification is structured: Part I describes the data model and the operations exposed by all the services. Part II comes in two flavors: a REST binding (based on APP, the Atom Publishing Protocol) and a Web services binding (based on SOAP).

This is the first time, to my knowledge, that someone (who presumably isn’t a participant in the SOAP/REST religious war but simply wants to get something done) describes two ways to achieve a real-life task, using either APP or SOAP. I expect that this will attract a lot of attention and provide data in the SOAP versus REST debate.

But this is not what I want to write about. I’ll just point out that the REST binding specification somehow is twice as long as the SOAP binding specification, which I find intriguing but not necessarily meaningful (things are looking good for your bet Sanjiva).

What really caught my attention is how SOAP is used in CMIS. You can hardly tell it’s SOAP. CMIS just defines XML messages to be used as payload for requests and responses. You would be excused for forgetting halfway through your implementation that you’re supposed to wrap those in a SOAP envelope. Headers are a no-show. The specification says it uses SOAP faults but it actually goes out of its way to avoid the existing elements for fault code and fault message and instead invent its own. The only SOAP feature it really uses is MTOM.

Except for the MTOM part, this reminds me of what SOAP was at the beginning of the decade, before any header had been defined (other than those used as illustration in the SOAP specification itself). I want to call it Zen-SOAP, by opposition to the WS-KitchenSink approach in which even simple, synchronous, clear-text, request-response SOAP exchanges somehow get saddled with a half dozen WS-Addressing headers before they’ve even left the gate (did I mention that I don’t like WS-Addressing?).

Another comedian in the WS-KitchenSink theater troupe is the WS-Transfer stack and especially WS-ResourceTransfer (WS-RT). Unless I read too much into this draft of CMIS, its content is devastating in two ways for WS-ResourceTransfer: in one fell swoop it shows that the specification is mostly useless and it destroys the argument that WS-ResourceTransfer needs to be stand-alone as opposed to just a part of WS-Management.

In “who needs XPath fragment-level PUT?”, I tried to make the case that the use of XPath in WS-RT to do fine-grained updates is a case of over-engineering. That there is no real need for it. Still, in that article I try to think of cases where the feature might be justified. I came up with two and I wrote that “one is if the resource actually is a document (as opposed to having its state represented by a document). For example, a wiki page”. But I dismissed it because wiki-land is REST country. I didn’t think of it at the time, but there is an “enterprise” version of wiki, a world in which, presumably, SOAP is well-regarded: Content Management Systems. Surely, if there is a domain that needs a fine-grained SOAP-based document editing protocol it’s the CMS world.

Today’s release of CMIS demolishes this use case with two punches to the guts:

  • They do have a query language, but it is SQL-based, not XPath-based.
  • The query is only used for reads, not for updates. Updates are done through specialized operations (addObjectToFolder, moveObject, updateProperties, createRelationship…).

This goes beyond not using a generic fine-grained update mechanism. It also goes against using any generic GET/SET operation. The blow reaches all the way to WS-Transfer. For all this, CMIS comes out a much simpler specification and it also frees itself from the web of dependencies (on specifications at different stages of standardization) that has plagued specifications that use WS-Transfer and will plague WS-Federation for using WS-RT.

It will be interesting to see what happens when the WS-* architects and Microsoft and IBM get hold of the CMIS specification and of its authors in their companies. I am especially worried about the fate of the IBM CMIS authors. The recent news about Oslo show that the XML people at Microsoft are a lot more willing to put the XML tools back in the box when needed.

In truth, the CMIS authors do appear to need some help from the SOAP experts in their companies, if only to fix the way they use SOAP faults and to help the poor soul who put this comment in the WSDL:

<!– had to use include – .net wsdl.exe code generator doesn’t seem to like imports on the schema –>

But they might be getting more “suggestions” than they bargained for. In the same way that the WS-Federation folks were going on their own merry way until it was “suggested” to them by someone (who probably had an agenda) to use WS-RT. I’ll try to keep an eye on how CMIS evolves.

In the meantime, I find in CMIS data points that reinforce my opinion that WS-Transfer should be absorbed by WS-Management, WS-MeX and WS-Federation should return to defining their own operations and WS-RT should be left to die (or, for a more positive spin, be used as inspiration in the next version of WS-Management).

[UPDATED 2008/10/02: Roy Fielding doesn’t like the so-called-RESTful binding. Sam Ruby cautiously defends it. Links via Billy Cripe.]

[UPDATED 2009/5/1: For some reason this entry is attracting a lot of comment spam, so I am disabling comments. Contact me if you’d like to comment.]

4 Comments

Filed under Everything, IBM, Microsoft, Query, REST, SOAP, SOAP header, Specs, Standards, Tech, WS-Management, WS-ResourceTransfer, WS-Transfer, XPath

Oslo, blog posts and my crystal ball

There is more and more information coming out about Oslo in anticipation of the Microsoft PDC in October.

David Chappell recorded a video about it last month. More recently Doug Purdy and Don Box each posted a short description of Oslo. Don describes the goal of Oslo as “simplify the process of developing, deploying, and managing software”. But when he lists ancestor technologies to illustrate that “Microsoft has been moving in this direction for over a decade now”, they are all about development, not management: COM type libraries, .NET metadata attributes, XAML. Interesting that neither SDM nor SML gets a mention. Neither did SCA by the way, but I wasn’t really expecting that one… :-)

Maybe the I am the only one looking for a SDM/SML echo here, just because I came to hear of Oslo through the DSI angle. Am I wrong to see Oslo as an enabler for DSI? This eWeek article doesn’t have anything to do with IT management. Reading it, Oslo is all about allowing people to write code through drag and drop. Yawn. And Don Box endorses the article.

Maybe it’s just me (an IT management guy more than a software development guy) but I don’t care so much about how the application model is created. I care a lot more about what it allows you to do in terms of IT management. Please don’t make me pull out the often-quoted figure about the percentage of IT budget spent on operations versus development/licensing. The eWeek piece fails to excite me, but fortunately David Chappell’s video interview is a lot more aligned with my thinking, so I still hold hopes for Oslo as an IT management enabler. Here is my approximate transcript of an example that David provides (at around 4:20) in the video:

“If someone comes to you and says i’ve got this business process and the SLA is not being met, what do you do? You’ve got to trace this through the right business process and the right application that supports that part of the process and find the machine it runs on and maybe look at the workflow that implements it and maybe look at the services that it provides. This involves talking to business analysts, or the IT pros or the architect or the developer, all of whom have their own view of the world, their own tools, their own prospective. The repository provides a common place to store all this stuff, to link it all together, and with a visual editor to have a common tool that lets you actually go through and answer this kind of questions.”

Now you’re talking.

And if Oslo is not the new blood of DSI, then what is? The DSI story is getting dated, SML is fading in our memories and of the three parts that supposedly compose DSI (“virtualized infrastructure, design for operations, and knowledge-driven management”), only virtualization is actually represented on the list of technologies on the DSI home page. Has DSI turned into just allowing System Center to manage a hypervisor? I still hold hopes that the Oslo data is going to spice things up there. It would be good for the industry at large, not just Microsoft.

I won’t be at the PDC but it will be interesting to see what filters out of these sessions. The first session in the list adds management of hybrid application systems (hybrid as in “cloud/on-premise combination” or “software+services” as Microsoft calls it), to the long “can do” list for Oslo. Impressive, if there is some meat behind the abstract. I think this task is often overlooked in discussions around management aspects of Cloud computing (see “the new, interesting thing is going to be the IT infrastructure to manage your usage of utility computing services as well as their interactions with your in-house software” in this previous entry).

Yes, I am reading way too much into session abstracts, but while I am at it I can’t help noticing that there is a lot of SQL and very little XML/XSD/XPath mentioned there. Even though one of the presenters is Gudge, the only person I have ever met who fully understands XSD (actually even he doesn’t, I’ve seen him in the WS-I days have to refer to… his book).

Even though I am sure we’ll be told that SML can be built on top of Oslo, the SQL orientation won’t make that so easy (I want to see how to build XSD+Schematron validation on top of a relational store using Oslo’s drag and drop development tool). And it puts Microsoft on a different architectural direction from IBM, who, as far as I can tell, thinks that the world is a big XML document. Neither is the most appropriate for IT management models. I prefer a graph model and associated graph queries along the lines of SPARQL or CMDBf.

But that’s just late-night idle speculations on my part (aka “blogging”). Let’s see what comes out in October.

[UPDATED 2008/9/10: Interesting timing. Microsoft is joining OMG, home of UML and BPMN. Coming next: a submission of a “new version” of UML and BPMN that happens to contain the extensions and tweaks that Microsoft made to them in the process of implementing Oslo. This, BTW, is the final nail in the SML coffin (SML isn’t even mentioned in the press release).]

3 Comments

Filed under Application Mgmt, CMDBf, Conference, Desired State, Everything, Graph query, IT Systems Mgmt, Mgmt integration, Microsoft, Middleware, Modeling, Oslo, Query, SaaS, SCA, SML, SPARQL, Specs, Tech, Trade show, Utility computing, Virtualization

I have seen the future of CMDBf

I got a sneak peak at CMDBf v2 today.

I am calling it v2 based on the assumption that the one being currently standardized in DMTF will end up being called 1.0 (because it’s the first one out of DMTF) or 1.1 (to prevent confusion with the submitted version).

At the Semantic Technology Conference, David Booth from HP presented his work (along with his partner, Steve Battle from HP Labs) to provide a SPARQL front-end to HP’s Universal CMDB (the engine under what was the Mercury MAM product). Here are the slides.

The mapping from SPARQL to TQL (the native query interface for UCMDB) was made pretty easy by the fact that TQL is a graph-oriented query language. How much harder would it be to similarly transform a CMDBf (v1) query interface into a SPARQL query interface (and vice-versa)? Not much. The only added difficulty would come from the CMDBf XPath constraints. TQL has a property value mechanism that is very similar to CMDBf’s “propertyValue” constraint and maps well to SPARQL functions. The introduction of XPath as a constraint language in CMDBf makes things harder. It could be handled by adding XPath support to the SPARQL engine using function extensibility. Or by turning the entire XML into RDF and emulating XPath in SPARQL. But in either case, you’ll have impedence mismatch at some point because concepts such as element order that exist in XPath have no native equivalent in RDF.

The use of XPath in selectors on the other hand is not a problem. HP’s prototype uses Gloze (available as a Jena package) to turn the XML returned by UCMDB into RDF. An XSLT transform could turn that same XML into a CMDBf-valid XML response instead and that XSLT could easily handle the XPath selectors from the query request. This is another reason why constraints and selectors should remain separate in CMDBf (fortunately the specification is back to doing this properly).

Here is why I call this prototype CMDBf v2: The CMDBf effort (v1 or 1.1), in its current form of re-inventing a graph query, can succeed. Let’s assume the working group strikes a reasonable balance between completeness and complexity, and vendors choose to compete on innovation and execution rather than lock-in (insert cynical comment here). CMDBf may then end up being supported by the main CMDB vendors. It wouldn’t provide federation capabilities, but having a common CMDB query interface supported by the Big Four would help with management integration. And yet, while the value would be real, it would only provide a little help to solve a larger problem:

  • As a technology limited to IT systems management, it would be unlikely to see widely available tools (e.g. user consoles and language-specific libraries).
  • It wouldn’t get the kind of robustness and interoperability that comes from wide adoption. While pretty similar, there might be some minor differences in the various implementations. Once your implementation has been tweaked to work with the implementations from the Big Four, you’ll call it done. Just like SNMP, another technology that is specific to IT systems management (see it happen here).
  • Even if it works perfectly at the query level, it will just hasten the time when developers run into the real problem, model interoperability. CMDBf doesn’t help at all with this. In fact, it makes it harder by hard-coding some dependencies on an XML back-end (the XPath constraints).

In the long run, IT management has to become more automated and integrated. That’s a given. The way it happens may or may not go through CMDB-like configuration stores. But if it does, we’ll have to eventually move beyond CMDBf (v1) towards something that addresses the three requirements above. And federation. I don’t know if it will be called CMDBf v2, and/or if it will come from the DMTF (by then, the CMDBf brand might be an asset or a liability depending on developer experience with the specification). But I strongly suspect (“probability 0.8” as a Gartner analyst might put it) that it will use semantic technologies. Because the real, hard, underlying problem is a problem of semantic integration. In that sense, David and Steve’s prototype is a sneak peek at what will come after CMDBf v1/1.1.

Pretty much since the beginning of CMDBf I have been pushing for it to ideally embrace SPARQL (with no success) or to at least stay close to it conceptually in order to make the eventual mapping/evolution smooth (with a bit more success). This includes pushing for a topological query language, trying to keep XML idiosyncrasies at bay and keeping constraints and selectors cleanly separated. Rather than working within the CMDBf group, David took the alternative approach of simply doing it. Hopefully this will help convince people of the value of re-using semantic web technology for IT systems management. Yes semantic technologies have been designed for a much more general use case. But the use cases that CMDB systems address are a subset of the use cases addressed by semantic technologies. It’s hard for domain experts to see their domain as just a subset of a larger problem, but this is the case here. Isn’t HTTP serving the IT management community better than a systems management-specific alternative would?

By the way, there is no inferencing taking place in the HP prototype. We are just talking about re-using an existing, well though-through graph query language. Sure OWL inferencing and some rules could be seamless layered on top of this. But this is in no way required to do (better) what CMDBf v1 tries to do.

And then there is the “federation” question. Who do you trust more to deliver this? A bunch of IT system management architects in DMTF or the web and query experts at W3C, HP Labs etc who designed and implemented SPARQL over many years? BTW, it sounds like SPQARL federation was discussed at WWW 2008, based on these meeting notes (search for “federation”).

2 Comments

Filed under Automation, CMDB, CMDB Federation, CMDBf, Conference, DMTF, Everything, Graph query, HP, IT Systems Mgmt, Query, RDF, Semantic tech, SPARQL, Standards, W3C, XPath

Of graphs and trees: Kingsley Idehen to the rescue

I just read the transcript of Jon Udell’s podcast interview of Kingsley Idehen. It’s almost two years old but it contains something that I have tried (and mostly failed) to explain for a while now, so maybe borrowing someone else’s words (and credibility) would help.

Kingsley says:

“A graph model, ideally, will allow you to explore almost all the comprehensible dimensions of the nodes in that network. So you can traverse that network in a myriad of different ways and it will give you much more flexibility than if you’re confined to a tree, in effect, the difference between XQuery and SPARQL. I always see the difference between these two things as this. If you visualize nodes on a network, SPARQL is going to get you to the right node. Your journey to what you want is facilitated by SPARQL, and then XQuery can then take you deeper into this one node, which has specific data that the graph traversal is taking you to.”

Nicely said, especially considering that this is not a prepared statement but a transcript of a (presumably) unscripted interview.

He later provides an example:

“Let’s take a microformat as an example. HCard, or an hCalendar, is a well-formed format. In a sense, it’s XML. You can locate the hCard in question, so if you had a collection of individuals who had full files on the network in the repository, it could be a graph of a social network or a group of people. Now, through that graph you could ultimately locate common interests. And eventually you may want to set up calendars but if the format of the calendar itself is well formed, with XQuery you can search a location, with XPath it’s even more specific. Here you simply want to get to a node in the content and to get a value. Because the content is well formed you can traverse within the content, but XQuery doesn’t help you find that content as effectively because in effect XQuery is really all about a hierarchical model.”

Here is one way to translate this to the IT management domain. Replace hCard with an XML-formated configuration record. Replace the graph of social relationships with a graph of IT-relevant relationships (dependency, ownership, connections, containment…). Rather than attempt to XQuery across an entire CMDB (or, even worse, an entire CMDB federation), use a graph query (ideally SPARQL) to find the items of interest and then use XPath/XQuery to drill into the content of the resulting records. The graph query language in CMDBf is an attempt to do that, but it has to constantly battle attempts to impose a tree-based view of the world.

This also helps illustrate why SPARQL is superior to the CMDBf query language. It’s not just that it’s a better graph query language, one that has received much more review and validation by people more experienced in graph theory and queries, and one that is already widely implemented. It also does something that CMDBf doesn’t attempt to do: it lets you navigate the graph based on the semantics appropriate for the task at hand (dependency relationships, governance rules, distributed performance management…), something that CMDBf cannot do. There is more to classification than simply class inheritance. I think this is what Kingsley refers to when he says “in a myriad of different ways” in the quote above.

Here is a way to summarize the larger point (that tree and graph views are complementary):

Me Tarzan, you Jena

Where Tarzan (appropriately) represents the ability to navigate trees and Jane/Jena represents the ability to navigate graphs (Jena, from HP Labs, is the leading open source RDF/OWL/SPARQL framework). As in the movie, they complement each other (to the point of saving one another’s life and falling in love, but I don’t ask quite that much of SPARQL and XQuery).

On a related topic, I recently saw some interesting news from TopQuadrant. Based on explicit requests from the majority of their customers, they have added capabilities to their TopBraid Composer product to better make use of the RDF/OWL support in the Oracle database. TopQuadrant is at the forefront of many semantic web applications and the fact that they see Oracle being heavily used by their customers is an interesting external validation.

[UPDATED 2008/03/05: more related news! The W3C RDB2RDF incubator group has started is life at W3C, chaired by my colleague Ashok Malhotra, to work on mappings between RDF/OWL and relational data.]

1 Comment

Filed under CMDB Federation, CMDBf, Everything, Graph query, Query, RDF, SPARQL, Standards, W3C, XPath, XQuery

An interesting business process query language

While doing some research on the different ways to probe and squeeze business process definitions to extract insight relevant for IT management I ran into this very interesting paper: Querying Business Processes. It defines a query language (called BP-QL) to query process definitions. Not much in common with CMDB Federation at first sight, and CMDBf was not on my mind at the time. Until I looked at the description of the query language that the researchers came up with. It is strikingly similar to the CMDBf query language. This is not very surprising since both are graph-based query languages that rely on patterns (where the patterns mix topological aspects with constraints on node/link properties).

CMDBf is more complete in some respects. It supports properties on the relationships, not just the items. The “depthLimit” element provide more control than BP-QL’s double-headed edges. BP-QL has its own extra features, including support for joins (something we discussed in CMDBf and that could be added to the specification) and negation at the graph level (e.g. A and B are not connected by any relationship of type “foo”, which may be useful but one should remember that CMDB discovery is rarely guaranteed to be comprehensive so an open-world approach is often preferable).

Assuming a suitable CMDB model for business processes, a CMDBf-compliant CMDB should cover many of the simpler use cases addressed by BP-QL. And reciprocally, the more advanced features in BP-QL are not really specific to business process definitions (even though that’s the scope of the paper) and could well be applied to CMDBf. I was also very interested by the BP-QL “compact representation” and the implementation choices. I hadn’t heard of Active XML before, something to look into especially if, as the paper hints, it does a better job than XQuery at dealing with idrefs. And Active XML introduces some interesting federation (or at least distribution) capabilities that are not currently exploited by BP-QL but which I find intriguing and which reinforce the parallel with the declared goal of CMDBf.

Is this similarity between the query languages just an interesting pattern to notice? Or is there more to it? The parallel between BP-QL and CMDBf invites the question of whether one should model business processes in a CMDB. And if so, is a business process represented by just one CI or do you break it down into a model similar to the one the BP-QL query language works on? You would need to go that far if you wanted to use queries to the CMDB to answer questions such as those handled by the BP-QL engine. And by doing this in the context of a CMDB that contains a lot more than just process definitions, you’d be able to enrich the queries with considerations from other domains, such as application or host topology. Modeling business process steps/activities may seem like very fine-grained modeling for a CMDB, but isn’t this part of the sales pitch for federated CMDBs, that participants in the federation can provide different levels of granularity? Of course, CMDB federation might never work out. If it does work and if we use it that way, we are not talking about just supporting change management processes (which are more likely to take place at the level of the overall process definition than the individual step) but rather about management integration for a wide variety of use cases. If that means we need to drop the term CMDB along the way (and leave it for the sole usage of the IT process people), I am more than happy to oblige.

[UPDATE on 2008/01/11: Prof. Milo pointed me to this follow-up paper that proposes a similar looking query language except that this time it is targeted at monitoring process instances rather than analyzing process definitions. And the monitoring runs as a set of BPEL processes within the monitored BPEL engine. Her group is doing some very interesting work.]

2 Comments

Filed under Business Process, CMDB Federation, CMDBf, Everything, Graph query, Mgmt integration, Query, Research

Illustrative algorithm for CMDBf 1.0 Query operation

When I posted an algorithm for the server side implementation of a CMDBf Query call for version 0.95 of the specification, the interoperability testing session based on that version was over and I was pretty sure no-one but those of us who participated in that session would write an implementation of 0.95. But I published the algorithm anyway since I thought it was helpful to anyone who wanted to understand the specification in depth, even if they were not implementing it. Now that 1.o is out, there is a much higher probability of people implementing the specification, so I figured it would be worth updating the algorithm to take into account the changes from 0.95 to 1.0. So here it is.

One caveat. This algorithm assumes that the query request does not make use of the xpathExpression element because, as I have explained in my review of CMDBf 1.0, I don’t think interoperability is achievable on this feature in the current state of the specification.

As a note of caution, the previous version of the algorithm was backed by my implementation of CMDBf 0.95 for the interoperability testing, so I felt pretty confident about it. For this version of the algorithm I have not written a corresponding implementation and I have not done interoperability testing with anyone, it’s just based on my reading of the specification. The handling of depthLimit in particular is a little tricky and needs to be validated by implementation (what with creating a bunch of dummy item and relationship templates with temporary names and later going back to the original template names), please let me know if you find it flawed.

And, as previously, this is in no way an optimal implementation strategy. It is the most direct and obvious set of steps that I can come up with to implement the Query call in a way that exactly conforms to the specification. There are lots of ways to make this go faster, such as the ones I mentioned in a previous post (e.g. breaking out of loops once an instance has been removed, or not recalculating L1 and L2 over and over again for relationships in the same working set that share a source/target) plus new ones such as being smarter than my brute-force approach to handling depthLimit (in step 2).

All this said, here is the algorithm:

1) for each itemTemplate, calculate the set of all items (including relationships since they are a subclass of item) that obey the instanceIdConstraint and recordConstraint elements in the template (if present). Call this the working set for the itemTemplate.
2) for each relationshipTemplate RT that has a depthLimit element:

2.1) for i ranging from 1 to the value of maxIntermediateItems for RT:

2.1.1) create an itemTemplate that is an exact copy of the itemTemplate referenced by RT’s sourceTemplate, except that it has a new, unique temporary id (keep a record linking that new id to the id of the original source itemTemplate).
2.1.2) create an itemTemplate that is an exact copy of the itemTemplate referenced by RT’s targetTemplate, except that it has a new, unique, temporary id (keep a record linking that new id to the id of the original target itemTemplate).
2.1.3) for j ranging 1 from i:

2.1.3.1) create an itemTemplate that is an exact copy of the itemTemplate referenced by RT’s intermediateItemTemplate, except that it has a new, unique, temporary id (keep a record linking that new id to the id of the original intermediary itemTemplate).
2.1.3.2) create a relationshipTemplate that is an exact copy of RT, except that its source is the itemTemplate created in the previous iteration of the current loop (or the itemTemplate created in step 2.1.1 if j=1), its target is the itemTemplate created in the previous step and it has a new, unique, temporary id (keep a record linking that new id to RT’s id).

2.1.4) create a relationshipTemplate that is an exact copy of RT, except that its source is the last itemTemplate created in the 2.1.3 loop, its target is the itemTemplate created in 2.1.2 and it has a new, unique, temporary id (keep a record linking that new id to RT’s id).

3) for each relationshipTemplate calculate the set of all relationships that obey the instanceIdConstraint and recordConstraint elements in the template (if present). Call this the working set for the relationshipTemplate.
4) set need_to_loop = true
5) while (need_to_loop == true)

5.1) set need_to_loop = false
5.2) for each relationshipTemplate RT

5.2.1) let ITsource be the itemTemplate that is referenced as sourceTemplate by RT. Calculate the set of all items (including relationships since they are a subclass of item) that obey at least one of the instanceIdConstraint elements in ITsource (assuming there is at least one such element) and all the recordConstraint elements in ITsource. Call this the working set for ITsource.
5.2.2) let ITtarget be the itemTemplate that is referenced as targetTemplate by RT. Calculate the set of all items (including relationships since they are a subclass of item) that obey at least one of the instanceIdConstraint elements in ITtarget (assuming there is at least one such element) and all the recordConstraint elements in ITtarget. Call this the working set for ITtarget.
5.2.3) for each relationship R in the working set for RT

5.2.3.1) if the source of R is not in the working set for ITsource, then remove R from the RT working set
5.2.3.2) if the target of R is not in the working set for ITtarget, then remove R from the RT working set
5.2.3.3) if RT has a source/@minimum or a source/@maximum attribute

5.2.3.3.1) find the list L1 of all relationships in the working set for RT that have the same source as R
5.2.3.3.2) if RT has source/@minimum and the cardinality of L1 is less than this minimum then remove all relationships in L1 from the RT working set
5.2.3.3.3) if RT has source/@maximum and the cardinality of L1 is more than this maximum then remove all relationships in L1 from the RT working set

5.2.3.4) if RT has a target/@minimum or a target/@maximum attribute

5.2.3.4.1) find the list L2 of all relationships in the working set for RT that have the same target as R
5.2.3.4.2) if RT has target/@minimum and the cardinality of L2 is less than this minimum then remove all relationships in L2 from the RT working set
5.2.3.4.3) if RT has target/@maximum and the cardinality of L2 is more than this maximum then remove all relationships in L2 from the RT working set

5.3) for each itemTemplate IT:

5.3.1) let sourceRTset be the set of all relationshipTemplates that references IT as its sourceTemplate
5.3.2) let targetRTset be the set of all relationshipTemplates that references IT as its targetTemplate
5.3.3) for each item I in the IT working set

5.3.3.1) for each relationshipTemplate sourceRT in sourceRTset, if there is no relationship in the working set for sourceRT that uses I as its source, remove I from the IT working set and set need_to_loop to true
5.3.3.2) for each relationshipTemplate targetRT in targetRTset, if there is no relationship in the working set for targetRT that uses I as its source, remove I from the IT working set and set need_to_loop to true

6) process the eventual contentSelector elements and/or the @suppressFromResult attributes on the templates that have matching items/relationships in the response to remove or pair down items and relationships as requested
7) package the resulting items and relationships in a way that conforms to the CMDBf response message format (including putting each item in the <nodes> element with the appropriate @templateId attribute and putting each relationship in the <edges> element with the appropriate @templateId).
8) replace all the temporary template ids (from step 2) that appear in templateId attributes in the response with the original ids of the items and template based on the records that were kept in step 2.

Just to clarify things, what I do in step 2 is simply make explicit all the itemTemplates and relationshipTemplates that are made implicit by the depthLimit element, so that we can provide with a simpler algorithm after that assumes that all relationshipTemplate correspond to direct relationships (no intermediary). And in step 8 I hide the fact that this took place.

[UPDATED 2009/5/1: For some reason this entry is attracting a lot of comment spam, so I am disabling comments. Contact me if you’d like to comment.]

8 Comments

Filed under CMDB, CMDB Federation, CMDBf, Graph query, IT Systems Mgmt, Pseudo-algorithm, Query, Specs, Standards, Tech

Review of the CMDBf specification version 1.0

Having read the recently released CMDBf 1.0 specification over the weekend, I see several improvements since 0.95, including:

  • the introduction of depthLimit
  • the lastModified metadata element
  • the ability to specify more than one instanceId in a template
  • the ability to advertise what parts of the specification you implement
  • the definition of faults

But while 1.0 is more complete than 0.95, I think it makes it harder to achieve interoperability. Here are the main friction points for interop:

New role for XPath

The xpathExpression element (which replaces xpath1Selector) changes in two very important ways. First, rather than being limited to XPath 1.0, it now also allows XPath 2.0. Support for this is a lot harder to achieve for people who don’t use XML as the backend format for their data. Considering the current state of adoption of XPath 2.0 and the low level of XML complexity exposed by most CMDB models, I don’t think it was opportune to bring this into CMDBf yet. And my guess is that most implementations will stay away from this. But there is a second change, less obvious but even more problematic. XPath is not just another constraint mechanism for a CMDBf template anymore, one that returns a boolean result indicating whether the instance meets the constraint or not, as it used to be in 0.95. It is now an alternative selection and filtering mechanism that lives in parallel to all the other elements in a template (and can’t mix with them). Overall, I think this change goes too far in the direction of turning a shared agreement to exchange data in XML into an assumption that the internal data models are all based on XML. And the killer with regards to interoperability is that the specification says nothing about how the resulting node sets are serialized in the response. There may be a serialization for the XPath 2.0 model, but there is no such thing for XPath 1.0 and I don’t see in the current state of the specification how two implementations have any chance to interoperate when using this feature.

Introduction of linkDepth

As I mentioned earlier, linkDepth is a very useful addition (even though it pales in comparison to the inferencing capabilities that could have been derived from basing CMDBf on RDF). But it is also a complicated feature. The intermediateItemTemplate attribute is a good re-use of the existing plumbing, but it needs at least a detailed example. I trust that the group will generate one once they’ve caught their breath from putting out the specification.

Service capability metadata

There is a new section (#6) to provide ways to describe what CMDBf features an implementation supports. But it is a very granular representation. Basically, for every feature you can describe if you support it or not. So someone may describe that they support everything inside propertyValue, except for the “like” operator. And someone else might support all the operators but not the caseSensitive modifier. That might be ok for human consumptions, but automated scenarios rely on pre-programmed queries and that is made very hard by all the possible combinations of options. What we need is a few well-defined profiles that people implement fully. Starting of course with a profile that rules out xpathExpression.

Record metadata

This new version introduces metadata on records. While recordId and lastModified are probably well understood and interoperably usable I am a bit more dubious about whether baselineId and snapshotId are going to be interoperable across vendors based on their limited description in the specification. The nice thing is that this metadata can not only be returned but also searched on. Well, at least that’s the intent. But this goes through the recordMetadata attribute on propertyValue which, while present in the pseudo-schema, is missing in the XSD…

The contentSelector element

This new element is more flexible that the propertySubsetDirective element that it replaces. In addition to specifying what properties you want returned it also allows you to specify that you only want certain record types and/or that you only want the record(s) that were used to satisfy constraints in the template. Those are nice additions, but the way the second part is implemented (through the use of the matchedRecords attribute) seems to assume that only one record in the instance was used to match all the constraints in the template. This is not necessarily the case, an instance can be selected by having different records match the different constraints in the template as long as it has at least one matching record per constraint (line 765 says “the item satisfies all the constraints”, not “a record of the item satisfies all the constraints” and you can also see this in the example in section 4.2 where the records mentioned on lines 637 and 639 don’t have to be the same). So do you return all records that have a role in matching the template, or only those (if there is any) that matches all the constraints on their own as the text seems to imply? And if several record combinations inside an instance can be used to match the constraints in a template, do I return all of them or can I just pick any subset that matches? Also, how can I say that I want all records that established the template match, independently of their type? There doesn’t seem to be a way to do this, or is it by putting a contentSelector element with no child element and the matchedRecords attribute set to false? There won’t be much interoperability on this feature until all this is clarified.

Relationships as items

A major change between 0.95 and 1.0 is that now a relationship can match an itemTemplate. For example, if you ask for all items that were modified during the last 24 hours you will get all the items and all the relationships that meet that criteria while in the previous version you’d have to explicitly request the relationships with a relationshipTemplate if you wanted to get them too). There is a good case to be made for either view and the one that works best largely depends on your backend implementation technology (RDF, objects, SQL, CIM…). But the important thing is for the spec to be clear and on this point I think the change wasn’t made explicit enough in the query section of the specification. If Van hadn’t called my attention to this on his blog, I would have missed this important change.

Security boilerplate

There is a person at IBM (probably located in a well-stoked underground bunker in upstate NY) who has instilled the fear of god in all IBM employees (at least all those who author publicly available specifications) and forces them to include a boilerplate “security considerations” section everywhere. I have co-authored several documents with IBM employees and it never fails, even thought it doesn’t add anything useful to the specification. You should see the look of fear on the face of the IBM employees when someone else suggests doing without it. We somehow managed to sneak one such slimmer specification past the IBMers with CMDBf 0.95 but I see that this has been “corrected” in 1.0. I hope that whatever painful punishment Scott, Jacob, Andrew and Mark (or their families and pets) were subjected to in the process by the IBM security ogre wasn’t too cruel. Sure, this doesn’t really impact interoperability, but now that I don’t work for a company that makes money from ink anymore, I have even less patience for this bloating.

OK, that’s enough back seat driving for now. Hopefully the standards group that will take over the specification will address all these questions. In the context of the entire specification, these are pretty small issues and mostly easy to fix. And the CMDBf group can go on to address the hard issues of federation (including security-related issues that abound in this field if one really wants to tackle them). The current specification is a useful graph-oriented query language that is a good match for CMDB data. But it’s really just a query language (plus a simple registration system).

[UPDATE: while updating the CMDBf query algorithm, I noticed another small error: maxIntermediateItems is an attribute in the pseudo-schema but an element in the schema. Something else to fix in the next version.]

3 Comments

Filed under CMDB, CMDB Federation, CMDBf, Everything, Graph query, IT Systems Mgmt, ITIL, Query, Specs, Standards, Tech

CMDBf 1.0 specification released

The CMDBf committee has just released version 1.0 of the specification. Van Wiles has an overview of the changes between 0.95 and 1.0. I left HP soon after 0.95 was released and that’s when my participation in CMDBf ended, so Van’s summary is very useful to me. The changes he lists are not surprising and some of them already existed in draft form before 0.95 publication. I need to spend some intimate time with the specification to review the changes to the template mechanism in more details. Some of the changes have the potential to make the specification quite a bit harder to implement. This is especially the case for the introduction of “depthLimit” (but it’s probably a needed feature anyway). And the fact that relationships can now match item selectors will make things either easier or harder to implement, depending on your implementation choice (e.g. straight-to-SQL/XML or through an OO or RDF model). Congrats to the group. We should soon hear about submission for standardization.

Comments Off on CMDBf 1.0 specification released

Filed under CMDB Federation, CMDBf, Everything, IT Systems Mgmt, Query, Specs, Standards, Tech

Tutorial and pseudo-algorithm for CMDBF Query operation

[UPDATE: an updated version of the algorithm that complies to version 1.0 of the specification (and not 0.95 as in this post) is now available]

The CMDBF Query operation (section 4 of the CMDBF specification) quickly becomes very intuitive to use, but it can first look a little strange to people used to SQL-style queries, because of its graph-based nature. I hope the normative text and the examples in the spec mitigate this. But just in case, here is some additional information that doesn’t belong in the spec but can be useful. Think of it as a very first draft of a primer/tutorial about that Query interface.

The easiest way to think about this interface is to think graphically. Imagine that you’re not writing your query in XML, but instead creating it in a GUI. You want to find all Windows XP machines that are owned by someone in the marketing department. First you create a circle in the GUI that represents the machine (through a Visio-like drag-and-drop into the query composer window). You right-click on that circle and in the right-click menu you select the option called “type” and you set it to “computerSystem”. Then you right-click to select “add property constraint”. You enter “OS_Version” as the property name (if your GUI tools is any good it will present you with a list to choose from, based on the previously selected type) and “WindowsXP” as the value. Already you’ve created a query that selects all Windows XP machines. Not a bad start.

But you only want the machines that are owned by marketing people. So go ahead and create another circle to represent a person. Use similar right-click actions to set the type to “person” and to set the “department” property to “marketing”.

If you submit the query at this point, you’ll get a list of all Windows XP machines and all people in the marketing department. So, rather than reducing the result (by removing machines not owned by marketing people), you have expanded it (by adding records for the marketing people). Oops.

The next step is to create a relationship that constrains the machine to belong to someone in marketing. This is done very simply in our handy GUI tool by drawing an arrow that goes from the “person” circle to the “machine” circle. This requires that there be a relationship between the two. If we want to further ensure that this relationship is of type “owns” (and not “uses” for example), we do this by right-clicking on the arrow (like we do on circles), selecting “type” and setting its value to “owns”.

If we run the query now, we get the list of all Windows XP machines owned by marketing people. We also get the list of the marketing people who own these machines and we get the relationships between people and machines. So we now have want we wanted. But maybe a little more. Maybe we only care about the list of machines, we don’t want to retrieve all the data about the marketing people. As long as we know the machines are owned by marketing people, that’s all that we care about. We don’t need to know what specific person owns what specific machine. We could simply ignore the people and relationships in the response. Or we could enrich the query to specify that the people and relationship need not be returned (but they are still part of the query in the sense that they limit what machines get returned).We do this by right-clicking on the “person” circle, and selecting the “suppress” option. Similarly, select the “suppress” option on the relationship (the arrow) too.

This query will return a list of all Windows XP machines that are owned by someone in marketing, and nothing else.

Here is what the query could look like graphically (here I assume that the GUI tools represents the fact that the arrow and the “person” circle have the “suppress” option selected on them by turning their solid lines into dotted lines and their text into italics):

The most intuitive way to think about what happens when the query gets processed is that the program looks for all instances of the patterns described by the query. In other words, it tries to superimpose the requested graph everywhere on the graph of available data and selects all the instances where the requested graph “fits”.

What does the GUI tool do behind the scene to turn this query into the proper XML, as described by the spec?

For each circle, it creates an <itemTemplate> element. For each arrow, it creates a <relationshipTemplate> element and sets its <source> and <target> elements to the right item templates. For each constraint on a circle or arrow (i.e. when we set the type or when we set the value or a give property) it creates the appropriate selector and embeds it in the <itemTemplate> or <relationshipTemplate> that corresponds to this circle or arrow. Finally, it sets the @dropDirective attribute to “true” on all the <itemTemplate> and <relationshipTemplate> elements that corresponds to circles and arrows on which the “suppress” option was selected.

Here is what the resulting query looks like in XML:

<query xmlns="http://schemas.cmdbf.org/0-9-5/datamodel">
  <itemTemplate id="machine">
    <propertyValueSelector namespace="http://example.com/computerModel" localName="OS_Version">
      <equal>Windows XP</equal>
    </propertyValueSelector>
  <recordTypeSelector namespace="http://example.com/computerModel" localName="computerSystem"/>
  </itemTemplate>
  <itemTemplate id="person" dropDirective="true">
    <propertyValueSelector namespace="http://example.com/peopleModel" localName="department">
      <equal>marketing</equal>
    </propertyValueSelector>
    <recordTypeSelector namespace="http://example.com/peopleModel" localName="person"/>
  </itemTemplate>
  <relationshipTemplate id="administers" dropDirective="true">
    <recordTypeSelector namespace="http://example.com/computerModel" localName="owns"/>
    <source ref="person"/>
    <target ref="machine"/>
  </relationshipTemplate>
</query>

Note: like all query language, the actual query depends of course on the underlying model. In this example, I assumed that the OS version is represented as a property of the machine. More commonly, the OS will be a node of its own that has a relationship with the machine. So you’d have another circle for the OS. With a property constraint on that circle (version=”WindowsXP”) and a line representing a “runs” relationship between the machine circle and the OS circle. Similarly, “marketing” could be a node of its own that people have a relationship with, rather than just a property of each person. None of this changes the logic behind the Query operation.

Now, this is nice for the user of the query, but what about the poor developer who gets the 50-pages spec thrown on his/her desk and has 2 weeks to make sense of it and implement the server side of the query? I’ve said above that the program “tries to superimpose the requested graph everywhere on the graph of available data and selects all the instances where the requested graph fits” but that’s a lot easier to write as a sentence than to implement. So here is a pseudo-algorithm to help.

1) for each itemTemplate calculate the set of all items that obey all the selectors in the template. Call this the working set for the itemTemplate.
2) for each relationshipTemplate calculate the set of all relationships that obey all the selectors in the template. Call this the working set for the relationshipTemplate.
3) set need_to_loop = true
4) while (need_to_loop == true)

4.1) set need_to_loop = false
4.2) for each relationshipTemplate RT

4.2.1) let ITsource be the itemTemplate that is referenced as source by RT
4.2.2) let ITtarget be the itemTemplate that is referenced as target by RT
4.2.3) for each relationship R in the working set for RT

4.2.3.1) if the source of R is not in the working set for ITsource, then remove R from the RT working set
4.2.3.2) if the target of R is not in the working set for ITtarget, then remove R from the RT working set
4.2.3.3) if RT has a source/@minimum or a source/@maximum attribute

4.2.3.3.1) find the list L1 of all relationships in the working set for RT that have the same source as R
4.2.3.3.2) if RT has source/@minimum and the cardinality of L1 is less than this minimum then remove all relationships in L1 from the RT working set
4.2.3.3.3) if RT has source/@maximum and the cardinality of L1 is more than this maximum then remove all relationships in L1 from the RT working set

4.2.3.4) if RT has a target/@minimum or a target/@maximum attribute

4.2.3.4.1) find the list L2 of all relationships in the working set for RT that have the same target as R
4.2.3.4.2) if RT has target/@minimum and the cardinality of L2 is less than this minimum then remove all relationships in L2 from the RT working set
4.2.3.4.3) if RT has target/@maximum and the cardinality of L2 is more than this maximum then remove all relationships in L2 from the RT working set

4.3) for each itemTemplate IT

4.3.1) let sourceRTset be the set of all relationshipTemplates that references IT as its source
4.3.2) let targetRTset be the set of all relationshipTemplates that references IT as its target
4.3.3) for each item I in the IT working set

4.3.3.1) for each relationshipTemplate sourceRT in sourceRTset, if there is no relationship in the working set for sourceRT that uses I as its source, remove I from the IT working set and set need_to_loop to true
4.3.3.2) for each relationshipTemplate targetRT in targetRTset, if there is no relationship in the working set for targetRT that uses I as its source, remove I from the IT working set and set need_to_loop to true

5) process all directives (dropDirective or propertySubsetDirective) to remove or pair down items and relationships as requested
6) package the resulting items and relationships in a way that conforms to the CMDBF response message format (including putting each item in the <nodes> element with the appropriate @templateId attribute and putting each relationship in the <edges> element with the appropriate @templateId).

There are all kinds of optimizations possible here (like breaking out of loops once an instance has been removed, or not recalculating L1 and L2 over and over again for relationships in the same working set that share a source/target), but this is the most basic form or the algorithm. The goal is to illuminate the spec, not to provide an optimal implementation strategy.

In this post, I have focused on describing and illustrating the topological aspects of the query language. The other concepts that come into play are the Selector and the Directive mechanisms. But these are a lot more familiar to people used to SQL and I think they are sufficiently explained in the spec. So I have assumed here (in steps 1, 2 and 5 of the pseudo-algorithm) that they are well understood.

4 Comments

Filed under CMDB Federation, CMDBf, Everything, Graph query, Pseudo-algorithm, Query, Specs, Standards, Tutorial