Monday, June 12, 2006

A Caching EntityResolver for the collection() function

XSLT 2.0 allows you to write a "standalone stylesheet", that is, a transform with no input XML. Instead you must provide the name of the template where execution should begin. The stylesheet can then use the document() or doc() functions to process individual files, and the collection() function to process directories of XML. I use the standalone XSLT model a lot where you have multiple input files and a single output file, for example generating a report or creating an index page.

I recently ran into a problem though, where a few thousand of several thousand input files referenced entity files hosted at the w3.org. This only became apparent when the transforms started failing for no reason. It turns out that the network admins where I work had turned off the proxy, so each and every entity file was being fetched from the w3c - where making 500 requests in 10 minutes mean you get a 24 hour ban...

In a standard transform catching and caching requests for entity files is easy - you write a custom EntityResolver and tell the XMLReader to use that. For the collection() function however it's a lot more complicated. In Saxon 8.7.1, the URI used in the collection() function is resolved using the StandardCollectionURIResolver(). You can use your own one of these by calling setCollectionURIResolver() on the Configuration, and then do what you like within that class. To create a CollectionURIResolver() that caches entities files, it's pretty much a case of copying whats in the StandardCollectionURIResolver() and then modifying it to suit your needs.

I've added this functionality to the next release of Kernow, so that standalone transforms and XQueries have the option of a caching EntityResolver. I get the impression it's early days here, and this will become a common enough requirement that a setter will be exposed for an EntityResolver on the XMLReader used by the collection() function, making this kind of task much easier in the future.

No comments: