Location: http://xmlsoft.org/catalog.html
Libxml home page: http://xmlsoft.org/
Mailing-list archive: http://mail.gnome.org/archives/xml/
Version: $Revision: 1.5 $
Table of Content:
What is a catalog? Basically it's a lookup mechanism used when an entity (a file or a remote resource) references another entity. The catalog lookup is inserted between the moment the reference is recognized by the software (XML parser, stylesheet processing, or even images referenced for inclusion in a rendering) and the time where loading that resource is actually started.
It is basically used for 3 things:
"-//OASIS//DTD DocBook XML V4.1.2//EN"
of the DocBook 4.1.2 XML DTD with the actual URL where it can be downloaded
http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
"http://www.oasis-open.org/committes/tr.xsl"
should really be looked at
"http://www.oasis-open.org/committes/entity/stylesheets/base/tr.xsl"
Libxml, as of 2.4.3 implements 2 kind of catalogs:
In a normal environment libxml will by default check the presence of a catalog in /etc/xml/catalog, and assuming it has been correctly populated, the processing is completely transparent to the document user. To take a concrete example, suppose you are authoring a DocBook document, this one starts with the following DOCTYPE definition:
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//Norman Walsh//DTD DocBk XML V3.1.4//EN" "http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd">
When validating the document with libxml, the catalog will be automatically consulted to lookup the public identifier "-//Norman Walsh//DTD DocBk XML V3.1.4//EN" and the system identifier "http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd", and if these entities have been installed on your system and the catalogs actually point to them, libxml will fetch them from the local disk.
Note: Really don't use this DOCTYPE example it's a really old version, but is fine as an example.
Libxml will check the catalog each time that it is requested to load an entity, this includes DTD, external parsed entities, stylesheets, etc ... If your system is correctly configured all the authoring phase and processing should use only local files, even if your document stays portable because it uses the canonical public and system ID, referencing the remote document.
Here is a couple of fragments from XML Catalogs used in libxml early
regression tests in test/catalogs
:
<?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN" uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/> ...
This is the beginning of a catalog for DocBook 4.1.2, XML Catalogs are
written in XML, there is a specific namespace for catalog elements
"urn:oasis:names:tc:entity:xmlns:xml:catalog". The first entry in this
catalog is a public
mapping it allows to associate a Public
Identifier with an URI.
... <rewriteSystem systemIdStartString="http://www.oasis-open.org/docbook/" rewritePrefix="file:///usr/share/xml/docbook/"/> ...
A rewriteSystem
is a very powerful instruction, it says that
any URI starting with a given prefix should be looked at another URI
constructed by replacing the prefix with an new one. In effect this acts like
a cache system for a full area of the Web. In practice it is extremely useful
with a file prefix if you have installed a copy of those resources on your
local system.
... <delegatePublic publicIdStartString="-//OASIS//DTD XML Catalog //" catalog="file:///usr/share/xml/docbook.xml"/> <delegatePublic publicIdStartString="-//OASIS//ENTITIES DocBook XML" catalog="file:///usr/share/xml/docbook.xml"/> <delegatePublic publicIdStartString="-//OASIS//DTD DocBook XML" catalog="file:///usr/share/xml/docbook.xml"/> <delegateSystem systemIdStartString="http://www.oasis-open.org/docbook/" catalog="file:///usr/share/xml/docbook.xml"/> <delegateURI uriStartString="http://www.oasis-open.org/docbook/" catalog="file:///usr/share/xml/docbook.xml"/> ...
Delegation is the core features which allows to build a tree of catalogs,
easier to maintain than a single catalog, based on Public Identifier, System
Identifier or URI prefixes it instructs the catalog software to look up entries
in another resource. This feature allow to build hierarchies of catalogs, the
set of entries presented should be sufficient to redirect the resolution of
all DocBook references to the specific catalog in
/usr/share/xml/docbook.xml
this one in turn could delegate all
references for DocBook 4.2.1 to a specific catalog installed at the same time
as the DocBook resources on the local machine.
The user can change the default catalog behaviour by redirecting queries
to its own set of catalogs, this can be done by setting the
XML_CATALOG_FILES
environment variable to a list of catalogs, an
empty one should deactivate loading the default
/etc/xml/catalog
default catalog.
@@More options are likely to be provided in the future@@
Setting up the XML_DEBUG_CATALOG
environment variable will
make libxml output debugging informations for each catalog operations, for
example:
orchis:~/XML -> xmllint --memory --noout test/ent2 warning: failed to load external entity "title.xml" orchis:~/XML -> export XML_DEBUG_CATALOG= orchis:~/XML -> xmllint --memory --noout test/ent2 Failed to parse catalog /etc/xml/catalog Failed to parse catalog /etc/xml/catalog warning: failed to load external entity "title.xml" Catalogs cleanup orchis:~/XML ->
The test/ent2 references an entity, running the parser from memory makes
the base URI unavailable and the the "title.xml" entity cannot be loaded.
Setting up the debug environment variable allows to detect that an attempt is
made to load the /etc/xml/catalog
but since it's not present the
resolution fails.
But the most advanced way to debug XML catalog processing is to use the xmlcatalog command shipped with libxml2, it allows to load catalogs and make resolution queries to see what is going on. This is also used for the regression tests:
orchis:~/XML -> ./xmlcatalog test/catalogs/docbook.xml "-//OASIS//DTD DocBook XML V4.1.2//EN" http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd orchis:~/XML ->
For debugging what is going on, adding one -v flags increase the verbosity level to indicate the processing done (adding a second flag also indicate what elements are recognized at parsing):
orchis:~/XML -> ./xmlcatalog -v test/catalogs/docbook.xml "-//OASIS//DTD DocBook XML V4.1.2//EN" Parsing catalog test/catalogs/docbook.xml's content Found public match -//OASIS//DTD DocBook XML V4.1.2//EN http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd Catalogs cleanup orchis:~/XML ->
A shell interface is also available to debug and process multiple queries (and for regression tests):
orchis:~/XML -> ./xmlcatalog -shell test/catalogs/docbook.xml "-//OASIS//DTD DocBook XML V4.1.2//EN" > help Commands available: public PublicID: make a PUBLIC identifier lookup system SystemID: make a SYSTEM identifier lookup resolve PublicID SystemID: do a full resolver lookup add 'type' 'orig' 'replace' : add an entry del 'values' : remove values dump: print the current catalog state debug: increase the verbosity level quiet: decrease the verbosity level exit: quit the shell > public "-//OASIS//DTD DocBook XML V4.1.2//EN" http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd > quit orchis:~/XML ->
This should be sufficient for most debugging purpose, this was actually used heavily to debug the XML Catalog implementation itself.
Basically XML Catalogs are XML files, you can either use XML tools to manage them or use xmlcatalog for this. The basic step is to create a catalog the -create option provide this facility:
orchis:~/XML -> ./xmlcatalog --create tst.xml <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/> orchis:~/XML ->
By default xmlcatalog does not overwrite the original catalog and save the
result on the standard output, this can be overridden using the -noout
option. The -add
command allows to add entries in the
catalog:
orchis:~/XML -> ./xmlcatalog --noout --create --add "public" "-//OASIS//DTD DocBook XML V4.1.2//EN" http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd tst.xml orchis:~/XML -> cat tst.xml <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN" uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/> </catalog> orchis:~/XML ->
The -add
option will always take 3 parameters even if some of
the XML Catalog constructs (like nextCatalog) will have only a single
argument, just pass a third empty string, it will be ignored.
Similarly the -del
option remove matching entries from the
catalog:
orchis:~/XML -> ./xmlcatalog --del "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" tst.xml <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/> orchis:~/XML ->
The catalog is now empty. Note that the matching of -del
is
exact and would have worked in a similar fashion with the Public ID
string.
This is rudimentary but should be sufficient to manage a not too complex catalog tree of resources.
First, and like for every other module of libxml, there is an automatically generated API page for catalog support.
The header for the catalog interfaces should be included as:
#include <libxml/catalog.h>
The API is voluntarily kept very simple. First it is not obvious that applications really need access to it since it is the default behaviour of libxml (Note: it is possible to completely override libxml default catalog by using xmlSetExternalEntityLoader to plug an application specific resolver).
Basically libxml support 2 catalog lists:
oasis-xml-catalog
PIs to specify its own catalog list, it is
associated to the parser context and destroyed when the parsing context
is destroyed.the document one will be used first if it exists.
xmlInitializeCatalog(), xmlLoadCatalog() and xmlLoadCatalogs() should be used at startup to initialize the catalog, if the catalog should be initialized with specific values xmlLoadCatalog() or xmlLoadCatalogs() should be called before xmlInitializeCatalog() which would otherwise do a default initialization first.
The xmlCatalogAddLocal() call is used by the parser to grow the document own catalog list if needed.
The XML Catalog spec requires the possibility to select default preferences between public and system delegation, xmlCatalogSetDefaultPrefer() allows this, xmlCatalogSetDefaults() and xmlCatalogGetDefaults() allow to control if XML Catalogs resolution should be forbidden, allowed for global catalog, for document catalog or both, the default is to allow both.
And of course xmlCatalogSetDebug() allows to generate debug messages (through the xmlGenericError() mechanism).
xmlCatalogResolve(), xmlCatalogResolveSystem(), xmlCatalogResolvePublic() and xmlCatalogResolveURI() are relatively explicit if you read the XML Catalog specification they correspond to section 7 algorithms, they should also work if you have loaded an SGML catalog with a simplified semantic.
xmlCatalogLocalResolve() and xmlCatalogLocalResolveURI() are the same but operate on the document catalog list
xmlCatalogCleanup() free-up the global catalog, xmlCatalogFreeLocal() is the per-document equivalent.
xmlCatalogAdd() and xmlCatalogRemove() are used to dynamically modify the first catalog in the global list, and xmlCatalogDump() allows to dump a catalog state, those routines are primarily designed for xmlcatalog, I'm not sure that exposing more complex interfaces (like navigation ones) would be really useful.
The xmlParseCatalogFile() is a function used to load XML Catalog files, it's similar as xmlParseFile() except it bypass all catalog lookups, it's provided because this functionality may be useful for client tools.
Since the catalog tree is built progressively, some care has been taken to try to avoid troubles in multithreaded environments but without a test-and-set routine accessible from C this can't be fully guaranteed, so the best is to use xmlGetExternalEntityLoader and set the entity loader routines to one of your code doing the synchronization.
The XML Catalog specification is relatively recent so there isn't much literature to point at:
If you have suggestions for corrections or additions, simply contact me:
$Id: catalog.html,v 1.5 2001/09/03 16:11:47 jfleck Exp $