 | Level: Intermediate James R. Fuller (jim.fuller@webcomposite.com), Technical Director, FlameDigital Limited & Webcomposite s.r.o.
24 Jun 2008 Since October 2005, the W3C XML Processing Model Working Group (WG) has collaborated on a Working Draft (WD) specification titled "XProc: An XML Pipeline Language." As early implementations start to appear on the horizon and the anticipation of a second Last Call by the W3C WG (paving the way to a W3C draft recommendation), it has become clear that over the past 12 months, the XProc specification effort has picked up pace. Discover what XProc is today and its future, get the back story on some of the more contentious issues, and even run through a few examples.
XProc is a markup language that describes processing pipelines composed of discrete
steps that apply operations on XML documents. If a specification's importance is
related to the quality of individuals working on it, then XProc is significant, indeed.
The W3C XML Processing Model WG is packed with pragmatic XML practitioners and
superstars as well as grizzled veterans of past XML-related efforts: Erik Bruchez,
Andrew Fang, Paul Grosso, Rui Lopes, Murray Maloney, Alex Milowski, Michael
Sperberg-McQueen, Jeni Tennison, Henry Thompson, Richard Tobin, Alessandro
Vernet, Norman Walsh (Chair), and Mohamed Zergaoui, to name a few.
 |
Frequently used acronyms
- DTD: Document Type Definition
- HTTP: Hypertext Transfer Protocol
- W3C: World Wide Web Consortium
- XML: Extensible Markup Language
- XSL: Extensible Stylesheet Language
- XSLT: Extensible Stylesheet Language Transformations
|
|
XProc is not the first W3C attempt to establish an XML processing pipelines standard. In 2002, as part of the XML Processing Model Workshop, there was the "XML Pipeline Definition Language Submission," submitted by Sun
Microsystems, Alis Technologies, Arbortext, Cisco Systems, Fujitsu, Markup Technology,
and Oracle. This submission was published on 28 February 2002 as "XML Pipeline
Definition Language Version 1.0."
In 2004, a W3C Note attempted to set out requirements for an XML processing model: "XML
Processing Model Requirements," W3C Working Group Note 05 April 2004. In 2005,
another W3C member submission was proposed: "XML Pipeline Language (XPL) Version
1.0" (draft), submitted by Orbeon, Inc., on 11 March and published on 11 April.
 |
Use cases
XProc’s goal is to promote an interoperable and standard approach to the
processing of XML documents. These requirements were formally set out in a group
of use cases (see Resources), some of which I list below:
- Apply a sequence of operations to XML documents.
- Parse XML, validate it against a schema, and then apply an XSLT
transformation.
- Combine multiple XML documents (document aggregation).
- Interact with Web services.
- Use metadata retrieval.
|
|
I haven't seen any specific studies citing the need for XProc, so I here proffer a few of my
own unabashedly biased opinions:
- XProc’s declarative format, combined with the simplicity of thinking in terms of
pipelines, will mean that non-technical people can be involved in writing
and maintaining processing workflows.
- XProc, in many configurations, is amenable to streaming, whereas other
approaches to control XML processes are not (for example, XSLT).
- XProc steps focus on performing specific operations, which over time
should experience greater optimization (in an XProc processor used by many)
versus one-off code that you or I write (used by few).
- XProc's standard step library and extensibility mechanisms position
XProc to be an all-encompassing solution.
- Structured data (such as XProc markup) is typically easier to reuse than structured
code.
- One of XProc's inspirations is UNIX® pipelines, which hopefully all can agree is a good thing!
Not surprisingly, XProc will probably gain considerable favor amongst those groups who
work and generate XML documents. You can also imagine that people with business
workflows and XML documents flowing through them might be excited by the possibility
of modeling their workflows with XProc pipelines, and then running them on their XML
documents.
The XProc vocabulary
XProc is comprised of a small vocabulary divided into three categories:
core elements, ancillary elements, and a standard step library. The core elements
provide modern computing language constructs, such as conditional and iterative
processing and try/catch error mechanisms:
-
<p:for-each>: Iterative processing
statement
-
<p:choose>: Case logic statement
(similar to XSLT <xsl:choose>)
-
<p:group>: Groups a series of steps
into a named sub-pipeline
-
<p:try>: Provides a try/catch
mechanism to handle dynamic errors
-
<p:viewport>: Applies a sub-pipeline
process to subtrees contained in a single XML document
The elements used in the declaration and definition of steps provide the basis for XProc extensibility and reusability:
-
<p:library>: Contains step
declarations to provide reusable step libraries
-
<p:declare-step>: Defines a step
and its functional signature, typically in a
<p:library> element
-
<p:import>: Brings in through a
Uniform Resource Identifier (URI) any declared pipelines or library to
the current pipeline
XProc ancillary elements are mainly children nodes of XProc steps and handle
tasks such as step bindings, making it easy to configure a step. These elements
consist of:
-
Inputs and outputs: These elements define ports that can bind to
the inputs or outputs of other steps and define the flow of XML documents. In
addition, you can define XML documents inline (directly in the XProc
document) or bring in documents through an external URI.
-
Options: Options are the primary mechanism for configuring steps,
with the
<p:with-option> element or as
a name-value attribute on the step instance. Note that options are part of
the functional signature of a step, and their names are invariant.
-
Variables: Variables are used with compound steps and define XPath
variables for use within a compound step sub-pipeline.
-
Parameters: Unlike options and variables, parameters have names
that are computed at run time and are not related to any functional signature,
as defined by
<p:declare-step>. Perhaps
the most significant aspect to XProc is the 30-40 steps defined in a standard
XProc library, which are split into a set of required and optional steps.
The real power of XProc is embodied in its standard library of required and optional steps, which
perform a wide variety of tasks, such as:
- XSLT, XQuery, XInclude processing
- Schema validation (DTD, RelaxNG, Schematron, XML schema)
- XML update operations, such as inserting or deleting XML elements and attributes
- XML storage and retrieval
- Wrap, unwrap, escape, and unescape XML
- HTTP requests
- Execute native commands
Here is a brief overview of each step contained in the XProc standard library:
- Required steps:
-
<p:add-attribute>: Add
an attribute to a set of matching elements.
-
<p:add-xml-base>: Add
or correct xml:base attributes
on elements.
-
<p:compare>: Compare
two documents for equivalence.
-
<p:count>: Count the
number of documents in source input.
-
<p:delete>: Delete items
specified by a match pattern from the source input.
-
<p:directory-list>: Enumerate
the directory listing into the result output.
-
<p:error>: Generate a
dynamic error.
-
<p:escape-markup>: Escape
source input.
-
<p:http-request>: Interact with
resources identified by Internationalized Resource Identifiers (IRIs)
over HTTP.
-
<p:identity>: Make an exact
copy of an input source to the result output.
-
<p:insert>: Insert an XML
selection into the source input.
-
<p:label-elements>: Create a
label for each matched element, and store the value of the label in
an attribute.
-
<p:load>: Load an XML
resource that an IRI specifies and provide it as result output.
-
<p:make-absolute-uris>: Make
the value of an element or attribute in the source input an absolute IRI
value in the result output .
-
<p:namespace-rename>: Rename
the namespace declarations.
-
<p:pack>: Merge two
document sequences.
-
<p:parameters>: Make available
a set of parameters as a c:param-set XML
document in the result output.
-
<p:rename>: Rename elements,
attributes, or processing instruction.
-
<p:replace>: Replace matching
elements.
-
<p:set-attributes>: Set attributes
on matching elements.
-
<p:sink>: Accept source input and
generate no result output.
-
<p:split-sequence>: Divide a single
sequence into two.
-
<p:store>: Store a serialized
version of its source input to a URI.
-
<p:string-replace>:Perform string
replacement on the source input.
-
<p:unescape-markup>: Unescape
the source input.
-
<p:unwrap>: Replace matched elements
with their children.
-
<p:wrap>: Wrap matching nodes
in the source document with a new parent element.
-
<p:wrap-sequence>: Produce a
new sequence of documents.
-
<p:xinclude>: Apply XInclude
processing to the input source.
-
<p:xslt>: Apply an XSLT
version 1.0 or XSLT version 2.0 style sheet input source.
- Optional steps:
-
<p:exec>: Apply an external
command to the input source.
-
<p:hash>: Generate a message
digest or a digital fingerprint for some value.
-
<p:uuid>: Generate a
Universally Unique Identifier (UUID).
-
<p:validate-with-relax-ng>: Validate
the input XML with RelaxNG schema.
-
<p:validate-with-schematron>: Validate
the input XML with Schematron schema.
-
<p:validate-with-xml-schema>: Validate
the input XML with XML schema.
-
<p:www-form-urldecode>: Decode
the x-www-form-urlencoded string into a set of XProc parameters.
-
<p:www-form-urlencode>: Encode
a set of XProc parameter values as an x-www-form-urlencoded string.
-
<p:xquery>: Apply an XQuery
version 1.0 query.
-
<p:xsl-formatter>: Render an
XSL version 1.1 document (as in XSL-FO).
It is easy to create new steps from existing pipelines. If you want, you can even
create third-party libraries with extension steps that augment the XProc
processor itself.
Note: As the specification process is ongoing, the standard library is one
area that continues to experience a bit of volatility. I suggest referring to
up-to-date definitions in the current WD (see Resources)
for specific details.
Example pipelines
Listing 1 illustrates an XProc pipeline with a single step that
applies an XSLT operation to an XML document.
Listing 1. Simple implicit pipeline
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example">
<p:xslt>
<p:input port="stylesheet">
<p:document href="mystylesheet.xslt"/>
</p:input>
</p:xslt>
</p:pipeline>
|
XProc pipelines accept zero or more XML documents as their input and produce zero
or more XML documents as output. The XProc code in Listing 1 consists of a
<p:pipeline> top-level element, a
<p:xslt> step, and not much else. An
XML document that comes into the standard input of the XProc processor is
handed off to the <p:xslt> step, which then applies
an XSLT process using mystylesheet.xslt (where mystylesheet is
defined by the <p:input>/<p:document>
element) on the XML document.
With only a single step, its results are placed onto the result port for
the entire pipeline, which (incidentally) typically outputs the XML document to
standard output. Figure 1 shows this process, outlining where
the XML document flows from source and result ports.
Figure 1. Logic flow for a simple pipeline
These connections between ports are known as step bindings, and they control
the flow of XML document processing. These bindings can be implicitly or explicitly
defined. In the Listing 1 example, bindings were implicit, with process flow dictated by XProc's natural defaulting mechanisms.
Listing 2 shows a functionally equivalent pipeline with explicit
step bindings.
Listing 2. Simple explicit pipeline
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" name="xslt-example">
<p:input port="my-source" primary="true" sequence="false"/>
<p:output port="my-result" primary="true" sequence="false">
<p:pipe step="step1" port="result"/>
</p:output>
<p:xslt name="step1">
<p:input port="source">
<p:pipe step="xslt-example" port="my-source"/>
</p:input>
<p:input port="stylesheet">
<p:document href="mystylesheet.xslt"/>
</p:input>
</p:xslt>
</p:declare-step>
|
In Listing 1, I used <p:pipeline>,
which implicitly declared a source input and result output port. Using
<p:declare-step> now means that I have to
explicitly define these ports as well as declare step bindings between sequential
sibling steps. These bindings and ports are summarized below:
- Top-level my-source input port will receive any standard input.
- Top-level my-result output port will receive the results of the step1
result port and place them on the standard output.
- The step1 source input is bound to the my-source input port.
It's difficult to illustrate step bindings between steps using a pipeline with a single step; so,
I created a nontrivial example in which I show several XProc steps and logic
structures. Listing 3 presents a more representative XProc
example containing multiple steps along with some conditional logic steps.
Listing 3. Complex pipeline
<p:pipeline name="mypipeline" type="myexample" xmlns:p="http://www.w3.org/ns/xproc">
<p:xinclude name="step1"/>
<p:choose name="step2">
<p:when test="/*[@version < 2.0]">
<!-- subpipeline //-->
<p:validate-with-xml-schema name="step2a1">
<p:input port="schema">
<p:document href="newer-schema.xsd"/>
</p:input>
</p:validate-with-xml-schema>
</p:when>
<p:otherwise>
<!-- subpipeline //-->
<p:validate-with-xml-schema name="step2b1">
<p:input port="schema">
<p:document href="older-schema.xsd"/>
</p:input>
</p:validate-with-xml-schema>
</p:otherwise>
</p:choose>
<p:for-each name="step3">
<p:iteration-source select="//div"/>
<!-- subpipeline //-->
<p:string-replace name="step3a1">
<p:option name="match" value="//span[@class=’css1’]"/>
<p:option name="replace" value=""/>
</p:string-replace>
</p:for-each>
<p:wrap-sequence name="step4">
<p:option name="wrapper" value="document"/>
</p:wrap-sequence>
</p:pipeline>
|
This pipeline roughly translates to the following:
- Input stdin XML documents.
- Apply
<xinclude> processing (step1).
- Choose (step2) between using an newer (step2a1) or older
(step2b1) schema and validate.
- Extract each (step3) HTML
<div>
element, applying a string replace operation (step3a1).
- Wrap up (step4) the final sequence of
<div>
elements with a <document>
element.
- Output the XML documents to stdout.
Steps can have other input or output ports defined that work with non-XML
documents, but only XML documents (as in XML infoset) can flow
between primary input and output ports.
I used all three kinds of XProc steps in this
example. Step types are represented as rectangles in the workflow diagram that
Figure 2 shows.
Figure 2. Logic flow for a
complex pipeline
The p:pipeline compound step
The largest rectangle represents the whole
<p:pipeline>, which can itself be invoked
as a step. Because it contains a sub-pipeline, it is a compound step.
The p:choose multi-container step
The <p:choose> step has two sub-pipelines,
which makes it a multi-container step. It chooses which sub-pipeline to follow
based on the evaluation of a <p:when> XPath
expression.
The p:for-each compound step
The <p:for-each> step contains a single sub-pipeline consisting of one step, so it's a compound step.
Atomic steps
Most XProc steps are atomic steps. These steps apply a specific operation
to an XML document, examples of which are
<p:xinclude>, <p:validate->,
<p:string-replace>, and
<p:wrap>.
Considerations in the development of XProc
Throughout the XProc specification process, the WG had to navigate several issues:
 |
Current status
As with any W3C specification, XProc underwent a lot of changes in terms of syntax
and semantics (see Resources). The latest WD, dated
1 May 2008, is evolutionary as it irons out the details from
the churn of the previous draft's decision to allow both version 1 and version 2 of
XPath and XSLT.
The current WD is also revolutionary in that the spec was completely rewritten to
disentangle some of the various notions that the <p:option>
element had become. For example, <p:option> is now
used only in the functional signature of a step (only in
<p:declare-step>), with
<p:with-option>used in the step instance itself,
when setting an option's value. In addition, the <p:variable/>
element was added to hold computed values, especially for use with compound
steps.
The most interesting change with options, parameters, and variables is the removal
of the value attribute; values are now the result of evaluating an XPath expression
defined by a select attribute. Listing 4 illustrates how this
works, using p:with-option as an example.
Listing 4. Example of p:with-option use with an XPath expression
<ex:someStep>
<p:with-option name="some-option-name" select="'some value'"/>
</ex:someStep>
|
Defining simple values for options, using an XPath expression, becomes tedious [for example, the need to use nested double (") and single (') quotation marks]. The syntactic shorthand
(see Listing 5) was retained as the preferred method whereby options can be defined
with name-value attributes on the step element itself.
Listing 5. Setting an option with syntactic shorthand
<ex:stepType option-name="some value"/>
|
All these changes make the specification much clearer, but at the expense of a larger XProc vocabulary
(<p:variable>, <p:with-option>,
<p:with-param>, and so on).
One last significant change to mention is that you now must use
<p:declare-step/> consistently to declare new
steps. This change adds a slight cognitive load to users who now have to think
about an instance of a step versus its declaration (in a library or pipeline).
Overloading too many concepts onto the idea of what a single step element
is in XProc might be potentially constraining. I think that the
WG splitting up concepts now was a pragmatic decision.
Summary
It's important for XML technologists to remind themselves that some families and
phylum of developers do not work with XML. When someone from
these groups asks, "Why do I need XProc?," my first response is usually that XProc is designed to be
platform neutral, meaning that XProc can run everywhere a compliant XProc
processor can run. However, if you already work with XML documents and
technologies, XProc is probably something you have emulated with other
approaches (XSLT, Apache Ant, Apache Cocoon site maps, Jelly, and so on), and you
will be happy to see the arrival of XProc processors.
Note: No implementations of XProc are production-ready, but several are in
development. See Resources for more information.
With XML seeping into every aspect and tier of computing, having a single, easy-to-understand
processing approach like XProc to orchestrate one's expanding XML ecosystem
might be a disruptive technology. The XProc Standard library, combined with the
extensibility of writing your own third-party step libraries, provides a powerful
facade over existing and future XML processors. Thus, rather than fit your workflows around the vagaries of any specific technology, you can now
honestly define your processing as a series of operations on XML documents.
XProc is expected to go into a second Last Call sometime in the coming months, which
seems to indicate that you might see a W3C recommendation before the end
of 2008.
Resources Learn
Get products and technologies
-
List of XProc implementations: See XProc implementations currently undergoing development.
-
Smallx: Explore Smallx. It is, according to its developers, "a library and set of tools that is being developed to process XML infosets. It has two distinct features in that the infoset implementation allows streaming of documents and that processing of infosets can be accomplished using a concept called pipelines."
-
Sxpipe: Try Simple XML Pipelines (sxpipe) to build a simple processing model for XML documents and choose the order in which components are evaluated.
-
Apache Ant: Get more information about and download this
Java™-based build tool.
-
Apache Cocoon: Download and get more information on this Web development framework built around the concepts of separation of concerns.
-
Jelly: Get the tool to turn your XML code into executable files.
-
IBM
trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author  | 
|  | Jim Fuller has been a professional developer for 15 years, working with several
blue-chip software companies in both his native USA and the UK. He has co-written a few
technology-related books and regularly speaks and writes articles focusing on XML
technologies. He is a founding committee member for XML Prague and was in the gang responsible for EXSLT. He spends his free time playing with XML databases and XQuery. Jim is technical
director for a few companies (FlameDigital, Webcomposite s.r.o.) and can be reached at jim.fuller@webcomposite.com. |
Rate this page
|  |