Canonical XML

This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.

Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.

A canonical XML document conforms to the following grammar:

CanonXML    ::= Pi* element Pi*
element     ::= Stag (Datachar | Pi | element)* Etag
Stag        ::= '<'  Name Atts '>'
Etag        ::= '</' Name '>'
Pi          ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
Atts        ::= (' ' Name '=' '"' Datachar* '"')*
Datachar    ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'
                 | '&#9;'| '&#10;'| '&#13;'
                 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
Name        ::= (see XML spec)
Char        ::= (see XML spec)
S           ::= (see XML spec)

Attributes are in lexicographical order (in Unicode bit order).

A canonical XML document is encoded in UTF-8.

Ignorable white space is considered significant and is treated equivalently to data.

James Clark