June 5, 2008

xmldiff patches XML files by sending just the changes

Author: Ben Martin

The GNU diff and patch utilities let you compare files to generate a patch that describes the changes between files. You can then apply the patch file on that machine or another. You might think to use diff and patch on XML files, since they are just text files -- and depending on your application, diff and patch might serve your needs well. However, as things such as the sequence in which attributes are ordered in an element tag are not important in an XML file, using specific tools that are aware of the XML standard can make seeing differences and sending XML-aware "patch" files much more attractive.

Xmldiff is a tool that can show you the differences between two XML files, taking into account changes that are purely syntax or are not significant according to the XML specification. One of the patch formats that Xmldiff can generate is an XUpdate XML document that succinctly describes the changes between two XML files.

In this article, I'll use xmldiff to generate an XUpdate patch, and the Perl module XML::XUpdate::LibXML to apply this patch to an XML file. With these two tools you can update XML files on remote machines by sending only the smaller XUpdate files that describe the changes and use XML::XUpdate::LibXML to apply patches on the remote machine.

You can find packages for xmldiff for Fedora Core 6 and older, in Ubuntu Hardy universe, and as a 1-Click install for openSUSE 10.3. Packages for XML::XUpdate::LibXML are not available in mainstream repositories. For this article I'll build both from source using xmldiff 0.6.8 and the latest XML::XUpdate::LibXML from CPAN on a 64-bit Fedora 8 machine.

Xmldiff is written in Python and uses Python to install itself. I found that attempting to run the recommended install command failed with the SyntaxError shown below. To fix this, simply move the __future__ declaration above the __revision__ one. With this minor change, followed by executing setup.py install as root, you should have a working installed xmldiff.

$ python setup.py install
File "setup.py", line 23
from __future__ import nested_scopes
SyntaxError: from __future__ imports must occur at the beginning of the file

$ edit setup.py
from __future__ import nested_scopes
__revision__ = '$Id: setup.py,v 1.20 2005-01-12 14:21:47 syt Exp $'

$ sudo python setup.py install

Installing XML::XUpdate::LibXML through CPAN is shown below. You might have to install the perl-CPAN package if you have not used CPAN before. When you first use CPAN, upon executing the install command shown below you will be prompted for some configuration information for your CPAN setup. Most prompts when setting up CPAN have usable defaults you can just accept.

On a Fedora system you will find that libXML is already offered as an RPM package installable through yum. Below I install libXML from Fedora's repository and then install XML::XUpdate::LibXML using CPAN because it is not available from Fedora. For me the CPAN install also pulled in libXML-iterator and XML-NodeFilter in order to satisfy dependencies.

# yum install perl-CPAN
# yum install perl-XML-LibXML
# perl -MCPAN -e 'install XML::XUpdate::LibXML'
Are you ready for manual configuration? [yes]
Your ftp_proxy? http://daiin.example.com:3128
Your http_proxy? [http://daiin.example.com:3128/]
---- Unsatisfied dependencies detected during [P/PH/PHISH/XML-LibXML-Iterator-1.04.tar.gz] -----
Shall I follow them and prepend them to the queue
of modules we are processing right now? [yes]


For testing purposes I used some XML files from the gutenprint-foomatic package, starting with the 106KB gutenprint-ijs.5.0-stp_magentagamma-1.xml. I show the normal patch and XUpdate below so you can see the size of the change and reproduce the modified file with GNU patch if desired. Notice that the xmldiff command takes almost 10 seconds on this 106KB file.

$ diff -Nuar gutenprint-ijs.5.0-stp_magentagamma-1-original.xml gutenprint-ijs.5.0-stp_magentagamma-1-new.xml
--- gutenprint-ijs.5.0-stp_magentagamma-1-original.xml 2008-05-23 14:21:09.744407642 +1000
+++ gutenprint-ijs.5.0-stp_magentagamma-1-new.xml 2008-05-23 14:22:04.445410809 +1000
@@ -2566,6 +2566,11 @@
<printer>printer/Epson-Stylus_Color_1520</printer><!-- gutenprint name: escp2-1520 -->
+ <constraint>
+ <driver>my new driver</driver>
+ <printer>printer/OpenHardware</printer>
+ <arg_defval>7</arg_defval>
+ </constraint>

$ time xmldiff -x \
gutenprint-ijs.5.0-stp_magentagamma-1-original.xml \
<?xml version="1.0"?>
<xupdate:modifications version="1.0" xmlns:xupdate="http://www.xmldb.org/xupdate">
<xupdate:insert-after select="/option[1]/constraints[1]/constraint[510]" > <xupdate:element name="constraint"> <driver>
my new driver

real 0m9.489s
user 0m2.784s
sys 0m0.560s

The below script, apply-xupdate.pl, is a very minimal program I created using the XML::XUpdate::LibXML Perl module to apply an XUpdate document to an existing XML file. The first parameter should be the old XML file that you wish to patch, and the second is the XUpdate document to apply. The XML file that results from the input XML file with the XUpdate applied is written as the output of running the Perl script.


use XML::LibXML;
use XML::XUpdate::LibXML;

print STDERR "XUpdate applying script version 1.0\n";
print STDERR "input:$infilename\n";
print STDERR "xupdate:$xupdatefilename\n";

$parser = XML::LibXML->new();
open $fh, "$infilename";
binmode $fh;
$doc = $parser->parse_fh($fh);

open $fh, "$xupdatefilename";
binmode $fh;
$xupdatedoc = $parser->parse_fh($fh);

$xup = XML::XUpdate::LibXML->new();

print STDOUT $doc->toString();

The invocation of the apply-xupdate.pl script is shown below, along with the resulting GNU diff on the resulting XML file. Notice that there are slight, meaningless changes in white space between the XML file that was used to generate the XUpdate and the XML file that results in applying the XUpdate to the original XML file.

$ apply-xupdate.pl \
gutenprint-ijs.5.0-stp_magentagamma-1-original.xml \
gutenprint-ijs.5.0-stp_magentagamma-1.xupdate >newdoc.xml
$ diff -Nuar /tmp/gutenprint-ijs.5.0-stp_magentagamma-1-new.xml newdoc.xml
- <constraint sense='true'>
+ <constraint sense="true">
<printer>printer/Epson-Stylus_Color_1520</printer><!-- gutenprint name: escp2-1520 -->
- </constraint>
- <constraint>
- <driver>my new driver</driver>
- <printer>printer/OpenHardware</printer>
- <arg_defval>7</arg_defval>
- </constraint>
+ </constraint> <constraint> <driver>
+my new driver
+ </driver>
+ <printer>
+ </printer>
+ <arg_defval>
+ </arg_defval>
+ </constraint>

Wrap up

Xmldiff doesn't consider XML namespaces when generating a diff. If the same namespace is used with different prefixes in two files, then the nodes will be considered different even though they are semantically identical.

The README file mentions that the time to generate a diff with xmldiff can become prohibitive on very large XML files. To test that, I attempted to run xmldiff on the 8.4MB file gutenprint-ijs.5.0-pagesize.xml. I killed the xmldiff process after 15 minutes on a 2.4GHz Intel Q6600 quad core machine. However, there may be cases where spending the time to generate an XUpdate for a large XML file are still desirable -- for example, when you can send a patch over an expensive or slow network to a mobile device.


  • Tools & Utilities
  • Internet & WWW
Click Here!