Order-Independent Tree Comparison Using X-Diff

I am interested in the testing of regressions of a program that I am modifying. What is unique about this regression test is that I want to check for differences after modifications of the internal data structures using XML or JSON. As an example, suppose I am working on a C# compiler. The compiler works by parsing C# programs, then constructs an abstract syntax tree. Let’s assume that I’ve written some simple C# programs that are valid, and the compiler before modifications is correct. I now modify the compiler to use a new parser, and now want to check for regressions in the data structures after the parser. To test for regressions, I compare the XML or JSON representations of the abstract syntax tree of a program, before and after modifications to the compiler.

X-Diff

What is it?

X-Diff is a program that determines the set of changes between two XML documents. X-Diff assumes an unordered tree comparison model, whereby order of elements is not important (i.e., the trees are isomorphic). The command-line program takes two input file names containing XML (or JSON, in the C# port/extension).

Who wrote it?

http://pages.cs.wisc.edu/~yuanwang/xdiff.html

Wang, Yuan, David J. DeWitt, and Jin-Yi Cai. “X-Diff: An effective change detection algorithm for XML documents.” Data Engineering, 2003. Proceedings. 19th International Conference on. IEEE, 2003. DOI: 10.1109/ICDE.2003.1260818.

C++ and Java source.

How does it work?

Note: It is difficult to relate the code to the algorithms described in the paper, with many special case code percolated throughout. Below is an overview of what the code does.

  1. Input two XML files with XParser.parse(). The code intermingles parsing with attribute analysis, so it can be somewhat difficult to understand. In Java, the Xerces XML parser is used (http://xerces.apache.org/). In the C# port, the Microsoft “LINQ to XML” library is used instead. “LINQ to XML” does not provide a SAX-like interface for parsing, so I’ve added a pre-order tree walker (XParser.preorder) to XParser as a drop-in substitute. During parsing, a hash value is computed for each node in the XML document. The value is recursively defined: hash(N) = DES(N) + \sum^c[hash(c)]^2\text{, where c = child of N}. Notes: (a) the hash value for node N is order independent because addition is commutative. (b) I don’t understand why the authors square the hash value–except when a node has a single child. I don’t know why. (c) The Data Encryption Standard (DES) is used to obtain the hash value of tags, attributes, and text: DES(N). It isn’t clear why DES is used as opposed to Cyclic Redundency Check (CRC).
  2. In a top-down manner, each node in one tree is matched to a node is the second tree. The result of is a list of pairs that are matched.
    1. xdiff(node1, node2)–node1 and node2 are two nodes which are to be matched. The result is a matching, defined by table matched.
      1. If the hash value of child c1 of node1 and c2 of node2 are equal, they are removed from further comparison, and placed in matched += (c1, c2).
      2. If there is no match for c1, c2, then c1 and c2 are added to unmatched1, unmatched2, respectively.
      3. The sets unmatched1 and unmatched2 are now examined: Find an optimal matching of nodes in unmatched, matchList(unmatched1, unmatched2). Note: matchList is a recursive function defined below.
    2. matchList(unmatched1, unmatched2)–unmatched lists are of unmatched nodes in the two trees.
      1. For all n1 in unmatched1
        1. For all n2 in unmatched2
          1. Compute the optimal “distance” d = distance(n1, n2). Note: distance() is a recursive function that finds the optimal number of changes to convert node n1 to node n2. Distance should also compute the cost of creating a tree (i.e., distance(null, n2)) or deleting a tree (i.e., distance(n1, null)).
      2. Find the optimal bipartite complete matching by picking the list of weighted pairs (n1, n2) such that n1 is in unmatched1 (or null), n2 is in unmatched2 (or null): searchNCC().
      3. For each pair in optimal match, xdiff(p1, p2).

Port to C#

I have ported the code to C#; it is available in a git repository (https://bitbucket.org/ken_domino/xdiff).