I am interested in the testing of regressions of a program that I am modifying. What is unique about this regression test is that I want to check for differences after modifications of the internal data structures using XML or JSON. As an example, suppose I am working on a C# compiler. The compiler works by parsing C# programs, then constructs an abstract syntax tree. Let’s assume that I’ve written some simple C# programs that are valid, and the compiler before modifications is correct. I now modify the compiler to use a new parser, and now want to check for regressions in the data structures after the parser. To test for regressions, I compare the XML or JSON representations of the abstract syntax tree of a program, before and after modifications to the compiler.
X-Diff
What is it?
X-Diff is a program that determines the set of changes between two XML documents. X-Diff assumes an unordered tree comparison model, whereby order of elements is not important (i.e., the trees are isomorphic). The command-line program takes two input file names containing XML (or JSON, in the C# port/extension).
Who wrote it?
http://pages.cs.wisc.edu/~yuanwang/xdiff.html
Wang, Yuan, David J. DeWitt, and Jin-Yi Cai. “X-Diff: An effective change detection algorithm for XML documents.” Data Engineering, 2003. Proceedings. 19th International Conference on. IEEE, 2003. DOI: 10.1109/ICDE.2003.1260818.
C++ and Java source.
How does it work?
Note: It is difficult to relate the code to the algorithms described in the paper, with many special case code percolated throughout. Below is an overview of what the code does.
- Input two XML files with XParser.parse(). The code intermingles parsing with attribute analysis, so it can be somewhat difficult to understand. In Java, the Xerces XML parser is used (http://xerces.apache.org/). In the C# port, the Microsoft “LINQ to XML” library is used instead. “LINQ to XML” does not provide a SAX-like interface for parsing, so I’ve added a pre-order tree walker (XParser.preorder) to XParser as a drop-in substitute. During parsing, a hash value is computed for each node in the XML document. The value is recursively defined:
. Notes: (a) the hash value for node N is order independent because addition is commutative. (b) I don’t understand why the authors square the hash value–except when a node has a single child. I don’t know why. (c) The Data Encryption Standard (DES) is used to obtain the hash value of tags, attributes, and text:
. It isn’t clear why DES is used as opposed to Cyclic Redundency Check (CRC).
- In a top-down manner, each node in one tree is matched to a node is the second tree. The result of is a list of pairs that are matched.
- xdiff(node1, node2)–node1 and node2 are two nodes which are to be matched. The result is a matching, defined by table matched.
- If the hash value of child c1 of node1 and c2 of node2 are equal, they are removed from further comparison, and placed in matched += (c1, c2).
- If there is no match for c1, c2, then c1 and c2 are added to unmatched1, unmatched2, respectively.
- The sets unmatched1 and unmatched2 are now examined:Â Find an optimal matching of nodes in unmatched, matchList(unmatched1, unmatched2). Note: matchList is a recursive function defined below.
- matchList(unmatched1, unmatched2)–unmatched lists are of unmatched nodes in the two trees.
- For all n1 in unmatched1
- For all n2 in unmatched2
- Compute the optimal “distance” d = distance(n1, n2). Note: distance() is a recursive function that finds the optimal number of changes to convert node n1 to node n2. Distance should also compute the cost of creating a tree (i.e., distance(null, n2)) or deleting a tree (i.e., distance(n1, null)).
- For all n2 in unmatched2
- Find the optimal bipartite complete matching by picking the list of weighted pairs (n1, n2) such that n1 is in unmatched1 (or null), n2 is in unmatched2 (or null):Â searchNCC().
- For each pair in optimal match, xdiff(p1, p2).
- For all n1 in unmatched1
- xdiff(node1, node2)–node1 and node2 are two nodes which are to be matched. The result is a matching, defined by table matched.
Port to C#
I have ported the code to C#; it is available in a git repository (https://bitbucket.org/ken_domino/xdiff).