{"id":1410,"date":"2016-01-26T15:52:54","date_gmt":"2016-01-26T20:52:54","guid":{"rendered":"http:\/\/codinggorilla.domemtech.com\/?p=1410"},"modified":"2016-01-26T17:42:23","modified_gmt":"2016-01-26T22:42:23","slug":"order-independent-tree-comparison-using-x-diff","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2016\/01\/26\/order-independent-tree-comparison-using-x-diff\/","title":{"rendered":"Order-Independent Tree Comparison Using X-Diff"},"content":{"rendered":"<p>I am interested in the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regression_testing\" target=\"_blank\">testing of regressions<\/a> of a program that I am modifying. What is unique about this regression test is that I want to check for differences after modifications of the internal data structures using XML or JSON. As an example, suppose I am working on a C# compiler. The compiler works by parsing C# programs, then constructs an abstract syntax tree. Let&#8217;s assume that I&#8217;ve written some simple C# programs that are valid, and the compiler before modifications is correct. I now modify the compiler to use a new parser, and now want to check for regressions in the data structures after the parser. To test for regressions, I compare the XML or JSON representations of the abstract syntax tree of a program, before and after modifications to the compiler.<\/p>\n<h1>X-Diff<\/h1>\n<h2>What is it?<\/h2>\n<p>X-Diff is a program that determines the set of changes between two XML documents. X-Diff assumes an unordered tree comparison model, whereby order of elements is not important (i.e., the trees are <a href=\"https:\/\/en.wikipedia.org\/wiki\/Graph_isomorphism_problem\" target=\"_blank\">isomorphic<\/a>). The command-line program takes two input file names containing XML (or JSON, in the C# port\/extension).<\/p>\n<h2>Who wrote it?<\/h2>\n<p><a href=\"http:\/\/pages.cs.wisc.edu\/~yuanwang\/xdiff.html\" target=\"_blank\">http:\/\/pages.cs.wisc.edu\/~yuanwang\/xdiff.html<\/a><\/p>\n<p>Wang, Yuan, David J. DeWitt, and Jin-Yi Cai. &#8220;X-Diff: An effective change detection algorithm for XML documents.&#8221; <i>Data Engineering, 2003. Proceedings. 19th International Conference on<\/i>. IEEE, 2003. DOI: <a class=\"ng-binding\" href=\"http:\/\/dx.doi.org\/10.1109\/ICDE.2003.1260818\" target=\"_blank\">10.1109\/ICDE.2003.1260818<\/a>.<\/p>\n<p>C++ and Java source.<\/p>\n<h2>How does it work?<\/h2>\n<p><em>Note: It is difficult to relate the code to the <\/em><em>algorithms described in the paper, with many special case code percolated throughout<\/em>. Below is an overview of what the code does.<\/p>\n<ol>\n<li>Input two XML files with <em>XParser.parse()<\/em>. The code intermingles parsing with attribute analysis, so it can be somewhat difficult to understand. In Java, the Xerces XML parser is used (<a href=\"http:\/\/xerces.apache.org\/\" target=\"_blank\">http:\/\/xerces.apache.org\/<\/a>). In the C# port, the Microsoft &#8220;LINQ to XML&#8221; library is used instead. &#8220;LINQ to XML&#8221; does not provide a SAX-like interface for parsing, so I&#8217;ve added a pre-order tree walker (<em>XParser.preorder<\/em>) to XParser as a drop-in substitute. During parsing, a hash\u00c2\u00a0value is computed for each node in the XML document. The value is recursively defined: <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=hash(N)%20%3D%20DES(N)%20%2B%20%5Csum%5Ec%5Bhash(c)%5D%5E2%5Ctext%7B%2C%20where%20c%20%3D%20child%20of%20N%7D\" alt=\"hash(N) = DES(N) + \\sum^c[hash(c)]^2\\text{, where c = child of N}\" align=\"absmiddle\" \/>.\u00c2\u00a0Notes: (a) the hash value for node N is order independent because addition is commutative. (b) I don&#8217;t understand why the authors square the hash value&#8211;except when a node has a single child. I don&#8217;t know why. (c) The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_Encryption_Standard\" target=\"_blank\">Data Encryption Standard (DES)<\/a>\u00c2\u00a0is used to obtain the hash value of tags, attributes, and text: <img class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=DES(N)\" alt=\"DES(N)\" align=\"absmiddle\" \/>. It isn&#8217;t clear why DES is used as opposed to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cyclic_redundancy_check\" target=\"_blank\">Cyclic Redundency Check (CRC).<\/a><\/li>\n<li>In a top-down manner, each node in one tree is matched to a node is the second tree. The result of is a list of pairs that are matched.\n<ol>\n<li>xdiff(node1, node2)&#8211;node1 and node2 are two nodes which are to be matched. The result is a matching, defined by table matched.\n<ol>\n<li>If the hash value of child c1 of node1 and c2 of node2 are equal, they are removed from further comparison, and placed in matched += (c1, c2).<\/li>\n<li>If there is no match for c1, c2, then c1 and c2 are added to unmatched1, unmatched2, respectively.<\/li>\n<li>The sets unmatched1 and unmatched2 are now examined:\u00c2\u00a0Find an optimal matching of nodes in unmatched, matchList(unmatched1, unmatched2). Note: matchList is a recursive function defined below.<\/li>\n<\/ol>\n<\/li>\n<li>matchList(unmatched1, unmatched2)&#8211;unmatched lists are of unmatched nodes in the two trees.\n<ol>\n<li>For all n1 in unmatched1\n<ol>\n<li>For all n2 in unmatched2\n<ol>\n<li>Compute the optimal &#8220;distance&#8221; d = distance(n1, n2). Note: distance() is a recursive function that finds the optimal number of changes to convert node n1 to node n2. Distance should also compute the cost of creating a tree (i.e., distance(null, n2)) or deleting a tree (i.e., distance(n1, null)).<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n<li>Find the optimal bipartite complete matching by picking the list of weighted pairs (n1, n2) such that n1 is in unmatched1 (or null), n2 is in unmatched2 (or null):\u00c2\u00a0searchNCC().<\/li>\n<li>For each pair in optimal match, xdiff(p1, p2).<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Port to C#<\/h2>\n<p>I have ported the code to C#; it is available in a git repository (<a href=\"https:\/\/bitbucket.org\/ken_domino\/xdiff\" target=\"_blank\">https:\/\/bitbucket.org\/ken_domino\/xdiff<\/a>).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am interested in the testing of regressions of a program that I am modifying. What is unique about this regression test is that I want to check for differences after modifications of the internal data structures using XML or JSON. As an example, suppose I am working on a C# compiler. The compiler works &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2016\/01\/26\/order-independent-tree-comparison-using-x-diff\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Order-Independent Tree Comparison Using X-Diff&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1410"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=1410"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1410\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=1410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=1410"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=1410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}