{"id":1613,"date":"2018-11-09T21:39:09","date_gmt":"2018-11-10T02:39:09","guid":{"rendered":"http:\/\/codinggorilla.domemtech.com\/?p=1613"},"modified":"2018-11-11T10:14:19","modified_gmt":"2018-11-11T15:14:19","slug":"porting-re2-j-to-c","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2018\/11\/09\/porting-re2-j-to-c\/","title":{"rendered":"Porting Re2\/j to C#"},"content":{"rendered":"<p>For these last few weeks, I&#8217;ve been trying to grapple with the problem of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Platform_Invocation_Services\">p\/invoke<\/a>&#8211;the nasty but must-use feature in C#. While one could write these declarations out by hand, some libraries are too large, and change too often, so people use p\/invoke generators, like <a href=\"http:\/\/swig.org\/\">SWIG<\/a>. However, the is no generator that is easy to use\u00c2\u00a0or generates 100% correct C# declarations. So, as every software developer does,\u00c2\u00a0so I go to re-invent the wheel.<\/p>\n<p><!--more--><\/p>\n<p>Part of my solution involves using regular expressions&#8211;which requires a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Parsing#Computer_languages\">recognizer<\/a>. Rather than write my own regular expression engine&#8211;as I don&#8217;t want to spend all day writing one&#8211;I decided to look for one.<\/p>\n<p>If you try to Google for &#8220;regular expression engine source code&#8221;, you will find a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Comparison_of_regular_expression_engines\">great many implementations<\/a>. But, hardly any I would consider an easy read, in the evening beside your dog at the fireplace. Some say the\u00c2\u00a0<a href=\"https:\/\/www.cs.princeton.edu\/courses\/archive\/spr09\/cos333\/beautiful.html\">implementation by Pike and Kernighan<\/a>\u00c2\u00a0is a nice and clean implementation, but that code really isn&#8217;t a regular expression engine. It&#8217;s a\u00c2\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glob_(programming)\">glob pattern<\/a> matcher. And, if you try to extend it to include <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regular_expression#Basic_concepts\">regular expression operators<\/a> like groups (parenthesized patterns), sets (denoted by square brackets), and Boolean Or operators, you end up with a\u00c2\u00a0<a href=\"http:\/\/stevehanov.ca\/blog\/index.php?id=90\">big, fat, pile of crap<\/a>.<\/p>\n<p>Surprisingly, virtually all regular expression engines are implemented as NFA&#8217;s with backtracking. As I really didn&#8217;t want to go through <a href=\"https:\/\/en.wikipedia.org\/wiki\/Thompson%27s_construction\">Thompson&#8217;s Construction<\/a>, and re-implement something I did years ago, I decided to try a nice implementation of Thompson&#8217;s original NFA algorithm by <a href=\"https:\/\/swtch.com\/~rsc\/regexp\/regexp1.html\">Cox<\/a>. The <a href=\"https:\/\/github.com\/google\/re2\">original code is in C++<\/a>, then\u00c2\u00a0<a href=\"https:\/\/github.com\/google\/re2j\">ported to Java<\/a> a few years ago. Starting with that, I\u00c2\u00a0<a href=\"https:\/\/github.com\/kaby76\/re2cs\">ported it to C#<\/a>\u00c2\u00a0over the last two days. Does it work? Yes!<\/p>\n<h3>How does it work?<\/h3>\n<p>RE2 works by constructing an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Nondeterministic_finite_automaton\">NFA<\/a>, then, as explained by Wikipedia, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Nondeterministic_finite_automaton#Implementation\">keeps a set data structure of all the possible states that the recognizer is in<\/a>. (Surprisingly, Wiki does not describe the backtracking method, used by most RE recognizers!)<\/p>\n<p>Let&#8217;s take Cox&#8217;s example (<em>pattern = &#8220;abab|abbb&#8221;, text = &#8220;abbb&#8221;<\/em>).\u00c2\u00a0Debugging <a href=\"https:\/\/github.com\/kaby76\/re2cs\/blob\/master\/program\/Program.cs\">the implementation<\/a>, we see the NFA is implemented by the class\u00c2\u00a0<a href=\"https:\/\/github.com\/kaby76\/re2cs\/blob\/master\/engine\/Machine.cs#L17\">Machine<\/a>. <a href=\"https:\/\/github.com\/kaby76\/re2cs\/blob\/master\/engine\/Prog.cs#L16\">Prog<\/a> encodes the states and transitions of the NFA. The\u00c2\u00a0<em>inst<\/em> field of the class encodes the transitions organized by state, which is identified by a\u00c2\u00a0non-negative integer. The main loop of the recognizer is <a href=\"https:\/\/github.com\/kaby76\/re2cs\/blob\/master\/engine\/Machine.cs#L291\">for-loop<\/a>. Starting in state 1, the recognizer steps to {2} on &#8216;a&#8217;. Then, on &#8216;b&#8217; the recognizer steps to {7, 3, 5}. Then, on &#8216;b&#8217;, the recognizer steps to {6}. Then, on &#8216;b&#8217;, the recognizer steps to {8}, which is the <em>match<\/em> state.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-full wp-image-1627\" src=\"http:\/\/codinggorilla.com\/wordpress\/wp-content\/uploads\/2018\/11\/2018-11-10-1.png\" alt=\"\" width=\"688\" height=\"687\" \/><\/p>\n<h3>The Alternative<\/h3>\n<p>Note: The simplest implementation I&#8217;ve seen is just a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Recursive_descent_parser\">recursive descent<\/a>, which you can find <a href=\"https:\/\/github.com\/SirWumpus\/ioccc-ag\/blob\/master\/spoiler\/ag.c\">here<\/a>. I think I even had it on a quiz way back in compiler construction many years ago.<\/p>\n<p>&#8211;Ken<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For these last few weeks, I&#8217;ve been trying to grapple with the problem of p\/invoke&#8211;the nasty but must-use feature in C#. While one could write these declarations out by hand, some libraries are too large, and change too often, so people use p\/invoke generators, like SWIG. However, the is no generator that is easy to &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2018\/11\/09\/porting-re2-j-to-c\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Porting Re2\/j to C#&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1613"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=1613"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1613\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=1613"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=1613"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=1613"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}