{"id":186,"date":"2010-04-15T07:44:26","date_gmt":"2010-04-15T14:44:26","guid":{"rendered":"http:\/\/domemtech.com\/?p=186"},"modified":"2010-08-24T12:35:20","modified_gmt":"2010-08-24T19:35:20","slug":"march-madness","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2010\/04\/15\/march-madness\/","title":{"rendered":"March madness&#8230;"},"content":{"rendered":"<p>For several months, I had been editing a new edition of a textbook (Atlas of the Canine Brain, ISBN 978-0-916182-17-5). This book was first published in Russian in 1959, then translated and published in English in 1964. Although the English book was for sale, the publishing company (<a href=\"http:\/\/nppbooks.com\">NPP Books<\/a>) had only a limited number of copies left. So, a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Print_on_demand\">Print-On-Demand<\/a> (POD) version of the book was needed. Of course, in 1964 there were no personal computers. (Even in the early &#39;70&#39;s, I was still using <a href=\"http:\/\/en.wikipedia.org\/wiki\/Punched_card\">punched cards<\/a>.) The book was written by typewritten on 8.5&quot; by 11&quot; paper, but the original manuscript, which also included the figures, was lost. Fortunately, the text and figures were recovered from the Russian and English books using a scanner and optical character recognition (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\">OCR<\/a>). Call me old fashioned, but it still seems quite remarkable that the technology exists to recover text from old books.<br \/>\n\t<!--more--><\/p>\n<p>However, despite the help of these technologies, the results were far from having the text in a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Document_markup_language\">document markup language<\/a> (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Latex\">L<\/a><sup><a href=\"http:\/\/en.wikipedia.org\/wiki\/Latex\">A<\/a><\/sup><a href=\"http:\/\/en.wikipedia.org\/wiki\/Latex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Latex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Latex\">X<\/a>), which is used to produce a typeset quality PDF file of the book. Current OCR technologies do not produce 100% accurate results, do not generally work well when text is underlined, is super- and subscripted, has language-dependent character accents, or text that is formatted in tables.<\/p>\n<p>In addition, the bibliography was not in an acceptable style. &nbsp;Citations throughout the book could not be traced easily to a particular reference in the bibliography. Consequently, the bibliography needed to be converted into <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">B<\/a><span style=\"font-size: 10pt;\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">IB<\/a><\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">X<\/a> format. Although I could have edited the large bibliography using Word or OpenOffice, and use&nbsp;<a href=\"http:\/\/writer2latex.sourceforge.net\/index.html\">Writer2BibTeX<\/a> to output <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">B<\/a><span style=\"font-size: 10pt;\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">IB<\/a><\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">X<\/a>, &nbsp;I decided to use <a href=\"http:\/\/www.antlr.org\">Antlr<\/a>, a compiler front-end generator, to parse and translate the bibliography from unformatted text into <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">B<\/a><span style=\"font-size: 10pt;\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">IB<\/a><\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">X<\/a>.<\/p>\n<p><span style=\"text-decoration: underline;\">The problem<\/span><\/p>\n<p>A program is to be written that will input bibliography entries that are in a standard bibliography format for a book. &nbsp;These entries were produced through the process of scanning the images of the bibliography, converted into text using OCR, then converted to a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">X<\/a>&nbsp;format,&nbsp;using&nbsp;<a href=\"http:\/\/www.openoffice.org\/\">OpenOffice <\/a>Writer and the&nbsp;<a href=\"http:\/\/writer2latex.sourceforge.net\/index.html\">Write2Latex<\/a> plugin. &nbsp;The output of the program is a bibliography in&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">B<\/a><span style=\"font-size: 10pt; \"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">IB<\/a><\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">T<\/a><sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">E<\/a><\/sub><a href=\"http:\/\/en.wikipedia.org\/wiki\/Bibtex\">X<\/a>&nbsp;format.<\/p>\n<p><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shCore.js\"><\/script><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shBrushPlain.js\" ><\/script><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shBrushJava.js\" ><\/script><link href=\"syntaxhighlighter\/styles\/shCore.css\" rel=\"stylesheet\" type=\"text\/css\" \/>\n<link href=\"syntaxhighlighter\/styles\/shThemeEclipse.css\" rel=\"Stylesheet\" type=\"text\/css\" \/>\n<p>INPUT:<\/p>\n<pre class=\"brush: plain;\">Abuladze, K.S., Zelikin, I.U., and Rosenthal, I.S.: The reactions of the\r\ndog after the removal of the entire cortex of the right hemisphere and\r\nthe motor region of the left hemisphere. Archive of Biological\r\nSciences, 61, 3, 94,\r\n1941.}\r\n\r\n{\\selectlanguage{english}\\sffamily\r\nIvanov-Smolenskiy, A.G.: Outlines of patho-physiology of higher nervous\r\nactivity. M., 1949.}\r\n\r\n{\\selectlanguage{english}\\sffamily\r\nR\\&quot;{u}dinger, N.: licher die Hirne verschiedener Hunderassen. Sitzungsber.\r\nd. Wissensch. zu M\\&quot;{u}nchen, 1894.}\r\n\r\n{\\selectlanguage{english}\\sffamily\r\nClark le Gros, W.E.: The homologies of the pulvinar in mammals. Monit.\r\nZool. ital., 41, 1931.}\r\n\r\n{\\selectlanguage{english}\\sffamily\r\nAdrianov, O.S. and Mering, T.A.: Materials on the morphology and\r\nphysiology of\r\nthe cortical ends of the analysors of the dog. VIII All-Union convention\r\nof physiologists, biochemists and pharmacologits. Theses of lectures.\r\nM., 9,\r\n1955.}\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>OUTPUT:<\/p>\n<pre class=\"brush: plain;\">@article {\r\n\ttestx1,\r\n\tauthor = { {K. S.} {Abuladze} AND {I. U.} {Zelikin} AND {I. S.} {Rosenthal} },\r\n\ttitle  = { The reactions of the  dog after the removal of the entire cortex of the right hemisphere and  the motor region of the left hemisphere. },\r\n\tjournal = {  Archive of Biological  Sciences, 61, 3, 94. },\r\n\tyear   = { 1941 }\r\n}\r\n\r\n@article {\r\n\ttestx2,\r\n\tauthor = { {A. G.} {Ivanov-Smolenskiy} },\r\n\ttitle  = { Outlines of patho-physiology of higher nervous  activity. },\r\n\tjournal = {  M. },\r\n\tyear   = { 1949 }\r\n}\r\n\r\n@article {\r\n\ttestx3,\r\n\tauthor = { {N.} {R\\&quot;{u}dinger} },\r\n\ttitle  = { licher die Hirne verschiedener Hunderassen. },\r\n\tjournal = {  Sitzungsber.  d. Wissensch. zu M\\&quot;{u}nchen. },\r\n\tyear   = { 1894 }\r\n}\r\n\r\n@article {\r\n\ttestx4,\r\n\tauthor = { {W. E.} {Clark le Gros} },\r\n\ttitle  = { The homologies of the pulvinar in mammals. },\r\n\tjournal = {  Monit.  Zool. ital., 41. },\r\n\tyear   = { 1931 }\r\n}\r\n\r\n@article {\r\n\ttestx5,\r\n\tauthor = { {O. S.} {Adrianov} AND {T. A.} {Mering} },\r\n\ttitle  = { Materials on the morphology and  physiology of  the cortical ends of the analysors of the dog. },\r\n\tjournal = {  VIII All-Union convention  of physiologists, biochemists and pharmacologits. Theses of lectures.  M., 9. },\r\n\tyear   = { 1955 }\r\n}\r\n<\/pre>\n<p>The input format of each reference is relatively simple to state: it is a list of author(s), followed by a colon, the title, the journal or book, and finally, the year that it was published. There were several nuances in this simple definition: the text could contain LaTex formatting directives for generating special character accents, e.g., &#39;\\&quot;{u}&#39;; the authors are optionally separated by commas; and, last name may be complex, like &quot;Clark le Gros&quot;. As in so many parsing problems, writing the grammar for this &quot;simple&quot; problem can be quite difficult because of these little nuances, and because Antlr is difficult to use. &nbsp;The experience brought back many unpleasant memories of Antlr (because it is not a well-documented tool), and why I have many issues with much of the free software that developers use. &nbsp;But, that is another story&#8230;<\/p>\n<p><span style=\"text-decoration: underline;\">The Solution<\/span><\/p>\n<p>Doit.g:<\/p>\n<pre class=\"brush: plain;\">grammar Doit;\r\n\r\noptions {\r\n    backtrack=true;\r\n    output=AST;\r\n}\r\n\r\ntokens {\r\n    ALLREFS;\r\n    REF;\r\n    NAME;\r\n    FIRSTNAME;\r\n    LASTNAME;\r\n    DATE;\r\n    TITLE;\r\n}\r\n\r\n@members{\r\n    boolean matchdigits(TokenStream input)\r\n    {\r\n        \/\/ return true if the lookahead isn&#39;t the end of the reference.  This occurs when the input is &quot;date . }&quot;.\r\n        int p = 1;\r\n        int t = input.LA(p);\r\n        if (t != DIGITS)\r\n        {\r\n            return false;\r\n        }\r\n        while (t == DIGITS)\r\n        {\r\n            t = input.LA(++p);\r\n        }\r\n        if (! (t == PERIOD || t == MINUS))\r\n        {\r\n            return true;\r\n        }\r\n        \/\/ two cases: t was a period or t was a minus.\r\n        if (t == MINUS)\r\n        {\r\n            t = input.LA(++p);\r\n            if (t != DIGITS)\r\n            {\r\n                return true;\r\n            }\r\n            while (t == DIGITS)\r\n            {\r\n                t = input.LA(++p);\r\n            }\r\n            if (! (t == PERIOD))\r\n            {\r\n                return true;\r\n            }\r\n        }\r\n        \/\/ skip past period.\r\n        t = input.LA(++p);\r\n        if (t != CC)\r\n        {\r\n            return true;\r\n        }\r\n        return false;\r\n    }\r\n\r\n    boolean matchlastname(TokenStream input)\r\n    {\r\n        \/\/ match long name.\r\n        int p = 1;\r\n        int t = input.LA(p);\r\n        if (t != WORD)\r\n        {\r\n            return false;\r\n        }\r\n        while (t == WORD || t == OC || t == CC || t == WS || t == MINUS || t == TEX)\r\n        {\r\n            t = input.LA(++p);\r\n            if (input.LT(1).getText().equals(&quot;and&quot;))\r\n                return true;\r\n            if (input.LT(1).getText().equals(&quot;und&quot;))\r\n                return true;\r\n            if (input.LT(1).getText().equals(&quot;et&quot;))\r\n                return true;\r\n        }\r\n        if (t == COMMA)\r\n        {\r\n            return true;\r\n        }\r\n        if (t == COLON)\r\n        {\r\n            return true;\r\n        }\r\n        if (t == PERIOD)\r\n        {\r\n            return false;\r\n        }\r\n        return false;\r\n    }\r\n\r\n}\r\n\r\nprog:   ref+ WS? EOF -&gt; ^(ALLREFS ref+)\r\n    ;\r\n\r\nref:   block\r\n    ;\r\n\r\nblock:\r\n    WS!\r\n    OC!\r\n    stuff\r\n    CC!\r\n    ;\r\n\r\nstuff:\r\n    header\r\n    authors\r\n    COLON\r\n    rest\r\n    date\r\n        -&gt; ^(REF authors ^(TITLE rest) date)\r\n    ;\r\n\r\nheader:\r\n        HEADER WS\r\n    ;\r\n\r\nauthors :\r\n    name (\r\n        COMMA! WS! (andauthor | name)\r\n        | WS! andauthor\r\n        )*\r\n    ;\r\n\r\nandauthor:\r\n    { input.LT(1).getText().equals(&quot;and&quot;)\r\n      || input.LT(1).getText().equals(&quot;und&quot;)\r\n      || input.LT(1).getText().equals(&quot;et&quot;)\r\n        }? WORD! WS! name\r\n    ;\r\n\r\nname:\r\n    (\r\n        lastname\r\n        ((COMMA WS firstname)=&gt; COMMA WS firstname)?\r\n    ) -&gt; ^(NAME (^(FIRSTNAME firstname))? ^(LASTNAME lastname))\r\n    ;\r\n\r\nlastname:\r\n    { matchlastname(input) }? ( WORD ((MINUS WORD) | (OC WORD CC WORD?) | TEX | (WS WORD))* )\r\n    ;\r\n\r\nfirstname:\r\n    WORD PERIOD (WORD PERIOD (WORD PERIOD (WORD PERIOD)? )? )?\r\n    ( { !(\r\n        input.LT(1).getText().equals(&quot;and&quot;)\r\n        || input.LT(1).getText().equals(&quot;und&quot;)\r\n        || input.LT(1).getText().equals(&quot;et&quot;)\r\n        ) }? WORD )?\r\n    ;\r\n\r\nrest:\r\n    (WORD\r\n        | PERIOD\r\n        | COLON\r\n        | COMMA\r\n        | MINUS\r\n        | { matchdigits(input) }?=&gt; DIGITS\r\n        | esc\r\n        | WS\r\n        | TEX\r\n        )+\r\n    ;\r\n\r\nesc:\r\n    OC (WORD | MINUS)+ CC\r\n    ;\r\n\r\ndate:\r\n        DIGITS (MINUS DIGITS)?  PERIOD -&gt; ^(DATE DIGITS (MINUS DIGITS)?)\r\n    ;\r\n\r\nPERIOD :\r\n    &#39;.&#39;\r\n    ;\r\n\r\nCOLON :\r\n    &#39;:&#39;\r\n    ;\r\n\r\nMINUS:\r\n    &#39;-&#39;\r\n    ;\r\n\r\nDIGITS:\r\n    (&#39;0&#39; .. &#39;9&#39;)+\r\n    ;\r\n\r\nWORD  :\r\n    (&#39;a&#39;..&#39;z&#39;\r\n        | &#39;A&#39;..&#39;Z&#39;\r\n        | &#39;&quot;&#39;\r\n        | &#39;\\&#39;&#39;\r\n        | &#39;`&#39;\r\n        | &#39;?&#39;\r\n        | &#39;\/&#39;\r\n        | &#39;;&#39;\r\n        | &#39;(&#39;\r\n        | &#39;)&#39;\r\n        | &#39;=&#39;)+\r\n    ;\r\n\r\nTEX:\r\n    &#39;\\\\&#39; .\r\n    ;\r\n\r\nCOMMA:\r\n    &#39;,&#39;\r\n    ;\r\n\r\nHEADER:\r\n    &#39;\\\\selectlanguage{english}\\\\sffamily&#39;\r\n    ;\r\n\r\nOC :\r\n    &#39;{&#39;\r\n    ;\r\n\r\nCC :\r\n    &#39;}&#39;\r\n    ;\r\n\r\nWS  :\r\n    (\r\n        &#39; &#39;\r\n        | &#39;\\t&#39;\r\n        | &#39;\\r&#39;? &#39;\\n&#39;\r\n        | &#39;%&#39; .* &#39;\\r&#39;? &#39;\\n&#39;\r\n        )+\r\n    ;\r\n<\/pre>\n<p>Doit.java:<\/p>\n<pre class=\"brush: java;\">import org.antlr.runtime.*;\r\nimport org.antlr.runtime.tree.*;\r\nimport java.io.*;\r\n\r\npublic class Doit {\r\n\r\n    int my_i = 0;\r\n\r\n    String my_chop(String inp)\r\n    {\r\n        \/\/ remove any trailing blanks and comma.\r\n        int last = inp.length() - 1;\r\n        while (last &gt; 0)\r\n        {\r\n            if (inp.charAt(last) == &#39; &#39;\r\n                  || inp.charAt(last) == &#39;,&#39;\r\n                  || inp.charAt(last) == &#39;\\n&#39;\r\n                  || inp.charAt(last) == &#39;\\r&#39;)\r\n            {\r\n                last--;\r\n            }\r\n            else\r\n                break;\r\n        }\r\n        String tmp = inp.substring(0, last+1);\r\n        \/\/ make sure it ends with a &#39;.&#39;.\r\n        if (tmp.charAt(tmp.length()-1) != &#39;.&#39;)\r\n            tmp += &#39;.&#39;;\r\n        return tmp;\r\n    }\r\n\r\n    String my_year(String inp)\r\n    {\r\n        \/\/ remove any trailing .\r\n        int last = inp.length() - 1;\r\n        while (last &gt; 0)\r\n        {\r\n            if (inp.charAt(last) == &#39; &#39;\r\n                  || inp.charAt(last) == &#39;.&#39;\r\n                  || inp.charAt(last) == &#39;\\n&#39;\r\n                  || inp.charAt(last) == &#39;\\r&#39;)\r\n            {\r\n                last--;\r\n            }\r\n            else\r\n                break;\r\n        }\r\n        return inp.substring(0,last+1);\r\n    }\r\n\r\n    String my_escape(String inp)\r\n    {\r\n        \/\/ escape any backslashes that don&#39;t make sense.\r\n        String s = new String();\r\n        int last = 0;\r\n        s = &quot;&quot;;\r\n        while (last &lt; inp.length())\r\n        {\r\n            if (inp.charAt(last) == &#39;\\\\&#39; &amp;&amp; inp.charAt(last+1) == &#39;t&#39;)\r\n                s += &quot;\\\\\\\\&quot;;\r\n            else if (inp.charAt(last) == &#39;\\\\&#39; &amp;&amp; inp.charAt(last+1) == &#39;d&#39;)\r\n                s += &quot;\\\\\\\\&quot;;\r\n            else\r\n                s += inp.charAt(last);\r\n            last++;\r\n        }\r\n        return s;\r\n    }\r\n\r\n    String my_rmnewline(String inp)\r\n    {\r\n        \/\/ remove new lines.\r\n        String s = new String();\r\n        int last = 0;\r\n        s = &quot;&quot;;\r\n        while (last &lt; inp.length())\r\n        {\r\n            if (!(inp.charAt(last) == &#39;\\n&#39; || inp.charAt(last) == &#39;\\r&#39;))\r\n                s += inp.charAt(last);\r\n            else\r\n                s += &#39; &#39;;\r\n            last++;\r\n        }\r\n        return s;\r\n    }\r\n\r\n    String walk(String id, Tree ast) {\r\n        int i;\r\n        String result = &quot;&quot;;\r\n        int t = ast.getType();\r\n        switch(t) {\r\n            case DoitParser.ALLREFS:\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                    result += walk(id, (Tree)ast.getChild(i));\r\n                break;\r\n\r\n            case DoitParser.REF:\r\n            {\r\n                result += &quot;\\n@article {\\n&quot;;\r\n                result += &quot;     &quot; + id + (++my_i) + &quot;,\\n&quot;;\r\n                boolean first = true;\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                {\r\n                    Tree c = (Tree)ast.getChild(i);\r\n                    int ct = c.getType();\r\n                    if (ct == DoitParser.NAME)\r\n                    {\r\n                        if (first)\r\n                            result += &quot; author = { &quot;;\r\n                        else\r\n                            result += &quot; AND &quot;;\r\n                        first = false;\r\n                        result += walk(id, (Tree)ast.getChild(i));\r\n                        continue;\r\n                    } else if (ct == DoitParser.TITLE) {\r\n                        result += &quot; },\\n&quot;;\r\n                    }\r\n                    result += walk(id, ast.getChild(i));\r\n                }\r\n                result += &quot;}\\n\\n&quot;;\r\n                break;\r\n            }\r\n\r\n            case DoitParser.NAME:\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                    result += walk(id, (Tree)ast.getChild(i));\r\n                break;\r\n\r\n            case DoitParser.FIRSTNAME:\r\n                result += &quot;{&quot;;\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                {\r\n                    result += ast.getChild(i).getText();\r\n                    if (ast.getChild(i).getType() == DoitParser.PERIOD\r\n                          &amp;&amp; i != (ast.getChildCount()-1))\r\n                        result += &#39; &#39;;\r\n                }\r\n                result += &quot;}&quot;;\r\n                result += &quot; &quot;;\r\n                break;\r\n\r\n            case DoitParser.LASTNAME:\r\n                result += &quot;{&quot;;\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                    result += ast.getChild(i).getText();\r\n                result += &quot;}&quot;;\r\n                break;\r\n\r\n            case DoitParser.TITLE:\r\n            {\r\n                String temp = &quot; title  = {&quot;;\r\n                boolean first = true;\r\n                String last = null;\r\n                String lastm1 = null;\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i) {\r\n                    temp += my_rmnewline(ast.getChild(i).getText());\r\n                    \/\/ On first period of long word, make this the end\r\n                    \/\/ of the title and start\r\n                    if (ast.getChild(i).getType() == DoitParser.PERIOD &amp;&amp; first\r\n                          &amp;&amp; last != null)\r\n                    {\r\n                        if (last.length() &gt; 1\r\n                              || (lastm1 != null &amp;&amp; lastm1.compareTo(&quot;.&quot;) == 0)\r\n                              || (last.compareTo(&quot;}&quot;) == 0))\r\n                        {\r\n                            temp += &quot; },&quot;;\r\n                            temp += &#39;\\n&#39;;\r\n                            temp += &quot;   journal = { &quot;;\r\n                            first = false;\r\n                        }\r\n                    }\r\n                    lastm1 = last;\r\n                    last = ast.getChild(i).getText();\r\n                }\r\n                result += my_chop(my_escape(temp));\r\n                result += &quot; },&quot;;\r\n                break;\r\n            }\r\n\r\n            case DoitParser.DATE:\r\n                result += &quot;\\n   year   = { &quot;;\r\n                for (i = 0; i &lt; ast.getChildCount(); ++i)\r\n                    result += ast.getChild(i).getText();\r\n                result += &quot; }\\n&quot;;\r\n                break;\r\n\r\n            default:\r\n                result += ast.getText();\r\n                break;\r\n        }\r\n        return result;\r\n    }\r\n\r\n    public static void main(String[] args) throws Exception {\r\n        for (String s: args) {\r\n            try {\r\n                System.out.println(&quot;Input file is &quot; + s);\r\n                File inFile = new File(s);\r\n                String inFileName = inFile.getName();\r\n                int whereDot = inFileName.lastIndexOf(&#39;.&#39;);\r\n                if (!(0 &lt; whereDot &amp;&amp; whereDot &lt;= inFile.getName().length() - 2 ))\r\n                {\r\n                    System.out.println(&quot;Illegal file name.&quot;);\r\n                    continue;\r\n                }\r\n                String prefix = inFileName.substring(0,whereDot) + &quot;x&quot;;\r\n                String outFileName = inFileName.substring(0,whereDot) + &quot;.bib&quot;;\r\n                System.out.println(&quot;Output file is &quot; + outFileName);\r\n\r\n                \/\/ Create output file.\r\n                FileOutputStream out;\r\n                PrintStream p;\r\n                out = new FileOutputStream(outFileName);\r\n                p = new PrintStream(out);\r\n\r\n                System.setIn(new FileInputStream(s));\r\n                ANTLRInputStream input = new ANTLRInputStream(System.in);\r\n                DoitLexer lexer = new DoitLexer(input);\r\n                CommonTokenStream tokens = new CommonTokenStream(lexer);\r\n                DoitParser parser = new DoitParser(tokens);\r\n                DoitParser.prog_return result = parser.prog();\r\n                Tree t = (Tree)result.getTree();\r\n                Doit doit = new Doit();\r\n                String translation = doit.walk(prefix, t);\r\n                p.println(translation);\r\n                out.close();\r\n\r\n            } catch (IOException e) {\r\n                e.printStackTrace();\r\n            }\r\n        }\r\n    }\r\n}\r\n<\/pre>\n<p>The grammar for the list of unformatted references starts on line 98 of Doit.g. This grammar produces an abstract syntax tree (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Abstract_syntax_tree\">AST<\/a>), which is then walked by a simple tree walker (starting on line 87 of Doit.java) to print out the bibliography in BIBTEX format.<\/p>\n<p>This grammar requires several syntactic and semantic predicates, which are hard to use and understand. &nbsp;On line 141 of Doit.g, the syntactic predicate (COMMA WS firstname)=&gt; is used in to recognize authors with a first name after looking ahead for the first name. &nbsp;For example, when matching &quot;Abuladze, K.S.&quot;, the comma is only recognized when the first name &quot;K.S.&quot; is also recognized. Otherwise, the comma separates last names. On lines 132-135 of Doit.g, the semantic predicate<\/p>\n<p>{ input.LT(1).getText().equals(&quot;and&quot;)<br \/>\n\t|| input.LT(1).getText().equals(&quot;und&quot;)<br \/>\n\t|| input.LT(1).getText().equals(&quot;et&quot;)<br \/>\n\t}?<br \/>\n\tgates the production (and alternatives that use the non-terminal andauthor), so that &quot;and&quot; (in English) starts another author.<br \/>\n\t<script type=\"text\/javascript\">\n     SyntaxHighlighter.all()\n<\/script><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For several months, I had been editing a new edition of a textbook (Atlas of the Canine Brain, ISBN 978-0-916182-17-5).  This book was first published in Russian in 1959, then translated and published in English in 1964.  Although the English book was for sale, the publishing company (<a href=\"http:\/\/nppbooks.com\">NPP Books<\/a>) had only a limited number of copies left.  So, a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Print_on_demand\">Print-On-Demand<\/a> (POD) version of the book was needed.  Of course, in 1964 there were no personal computers. (Even in the early &#8217;70&#8217;s, I was still using <a href=\"http:\/\/en.wikipedia.org\/wiki\/Punched_card\">punched cards<\/a>.)  The book was written by typewritten on 8.5&#8243; by 11&#8243; paper, but the original manuscript, which also included the figures, was lost.  Fortunately, the text and figures were recovered from the Russian and English books using a scanner and optical character recognition (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Optical_character_recognition\">OCR<\/a>).  Call me old fashioned, but it still seems quite remarkable that the technology exists to recover text from old books.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/186"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=186"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/186\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=186"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}