{"id":995,"date":"2011-09-15T18:04:26","date_gmt":"2011-09-16T01:04:26","guid":{"rendered":"http:\/\/domemtech.com\/?p=995"},"modified":"2011-09-16T03:50:58","modified_gmt":"2011-09-16T10:50:58","slug":"antlr-php-target","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2011\/09\/15\/antlr-php-target\/","title":{"rendered":"Antlr PHP target"},"content":{"rendered":"<p style=\"text-align: justify; \">My wife&nbsp;recently started&nbsp;<a href=\"http:\/\/reg-aff-views.info\">a WordPress blog<\/a>&nbsp;on regulatory affairs of the drug and medical device industries. &nbsp;When she started it, she decide to add posts from time to time of newly approved drugs and generics, based on information from the FDA website. &nbsp;So, after a few times proof-reading her drafts, I realized that this was a good example of an <a href=\"http:\/\/en.wikipedia.org\/wiki\/News_aggregator\">aggregator<\/a>&nbsp;problem, something that I&#39;ve been wanting to do for my own blog. &nbsp;The aggregator would read the <a href=\"http:\/\/www.accessdata.fda.gov\/scripts\/cder\/drugsatfda\/\">Drugs@FDA<\/a>&nbsp;and <a href=\"http:\/\/www.nlm.nih.gov\/medlineplus\/druginformation.html\">MedlinePlus<\/a> websites, extract approved pharmaceuticals and generics, then create a draft post for each new drug. &nbsp;Sounds like an easy project, right?<\/p>\n<p><!--more--><\/p>\n<p style=\"text-align: justify; \">As with any software, the devil was in the details. &nbsp;After studying what was involved in creating posts using an aggregator, I realized that the solution required three basic steps: fetching web pages; extracting the information from those pages; and making a post in WordPress. &nbsp;I was able to figure out how to do steps (1) and (3) pretty quickly, but step (2) was a little harder. &nbsp;How do I extract information from a web page? &nbsp;Since all the web pages are in HTML, what I needed was a parser for HTML. &nbsp;The parser would create a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Parse_tree\">parse tree<\/a> (or <a href=\"http:\/\/en.wikipedia.org\/wiki\/Abstract_syntax_tree\">abstract syntax tree<\/a>) representing the document. &nbsp;I could then write code to walk the tree, and extract the information I was looking for. &nbsp;In this case, I wanted to extract particular columns out of a table. &nbsp;<\/p>\n<p style=\"text-align: justify; \"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Parsing\">Parsing<\/a> is not easy for any language, let alone HTML. &nbsp;While there are a few tools available to parse HTML (e.g.,&nbsp;<a href=\"http:\/\/simplehtmldom.sourceforge.net\/\">http:\/\/simplehtmldom.sourceforge.net\/<\/a>&nbsp;,&nbsp;<a href=\"http:\/\/tidy.sourceforge.net\/\">http:\/\/tidy.sourceforge.net\/<\/a>&nbsp;), I didn&#39;t want to be <a href=\"http:\/\/www.codinghorror.com\/blog\/2009\/11\/parsing-html-the-cthulhu-way.html\">smart<\/a>: I wanted to write my own because I was a compiler engineer long ago!<\/p>\n<p style=\"text-align: justify; \">So, I decided to use <a href=\"http:\/\/en.wikipedia.org\/wiki\/ANTLR\">Antlr<\/a>, a parser generator. &nbsp;Writing a grammar for HTML wouldn&#39;t be easy because&nbsp;<a href=\"http:\/\/taligarsiel.com\/Projects\/howbrowserswork1.htm#HTML_Parser\">HTML is not a context-free grammar<\/a>. &nbsp;It also allows&nbsp;unmatched elements.&nbsp;In addition, the target language for the parser had to be PHP because it had to work with <a href=\"http:\/\/wordpress.org\/\">WordPress<\/a>. &nbsp;Unfortunately, the source code for the Antlr PHP runtime was <a href=\"http:\/\/www.antlr.org\/pipermail\/antlr-interest\/2011-September\/042627.html\">disorganized<\/a>. &nbsp;It was not up to date, having been last modified over a year ago; and, several developers created copies of the code, each modified in different ways. &nbsp;All of these problems created a lot of confusion.<\/p>\n<p style=\"text-align: justify; \">As it turned out, I was luckily able to update the Antlr PHP runtime fairly quickly. &nbsp;That code is Antlr 3.4\/PHP 5.3 dependent, and is&nbsp;available&nbsp;<a href=\"http:\/\/domemtech.com\/code\/antlrphpruntime.zip\">here<\/a>.&nbsp;Just unpack the zip file and use the enclosed antlr.jar file to generate the parser in PHP, e.g., &#39;java -jar antlr.jar TableLexer.g&#39;. &nbsp;I was able to use Antlr to find nested HTML tables, and HTML rows (<a href=\"http:\/\/domemtech.com\/code\/TableLexer.g\">TableLexer.g<\/a>). &nbsp;But, as with all Antlr grammars, they &nbsp;are target dependent because the grammar contains semantic checks written in PHP.<\/p>\n<p style=\"text-align: justify; \">Another example of Antlr targeted to the PHP runtime is an <a href=\"http:\/\/www.antlr.org\/wiki\/display\/ANTLR3\/Expression+evaluator\">expression grammar<\/a>. &nbsp;<a href=\"http:\/\/domemtech.com\/code\/php_calculator.php\">This web page<\/a> demonstrates how Antlr would work with this grammar rewritten for PHP.<\/p>\n<p><table' :=\"\" attr=\"\" digit=\"\" fragment=\"\" hexdigit=\"\" hexint=\"\" hexnum=\"\" int=\"\" lcletter=\"\" string=\"\" word=\"\" word:=\"\" ws=\"\"><\/table'><\/p>\n","protected":false},"excerpt":{"rendered":"<p>My wife&nbsp;recently started&nbsp;a WordPress blog&nbsp;on regulatory affairs of the drug and medical device industries. &nbsp;When she started it, she decide to add posts from time to time of newly approved drugs and generics, based on information from the FDA website. &nbsp;So, after a few times proof-reading her drafts, I realized that this was a good &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2011\/09\/15\/antlr-php-target\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Antlr PHP target&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/995"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=995"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/995\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=995"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=995"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=995"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}