My wife recently started a WordPress blog on regulatory affairs of the drug and medical device industries. When she started it, she decide to add posts from time to time of newly approved drugs and generics, based on information from the FDA website. So, after a few times proof-reading her drafts, I realized that this was a good example of an aggregator problem, something that I've been wanting to do for my own blog. The aggregator would read the Drugs@FDA and MedlinePlus websites, extract approved pharmaceuticals and generics, then create a draft post for each new drug. Sounds like an easy project, right?
As with any software, the devil was in the details. After studying what was involved in creating posts using an aggregator, I realized that the solution required three basic steps: fetching web pages; extracting the information from those pages; and making a post in WordPress. I was able to figure out how to do steps (1) and (3) pretty quickly, but step (2) was a little harder. How do I extract information from a web page? Since all the web pages are in HTML, what I needed was a parser for HTML. The parser would create a parse tree (or abstract syntax tree) representing the document. I could then write code to walk the tree, and extract the information I was looking for. In this case, I wanted to extract particular columns out of a table.
Parsing is not easy for any language, let alone HTML. While there are a few tools available to parse HTML (e.g., http://simplehtmldom.sourceforge.net/ , http://tidy.sourceforge.net/ ), I didn't want to be smart: I wanted to write my own because I was a compiler engineer long ago!
So, I decided to use Antlr, a parser generator. Writing a grammar for HTML wouldn't be easy because HTML is not a context-free grammar. It also allows unmatched elements. In addition, the target language for the parser had to be PHP because it had to work with WordPress. Unfortunately, the source code for the Antlr PHP runtime was disorganized. It was not up to date, having been last modified over a year ago; and, several developers created copies of the code, each modified in different ways. All of these problems created a lot of confusion.
As it turned out, I was luckily able to update the Antlr PHP runtime fairly quickly. That code is Antlr 3.4/PHP 5.3 dependent, and is available here. Just unpack the zip file and use the enclosed antlr.jar file to generate the parser in PHP, e.g., 'java -jar antlr.jar TableLexer.g'. I was able to use Antlr to find nested HTML tables, and HTML rows (TableLexer.g). But, as with all Antlr grammars, they are target dependent because the grammar contains semantic checks written in PHP.
Another example of Antlr targeted to the PHP runtime is an expression grammar. This web page demonstrates how Antlr would work with this grammar rewritten for PHP.