Note to start: If you're not interested in parsing Web IDL (or if you don't know what that means), this entry will probably not be relevant to you.
For a work project, I'm trying to parse pieces of the HTML5 specification.
Among other things, I want to parse the Web IDL fragments that describe the methods and attributes for the DOM interfaces that correspond to HTML elements.
For someone pretty new to Python, this is not an easy task.
So this entry documents some of the steps I went through, both for my own future reference and in case anyone else is trying to figure out how to parse Web IDL using Python.
That code relies on lxml, a Python library for parsing XML and HTML. I think I ran into some trouble installing that without knowing what I was doing, but the details are hazy in my memory at this point.
It also relies on widlproc, which converts Web IDL to XML. I grabbed it from its subversion repository and built it locally, which I vaguely recall went pretty smoothly.
Eventually I got lxml and widlproc set up, and I copied the code from the online copy of webidl-check into a local file that I called webidl-check.py, and I tried to run it from the command line. At which point it gave me a cryptic error message about missing a file called webidl-schematron.xsl.
Googling for that filename didn't help. Eventually I figured out that webidl-check has relevant support files in the neighboring widlproc-schematron and web-platform directories. Probably if I had checked these files out using CVS instead of just copying and pasting the code from that one file, things would've gone more smoothly.
Anyway, I got the support files in place, and created a small file containing nothing but a Web IDL fragment (copied and pasted from the HTML5 spec) and ran webidl-check from the command line again, giving it the URL of that Web IDL file.
And it told me that the file contained no Web IDL.
Cue half an hour of (a) looking at the Web IDL spec (including copying and pasting an example fragment directly from there), (b) trying various command-line options, and (c) reading through the source code for webidl-check, learning about Python CGI along the way. (webidl-check will also run as a CGI script, but I want to use it on the command line. Fortunately, it works on the command line as well.)
Eventually I found a comment in the source code that said
we try the HTMLParser followed a little later by a comment that said
and then get extracts IDLs through XPath, and I realized that webidl-check is looking for an HTML file, not a raw Web IDL file.
I enclosed my Web IDL fragment in a minimal HTML file. But webidl-check still couldn't see any IDL in the file.
Then I saw the actual XPath code that webidl-check uses to find IDL fragments.
It turns out that to use webidl-check, your Web IDL fragments have to be in an HTML file, enclosed in one of the following tags:
Not the most robust approach in the world. Nor the best-documented. So I figured I should post an entry describing this requirement in case anyone else is having the same difficulty.
Anyway, I added an appropriate class name to my
<pre> tag, and I ran webidl-check again, and lo! It worked!
Of course, the output is XML rather than a data structure. What I really want is not to check a Web IDL file for correctness, but to parse Web IDL code into a data structure.
So it looks like what I'll need to do is borrow webidl-check's code that calls widlproc. I'll call widlproc on each Web IDL fragment to convert it to XML, and then I'll use XPath (from lxml) to parse the XML into data structures.
Kind of kludgy, but I think it'll work.