« Seven million Californians | Main | Condo sold! »

Using the W3C's Web IDL checker


Note to start: If you're not interested in parsing Web IDL (or if you don't know what that means), this entry will probably not be relevant to you.

For a work project, I'm trying to parse pieces of the HTML5 specification.

Among other things, I want to parse the Web IDL fragments that describe the methods and attributes for the DOM interfaces that correspond to HTML elements.

For someone pretty new to Python, this is not an easy task.

So this entry documents some of the steps I went through, both for my own future reference and in case anyone else is trying to figure out how to parse Web IDL using Python.

Some judicious Googling led me to the W3C's webidl-checker system, and specifically the webidl-check code.

That code relies on lxml, a Python library for parsing XML and HTML. I think I ran into some trouble installing that without knowing what I was doing, but the details are hazy in my memory at this point.

It also relies on widlproc, which converts Web IDL to XML. I grabbed it from its subversion repository and built it locally, which I vaguely recall went pretty smoothly.

Eventually I got lxml and widlproc set up, and I copied the code from the online copy of webidl-check into a local file that I called webidl-check.py, and I tried to run it from the command line. At which point it gave me a cryptic error message about missing a file called webidl-schematron.xsl.

Googling for that filename didn't help. Eventually I figured out that webidl-check has relevant support files in the neighboring widlproc-schematron and web-platform directories. Probably if I had checked these files out using CVS instead of just copying and pasting the code from that one file, things would've gone more smoothly.

Anyway, I got the support files in place, and created a small file containing nothing but a Web IDL fragment (copied and pasted from the HTML5 spec) and ran webidl-check from the command line again, giving it the URL of that Web IDL file.

And it told me that the file contained no Web IDL.

Cue half an hour of (a) looking at the Web IDL spec (including copying and pasting an example fragment directly from there), (b) trying various command-line options, and (c) reading through the source code for webidl-check, learning about Python CGI along the way. (webidl-check will also run as a CGI script, but I want to use it on the command line. Fortunately, it works on the command line as well.)

Eventually I found a comment in the source code that said we try the HTMLParser followed a little later by a comment that said and then get extracts IDLs through XPath, and I realized that webidl-check is looking for an HTML file, not a raw Web IDL file.

I enclosed my Web IDL fragment in a minimal HTML file. But webidl-check still couldn't see any IDL in the file.

Then I saw the actual XPath code that webidl-check uses to find IDL fragments.

It turns out that to use webidl-check, your Web IDL fragments have to be in an HTML file, enclosed in one of the following tags:

  • <pre class="idl">
  • <pre class="webidl">
  • <code class="idl-code">

Not the most robust approach in the world. Nor the best-documented. So I figured I should post an entry describing this requirement in case anyone else is having the same difficulty.

Anyway, I added an appropriate class name to my <pre> tag, and I ran webidl-check again, and lo! It worked!

Of course, the output is XML rather than a data structure. What I really want is not to check a Web IDL file for correctness, but to parse Web IDL code into a data structure.

So it looks like what I'll need to do is borrow webidl-check's code that calls widlproc. I'll call widlproc on each Web IDL fragment to convert it to XML, and then I'll use XPath (from lxml) to parse the XML into data structures.

Kind of kludgy, but I think it'll work.


Whoa. It certainly seems very complicated way to do it. I can sort of understand conversion to XML for some uses, but as default method it really seems bass-ackwards: all languages have much more convenient accessors (JS or Java objects) than xml infoset.
Too bad if standard implementation takes such a complicated path.

I am also trying to find WebIDL processing tools as well, but given seeming lack (for me I'd want java tools) perhaps I need to consider writing a simple parser (using Antlr or something) and see how simple it would be.

Yeah, I agree that conversion to XML should not be necessary.

For a real programmer (unlike me), probably your best bet would be to grab the C source code for widlproc from its svn repository, and integrate its "Hand-crafted recursive descent parser" with your own code. The processfiles() routine appears to convert the Web IDL into internal data structures (root = parse();) before writing it out as XML (outputnode(root, 0);). So you could take the widlproc source code and call parse() directly from your own code, then use the resulting data structure.

But I was not up to that task, especially because I'm on a tight deadline and don't yet know anything about how to call C from Python; so I made do with the tools I had. I'm now using Python to call the compiled widlproc, then capturing the XML output and using BeautifulStoneSoup (from Beautiful Soup) to (for example) find all of the Attribute elements.

Although this approach is definitely kludgy, one advantage of it is that I don't have to learn about (or programmatically navigate through) the internal data structures used by widlproc; I already understand XML, and using BeautifulStoneSoup means I don't even have to mess with XPath.

Post a comment