« Birthday | Main | Hugo nomination deadline imminent »

RTF submission system tweak

| 1 Comment

A few authors have received mysterious Server Errors when they submit using our RTF submission form.

The short version of the following is that I think I've fixed the problem, but it's possible that the fix has introduced a worse problem. So if you submit to us in the next couple days and you receive a Server Error (or any other kind of error), please contact us immediately and let us know.

Technical version of the story:

I finally decided to track down the problem tonight (even though what I really should've been doing was reading subs and editing). I thought there must be some error condition that my code wasn't checking for.

After a long comedy of errors in which I attempted to download the Perl code to my local disk for testing but discovered (among other things) that Dreamweaver crashes on trying to download a UNIX file containing "::" in the filename to a Macintosh disk, I finally decided to just make a test copy of the conversion system on the SH site. After some initial testing, I discovered that the Server Error was occurring during the RTF parsing, not in my code at all. I'm using a (lightly modified) standard Perl module for the parsing--specifically, I'm using RTF::TEXT::Converter, which is part of RTF-Parser-1.09, which relies on RTF-Tokenizer-1.08. (Okay, actually I think it relies on any version of RTF-Tokenizer after 1.01, but I'm using 1.08.)

Not wanting to try to find an error in the RTF conversion code from scratch, I figured I could simplify things by figuring out what RTF code the parser was choking on. So I took a known-not-working RTF file and started removing code from it (using BBEdit) and then running the result through the test copy of the submission form. (As it turned out, there was probably a much better way of tracking down the problem, but I'm not a good enough programmer to have known about it; see below.)

Weirdly, what the parser seemed to be choking on was the RTF file's stylesheet. If I removed the stylesheet, the parsing worked; if I put it back in, it failed. In fact, the parsing failed if even one line of the stylesheet was in place.

Eventually I tried using a known-good RTF file's stylesheet in the problem file, and that worked fine.

I was baffled for a bit, but eventually I looked closely at the lines in the problem stylesheet, and discovered that there was no space between (for example) "\sbasedon32" and "Block Text". Whereas in the stylesheet that worked, there was a space after the last item in the style definition and before the style name.

So I added spaces to the original stylesheet, and it worked fine.

Which left me thinking "Okay, so the tokenizer is what's choking. But it's probably not possible to determine where a style definition element ends and a style name begins. So probably people whose word processors generate RTF with missing spaces are just out of luck."

But I figured it couldn't hurt to look at the tokenizer code. And lo and behold, the authors of the RTF Perl modules (who I'm pretty impressed with at this point) had anticipated even this problem. There's documentation in the module that explains that if you need to parse this particular form of invalid RTF, all you need to do is turn on the "sloppy" option. I looked for the code, and realized that in sloppy mode, the tokenizer looks for strings that have digits in the middle followed by letters, and acts as if there was a space after the digits. (At least, I think that's what it was doing; I was beginning to get sleepy, so I didn't look at it too closely.)

So then it was a matter of finding the code that calls the tokenizer and telling it to turn on sloppy mode.

If I were a real software engineer, I would've subclassed RTF::Parser. There are even instructions in Parser.pm on how to do that.

But I'm not an engineer, and I'm sleepy, and learning how to subclass is beyond tonight's meager remaining cognitive capabilities. So I just edited Parser.pm. I know, I know, bad Jed. Some day I'll come back and do it right. But for now, this seems to work.

But the one thing that the code doesn't explain is what the downside is, if any, of using sloppy mode. I tried a known-good RTF file and that converted okay, so I'm hoping that turning on sloppy mode doesn't cause non-sloppy RTF to break. But we'll see. It seems like if there were no downside, it would be on by default. But maybe the downside is just that it doesn't tell you when your RTF is invalid.

(Also, if I were an engineer I would probably be able to detect an error condition in the parsing and generate a useful error message, rather than just dying with a Server Error. I discovered tonight that the module uses Carp to generate error messages; perhaps some day I will learn to detect Carp error messages, which will make it a whole lot easier to track down the next parsing problem that comes up. But not tonight.)

In the meantime, if any of y'all know the people who make Abiword, you might let them know that their RTF generator generates slightly malformed stylesheets. At least in some version; for all I know, this has been fixed but the author of the story in question was using an older version. If I were more awake, I would try and track the Abiword people down and drop them a note myself, but that's not happening tonight.

The only thing that's happening tonight is that I'm going to bed. But again, if you submit to us anytime soon and you get an error message, please let us know ASAP.

1 Comment

Hello,

How exactly do you pass the sloppy option to the new method? My apache log just said either pass this option to the new method, or use the sloppy method.

I tried to find any info in the documentation or online, but didn't find anythig.

Thanks for your help.

Larry


Post a comment