Name parser
Got home late last night, should've gone to bed, but instead decided to write some code.
The following will probably be too simple/obvious for programmers, and too technical for non-programmers, but it was of interest to me, so I'm going to post it. :)
My system for parsing the metadata that authors give us about their stories--name, wordcount, etc.--is fairly robust, but the area where it most often falls down is author names. The algorithm I've been using for quite a while is to assume that everything before the last space in the full name is the first name, and everything after that last space is the last name. But there are about a hundred and fifty authors in our database (out of 7000+ total authors) who have last names that contain spaces for various reasons. (Some have a "Jr." or "III" or something similar at the end, others have a "de" or "von" or something similar at the beginning, and there are a couple other cases too.) I had written a hack a while back that caused the system to convert underscores to spaces after parsing a name into first name + last name; if an author gave their name as "Oscar de la Renta", then when I was entering their metadata into the form, I could add underscores by hand to produce "Oscar de_la_Renta", and the system would decide "de_la_Renta" was the surname and convert it back to "de la Renta" before saving it in the database. But that's kind of roundabout, and anyway I often forgot to add the underscores by hand, resulting in spurious database entries with (for example) first name "Oscar de la" and last name "Renta".
Last night I was reminding myself to use an underscore in such a name, and decided that it was time I finally implemented something I'd been thinking of for a while: a system that recognizes common parts of multi-word surnames, without my having to put in underscores.
It turned out to be pretty easy. I'm finding a lot of things easier to code in PHP than they used to be, now that I'm not scared of using PHP arrays any more. I'm not sure why I was before; they just seemed intimidating somehow. (I've used arrays in lots of other languages, of course.)
So I spent a couple of hours last night building the name parsing system (and making it a callable subroutine so I could centralize all the name processing in one place). I was pretty sleepy by the end, but I didn't want to leave the database system in a potentially broken state overnight, in case Karen and Susan would be using it this morning before I was up.
One of the things I did was create a test system that let me enter a full name and it would return the parsed personal and surnames, without interacting with the database at all. This was a huge improvement over my old system of testing improvements by entering info into the live database and then deleting it later.
This morning I woke up after too little sleep and was muzzily contemplating the day when I realized that I should take the test system a step further: instead of having to type in a bunch of names by hand, I should create a unit test, or at least something vaguely resembling one: a piece of test code that takes a set of specified inputs and desired results, runs the name parser on the inputs, and compares the actual results to the specified desired results. This way, if I make a change to the name parser, I can run the whole set of tests on it easily, quickly, and automatically, and see immediately if my change has broken anything. And as new unusual names come up, I can add them to the set of tests. So I just finished writing that, and that went pretty smoothly too.
The parser is still far from perfect. There's an entire class of Spanish surnames it can't handle (names like "Garcia y López"), and there are plenty of other surnames in the world that contain spaces but don't contain any of the prefixes or suffixes I'm looking for. Also, there are cases where a surname suffix is indistinguishable (without further context) from a first or middle name; for example, it's hard (especially for a computer program) to tell whether the "Ben" in "Joshua Ben David" is part of the personal name or the surname. And "Van" (with capital V) is another surname-prefix that sometimes (though rarely) appears as a first or middle name.
But the parser now correctly parses all but half a dozen of the names of authors who've actually submitted to us. So I'm pretty pleased with it.
Arguably, it was silly to spend two hours last night and another two hours this morning writing code that will probably save me about half an hour a year, if that. But (a) it'll also save me a fair bit of frustration and annoyance; and (b) it was good practice--I'm trying to teach myself better software-engineering practices, and I've been meaning to try writing a unit test for months now. (And pulling the name-parsing code out into a separate function will make a fair number of things cleaner and simpler to implement in the future, and will improve various kludgy parts of the existing system.)