Got home late last night, should've gone to bed, but instead decided to write some code.
The following will probably be too simple/obvious for programmers, and too technical for non-programmers, but it was of interest to me, so I'm going to post it. :)
My system for parsing the metadata that authors give us about their stories--name, wordcount, etc.--is fairly robust, but the area where it most often falls down is author names. The algorithm I've been using for quite a while is to assume that everything before the last space in the full name is the first name, and everything after that last space is the last name. But there are about a hundred and fifty authors in our database (out of 7000+ total authors) who have last names that contain spaces for various reasons. (Some have a "Jr." or "III" or something similar at the end, others have a "de" or "von" or something similar at the beginning, and there are a couple other cases too.) I had written a hack a while back that caused the system to convert underscores to spaces after parsing a name into first name + last name; if an author gave their name as "Oscar de la Renta", then when I was entering their metadata into the form, I could add underscores by hand to produce "Oscar de_la_Renta", and the system would decide "de_la_Renta" was the surname and convert it back to "de la Renta" before saving it in the database. But that's kind of roundabout, and anyway I often forgot to add the underscores by hand, resulting in spurious database entries with (for example) first name "Oscar de la" and last name "Renta".
Last night I was reminding myself to use an underscore in such a name, and decided that it was time I finally implemented something I'd been thinking of for a while: a system that recognizes common parts of multi-word surnames, without my having to put in underscores.
It turned out to be pretty easy. I'm finding a lot of things easier to code in PHP than they used to be, now that I'm not scared of using PHP arrays any more. I'm not sure why I was before; they just seemed intimidating somehow. (I've used arrays in lots of other languages, of course.)
So I spent a couple of hours last night building the name parsing system (and making it a callable subroutine so I could centralize all the name processing in one place). I was pretty sleepy by the end, but I didn't want to leave the database system in a potentially broken state overnight, in case Karen and Susan would be using it this morning before I was up.
One of the things I did was create a test system that let me enter a full name and it would return the parsed personal and surnames, without interacting with the database at all. This was a huge improvement over my old system of testing improvements by entering info into the live database and then deleting it later.
This morning I woke up after too little sleep and was muzzily contemplating the day when I realized that I should take the test system a step further: instead of having to type in a bunch of names by hand, I should create a unit test, or at least something vaguely resembling one: a piece of test code that takes a set of specified inputs and desired results, runs the name parser on the inputs, and compares the actual results to the specified desired results. This way, if I make a change to the name parser, I can run the whole set of tests on it easily, quickly, and automatically, and see immediately if my change has broken anything. And as new unusual names come up, I can add them to the set of tests. So I just finished writing that, and that went pretty smoothly too.
The parser is still far from perfect. There's an entire class of Spanish surnames it can't handle (names like "Garcia y López"), and there are plenty of other surnames in the world that contain spaces but don't contain any of the prefixes or suffixes I'm looking for. Also, there are cases where a surname suffix is indistinguishable (without further context) from a first or middle name; for example, it's hard (especially for a computer program) to tell whether the "Ben" in "Joshua Ben David" is part of the personal name or the surname. And "Van" (with capital V) is another surname-prefix that sometimes (though rarely) appears as a first or middle name.
But the parser now correctly parses all but half a dozen of the names of authors who've actually submitted to us. So I'm pretty pleased with it.
Arguably, it was silly to spend two hours last night and another two hours this morning writing code that will probably save me about half an hour a year, if that. But (a) it'll also save me a fair bit of frustration and annoyance; and (b) it was good practice--I'm trying to teach myself better software-engineering practices, and I've been meaning to try writing a unit test for months now. (And pulling the name-parsing code out into a separate function will make a fair number of things cleaner and simpler to implement in the future, and will improve various kludgy parts of the existing system.)

Yay! Jed is test infected!
If you are not familiar with php functions,and want to find php functions contains some words.for example,you want to find php functions contains 'url',try this site: [URL removed by Jed]
Hi, Leo -- I've removed the URL from your comment because, although your service does what you advertise it as doing, I'm afraid I feel that it has a bad user interface, and the very prominent placement of Google ads (more prominent than the PHP search itself) suggests that your goal is to make money rather than to provide a useful service. (It's quite possible to do both, but I don't get the impression that that's your intent.)
I personally find that the documentation on the PHP website is clear and well-organized and easy to search, and reasonably well presented; I don't see a need for a third-party service that does a search of their documentation. It's true that yours searches for substrings, so if your interface were nicer and more prominent I might have allowed your link, but I'm afraid as it is I don't feel that your service adds enough value.
Note to others: I'm fine with allowing links from here to commercial services that I like, or noncommercial services that I don't think much of. But I draw the line at allowing my site to be used to increase the PageRank of a commercial service that I don't feel is useful to my readers.
So... any chance you'll share the code with us? I found your site whilst searching for 'name parser' as I need to implement something similar.
Thanks,
James
Sure thing--didn't occur to me anyone would be interested. I don't seem to be able to use a pre tag in comments, so I've posted the name parser code on a separate page. If you have any questions, feel free to post them here or email me.
Hope it's useful!
Hey, thanks for the code! I was looking for a parser for author names (I'm writing a MySQL-based card catalogue for my personal book collection). You know how hard it is to find a Free/Open Source name parser? (Well, obviously, you do, since you had to write one, too. *grin*)
normalize_name() wasn't quite what I needed (I wanted to be able to parse it down to title, first, middle, last, suffix), but I got over some humps by looking at your code. I ended up ditching the regexes in favour of a state machine of sorts. It handles the Spanish last names (such as "Garcia y Lopez") you mentioned, and, since it lacks regexes ('\b' is a devil, isn't it?), apostrophes don't mess up tokenization: two things I never would've thought to check had I not read normalize_name.) Some of the test vectors I used were:
John Doe
Doe, John. A., Jr.
Juan Velasquez y Garcia III
Velasquez y Garcia, Dr. Juan Q. Xavier
Again, thanks for the inspiration (and the help on multi-word last names, which, if you read my code, is essentially the same as your method, just applied to the tokenized stream). If you're interested, you can check it out at http://alphahelical.com/code/misc/nameparse .
Again, Thanks! and Cheers!
Keith
Thanks, Keith! I'll post more thoughts about this when I have a chance to read through your code.
For now, just a couple notes before I forget, about the limitations of my code--partly as a note to myself, partly because at the moment this journal entry is the first Google result for [php name parser] so I expect I'll get a certain amount of traffic from people interested in name parsing.
1. My code has a bug: it doesn't correctly deal with three initials in a row (as in "M.F.K. Fisher"). I thought I'd tested that, but apparently not. Will fix soon.
2. My code lumps first and middle names together, because it can't distinguish between a two-word first name ("Mary Anne", "Norma Jeane") and a first-name-plus-middle-name. It could look for common two-word first names, but I don't think even a human can tell for sure, by visual examination, whether the first two words in a name like "Mary Lou Retton" are one name or two.
3. Likewise, my code is unable to recognize multi-word surnames that don't use prefixes or infixes. Like "Gabriel García Márquez"--I gather that his last name is "García Márquez," but I don't see any way to determine programmatically that "García" in this case is part of the last name rather than a middle name. (I could compare a given "middle" name with a list of known last names--except that it's quite common, at least in the US, for common surnames to be used as middle names, and sometimes even as first names.)
4. My code assumes that the personal name comes first and the family names comes last. That's an invalid assumption in many cultures.
5. My code does correctly recognize that in some cultures, it's fairly common for people to have only one name.
6. My code assumes that the name is in a Western alphabet. For my purposes, that's a reasonable assumption; submissions to the magazine have to be in English, so it's reasonable to assume that authors who would normally use a different set of characters will transliterate their name into Western European letters. Even so, my code may well have flaws when confronted with accented characters and such. At the moment, I'm not worrying about that, because at least two other parts of my system force ASCII or Latin-1, but in the long run I want the whole system to use Unicode.
Firstly, thank you a lot for writing your story! It made me feeling not alone :-D.
I agree with all your mentioned reasoning (six mentioned in your previous post). Just wanted to say that when I find name like A. B. Cin, I suppose that first name is everything until last dot (I don't use middle name definition). And then everything after that is a surname. If there are words without dots then only first word is first name.
And then I let option that name has a dash (like Ana-Marija). Mostly everything else will be surname.
I hope you found your answers in this last years (i mean, your last post was in 2007).
Cheers!!!