« Obama is in | Main | Double spacing »

Name parser

| 12 Comments

Got home late last night, should've gone to bed, but instead decided to write some code.

The following will probably be too simple/obvious for programmers, and too technical for non-programmers, but it was of interest to me, so I'm going to post it. :)

My system for parsing the metadata that authors give us about their stories--name, wordcount, etc.--is fairly robust, but the area where it most often falls down is author names. The algorithm I've been using for quite a while is to assume that everything before the last space in the full name is the first name, and everything after that last space is the last name. But there are about a hundred and fifty authors in our database (out of 7000+ total authors) who have last names that contain spaces for various reasons. (Some have a "Jr." or "III" or something similar at the end, others have a "de" or "von" or something similar at the beginning, and there are a couple other cases too.) I had written a hack a while back that caused the system to convert underscores to spaces after parsing a name into first name + last name; if an author gave their name as "Oscar de la Renta", then when I was entering their metadata into the form, I could add underscores by hand to produce "Oscar de_la_Renta", and the system would decide "de_la_Renta" was the surname and convert it back to "de la Renta" before saving it in the database. But that's kind of roundabout, and anyway I often forgot to add the underscores by hand, resulting in spurious database entries with (for example) first name "Oscar de la" and last name "Renta".

Last night I was reminding myself to use an underscore in such a name, and decided that it was time I finally implemented something I'd been thinking of for a while: a system that recognizes common parts of multi-word surnames, without my having to put in underscores.

It turned out to be pretty easy. I'm finding a lot of things easier to code in PHP than they used to be, now that I'm not scared of using PHP arrays any more. I'm not sure why I was before; they just seemed intimidating somehow. (I've used arrays in lots of other languages, of course.)

So I spent a couple of hours last night building the name parsing system (and making it a callable subroutine so I could centralize all the name processing in one place). I was pretty sleepy by the end, but I didn't want to leave the database system in a potentially broken state overnight, in case Karen and Susan would be using it this morning before I was up.

One of the things I did was create a test system that let me enter a full name and it would return the parsed personal and surnames, without interacting with the database at all. This was a huge improvement over my old system of testing improvements by entering info into the live database and then deleting it later.

This morning I woke up after too little sleep and was muzzily contemplating the day when I realized that I should take the test system a step further: instead of having to type in a bunch of names by hand, I should create a unit test, or at least something vaguely resembling one: a piece of test code that takes a set of specified inputs and desired results, runs the name parser on the inputs, and compares the actual results to the specified desired results. This way, if I make a change to the name parser, I can run the whole set of tests on it easily, quickly, and automatically, and see immediately if my change has broken anything. And as new unusual names come up, I can add them to the set of tests. So I just finished writing that, and that went pretty smoothly too.

The parser is still far from perfect. There's an entire class of Spanish surnames it can't handle (names like "Garcia y López"), and there are plenty of other surnames in the world that contain spaces but don't contain any of the prefixes or suffixes I'm looking for. Also, there are cases where a surname suffix is indistinguishable (without further context) from a first or middle name; for example, it's hard (especially for a computer program) to tell whether the "Ben" in "Joshua Ben David" is part of the personal name or the surname. And "Van" (with capital V) is another surname-prefix that sometimes (though rarely) appears as a first or middle name.

But the parser now correctly parses all but half a dozen of the names of authors who've actually submitted to us. So I'm pretty pleased with it.

Arguably, it was silly to spend two hours last night and another two hours this morning writing code that will probably save me about half an hour a year, if that. But (a) it'll also save me a fair bit of frustration and annoyance; and (b) it was good practice--I'm trying to teach myself better software-engineering practices, and I've been meaning to try writing a unit test for months now. (And pulling the name-parsing code out into a separate function will make a fair number of things cleaner and simpler to implement in the future, and will improve various kludgy parts of the existing system.)

12 Comments

Yay! Jed is test infected!


If you are not familiar with php functions,and want to find php functions contains some words.for example,you want to find php functions contains 'url',try this site: [URL removed by Jed]


Hi, Leo -- I've removed the URL from your comment because, although your service does what you advertise it as doing, I'm afraid I feel that it has a bad user interface, and the very prominent placement of Google ads (more prominent than the PHP search itself) suggests that your goal is to make money rather than to provide a useful service. (It's quite possible to do both, but I don't get the impression that that's your intent.)

I personally find that the documentation on the PHP website is clear and well-organized and easy to search, and reasonably well presented; I don't see a need for a third-party service that does a search of their documentation. It's true that yours searches for substrings, so if your interface were nicer and more prominent I might have allowed your link, but I'm afraid as it is I don't feel that your service adds enough value.

Note to others: I'm fine with allowing links from here to commercial services that I like, or noncommercial services that I don't think much of. But I draw the line at allowing my site to be used to increase the PageRank of a commercial service that I don't feel is useful to my readers.


So... any chance you'll share the code with us? I found your site whilst searching for 'name parser' as I need to implement something similar.

Thanks,
James


Sure thing--didn't occur to me anyone would be interested. I don't seem to be able to use a pre tag in comments, so I've posted the name parser code on a separate page. If you have any questions, feel free to post them here or email me.

Hope it's useful!


Hey, thanks for the code! I was looking for a parser for author names (I'm writing a MySQL-based card catalogue for my personal book collection). You know how hard it is to find a Free/Open Source name parser? (Well, obviously, you do, since you had to write one, too. *grin*)

normalize_name() wasn't quite what I needed (I wanted to be able to parse it down to title, first, middle, last, suffix), but I got over some humps by looking at your code. I ended up ditching the regexes in favour of a state machine of sorts. It handles the Spanish last names (such as "Garcia y Lopez") you mentioned, and, since it lacks regexes ('\b' is a devil, isn't it?), apostrophes don't mess up tokenization: two things I never would've thought to check had I not read normalize_name.) Some of the test vectors I used were:

John Doe
Doe, John. A., Jr.
Juan Velasquez y Garcia III
Velasquez y Garcia, Dr. Juan Q. Xavier

Again, thanks for the inspiration (and the help on multi-word last names, which, if you read my code, is essentially the same as your method, just applied to the tokenized stream). If you're interested, you can check it out at http://alphahelical.com/code/misc/nameparse .

Again, Thanks! and Cheers!
Keith


Thanks, Keith! I'll post more thoughts about this when I have a chance to read through your code.

For now, just a couple notes before I forget, about the limitations of my code--partly as a note to myself, partly because at the moment this journal entry is the first Google result for [php name parser] so I expect I'll get a certain amount of traffic from people interested in name parsing.

1. My code has a bug: it doesn't correctly deal with three initials in a row (as in "M.F.K. Fisher"). I thought I'd tested that, but apparently not. Will fix soon.

2. My code lumps first and middle names together, because it can't distinguish between a two-word first name ("Mary Anne", "Norma Jeane") and a first-name-plus-middle-name. It could look for common two-word first names, but I don't think even a human can tell for sure, by visual examination, whether the first two words in a name like "Mary Lou Retton" are one name or two.

3. Likewise, my code is unable to recognize multi-word surnames that don't use prefixes or infixes. Like "Gabriel García Márquez"--I gather that his last name is "García Márquez," but I don't see any way to determine programmatically that "García" in this case is part of the last name rather than a middle name. (I could compare a given "middle" name with a list of known last names--except that it's quite common, at least in the US, for common surnames to be used as middle names, and sometimes even as first names.)

4. My code assumes that the personal name comes first and the family names comes last. That's an invalid assumption in many cultures.

5. My code does correctly recognize that in some cultures, it's fairly common for people to have only one name.

6. My code assumes that the name is in a Western alphabet. For my purposes, that's a reasonable assumption; submissions to the magazine have to be in English, so it's reasonable to assume that authors who would normally use a different set of characters will transliterate their name into Western European letters. Even so, my code may well have flaws when confronted with accented characters and such. At the moment, I'm not worrying about that, because at least two other parts of my system force ASCII or Latin-1, but in the long run I want the whole system to use Unicode.


Firstly, thank you a lot for writing your story! It made me feeling not alone :-D.
I agree with all your mentioned reasoning (six mentioned in your previous post). Just wanted to say that when I find name like A. B. Cin, I suppose that first name is everything until last dot (I don't use middle name definition). And then everything after that is a surname. If there are words without dots then only first word is first name.
And then I let option that name has a dash (like Ana-Marija). Mostly everything else will be surname.

I hope you found your answers in this last years (i mean, your last post was in 2007).

Cheers!!!


As a programmer with a last name that has both a space and inter-capitalization, let me suggest that you will never reliably parse first names from last names. The only way to do it right is populate two separate fields.

That said, another mistake people make is capitalizing letters that were not entered that way. You have to deal with this because when people enter their last name on a form, they sometimes use all lower-case, and sometimes all upper-case. However, I have found that people with mixed-case names always enter them in mixed-case. So, you just need a PHP function that turns all lower-case into mixed-case, all upper-case into mixed-case, and leaves mixed-case alone!

smith -> Smith
SMITH -> Smith
O'Reilly -> O'Reilly
van Lammeren -> van Lammeren


Thanks for the comment, Mike.

Although you're right that it's impossible for software to parse all names with certainty, it's worth noting that both my code and Keith's abovelinked code do correctly recognize last names that start with a lowercased "van" or "von" followed by a space.

(It becomes impossible again when "Van" or "Von" is capitalized, because those can be used as first names. At least "Van" can. ...I encountered an interesting situation a couple days ago: someone who gave "Von Foo" (where "Foo" was a different name) as their pseudonym. I had assumed that "Von" was intended as a first name, but my parser concluded that that was a single two-word name. I initially saw this as a bug in the parser, but now I think it may be correct.)

Interesting point about case-sensitivity. My code leaves case intact, because in the science fiction world, there are people who go by all-lowercase names (and perhaps a few who go by all-uppercase names). And our submission form does explicitly say to use the case that they actually intend.

That said, some people ignore that instruction, and enter their name in all-caps even though they don't intend it to be all-caps. I was baffled by that behavior until someone recently suggested that perhaps it's a carryover from paper forms, where people are used to writing in all caps.

Anyway, I think for now I'm going to continue to leave capitalization alone, but if I ever change that, I'll test for mixed-case and leave that alone, so thanks for pointing that out!


Thanks for your work, Jed, and the nice narrative. I wanted to let you know that I've used some of what you've done to make yet another name parser at http://jasonpriem.com/human-name-parse/. Like yours, it uses regular expressions for the matching, because I think that's easier to understand and modify; like Keith's, it works with utf-8, returns all the name's parts, and handles a variety of compound surnames and initialized names (here's the test list). It also captures two extra types of names:

* leading initials, like the 'J.' in "J. Walter Weatherman" (instead of making it a first name).

* nicknames, like the 'Gob' in 'George Oscar “Gob” Bluth, Jr.' (instead of making part of a middle name).

However, I think the big advantage is that it uses some of the improvements in PHP development methods to make it easier to use and hack:

1. It's written in object-oriented PHP; to use you just instantiate the parser:

$parser = new Parser("O'Malley y Muñoz, C. Björn Roger III");

and then use the relevant 'get' method to retrieve name parts:

echo $parser->getFirst() . ' ' . $parser->getLast(); // returns 'Björn O'Malley y Muñoz'

You can also get the results as an associative or integer-indexed array.

2. As well as coming with a set of test names and an interface for testing them, it includes PHPUnit unit tests.

3. Everything's documented forPHPdoc

So, there it is. Hopefully yet another option for PHP name parsing will be of some use to someone. Thanks again for your work and post on this!


That's great, Jason—thanks much for letting me know about it, and I'm sorry that your comment fell afoul of my comment-spam system!

I'll try and take a look at your parser soon, and may end up using it in place of mine. It would definitely be a major improvement over mine in terms of good programming practices.

For my purposes, I don't want a leading initial to be treated differently from a first name, but I suppose I could get the leading initial and the first name using your methods and then concatenate them for my purposes. Will ponder further.


Post a comment