{"id":5481,"date":"2010-10-14T09:17:16","date_gmt":"2010-10-14T09:17:16","guid":{"rendered":"http:\/\/www.kith.org\/jed\/hodgepodge\/code\/name-parser-php-html\/"},"modified":"2025-09-20T23:42:05","modified_gmt":"2025-09-21T06:42:05","slug":"name-parser-php-html","status":"publish","type":"page","link":"https:\/\/www.kith.org\/jed\/hodgepodge\/code\/name-parser-php-html\/","title":{"rendered":"Name parser\/normalizer"},"content":{"rendered":"\r\n\r\n<p>Code that I wrote in 2007 for our <cite>Strange Horizons<\/cite> fiction-submission-handling system, to parse an author\u2019s name into a forename and a surname.<\/p>\r\n<p>In 2023, I\u2019m more inclined to think that the whole project of trying to programmatically parse human names is misguided. For more on that, see \u201c<a href=\"https:\/\/www.kalzumeus.com\/2010\/06\/17\/falsehoods-programmers-believe-about-names\/\">Falsehoods Programmers Believe About Names<\/a>.\u201d That said, I\u2019m leaving this code up in case it\u2019s of use to anyone.<\/p>\r\n<hr width=\"25%\" \/>\r\n\r\n<pre style=\"overflow-x: auto;\">\r\n\/\/ Name normalization function in PHP: splits a full name into\r\n\/\/   forename and surname, taking into account various common\r\n\/\/   surname prefixes and suffixes.\r\n\/\/\r\n\/\/ By Jed Hartman (logos@kith.org), 11 February 2007.\r\n\/\/ For more info, see:\r\n\/\/ https:\/\/www.kith.org\/jed\/2007\/02\/11\/name-parser\/\r\n\/\/\r\n\/\/ This code is in the public domain; no rights reserved.  Use as you like.\r\n\/\/\r\n\/\/ Parameter: Pass in a string containing a full name.\r\n\/\/ Returns: A 2-element array.  First element is fore-and-middle-names;\r\n\/\/   second element is surname including prefixes and suffixes, if any.\r\n\/\/   (Note that it's impossible to distinguish between a two-word\r\n\/\/   forename and a first-name-plus-middle-name, so we don't even try.)\r\n\/\/   Initials are normalized to have periods and spaces after them.\r\n\/\/\r\n\/\/ 2008-01-19: Cleaned up handling of periods, \"Mr\", and \"Mrs\".  Note\r\n\/\/   that \"Jr\" and \"jr\" now automatically get periods added.  If I want\r\n\/\/   to allow no-period versions of those, I could go through the\r\n\/\/   passed-in name and keep track of what pieces have periods, but I\r\n\/\/   don't think that's worth doing.\r\n\/\/\r\n\/\/ 2009-10-20: Added \"du\".  Considered adding a special case where if\r\n\/\/   a name has only two parts, then prefixes are treated as first\r\n\/\/   names, but decided against it; for example, I think \"Von Foo\" is\r\n\/\/   meant to be a unitary name, not first + last. So not gonna do this.\r\n\/\/\r\n\/\/ 2010-10-14: Added \"Vander\".\r\n\/\/\r\n\/\/ Bugs:\r\n\/\/\r\n\/\/   *  Doesn't handle various non-Anglo approaches to naming, as in\r\n\/\/      names like \"Garcia y Lopez\".  To handle such cases, if you're\r\n\/\/      manually reviewing names before you call this function, you can\r\n\/\/      manually insert underscores between name elements that should\r\n\/\/      stay together: \"Jaime Garcia_y_Lopez\".\r\n\/\/\r\n\/\/   *  Doesn't handle cases where a surname prefix is used as a\r\n\/\/      middle name, as in names like \"Joshua Ben David\".  Again,\r\n\/\/      you can manually insert underscores: \"Joshua_Ben David\".\r\n\/\/\r\n\/\/   *  Added in 2023: This entire endeavor is probably a bad way to\r\n\/\/      go about things; for more info, see\r\n\/\/      https:\/\/www.kalzumeus.com\/2010\/06\/17\/falsehoods-programmers-believe-about-names\/\r\n\/\/      But I\u2019m nonetheless leaving this code available in case it\u2019s\r\n\/\/      of use in certain limited contexts.\r\n\/\/\r\n\/\/ TODO:\r\n\/\/\r\n\/\/   *  Consider replacing my whole system with this code:\r\n\/\/      http:\/\/alphahelical.com\/code\/misc\/nameparse\/?misc\/nameparse\r\n\/\/      Or with this code:\r\n\/\/      http:\/\/jasonpriem.com\/human-name-parse\/\r\n\/\/\r\n\/\/   *  Consider generating capitalization and punctuation variants\r\n\/\/      for prefixes and suffixes rather than listing them all.\r\n\/\/\r\n\/\/   *  Clean up repetitive logic in middle of routine.\r\n\/\/\r\n\/\/   *  Remove other titles as well as \"Dr.\"\r\n\/\/\r\n\/\/   *  Find a more elegant way to handle apostrophes in forenames.\r\n\/\/\r\n\/\/   *  Don't treat non-ASCII characters and parentheses as word breaks.\r\n\r\nfunction normalize_name($full_name)\r\n{\r\n\r\n  $last_name_prefixes = array (\"ben\", \"da\", \"Da\", \"Dal\", \"de\", \"De\", \"del\", \"Del\", \"den\", \"der\", \"Di\", \"du\", \"e\", \"la\", \"La\", \"Le\", \"Mc\", \"San\", \"St\", \"Ste\", \"van\", \"Van\", \"Vander\", \"vel\", \"von\", \"Von\");\r\n  $last_name_suffixes = array (\"Jr\", \"jr\", \"Sr\", \"sr\", \"2\", \"II\", \"III\", \"IV\");\r\n  $add_periods = array (\"Ste\", \"St\", \"Jr\", \"jr\", \"Sr\", \"sr\");\r\n\r\n  $full_name = trim($full_name);\r\n  $full_name = preg_replace(\"\/]+&gt;\/\", \"\", $full_name); \/\/ Remove x-flowed and other tags.\r\n  $full_name = preg_replace(\"\/.+$\/\", \"\", $full_name); \/\/ Remove final periods.\r\n  $full_name = preg_replace(\"\/.\/\", \" \", $full_name); \/\/ Replace periods with spaces.\r\n  $full_name = preg_replace(\"\/ +\/\", \" \", $full_name); \/\/ Replace runs of spaces with single spaces.\r\n  $all_names = preg_split(\"\/[ xA0]\/\", $full_name); \/\/ Split on space or option-space.\r\n  $last_name = array_pop($all_names);\r\n  $second_to_last_word = array_pop($all_names);\r\n  if (is_null($second_to_last_word))\r\n  {\r\n    return array ($last_name, \"\");  \/\/ If only one name, consider it to be a \"first\" (personal) name.\r\n  }\r\n  if (in_array($last_name, $last_name_suffixes))  \/\/ Doesn't account for multiple suffixes; fix eventually, but v. rare.\r\n  {\r\n    $last_name = $second_to_last_word . \" \" . $last_name;\r\n\t$second_to_last_word = array_pop($all_names);\r\n  }\r\n  if (is_null($second_to_last_word))\r\n  {\r\n    return array ($last_name, \"\");  \/\/ If only one name, consider it to be a \"first\" (personal) name.\r\n  }\r\n  while (in_array($second_to_last_word, $last_name_prefixes))\r\n  {\r\n    $last_name = $second_to_last_word . \" \" . $last_name;\r\n\t$second_to_last_word = array_pop($all_names);\r\n  }\r\n  if (is_null($second_to_last_word))\r\n  {\r\n    return array ($last_name, \"\");  \/\/ If only one name, consider it to be a \"first\" (personal) name.\r\n  }\r\n  $last_name = preg_replace(\"\/_\/\", \" \", $last_name); \/\/ Change underscores to spaces, for multiword last names\r\n  array_push($all_names, $second_to_last_word); \/\/ Put latest unused name back on stack\r\n  $first_name = join(\" \", $all_names);\r\n  $first_name = preg_replace(\"\/_\/\", \" \", $first_name); \/\/ Change underscores to spaces, for multiword first names\r\n  $first_name = preg_replace(\"\/^Dr.? ?b\/\", \"\", $first_name);  \/\/ Remove \"Dr.\" from start of name\r\n  $first_name = preg_replace(\"\/^Mrs.? ?b\/\", \"\", $first_name);  \/\/ Remove \"Mrs.\" from start of name\r\n  $first_name = preg_replace(\"\/^Mr.? ?b\/\", \"\", $first_name);  \/\/ Remove \"Mr.\" from start of name\r\n  \/\/ TODO: change the above to an array of titles to be removed, and iterate through the array.\r\n\r\n  \/\/ Change all initials to have periods and spaces after them.\r\n  \/\/ Apostrophes cause problems with the word-boundary test, so temporarily change them.\r\n  \/\/ This is inelegant; should probably come back and figure out how to do it right at some point.\r\n  $first_name = preg_replace(\"\/'\/\", \"QXZQXZQXZ\", $first_name);\r\n  $first_name = preg_replace(\"\/b([A-Z])([A-Z])(.|b)\/\", \"$1 $2\", $first_name); \/\/ Two cap letters followed by period or space\r\n  $first_name = preg_replace(\"\/b([A-Z])(.|b)\/\", \"$1.\", $first_name);  \/\/ Single letter followed by period or space\r\n  $first_name = preg_replace(\"\/QXZQXZQXZ\/\", \"'\", $first_name);\r\n  \r\n  \/\/ Now add back in any missing periods.\r\n  foreach ($add_periods as $word)\r\n  {\r\n    $first_name = preg_replace(\"\/$wordb\/\", \"$word.\", $first_name);\r\n    $last_name = preg_replace(\"\/$wordb\/\", \"$word.\", $last_name);\r\n  }\r\n  \r\n  return array ($first_name, $last_name);\r\n}\r\n<\/pre>\r\n\r\n\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":5,"featured_media":0,"parent":5479,"menu_order":50,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"class_list":["post-5481","page","type-page","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/pages\/5481","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/comments?post=5481"}],"version-history":[{"count":8,"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/pages\/5481\/revisions"}],"predecessor-version":[{"id":20672,"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/pages\/5481\/revisions\/20672"}],"up":[{"embeddable":true,"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/pages\/5479"}],"wp:attachment":[{"href":"https:\/\/www.kith.org\/jed\/wp-json\/wp\/v2\/media?parent=5481"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}