« But then a month ago last Friday | Main | Good old Kamala »

Link syntax surprise: links can start with //

| No Comments

(Wrote this entry on November 12, 2010, but neglected to post it until now.)

Sometime around 1999, I was revising the Dreamweaver documentation for the first time, and I discovered that the information it gave about different kinds of HTML links (using absolute or relative URIs) was inaccurate and/or misleading.

So I rewrote it to cover what I saw as the three basic ways for a link to refer to a given web page. To summarize those three link types (though this isn't what I wrote at the time):

absolute
Generally starts with http:// and then specifies the host name and a path to the page from the host's web root directory. Example: http://www.example.com/foo/bar.html.
site root–relative
Leaves out http:// and the host name; starts with a slash (which, as in UNIX, denotes the root directory); specifies a path from the host's web root directory to a page. Example: /foo/bar.html.
document-relative
Specifies a path from the current document to the linked-to document. Leaves out http:// and the host name and the slash that indicates the root directory. Can consist of just a page name (if the target page is in the same directory), or can start with a subdirectory name (if the target page is in a subdirectory). If the path goes through a parent directory, then it starts with two dots, the UNIX syntax for “go up one directory level.” Example: ../foo/bar.html.

There are other kinds of links, of course, like mailto links or file links, but for the past ten years, I've thought that the above three were the only options for linking to web pages. A version of my explanation is still in the current Dreamweaver documentation, though there are a couple of oddly phrased bits there that I'd like to claim I didn't write.

(A link can also have other possible pieces: an IP address instead of a hostname; a username and password; a port number; query parameters; an ID within the page; etc. But I'm ignoring all those for this entry. More generally, there's some other stuff here that I'm simplifying for the sake of readability and semi-brevity.)

So why am I posting about this? Because yesterday, I learned that there's another option:

A “network-path” reference is a URI that looks like an absolute URI with the http left off; that is, it starts with // and a host name.

(I'm not going to get into the difference between URLs and URIs here; I'm just gonna say URI in this entry.)

It's hard to find information about this. Most sites that talk about link syntax just talk about the standard three options described above (though not always with those names), and searching for phrases like [link starting with two slashes] didn't help.

But after I found the relevant bit of the spec, I started searching for the official term “network-path reference,” which led me to a discussion of how and why you could use this.

The why is interesting: if someone might be arriving at your page using either http or https, and you want to be able to handle both appropriately, then you can use network-path links to other pages on your site, so that as they follow links, their browser will continue to use whichever protocol they started out with.

And that idea provides a much more coherent paradigm for the whole set of link syntax options:

If you leave off some part of a URI in a link, the browser substitutes the corresponding part of the current page's URI.

(Though there are only certain parts you can validly leave off; you can't just start in the middle of the host name, for example.)

So:

  • Give the full URI, and you've got what I call an absolute link, which I guess the spec would just call a URI.
  • Leave off the http: or https: and you've got a network-path reference; the browser substitutes whichever of those the user used to get to the page.
  • Leave off everything up to and including the host name (so the link starts with a single slash), and you've got what I call a site root–relative link, which the spec calls an “absolute-path reference”; the browser treats the link as if it included the host name of the current page.
  • Leave off the protocol, the host name, the starting slash, and some segments of the path, and you've got what I call a document-relative link, which the spec calls a “relative-path reference”; the browser treats the link as if it included all the left-out parts.

I'm oversimplifying again; there's a detailed algorithm for how exactly to construct the full URI in the spec. But I think the above is a reasonable conceptual description for most purposes.

By the way, if you leave off the http:// and start with, say, www, then what you've got is a suffix reference, which is (as the spec puts it) “primarily intended for human interpretation rather than for machines.” A browser user can enter such an item by hand into the browser's address box, but this kind of reference should not be used in a link, and browsers may behave badly when faced with such a link.

Post a comment