It is some sort of well known pattern, the “http://” stuff. It is so
frequently used that browsers just fill it in automatically by default. But what
does exactly mean? What’s behind it? In reality, behind a so called URI,
there’s more than meets the eye.
What is an URI?
First of all, we need some definition. an URI, or Uniform Resource
Identifier, is a string identifier that, as the name implies, identifies some
resource. A resource is “something” that can be referred to. It
could be an internet document, a book, whatever. The official definition of an URI
is detailed in its latest form in RFC
3986.
The URI is made of the following parts:
- A scheme
- A hierarchical part
- Optionally, a query
- Optionally, a fragment
Some URIs can be URLs (Locators) or URNs (Names). Both are URIs, but the URL
identifies a resource and the place where to find the resource,
while a URN describes a name of a resource through the special scheme name urn:.
When you type an address in your browser, you are specifying a URL. When you talk
about “urn:isbn:0-395-36341-1″, you are using a URN to talk about a book
with a specific ISBN. Please note that the relationship is
not bijective: although every URL is a URI, not all URIs are URLs, and the same can
be said for URNs instead of URL.
Although many scheme names are named after protocols (eg. http, ftp, ldap) this is
mostly incidental. There is no “technical correspondence” between the
scheme and the protocol: for example by saying a URI with the scheme name http,
trying to get the resource (therefore assuming the URI is also a location) will
trigger not only HTTP, but also DNS. Moreover, if you have the document in cache,
you will not use HTTP at all, but you will get a resource that is on your hard
drive. “Resolving” a URI means finding an access strategy to
“use” (in very broad term) the resource identified by that URI, an
operation called “dereference”. The most common dereferencing is
retrieval, like downloading the resource.
The hierarchical part includes either an authority and a path, or just a path. Please
note that the “path” concept is very loose: for example, in
“mailto:[email protected]” “[email protected]” is a path.
Similarly, in “urn:isbn:0-395-36341-1″, the path is
“isbn:0-395-36341-1″ (which corresponds to the Webster Dictionary, in
case you are wondering). When you have an authority involved for the resolution of
the path, and only in this case, then you have the “//” to indicate the
authority, for example “http://example.com/”, where the authority
is “example.com” and the path is “/”. An apparent
exception is “telnet://example.com”, but in any case you do have an
authority, and the path is always empty. Another apparent strange situation is
“mailto:[email protected]”: in this case, be careful not to confuse a
path with an authority (maybe user qualified, as in
“http://user@host/”).
What can you use URIs for ?
So, what does the distinction of URI, URL and URN is really useful for? The fact is
that, while a URL refers to a “place” where to find a resource, a URN
refers to a “name”of a resource in the urn scheme, and a URI as a name
of a resource in any scheme. When we talk about resource, the meaning is very broad.
It could be anything, even an abstract concept. An example is the definition of
namespaces in XML, something like
<h:html xmlns:h="http://www.w3.org/">
<h:head><h:title>Hello!</h:title></h:head>
<h:body>
<h:h1>Big Title</h:h1>
</h:body>
</h:html>
The URI http://www.w3.org/ is just an URI. It is not a location (the fact
that W3C is actually using the address to provide an informative page can be seen as
a coincidence). It is a symbol to mean something, namely, the fact that some
elements in that XML documents belong to the vocabulary of XHTML.
In Chestnut package manager, the XML manifest starts with
<Package xmlns="urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f" version="1.0.0">
Where the URI “urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f” again means
something: the fact that the “grammar” used is the one of Chestnut
package manager. I created this URI with the utility uuidgen, and by definition is a
unique id. I could have used anything else, even the URI of this post. The important
point is that it has to be a URI, and has to be unique for this specific use. In
this particular case, the URI is also a URN.
Another interesting usage of URIs is in ontologies and RDF descriptions. Briefly and
roughly, an ontology is a “description of a world”. Suppose that you
want to describe the following information
my guitar is white
and you want a computer to be able to understand it. Moreover, you would like to make
the computer understand that the guitar is a musical instrument
The guitar is a musical instrument
and that white is a color
white is a color
Now, if you ask the computer to search all the musical instruments that are white,
you would like to get my guitar, because it is white, and because it is a musical
instrument. Seems easy, but it’s not. The fact is that humans are very smart
at interpreting those phrases, but a computer is not. Quoting Bill Clinton, it depends on
what the meaning of “is” is. By saying that “my guitar is
white” we describe a property of the guitar in terms of its color. We have a
subject (”my guitar”), a verb (”is”, in terms of “has
color”) and a predicate (”white”). This is also known as a
“triplet”, and is the basis of RDF.
Of course, if we take “the guitar is a musical instrument” we have again
a case of triplet: “the guitar” is the subject, “is a” in
terms of “is a kind of” is the verb, and “musical
instrument” is the predicate. Please note that we have another interesting
fact here: while “my guitar” is a very specific guitar (that is, the
guitar I own), “the guitar” is an abstract concept that applies to any
guitar. They are not the same thing, they are two different concepts, but we can say
without doubt that “my guitar (the first concept) is a guitar (the second
concept)”(this is a case of instance/class relationship).
We can form very complex networks of subject-verb-predicate triplets describing the
digital and non-digital world we live in, so that a computer can help us in doing
complex search and analysis. How do we differentiate all the concepts and
ambiguities we just encountered? You guessed it: with URIs. We will have a URI to
express the concept of “my guitar”, another URI to express the concept
of “white”, another URI to express the concept of “guitar”,
“musical instrument”, “color”, and we will also have
different URIs expressing the concepts of “is (as a color)” and
“is (a kind of)”. All these concepts are part of the description (also
known as ontology) we want to grant to the small world we created in this example.
This description is rather simple and loose, but we can define way more complex
ontologies. For example, we can create a color ontology, describing the colors and
their relationships:
white is a color, red is a color, snow white is a kind of white, blood is a kind of red, red is a warm color, blue is a cold color
or an instrument ontology describing
guitar is a six-string instrument, violin is a four-string instrument, four-string instrument is a string instrument,
six-string instrument is a string instrument, string instrument is a musical instrument
Each of these concepts (”guitar”, “six-string instrument”,
“violin”, “four-string instrument”, “string
instrument”, “musical instrument”) will then be referred by means
of a unique URI.