From URI to ontologies

It is some sort of well known pattern, the “http://” stuff. It is so frequently used that browsers just fill it in automatically by default. But what does exactly mean? What’s behind it? In reality, behind a so called URI, there’s more than meets the eye.

What is an URI?

First of all, we need some definition. an URI, or Uniform Resource Identifier, is a string identifier that, as the name implies, identifies some resource. A resource is “something” that can be referred to. It could be an internet document, a book, whatever. The official definition of an URI is detailed in its latest form in RFC 3986.

The URI is made of the following parts:

A scheme
A hierarchical part
Optionally, a query
Optionally, a fragment

Some URIs can be URLs (Locators) or URNs (Names). Both are URIs, but the URL identifies a resource and the place where to find the resource, while a URN describes a name of a resource through the special scheme name urn:. When you type an address in your browser, you are specifying a URL. When you talk about “urn:isbn:0-395-36341-1″, you are using a URN to talk about a book with a specific ISBN. Please note that the relationship is not bijective: although every URL is a URI, not all URIs are URLs, and the same can be said for URNs instead of URL.

Although many scheme names are named after protocols (eg. http, ftp, ldap) this is mostly incidental. There is no “technical correspondence” between the scheme and the protocol: for example by saying a URI with the scheme name http, trying to get the resource (therefore assuming the URI is also a location) will trigger not only HTTP, but also DNS. Moreover, if you have the document in cache, you will not use HTTP at all, but you will get a resource that is on your hard drive. “Resolving” a URI means finding an access strategy to “use” (in very broad term) the resource identified by that URI, an operation called “dereference”. The most common dereferencing is retrieval, like downloading the resource.

The hierarchical part includes either an authority and a path, or just a path. Please note that the “path” concept is very loose: for example, in “[email protected]” “[email protected]” is a path. Similarly, in “urn:isbn:0-395-36341-1″, the path is “isbn:0-395-36341-1″ (which corresponds to the Webster Dictionary, in case you are wondering). When you have an authority involved for the resolution of the path, and only in this case, then you have the “//” to indicate the authority, for example “http://example.com/”, where the authority is “example.com” and the path is “/”. An apparent exception is “telnet://example.com”, but in any case you do have an authority, and the path is always empty. Another apparent strange situation is “[email protected]”: in this case, be careful not to confuse a path with an authority (maybe user qualified, as in “http://user@host/”).

What can you use URIs for ?

So, what does the distinction of URI, URL and URN is really useful for? The fact is that, while a URL refers to a “place” where to find a resource, a URN refers to a “name”of a resource in the urn scheme, and a URI as a name of a resource in any scheme. When we talk about resource, the meaning is very broad. It could be anything, even an abstract concept. An example is the definition of namespaces in XML, something like

<h:html xmlns:h="http://www.w3.org/">
 <h:head><h:title>Hello!</h:title></h:head>
 <h:body>
   <h:h1>Big Title</h:h1>
 </h:body>
</h:html>

The URI http://www.w3.org/ is just an URI. It is not a location (the fact that W3C is actually using the address to provide an informative page can be seen as a coincidence). It is a symbol to mean something, namely, the fact that some elements in that XML documents belong to the vocabulary of XHTML.

In Chestnut package manager, the XML manifest starts with

<Package xmlns="urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f" version="1.0.0">

Where the URI “urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f” again means something: the fact that the “grammar” used is the one of Chestnut package manager. I created this URI with the utility uuidgen, and by definition is a unique id. I could have used anything else, even the URI of this post. The important point is that it has to be a URI, and has to be unique for this specific use. In this particular case, the URI is also a URN.

Another interesting usage of URIs is in ontologies and RDF descriptions. Briefly and roughly, an ontology is a “description of a world”. Suppose that you want to describe the following information

my guitar is white

and you want a computer to be able to understand it. Moreover, you would like to make the computer understand that the guitar is a musical instrument

The guitar is a musical instrument

and that white is a color

white is a color

Now, if you ask the computer to search all the musical instruments that are white, you would like to get my guitar, because it is white, and because it is a musical instrument. Seems easy, but it’s not. The fact is that humans are very smart at interpreting those phrases, but a computer is not. Quoting Bill Clinton, it depends on what the meaning of “is” is. By saying that “my guitar is white” we describe a property of the guitar in terms of its color. We have a subject (“my guitar”), a verb (“is”, in terms of “has color”) and a predicate (“white”). This is also known as a “triplet”, and is the basis of RDF.

Of course, if we take “the guitar is a musical instrument” we have again a case of triplet: “the guitar” is the subject, “is a” in terms of “is a kind of” is the verb, and “musical instrument” is the predicate. Please note that we have another interesting fact here: while “my guitar” is a very specific guitar (that is, the guitar I own), “the guitar” is an abstract concept that applies to any guitar. They are not the same thing, they are two different concepts, but we can say without doubt that “my guitar (the first concept) is a guitar (the second concept)”(this is a case of instance/class relationship).

We can form very complex networks of subject-verb-predicate triplets describing the digital and non-digital world we live in, so that a computer can help us in doing complex search and analysis. How do we differentiate all the concepts and ambiguities we just encountered? You guessed it: with URIs. We will have a URI to express the concept of “my guitar”, another URI to express the concept of “white”, another URI to express the concept of “guitar”, “musical instrument”, “color”, and we will also have different URIs expressing the concepts of “is (as a color)” and “is (a kind of)”. All these concepts are part of the description (also known as ontology) we want to grant to the small world we created in this example. This description is rather simple and loose, but we can define way more complex ontologies. For example, we can create a color ontology, describing the colors and their relationships:

white is a color
red is a color
snow white is a kind of white
blood is a kind of red
red is a warm color
blue is a cold color

or an instrument ontology describing

guitar is a six-string instrument
violin is a four-string instrument
four-string instrument is a string instrument
six-string instrument is a string instrument
string instrument is a musical instrument

Each of these concepts (“guitar”, “six-string instrument”, “violin”, “four-string instrument”, “string instrument”, “musical instrument”) will then be referred by means of a unique URI.

.org