Data ontologies, models, structures… whatever you want to call them, these are the devices that allow the Semantic Web to operate as it is supposed to. And how exactly might that be? Well the stated aim of the Semantic Web is to move from the unstructured document web towards a ‘web of data’. So, essentially, the Semantic Web is focussed around the idea of structure and the transformation from the current web or Web 2.0 to its next iteration, which we might as well call Web 3.0, is a journey from lack of structure to a domain in which structure is all-encompassing and essential to its proper workings.
This agenda can be exemplified in the conundrum that faces most data mapping exercises involving a transformation from unstructured to structured data: how to deal with free-text fields. Let us imagine that we have a field in a database. It matters not whether it is a built on the relational model or otherwise. The only thing that is of concern to us is that a series of fields of completely unstructured free-text data need to be given some form of structure. Free-text fields are defined as strings of characters that are probably although not necessarily constituent of written language and as we all know, language can be extremely complex and nuanced, and these particular aspects can be challenging for computers to deal with.
Let us return to our example. Here is a possible entry that might be encountered.
‘Bob is the father of Liam, who is married to Catherine’
OK, as a native English speaker, I can immediately deduce the sense of this series of words. We often forget how amazingly competent the human brain is at reading information from a source and with seemingly instantaneously effect interpreting the data as meaning. One can imagine this breaking down of the source data as the construction of a sort of network diagram of connections and inter-dependencies. The Bob node in our network is linked to the Liam node and this node is in turn linked to Catherine. Catherine, therefore, has an indirect association with Bob. The links that associate each of our three nodes are also meaningful. Bob is the father of Liam. Liam is married to Catherine. By inference, as humans we can understand that Liam is the son of Bob and that Catherine is also married to Liam. We might also suggest that Liam is Catherine’s husband and that she is his wife. Catherine is also Bob’s daughter-in-law et cetera et cetera.
I can arrive at these final facts because I exist in a culture in which groups of individuals associate themselves in genetically related social units called families and that separate families can be linked through an event called marriage. These are cultural facts and ones that I have learnt through the living of my life, through my experience. It is this experience and the inferential power that it brings, particularly in the act of interpreting something like data via language, that is key to the way that we interpret and negotiate the world around us.
But what happens when a computer tries to arrive at the same network diagram of meaning? The problem with the current computer offering is that they are relatively dumb. For the most part, they cannot delve back into what we would call experience and they find it difficult to infer facts from other facts unless they are expressly programmed to do so. This working reality is changing and the wave of technological research that addresses its shortcomings we know as Artificial Intelligence. In perhaps a decade or two it is likely that we will need to redefine the limitations of computer-based reasoning but that is a discussion for another time and we will continue our discussion in the context of what is currently achievable.
The matter of reading in our string of data and making sense of it falls under the category of natural language processing. It is a complex area and is not the focus of this discussion. So let us just assume that we have an algorithm that can interpret the characters of text. What next? We need a device around which we can model the meaning of the statement. You’ve probably realised that I am talking about metadata here, data about data, and that is true to an extent but metadata is a far too broad a label to apply to what we are trying to do here. If we want to output truly valuable data to be published to the Semantic Web we need to start talking about ontologies.
The answer to this depends largely on whether you are approaching the question from a philosophical or a computer science standpoint and while there is obvious overlap in the objectives of both fields, we will address the nature of ontologies mainly in its understanding within computer science. Here an ontology is a formal definition of the naming, properties and relationships between entities that exist within a particular data domain. Essentially, this means that if you have a data domain that follows a particular ontology but in that domain you data that deviates from this prescription in any way, these deviations will not be visible to any client that interrogates the domain data from the perspective of the ontology.
In other words, one cannot see what one does not know exists. I am an archaeologist and there is numerous example of this phenomenon at work in the discipline. Take for instance the ‘discovery’ of ancient mud-brick by the German archaeologist Robert Koldewey during his work at the site of the ancient city of Babylon in what is now Iraq. Of course, mudbrick existed before Koldewey but until in invented a paradigm with which to understand it, it was invisible to the archaeological community. The same is exactly true for the data mining algorithm that queries a data resource for a particular type of data. It asks for information that it knows is likely to exist within the datastore and it uses an ontology to arrive at this understanding.
Let us consider a simple ontology that could be used to model the meaning that is inferred from the Bob, Liam and Catherine statement. Ontologies are built upon the basic building block of the class. A class is a prototype for a grouping of values. This grouping might have a physical manifestation in the real world as in the case of a car or it might not as in the case of a house mortgage. Classes contain properties that can be used to flesh out the class that it is housed in. For example, a Car class might have the properties make and registration number associated with it. A Mortgage class might have a balance and an owner property. Different classes will have different properties associated with them and given that we all view the world in different ways, different people will undoubtedly construct different classes that describe the same nominal entities. Classes and their properties are therefore subjective and whoever says that ontologies model concepts, which are universal across all language and cultures, as some data modellers will argue, do not understand the nuance that both of these variables promulgate.
Our example ontology will undoubtedly need a Person class. And perhaps that might be enough. As I said before, ontology building is subjective. You might disagree with my ontology. In fact, I would expect you to. Our Person class will have a property for the person’s name. This will contain a piece of text. The Person class will also have the property is_father_of and is_married_to and in these two cases their values will contain other persons. It would not make sense to have a person as a value for the name property and similarly a text string might not work for the other two properties. Properties, therefore, can have what is known as a range, or a set of value types that will be accommodated. A property’s domain is the class in which it occurs.
Classes are, however, abstract models. They need to be realised in some way in order to enter into being. We call this process instantiation and the products of it data instances. Here is a diagram showing an instance of the Person class that refers to the Bob entity in our example.
And here are the data instances that represent the Liam…
...and Catherine entities.
Now if we didn’t know anything about the data contained within the dataset, it would be impossible to construct our query except if it asked for the most basic of information. We might ask for instance, how many entries are contained within the datastore or we could request the first ten entries. Both of these are largely senseless questions. They will return information but their meaning will need to be deciphered by a human as a secondary step and they will be of no use to a computer algorithm.
If on the other hand, we know the structure of the ontology that was used to model the data, then we can ask meaningful questions such as, give me a list of all the Person instances in your dataset, tell me their names and if they are linked by marriage to any other Person instances? Ontologies, therefore, as long as they known by all parties involved, allow interrogation of datasets intelligently and this ability is the primary objective of the Semantic Web.
This need for public awareness of ontologies is also the reason that it is best practice to discover first if an ontology already exists that might suit your data needs before you set about building one yourself. And on that note we will talk next about one of the more popular public ontologies in the cultural heritage sector, the CIDOC CRM.