Go to main content
[Computer Science Image]

School of Computer Science

Ontologies in Biology

An ontology in the sense in which it is used in informatics is ``a representation of the shared background knowledge for a community''. Very broadly, it is an implementable model of the entities that need to be understood in common in order for some group of software systems and their users to function and communicate at the level required for a set of tasks. In doing so, an ontology provides the intended meaning of a formal vocabulary used to describe a certain conceptualisation of objects in a domain of interest. An ontology describes the categories of objects described in a body of data and the relationships between those objects and the relationships between those categories. In doing so, an ontology describes those objects and sometimes defines what is needed to be known in order to recognise one of those objects within the information being processed by an application. An ontology should be distinguished from thesauri, classification schemes and other simple knowledge organisation systems. By controlling the labels given to the categories in an ontology, a controlled vocabulary can be delivered; though an ontology is not a controlled vocabulary. when represented as a set of logical axioms with a strict semantics, an ontology can be used to make inferences about the objects that it describes and consequently provides a means to symbolically manipulate knowledge.

Ontology is a term with its origins with Aristotle in his writings on Metaphysics, IV,1 from 437 BCE. In very general terms, it is a branch of philosophy concerned with ``that which exists''; that is, a description of the things in the world. Philosophers in this field tend to be concerned with understanding what it means to be a particular thing in the world. The goal is to achieve a complete and true account of reality. Computer scientists have taken the term and somewhat re-defined it, removing the more philosophical aspects and concentrating upon the notion of a shared understanding or specification of the concepts of interest in a domain of information that can be used by both computer and humans to describe and process that information. The goal with a computer science ontology is to make knowledge of a domain computationally useful. There is less concern with a true account of reality as it is information that is being processed, not reality.

Putting the string ``define: ontology'' into the Google search engine finds some twenty or so definitions of ontology. They all cluster around either a philosophical or a computer science definition of ontology. This is presumably the root of the jibe that ontology is all about definitions, but there is no definition of ontology. So, we should really distinguish between philosophical ontology and computer science ontology and remove some of the dispute. Tom Gruber has the most famous definition of ontology in the computer science sense and established the popularity of the word within the domain, though conceptual models of various types have been built within computer science for decades. Gruber's definition is:

``In the context of knowledge sharing, I use the term ontology to mean a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.``The most noteworthy point is that Gruber states that his definition of ontology is not ``ontology in the philosophical sense''. Nevertheless, computer science ontology is still informed by the philosophical, but the goals for their ccreation and use are different.

We live in a world of instances, individuals or objects. There are trees, flowers, the sky, stones, animals, etc. As well as these material objects, there are also immaterial objects, such as ideas, spaces, representations of real things, etc. In the world of molecular biology and beyond, we wish to understand the nature, distinctions between and interactions of objects such as: Small and macromolecules; their functionalities; the cells in which they are made and work; together with the pieces of those cells; the tissues these cells aggregate to form; etc, etc. We do this through data collected about these phenomena and consequently we wish to describe the objects described in those data.

As human beings, we put these objects into categories or classes. These categories are a description of that which is described in a body of data. The categories themselves are a human conception. We live in a world of objects, but the categories into which humans put them are merely a way of describing the world; they do not themselves exist; they are a conceptualisation. The categories in an ontology are a representation of these concepts. The drive to categorise is not restricted to scientists; all human beings seem to indulge in the activity. If a community aggrees upon which categories of objects exist in the world, then a shared understanding has been created.

In order to communicate about these categories, as we have already seen, we need to give them labels. A collection of labels for the categories of interest forms a vocabulary or lexicon. Human beings can give multiple labels to each of these categories. This habit of giving multiple labels to the same category and the converse of giving the same label to different categories polysemy) leads to grave problems when trying to use the descriptions of objects in biological data resources. This issue is one of the most powerful motivations for the use of ontologies within bioinformatics.

AS well as agreeing on the categories in which we will place the objects of interest described in our data, we can also agree upon what the labels are for these categories. This has obvious advantages for communications---knowing to which category of objects a particular label has been given. This is an essential part of the shared understanding. By agreeing upon these labels and committing to their use, a community creates a controlled vocabulary.

The objects of these categories can be related to each other. When each and every member of one category or class is also the member of another category or class, then the former is subsumed by the latter or forms a subclass of the superclass. This subclass superclass relationship between objects is variously known as the `is-a', subsumption or taxonomic relationship. There can be more than one subclass for any given class. If every single kind of subclass is known, then the description is exhaustive or covered. Also, any pair of subclasses may overlap in their extent, that is, share some objects, or they may be mutually exclusive, in which case they are said to be disjoint. Both philosophical and ontology engineering best practice often advocate keeping sibling classes pairwise disjoint.

As well as the \con{is-a} relationship, objects can be rlated to each other by different kinds of relationship. One of the most frequently used is the `partOf' relationship, which is used to describe how objects are parts of, components of, regions of, etc. of other objects. Other relationships will describe how one object `developsInTo' or is `transformed into' another object, whilst retaining its identity (such as tadpole to frog). The `deriveFrom' relationship describes how one object changes into another object with a change of identity. Another relationship describes how a discrete object can `ParticipateIn' a process object.

These relationships, particularly the `is-a' relationship give structure to a description of a world of objects. The relationships, like the categories whose instances they relate, also have labels. Relationship labels are another part of a vocabulary. The structured description of objects also gives a structured controlled bocabulary. The entry on \textbf{Ontologies and Life Science Database Management} describes how such structured controlled vocabularies are exploited within biology.

So far, we have only described relationships that make some statement about the objects being described. It is also possible to make statements about the categories or classes. When describing the elemental form of an atom, for example, `Helium', statements about the discovery date, industrial uses, are about the category or class, not about the objects in the class. Each instance of a `Helium' object was not discovered in 1894; most helium atoms existed prior to that date, but humans discovered and labelled that category at that date.

Ideally, we wish to know how to recognise members of these categories. That is, we define what it is to be a member of a category. When describing the relationships held by an object in a category, we put inclusion conditions upon those instances or category membership criteria. We divide these conditions into two srorts:

  1. Necessary Conditions: These are conditions that an object must fulfil, but fulfilling that condition is not enough to recognise an object as being a member of a particular category.
  2. Necessary and Sufficient Conditions: These are conditions that an object must fulfil and are also sufficient to recognise an object to be a member of a particular category.
For example, each and every organic molecule of alcohol must have a hydroxyl group. That an organic molecule has a hydroxyl substituent is not, however, enough to make that molecule an alcohol. If, however, an organic molecule has a saturated backbone and a hydroxyl substituent on that backbone is enough to recognise an alcohol (at least acording to the IUPAC ``Gold Book'').

In making such definitions, an ontology makes distinctions. A formal ontology makes these distinctions rigourously. Broad ontological distinctions would include those between `Continuant' and `Occurant'; that is, between entities (things we can put in our hands) and processes. Continuants take part in processes and processes have participants that are continuants. Another distinction would be between `Dependant' and `Independant' objects. The existence of some objects depend on the existence of another object to ``bear'' that object. for example, a car is independent of the blue colour it bears. Continuants, for example, can be sub-categorised into material and immaterial continuants such as the skull and the cavity in the skull. Making such ontological distinctions primarily helps in choosing the relationships between the objects being described, as well as some level of consistency.

capturing such descriptions, including the definitions forms an ontology. Representing these descriptions as a set of logical axioms with a strict semantics enables those descriptions to be reliably interpreted by both humans and computers. Forming a consensus on which categories should be used to describe a domain and agreeing on the definitions by which objects in those categories are recognised enables that knowledge to be shared.

The life sciences, unlike a discipline such as physics, has not yet reduced its laws and principles to mathematical formulae. It is not yet possible, as it is whith physical obsevations, to take a biological observation, apply some equations and determine the nature of that observation and make predictions etc. Biologists record many facts about entities and from those facts make inferences. These facts are the knowledge about the domain of biology. This knowledge is held in the many databases and literature resources used in biology.

Due to human nature, the autonomous way in which these resources develop, the time span in which they develop, etc., the categories into which biologists put their objects and the labels used to describe those categories are highly heterogeneous. This heterogeneiety makes the knowledge component of biological resources very difficult to use. Deep knowledge is required by human users and the scale and complexity of these data makes that task difficult. In addition, the computational use of this knowledge component is even more difficult, exacerbated by the overwhelmingly natural language representation of these knowledge facts.

In molecular biology, we are used to having nucleic acid and protein sequence data that are computationally amenable. There are good tools that inform a biologist when two sequences are similar. Any evolutionary inference based on that similarity, however, based upon knowledge about the characterised sequence. Use of this knowledge has been human dependant and reconciliation of all the differing labels and conceptualisations used in representing that knowledge is necessary. For example, in post-genomic biology, it is possible to compare the sequences of the genome and the proteins it encodes, but not to compare the functionality of those gene products.

There is, therefore, a need to have a common understanding of the categories of objects described in biologys data and the labels used for those categories. In response to this need biologists have begun to create ontologies that describe the biological world. The initial move came from computer scientists who used ontologies to create knowledge bases that described the domain with high-fidelity; an example is EcoCyc. Ontologies were also used in projects such as TAMBIS to describe molecular biology and bioinformatics to reconcile diverse information sources and allow creation of rich queries over those resources. The explosion in activity came, however, in the post-genomic era with the advent of the Gene Ontology (GO). The GO describes the major functional attributes of gene products---molecular function, biological process and cellular components. Now some twenty plus genomic resources use GO to describe these aspects of the gene products of their respective organisms. Similarly, the Sequence Ontology describes sequence features; PATO (the phenotype Attribute and trait ontology) describes the qualities necessary to describe an organism's phenotype. All these and more are part of the Open Biomedical Ontologies project.