Biologists need knowledge in order to perform their work. A biologist will often use some pre-existing item of knowledge to make inferences about the item under investigation. The most common example of this within molecular biology is the use of sequence comparison to infer the function of a novel protein sequence. The reasoning is that if a sequence of unknown function is highly similar to a sequence of known function, then it is probable that the novel sequence also has that function. So, rather than using a rule, law or equation to find the function of a protein, a biologist uses the knowledge that a similar sequence has a known function to make a judgement about the function of the new sequence. This is why it is sometimes said that biology is a `knowledge based', rather than an `axiom based' discipline .
Modern biologists also need knowledge for communication. Biology is a data rich discipline, which is available as a fund of knowledge by which biologists generate further knowledge. This knowledge is stored in many hundreds of databases and many of these databases need to be used in concert during an investigation. Knowledge is vital in two respects during this process. For instance, when using more than one data store or analysis tool, a biologist needs to be sure that knowledge within one resource can be reliably compared to another. A prime example is the differing uses of the term `gene' within the community. In one database, gene may be defined as `the coding region of DNA'; in another as `DNA fragment that can be transcribed and translated into a protein' and `DNA region of biological interest with a name and that carries a genetic trait or phenotype' in a third . Being able to conform to a common definition or reason about the differences between definitions, in order to reconcile databases, would be advantageous. The second need for knowledge is to define and constrain data within a resource. Biological data can be very complex; not only in the type of data stored, but in the richness and constraints working upon relationships between those data. When designing a database it is useful to be able to describe what values can be specified for which attributes under which conditions. This is the encapsulation of biological knowledge within database schema.
It is impossible for a single biologist to deal with all the domain knowledge. The arrival of whole genomes and the knowledge they contain only exacerbates the situation. There is, therefore, a need to create systems that can apply the knowledge in the heads of domain experts to biological data. It is not envisaged that such systems could ever perform better than human experts, however, they could play a crucial role in helping the processing of data to the point where human experts could again apply their knowledge sensibly. This then raises numerous questions, in particular regarding how knowledge can be captured in ways that make it available and useful within computer applications.
This briefing is about the use of such knowledge within bioinformatics applications. Knowledge can be captured and made available to both machines and humans by an ontology. The premise for the need for ontologies within bioinformatics is the need to make knowledge available to that community and its applications. This paper will only be a brief introduction and will not be a complete guide to the philosophy, building and use of an ontology. It does, however, aim to provide the foundations.
Section 2 gives the definitions of ontology and related terms. In Section 3, we will describe the uses to which ontologies can be put, and then in Section 4 we will describe some current bioinformatics and molecular biology ontologies and how they are used. Section 5 will describe the processes of conceptualisation and specification, or building of, an ontology. Finally, Section 6 draws together the main themes of the paper and explores the future of ontologies in the bioinformatics domain.