Reconcile and Share (RASH): Methodologies for Coping with Semantic Heterogeneity in Bioinformatics Resources

Robert Stevens, Carole Goble, Norman W. Paton and Andy Brass
The Department of Computer Science
The University of Manchester
Oxford Road School of Biological Sciences
University of Manchester
Oxford Road
Manchester M13 9PT
robert.stevens@cs.man.ac.uk

RASH, which started in February 2000, aims to address the problem of semantic heterogeneity in bioinformatics resources. The Biological community is a distributed one, with a culture of sharing information. This network of information services forms a loose federation of autonomous, distributed, heterogeneous data repositories. These repositories are typically not databases, but proprietary file structures, with their associated search facilities and analysis tools. Often, the schemas (or metadata) of such databases are either implicit or not easily available, so it is difficult to determine exactly the type of conceptual information captured within specific data records.

Similarly, many resources overlap in their content, but may vary considerably on the view that is taken of that content. This can make it difficult to integrate resources in sensible and structured ways. Moreover, the metadata of the sources changes frequently. Thus, a biologist may encounter problems in using these resources, such as:

Which resource contains information on concept x?
is resource A's view of the concept x, different from resource B's view of the same concept?
can resources from A and B be combined easily?

Many bioinformaticians spend considerable amounts of time, repetitiously integrating these diverse resources. It is therefore clear that database heterogeneity is a major problem in bioinformatics. There are broadly two types of heterogeneity problems:

Syntactic: differing storage paradigms and formats, platforms, type systems and communication protocols.
Semantic: overlapping intensional descriptions – the same information is represented by differing models representing the same data in different ways. In addition, overlapping extensions -- the same instance, or different aspects of it, is present in multiple sources.

creation of the common global view, needed for integration and interoperation, of the sources requires the explicit elicitation of semantic heterogeneities and proposed reconciliations. Rather than individuals and groups within the community repetitiously ploughing the reconciliation furrow, the community needs access to information on semantic heterogeneity, a systematic and replicable way of identifying and reconciling such heterogeneity to give a resource that covers a wide range of the resources available in a sharable manner.

This project has two major objectives:

To provide a methodology by which it will be possible to systematically identify, describe and reconcile the contents and semantic heterogeneities of bioinformatics sources.
To develop a web-based tool, the BioCOMPASS, that supports the description, analysis and interrogation of the heterogeneities between the sources, so that they can be more easily shared and examined by the wider community.

The objectives required in order to achieve these aims break down into three major areas: the development of reconciliation methodologies; the construction of source and unifying schemas, and the development of a software tool to allow the management and sharing of this information.

As the RASH process is developed, the details of that process and the case studies supporting the process will become available via this site.