The RASH process describes a methodology for resolving semantic heterogeneieties between bioinformatics resources. The process starts with choosing resources, followed by making implicit schema explicit (where necessary), developing a unifying schema (if desired), a schema comparison stage and then resolution of conflicts found. The information about a resource's schema and equivalencies and conflicts found are stored in RASHdb, whose schema is a model of what information appears in a database schema. In fact, the RASH process is a methodology for populating RASHdb. It is this database that is the core of the BioCOMPASS.
The case study using the SWISS-PROT and PIR protein sequence databanks will be used to illustrate the RASH process. It might be more realistic, in terms of everyday bioinformatics tasks, to have as an example the reconciliation of several schema fragments. This particular case study was, however, chosen for its simplicity, whilst revealing some common features of the RASH process.
This complete reconciliation and integration of PIR and SWISS-PROT will take place under the following scenario: TAMBIS allows queries to be formed over multiple resources, but only one type of each resource may be queried. For example, only one protein sequence database can be used to answer queries concerning the concept `protein'. To relax this single source assumption, a common view (schema) is needed for the multiplicity of protein sequence databases. In addition, it would be useful to remove any redundancy in the answers to queries -- the same protein appearing more than once.
both these resources exhibit a common feature of bioinformatics resources -- they are, or appear to be, flat-file resources. In flat-file databanks, the schema is implicit. One major task within the RASH process is to make this implicit schema manifest.
The BioCOMPASS is used to manage the RASH process. Several assumptions and requirements have been declared for the RASH process -- these are stated to help ensure that the RASH process is appropriate to its task. At each stage of the process data is entered into the RASHdb that sits behind the BioCOMPASS. The process outlined below is deceptively simple -- the devil lies in the detail. the principal points of each stage are given and links provided to further detail and illustrations from the case study.
Before anything else can take place, the resources that will participate in the RASH process must be identified. Obviously, the BioCOMPASS is one avenue for resource identification -- through its biology topic queries. Otherwise, web searches can yield the bioinformatics resource you require. three useful web resources are: dbCAT, Amos' Web Links and Molecular biology database index. Expasy's BioHunt offers a route to search for molecular biology resources. Finally, the January issue of each year's nucleic Acids Research is dedicated to short articles on bioinformatics resources.
Many bioinformatics resources have no, or appear to have no, schema. this is usually because the resource is available, or appears to be available as, as a flat-file. Otherwise, the resource may be available via a web-based user interface. In such cases, it will be necessary to develop a explicit schema for the resource.
If a schema is available, it could be in any of the following forms:
Two schema in EXPRESS for the primary case study can be found for SWISS-PROT and PIR. These schema were made explicit using the documentation available for these resources.
There is necessarily a target schema for the reconciliation -- a schema to which the source schema must conform. One of the source schema can be promoted to be the unifying schema, e.g., SWISS-PROT is the unifying schema and PIR must be reconciled to that schema. Otherwise, either some intermediate, form or synthesis of the source schema, or a novel schema will be developed. A unifying schema for SWISS-PROT and PIR in EXPRESS can be viewed here. A unifying schema is not mandatory --it is possible to reconcile each of the source schema with each other. This may, however, be costly in effort.
It is at this stage that semantic heterogeneities in the schema are identified and resolved. At this stage of the process, RASHdb contains separate entries for two or more schema. Inter-schema relationships need to be made that identify equivalent schema elements. Properties of these inter-schema relationships will describe the type of heterogeneiety and the mechanism by which it may be resolved. It is possible, however, for equivalent entities to be irreconcilable in one or both directions.
Once all the information from this run of the RASH process has been gathered and entered, the BioCOMPASS can be used to answer RASH queries. It is possible, for instance, to recover a resource schema, together with a list of `instructions' on how to place data from another, equivalent, resource into that schema. The BioCOMPASS user interface and mode of action is described fully here.