The RASH Process

The RASH process describes a methodology for resolving semantic heterogeneieties between bioinformatics resources. The process starts with choosing resources, followed by making implicit schema explicit (where necessary), developing a unifying schema (if desired), a schema comparison stage and then resolution of conflicts found. The information about a resource's schema and equivalencies and conflicts found are stored in RASHdb, whose schema is a model of what information appears in a database schema. In fact, the RASH process is a methodology for populating RASHdb. It is this database that is the core of the BioCOMPASS.

The case study using the SWISS-PROT and PIR protein sequence databanks will be used to illustrate the RASH process. It might be more realistic, in terms of everyday bioinformatics tasks, to have as an example the reconciliation of several schema fragments. This particular case study was, however, chosen for its simplicity, whilst revealing some common features of the RASH process.

This complete reconciliation and integration of PIR and SWISS-PROT will take place under the following scenario: TAMBIS allows queries to be formed over multiple resources, but only one type of each resource may be queried. For example, only one protein sequence database can be used to answer queries concerning the concept `protein'. To relax this single source assumption, a common view (schema) is needed for the multiplicity of protein sequence databases. In addition, it would be useful to remove any redundancy in the answers to queries -- the same protein appearing more than once.

both these resources exhibit a common feature of bioinformatics resources -- they are, or appear to be, flat-file resources. In flat-file databanks, the schema is implicit. One major task within the RASH process is to make this implicit schema manifest.

The BioCOMPASS is used to manage the RASH process. Several assumptions and requirements have been declared for the RASH process -- these are stated to help ensure that the RASH process is appropriate to its task. At each stage of the process data is entered into the RASHdb that sits behind the BioCOMPASS. The process outlined below is deceptively simple -- the devil lies in the detail. the principal points of each stage are given and links provided to further detail and illustrations from the case study.

  1. Resource identification

    Before anything else can take place, the resources that will participate in the RASH process must be identified. Obviously, the BioCOMPASS is one avenue for resource identification -- through its biology topic queries. Otherwise, web searches can yield the bioinformatics resource you require. three useful web resources are: dbCAT, Amos' Web Links and Molecular biology database index. Expasy's BioHunt offers a route to search for molecular biology resources. Finally, the January issue of each year's nucleic Acids Research is dedicated to short articles on bioinformatics resources.

    The BioCOMPASS will guide the user to submit these data to the RASH management part of RASHdb. The principle task of the BioCOMPASS is to populate the RASHdb and it supports the RASH process.

  2. Schema manifestation

    Many bioinformatics resources have no, or appear to have no, schema. this is usually because the resource is available, or appears to be available as, as a flat-file. Otherwise, the resource may be available via a web-based user interface. In such cases, it will be necessary to develop a explicit schema for the resource.

    If a schema is available, it could be in any of the following forms:

    • ER or EER schema;
    • A collection of relational tables;
    • An object orientated database schema;
    • An ACEdb schema.
    The BioCOMPASS can accomodate all of these forms. It may well be easier, however, to transform these schema representations into RASH's preferred representation
    EXPRESS, as submission to RASHdb can be semi-automatic.

    Two schema in EXPRESS for the primary case study can be found for SWISS-PROT and PIR. These schema were made explicit using the documentation available for these resources.

  3. Development of unifying schema

    There is necessarily a target schema for the reconciliation -- a schema to which the source schema must conform. One of the source schema can be promoted to be the unifying schema, e.g., SWISS-PROT is the unifying schema and PIR must be reconciled to that schema. Otherwise, either some intermediate, form or synthesis of the source schema, or a novel schema will be developed. A unifying schema for SWISS-PROT and PIR in EXPRESS can be viewed here. A unifying schema is not mandatory --it is possible to reconcile each of the source schema with each other. This may, however, be costly in effort.

  4. Schema comparison

    It is at this stage that semantic heterogeneities in the schema are identified and resolved. At this stage of the process, RASHdb contains separate entries for two or more schema. Inter-schema relationships need to be made that identify equivalent schema elements. Properties of these inter-schema relationships will describe the type of heterogeneiety and the mechanism by which it may be resolved. It is possible, however, for equivalent entities to be irreconcilable in one or both directions.

  5. Instance conflict resolution
  6. Querying and presentation

    Once all the information from this run of the RASH process has been gathered and entered, the BioCOMPASS can be used to answer RASH queries. It is possible, for instance, to recover a resource schema, together with a list of `instructions' on how to place data from another, equivalent, resource into that schema. The BioCOMPASS user interface and mode of action is described fully here.