Assumptions and Requirements for the RASH Process
The following assumptions and requirements about bioinformatics and the RASH process as a whole can be made:
- The data sources will remain autonomous.
- The data sources can be regarded as read-only systems. Updates of the resources are only carried out by the provider. Multidatabase updates require heterogeneous database concurrency control mechanisms that are hard to develop Sheth and Larson 1990.
- Schema and terminology changes within the data sources can be frequent.
- Documentation, especially on metadata, will often be difficult to find. This will be important, especially for areas such as definitions of schema elements and keywords.
- The reconciliation will not be automatic. The BioCOMPASS will assist in the reconciliation process, but at this stage it is only `supported' not `automatic'.
- Queries will be only asked of the schema, queries are not to be made for either data retrieval or data update.
- The schema representation used in RASH to hold information about resource schema should be DBMS implementation independent.
- The RASH process must be able to accommodate all types of metadata representation without loss of information. Bioinformatics resources are stored as flat-file, other semi-structured forms (ASN.1), RDB, OODB, UML, ACEdb, etc. and the RASH process should deal with them all.
- The RASH process should be able to deal with the resources as they appear, not necessarily as they are really stored. For example, EMBL is stored as a relational database, but appears to the public in flat-file form. Nevertheless, the RASH process should be able to deal with EMBL in whatever form it appears.
- The RASH process must be `easy'. The people undertaking the RASH process may not be Computer Scientists, database experts, . . . So, the RASH process must be as simple as possible, given the inherent complexity of the task.
- As well as being `understandable', the process should be relatively quick, as the changes to the data source schema can be frequent. Therefore , the RASH process may be undertaken frequently.
- The RASH process must be replicable. With frequent schema changes, the native form and reconciled form of the data source will have to be regenerated. Many of the steps will be identical, despite the changes. Therefore, it will be necessary to document, if not record for automation, the steps involved in the RASH process. The retro-fitting tool of OPM has such a provision.
- The unify schema produced by the RASH process should not be data lossy. For instance, two database entries use different names for the same gene. A reconciled entry must contain the same names as both the originals. It may be that one is chosen as the preferred name, but all information should be retained. In such a case as that given above, it may be that two string attributes become an aggregation of strings (in a list, the first element may take precedence).
- The RASH process must be applicable to both complete resources and parts of resources. When integrating resources, it is common to pick only parts of the input resources. In contrast, when making non-redundant resources (e.g., protein), complete databases form the inputs to the reconciled result.
- Provenance is an important aspect of bioinformatics resources. Each resource has an associated credibility. Moreover, the rapidly changing nature of the resources mean that which version of a resource was used in a reconciliation will be important. In the final RASH schema, each element should have its own provenance. For example, in the gene name example, each element of the resulting collection would have recorded from which database it was derived.
- Documentation -- Given that conceptualisations and definitions of terminology are often missing from the data sources, it will be necessary for the reconciler to document their intentions, and assumptions used, during the RASH process. Any change to the information capacity must be recorded. It is likely that a reconciler may wish to either add or remove information from the final, unifying schema being developed.
- The metadata for the reconciled sources must be queryable. Part of the RASH proposal is the ability to query source descriptions to find what concepts are represented in which resources. the types of queries possible in the bioCOMPASS are `what sources talk about concept x', `how does the concept x differ between sources a and b?'.
- The need is to characterise and query the resources, but not to interoperate between the resources nor actually perform the reconciliation. The techniques used to reconcile and represent the resources need not go as far as to interoperate; only to give the information necessary to interoperate.
- The irreconcilable must be recorded -- together with argumentation for why it is irreconcilable.
- Much bioinformatics data is in text form or aggregated scalar values in textual form (e.g., SWISS-PROT description line). A part of schema manifestation in the RASH process should be to produce as much scalar information as possible. For example, reducing SWISS-PROT's description line to a list of names (the first being the preferred name) and an EC number. The latter can be thought of as just another name, but could be used to set a boolean flag for `isEnzyme'.