Semantic Heterogeneity in Bioinformatics Resources

A Definition of Semantic Heterogeneity

Hammer and McCleod describe semantic heterogeneity "By this [semantic heterogeneity] we mean variations in the manner in which data is specified and structured in different components. Semantic heterogeneity is a natural consequence of the independent creation and evolution of autonomous databases which are tailored to the requirements of the application system they serve." Won Kim expands this description of semantic heterogeneiety in databases in the follwoing way: "A schema contains a semantic descriptionof the information in a given database. It is possible to define equivalent schemas in as many ways as there are data models. Further, the same (or similar) information can be represented in many ways in the same data model. Given such inter- and intra-model variability, it is indeed a formidable task to integrate many schemas into a homogeneous schema."

a schema is the organization or structure for a database. Different people will structure or organise the same data or system in different ways. In addition to this human element are the differing capibilities of the modelling languages used to represents the modeller's view of the data. An ER schema, for instance, has neither inheritance nor collection types in its modelling repetoire, both of which are present in object oreintated modelling languages. this factor leads to further differences between schemata. Hammer and McCleod say that in the database context, this heterogeneity refers to differences in the meaning and use of data that make it difficult to identify the various relationships that exist between similar or related objects in the schema of different component databases. It is the identification and resolution of these differences that the RASH process seeks to accomplish.

Hammer and McCleod enumerate a corresponding set of causes for semantic heterogeneity as Kim:

  1. metadata language (conceptual model): The differing capibilities of the language used to represent the schema;
  2. Metadata specification (conceptual schema): Differing schema produced by different modellers;
  3. Object comparability: How objects in a chema are identified by the modeller (`species' and `organism' being the same object);
  4. Low-level data format: units of measure etc;
  5. Tools Differing DBMS, and interface tools altering the view taken of data.
    1. Causes 1 to 4 are included within the standard view of semantic heterogeneity. Hammer and McCleod point out that cause 5 (tools) is really orthogonal to the first four. Batini et al. give the causes of semantic heterogeneity as follows:
      1. Different perspectives -- this is the same as different modellers differing in how they conceptualise the same data;
      2. Equivalent constructs: This is the same as the differing capibilities of the schema representation languages;
      3. Incompatible design specifications: for the same data domain, the application developer may have slightly differing purposes and constraints. this can lead to differing schema.

      So, semantic heterogeneiety deals with how data is represented in structural, organisational terms within a database. So, it caqptures whether some data is represented as an entity in one schema, but only as an attribute of an entity in another. This definition extends as far as what data types are used (integer, real or string, for example), as well as units used (centimetres or inches) and precision (two or four decimal places; mark or grade in an exam). Semantic heterogeneity does not extend as far as the instances placed within the schemata of different databases. this description of semantic heterogeneiety does not extend to the fact that SWISS-PROT entry P21598 is equivalent (not identical) to PIR entry S13142. Separate techniques are required to resolve these instance conflicts.

      A more grey area is encountered when considering the SWISS-PROT keyword `loop' in this entry is the same as the PIR keyword `p-loop' or that the SWISS-PROT feature key `DISULFID' is equivalent to the term `disulfide' in PIR. It is easier to reconcile two large collections of keywords as a separate exercise from the data organisation reconciliation. Sometimes, a value in one DB can correspond to an attribute or entity in another -- For example, some of the feature key values from SWISS-PROT map onto record names in the feature table of PIR. These cases can be classified to their type of semantic heterogeneity, if the value is assumed to be an attribute. the boundary between data and metadata is somewhat blurred and can depend upon perspective. Tackling this problem is a feature of the RASH schema comparison process

      Classification of Semantic Heterogeneities

      This classification of the semantic heterogeneities existing in database schema has been taken from Won Kim. it has been adapted, by Won Kim, from an earlier, purely relational form to one that accommodates an object view. Here, the word entity is used for both class and entity or table. Similarly, attribute is used for both field, attribute and method. The classification is as follows:

      1. One to one entity conflict:
        1. Entity name:
          • different names for equivalent entities;
          • same name for different entities;
        2. Entity structure conflict:
          • missing attribute;
          • missing, but implicit attribute;
        3. Entity constraints;
        4. Entity inclusion;
      2. Many to many entity conflict:
        • Composition of two or more one to one entity conflicts;
      3. One to one attribute conflict:
        1. Attribute name:
          • different name for equivalent attribute;
          • same name for different attribute;
        2. Attribute constraints:
          1. integrity constraints;
          2. data type;
          3. composition;
        3. Default values;
        4. Attribute inclusion;
        5. Methods;
      4. Many to many attribute conflict:
        • Composition of two or more one to one attribute conflicts;
      5. Entity to attribute conflict:
        • Composition of two or more one to one entity and one to one attribute conflicts;
      6. Domain representation conflict:
        1. Different expression denoting same information;
        2. Different units;
        3. Different levels of precision.

      Some Examples of Bioinformatics Schema Heterogeneity