Description and classification of shims in myGrid

Last updated on 2006-11-23 by Duncan Hull, University of Manchester, UK.

This document outlines some the “shims” in myGrid, software components that align the input and output of closely related data. The shims have been implemented in two ways: firstly, on the client-side in Taverna as “Local Java Widgets” and the server-side as individual Web Services. Two classifications are presented here, the first: by Input which describes the client-side shims for software developers. The second: classifies both client and server side shims, for end-users (e.g. biologists), based on the relationship between the input and output of a shim.

DISCLAIMER: This technical report is an incomplete and ongoing work, liable to change without notice. Some of these software components have been difficult to describe because its not clear from the documentation, what the inputs and outputs they take and produce are, without invoking them. Software these are commented in red. Also, the Taverna workbench is a moving target, so some of these components (e.g. BIND) have become obsolete but are left here for historical purposes.

Table of Contents

1. Classification of shims by I/O relationship

This classification aims to help end-users of Taverna find shim services when required. They are currently organised into four classes, based on the relationship between the input and output of the shim.

1. Input uniquelyIdentifies Output
e.g. a DE-REFERENCER service like RefSeq which hasInput Accession_number and hasOutput RefSeq_Record, the object that the identifier identifies, see RefSeq Workflow Example. Another example, Text File Reader which takes a file url and retrieves the file at that URL
2. Input isPartOf Output
e.g. an EXTRACTOR like BLAST simplifier on Phoebus, which extracts specified sub-parts of a BLAST report, e.g. all the GI numbers.
3. Input equilvalentTo Output
e.g. a MAPPER that converts one identifier to another like SHoundGBAccFromGi which converts a GI identifier to a GenBank Accession number.
4. Input and Output are different representations of the same thing,
e.g. SYNTAX TRANSLATOR like SeqRet which can be used to convert given DNA sequences between different syntaxes e.g. FASTA to NCBI

Some of these types of shim are shown in the following workflows, which can be visualised and run in the Taverna workbench.

  1. Syntax translators: e.g. the SeqRet shim which reads and writes sequences, see example workflow at http://www.cs.man.ac.uk/~hulld/workflows/syntax_transation_with_seqret.xml in this workflow, seqret is used to translate a sequence from FASTA to EMBL format
  2. Dereferencers e.g. the GetFASTA shim which retrieves a sequence for a GI number, see example workflow at http://www.cs.man.ac.uk/~hulld/workflows/identifier_dereferencing_with_get_nucleotide.xml
  3. Identifier mapping: map between identifiers: e.g. the SRS links shim which can be used to map equivalence between different identifiers, as in the http://www.cs.man.ac.uk/~hulld/workflows/identifier_mapping_with_SRS_links.xml
  4. Also the partExtractor shim, http://www.cs.man.ac.uk/~hulld/workflows/part_extraction_with_xpath.xml which illustrates the part-extractor shim where the XPathTextWorker extracts the sequence from a GenBankrecord, identified by the given GI
  5. Finaly, all these classes of shim service combined together in one workflow http://www.cs.man.ac.uk/~hulld/workflows/allshims.xml for illustration

2. Classification of shims by input

The classification below by Tom Oinn and Mark Fortner is based on the type of data that the shim operates on (e.g. xml, ncbi, text etc). This classification is currently used to organise the shims in the Available Services Panel under Available processors > Local Services > Local Java widgets. However, this classification is currently of limited use to users of Taverna who are not software developers.

Service name (with link to Taverna API) Description Input I Output O Relation between O and I

Classified as "List"

StringListMerge Merge string list to string. Consumes a string list and optional seperator character and emits a string formed from the concatenation of all items in the list with the seperator (default newline) interposed between them. String list (and optional seperator) Merged list output hasPart input
FlattenList Flatten I(I()) to I(). Consume a list of lists and emit a list containing the first level flattening of the input. List. Set of sets? List. Subset? Unclassified
StringStripDuplicates Remove duplicate stringsConsumes a string list and emits the string list with duplicate entries removed. The first occurance of a duplicate is preserved and all subsequent ones omited, i.e the string list 'a,b,c,b,a,d' is converted to 'a,b,c,d' String list String list (stripped) Unclassified
EchoList EchoList. Echo the input list to the output list, does no actual processing at all. This class is intended to be used in conjunction with nested workflows in order to split the iteration out from the previous stage in the flow. List List Unclassified

Classified as "io"

TextFileWriter Write Text File. This processor writes the "filecontents" out to the the url specified in the "outputFile" parameter. Note that the outputMap is always empty. filecontents? outputFile Unclassified
EnvVariableWorker Get Environment Variables as XML. This processor exposes the Java environment variables as an XML document. None Environment variables n/a
FileListByRegexTask List files by regexThis processor lists the files in a given subdirectory using a regular expression. Directory and regular expression FileList Directory hasPart file?
LocalCommand Execute cmd line app This processor executes a commandline and returns the response as a String command (string) result (string) Unclassified, could be anything, depends on command
FileListByExtTask List Files By Extension This processor gets a list of files on a local directory directory, extension filelist directory hasPart file?
DataRangeTask Select Data Range From File Extracts a range of values from a two dimension data array array, starting point, end point array outputArray isPartOf inputArray
TextFileReader Read Text File reads text from a file specified by the "fileurl" attribute. and returns the results in the "filecontents" item in the outputMap fileurl filecontents filecontents isidentifiedby fileurl
ConcatenateFileListWorker Concatenate files concatenates a series of text files and saves the results into the output file filelist, outputfile, displayresults results results hasPart input
DataRangeColumnTask Select Column From File extract a single column of data from a data array produced by either an ExcelFileReader or by a DelimitedFileReader array, column array Unclassified, SELECT or PROJECT?
ExcelFileReader Read Excel File reads an Excel spreadsheet and creates an ArrayList of ArrayLists containing string data. Note that Formula's are not currently evaluated and thus are returned as empty strings. filename, firstRowContainsColumnNames, dateIndexes data Unclassified

Classified as "Metadata"

GetLSID Get internal LSID of input Outputs "replacelsid:input" which should be substituted for the input's lsid by the ProcessorTask. Chris Greenhalgh wrote this strips out two (or three?) of the five? components of an LSID input replacelsid Unclassified: try it out

Classified as "xml"

XPathTextWorker XPath From Text applies an arbitrary XPath expression to an XML document, and returns a nodelist containing the nodes that match the XPath expression. xpath, xml-text nodelist, nodelistAsXML Unclassified
XSLTWorker Transform XML transforms an input XML document into an output document. If an inFileURL is supplied, it will use the document located at the URL as input. If the xml-text is supplied, it will this in-memory XML document as input. If an outputFile url is supplied, the results will be written to the output document. xslFileURL, outFileURL, inFileURL, outputExt: @tavinput xslFileURL The complete path to XSL file, @tavinput outFileURL The complete path to the output file. (optional), @tavinput inFileURL The complete path to the input file, @tavinput xml-text The XML text to be processed. (optional), @tavinput outputExt The output file extension. Use this only if you want to add the extension to the input filename and use it as the output file name. outputStr, @tavoutput outputStr A string containing the output text. This is useful, if you want to connect this processor to another and pass the results to it. Unclassified
XPathWorker XPath From XML File applies an arbitrary XPath expression to an XML document, and returns a nodelist containing the nodes that match the XPath expression.. xpath, xmltext xml-text, nodelist unclassified

Classified as "ncbi"

note parameter naming conventions follow those outlined in Entrez Programming Utilities.

SNPWorker Get SNP XML fetches SNP records. id, retttype, retmode Doesn't work. Use snp id e.g. rs3091213 (taken from this list of SNPs) with no luck. See Get SNP example workflow (produces empty results). rettype defaults to xml, but its not clear what values retmode takes and if its mandatory or not. resultsXml I uniquelyIdentifes O
ProteinGBSeqWorker Get Protein GBSeq XML fetches protein data in GBSeq XML format. id e.g. GenBank identifier without leading GI: e.g. 1293613 see GenBank Sequence example workflow outputText I uniquelyIdentifes O
NucleotideGBSeqWorker Get Nucleotide GBSeq XML returns a GB Seq formatted record. id e.g. GenBank identifier without leading GI: e.g. 1293613 see GenBank Sequence example workflow outputText I uniquelyIdentifes O
NucleotideFastaWorker Get Nucleotide FASTA fetches a nucleotide sequence in FASTA format. id Accession number e.g. U49845 see Nucleotide FASTA workflow example. outputText I uniquelyIdentifes O
NucleotideINSDSeqXMLWorker Get Nucleotide INSDSeq XML returns a INSD formatted nucleotide record ide.g. The nucleotide accession. U49845, see Nucleotide INSD workflow example. outputText, e.g. INSD formatted nucleotide record I uniquelyIdentifes O
HomoloGeneWorker Homologene XML fetches HomoloGene data from NCBI. term, maxRecords, outputFile, xslt, ext Example? Homologene terms? [Ancestor] Taxonomic name of common ancestor of the species represented in a HomoloGene entry. [Gene Description] Detailed description of a Gene. [Gene Id] Unique Gene Identifier. [Gene Name] Gene Aliases. [Nucleotide Accession] GenBank accession identifier of nucleotide sequence. [Nucleotide GI] Unique Nucleotide identifier. [Organism] Description of the organism or the NCBI Taxonomy ID of a species. [Protein Accession] The protein accession number of a protein. [Protein GI] Unique Protein identifier. [Text Word] Free text to be searched for in HomoloGene. [Title] Summary of HomoloGene entry [UniGene ID] Unique Unigene identifier. resultsXml unclassified
ProteinFastaWorker Get Protein FASTA fetches a protein sequence in FASTA format. id is this a GenBankIdentifier or something else? outputText dereferencer
LocusLinkWorker Get LocusLink XML fetch locuslink data from the NCBI database as XML. id, rettype, retmode is this a GenBankIdentifier or something else? What values do rettype and retmode take? outputText dereferencer
PubMedSearchWorker Search PubMed downloads PubMed records in XML format. Since NCBI does not currently support a pure XML term, database, minDate, maxDate, reldate, rettype, cmd, cmd_current, dopt, orig_db, disp_max any valid pubmed query string / term? (e.g. apweiler), this doesn't currently work, are all the other parameters mandatory? See PubMed search example workflow resultsXml unclassified
NucleotideTinySeqXMLWorker Get Nucleotide TinySeq XML fetches a nucleotide sequence from NCBI and returns the results in the TinySeqXML format. id is this a GenBankIdentifier or something else? outputText dereferencer
OMIMWorker Get OMIM XML fetches an OMIM record from the NCBI database in XML format. term, maxRecords, outputFile, xslt, ext Can you give example terms? Are the other parameters optional or mandatory? resultsXml unclassified
EntrezGeneWorker Get Entrez Gene XML fetching an Entrez Gene record in XML format. It can also transform the resulting XML document. term, maxRecords, outputFile, xslt, ext Can you give example terms? Are the other parameters optional or mandatory? resultsXml unclassified
ProteinINSDSeqXMLWorker Get Protein INSDSeq XML fetches an INSD formatted protein record id is this a GenBankIdentifier or something else? outputText dereferencer
ProteinTinySeqXMLWorker Get Protein TinySeq XML fetches a protein in TinySeqXML format. id is this a GenBankIdentifier or something else? outputText dereferencer
EntrezProteinWorker Get Entrez Protein XML processor fetches an Entrez Protein record from NCBI. term, maxRecords, outputFile, xslt, ext Can you give example terms? Are the other parameters optional and what do they do? resultsXml unclassified
PubMedEFetchWorker Get PubMed XML by PMID PubMed articles in XML form. Use this worker only if you already know the pubmed id id, rettype, retmode e.g. 15262813, see PMID workflow example outputText dereferencer
NucleotideXMLWorker Get Nucleotide XML fetches Nucleotide XML documents. term, maxRecords, outputFile, xslt, ext Can you give example terms? Are the other parameters optional or mandatory? resultsXml unclassified
PubMedESearchWorker Search PubMed XML searches for articles in PubMed and returns their IDs in XML format term, db, field, retstart, retmax, mindate, maxdate, rettype Can you give example terms? Are the other parameters optional or mandatory? outputText unclassified

Classified as "net"

ExtractImageLinks Get image URLs from HTTP document Extract a list of all image links in the supplied html document document imagelinks unclassified
SendEmail Send an email Send an email from a workflow to, from, subject, body, smtpserver none n/a
WebPageFetcher Get web page from URL Fetch a single web page from URL url, base contents dereferencer
WebImageFetcher Get image from URL Fetch a single image from URL url, base image dereferencer

Classified as "text"

ByteArrayToString org.embl.ebi.escience. scuflworkers.java.ByteArrayToString Byte[] to String No description available yet. There isn't a String to Byte[] but there probably should be. bytes 'application/octet-stream' string 'text/plain' syntax translator
StringSetUnion String list union Provide the union of two lists of strings, the result being a string list containing all strings that occur in either of the input lists. list1, list2 union unclassified
StringConcat Concatenate two strings Returns the result of appending firststring to secondstring string1, string2 output unclassified
StringSetDifference String list difference Returns the items that are different between two sets or lists of string types where elements only exist in the output if they occur in either input, but not both list1, list2 difference unclassified
FilterStringList Filter list of strings by regex Filter a list of Strings, only passing through those that match the supplied regular expression stringlist, regex filteredlist unclassified
SplitByRegex Split string into string list by regular expression Split an input string into a list of strings using the given regular expression to determine the delimiter. If the regular expression is not supplied then it will default to the ',' character string, regex split unclassified
PadNumber Pad numeral with leading 0's Pad a numeral with leading zeroes to take it up to a specified length, which defaults to seven. input, targetlength padded unclassified
RegularExpressionStringList Filter list of strings extracting match to a regex Apply a regular expression to a string, returning a group that matches if there is a match. stringlist, regex, group filteredlist unclassified
StringSetIntersection String list intersection Returns the intersection of two sets or lists of string types where elements only exist in the output if they occur in both inputs. list, list2 intersection unclassified

Classified as "biojava"

BlastParserWorker Read BLAST results parses BLAST results and returns an XML document containing the results. fileUrl, strict blastresults unclassified
TranscribeWorker Transcribe DNA takes a DNA sequence and transcribes it into an RNA sequence dna_seq rna_seq unclassified
EMBLParserWorker Read EMBL file parses an EMBL-based file and outputs the results in Agave XML format. fileUrl emblFile unclassified
TranslateWorker Translate DNA translates a DNA sequence into a protein sequence. dna_seq prot_seq unclassified
ReverseCompWorker Reverse Complement DNA takes a raw DNA sequence and returns the reverse complement of the sequence. rawSeq revSeq unclassified
GenBankParserWorker Read GenBank file parses genbank files and outputs the results in Agave XML format. fileUrl genbankdata unclassified
SwissProtParserWorker Read SwissProt file parses a SwissProt file and outputs the results in Agave XML format fileUrl results unclassified

Classified as "jdbc"

SQLQueryWorker Execute SQL Query executes SQL prepared statements, and returns the results as an array of arrays. It can also, optionally generate an XML representation of the results. url, driver, userid, password, sql, params, provideXml resultsList, xmlresults unclassified
SQLUpdateWorker Execute SQL Update execute SQL update/insert statements url, driver, userid, password, sql, params resultsList unclassified

Classified as "base64"

EncodeBase64 Encode byte[] to base64 Encode byte[] data into base64 string bytes base64 unclassified
DecodeBase64 Decode base64 to byte[] Decode base64 string into byte[] base64 bytes unclassified

Classified as "moby"

CreateMobyData Create moby data construct a biomoby data packet from either an ID or a string content namespace, id, value, type mobydata unclassified
ExtractMobyData Parse moby data extract simple data types from biomoby data packets. mobydata namespace, id, value, type unclassified
CreateMobyCollection Create a moby collection construct a biomoby data packet from either an ID or a string content collectionName, mobySimple1, mobysimple2, ..., mobysimple35 mobyCollection unclassified

BioMoby services

mobycentral
Arbitrary biomoby service no description namespace 'text/plain', id 'text/plain', article name 'text/plain' mobyData 'text/xml' unclassified, depends on service

References

  1. Duncan Hull, Robert Stevens and Phillip Lord. Describing Web Services for user-oriented retrieval. Accepted paper and presentation in W3C Workshop on Frameworks for Semantics in Web Services, Digital Enterprise Research Institute (DERI), Innsbruck, Austria. 2005-06-09.
  2. Duncan Hull, Robert Stevens, Phillip Lord, Chris Wroe and Carole Goble. Treating shimantic web syndrome with ontologies. In First Advanced Knowledge Technologies workshop on Semantic Web Services (AKT-SWS04) KMi, The Open University, Milton Keynes, UK. 2004-12-08. (See Workshop proceedings CEUR-WS.org (issn:1613-0073) Volume 122 - AKT-SWS04)