Some of the essential text mining algorithms were implemented as Web services. Two Web services are intended for text filtering: StopWordsRemover and CharacterFilter, two Web services are dealing with linguistic morphology: a lemmatizator named LemmaGen and a stemmer named PorterStemmer and one Web service is a text format converter named GenerateBows. An auxiliary Web service named getValues was developed for providing a list of possible parameter values of a Web service parameter, which is used to provide user interface to Web services with parameters with several parameter values.

StopWordsRemover

operation: StopWordsRemover

WSDL: http://zulu.ijs.si:8086/SW_service?wsdl
Description: This operation takes as input  plain text and a dictionary of stop words. It removes the stop words from the input text.

LemmaGen

operation: LemmaGen

WSDL: http://zulu.ijs.si:8086/LM_service?wsdl
Description: This operation lemmatizes the input text according to the language parameter. Currently, 12 languages are supported: en,sl,ge,bg,cs,et,fr,hu,ro,sr,it,sp. It returns (language dependent) lemmatized text as output. All the words in the resulting text are in the same order as in the original text, but they are transformed to their dictionary form.

PorterStemmer

operation: PorterStemmer

Wsdl: http://zulu.ijs.si:8086/PS_service?wsdl
Description: This operation does text stemming. Stemming removes the inflicted endings of words. It is often used as text preprocessing for text mining, since stemmed words can be easily matched and counted. The input to this operation is the text to be stemmed, the output is the stemmed text.

GenerateBows

operation: GenerateBows

WSDL: http://bison.ijs.si/WebServices/TextNet.svc?wsdl
Description: BOW construction is a document corpora processing task as it transforms a corpus of documents into a Bag-Of-Words format. In this format, each document is represented as an unordered collection of words, disregarding grammar and even word order. There are several preprocessing options and parameters that can be set to this service.

  • Stemmer: Lemmatizer_Bulgarian, Lemmatizer_Czech,
    Lemmatizer_English, Lemmatizer_Estonian, Lemmatizer_French,
    Lemmatizer_German, Lemmatizer_Hungarian, Lemmatizer_Italian,
    Lemmatizer_Romanian, Lemmatizer_Serbian, Lemmatizer_Slovene,
    Lemmatizer_Spanish, PorterStemmer, None
  • StopWordSets: English, EnglishGoogle, English523, English425,
    English319, English8, EnglishInet, French, German, Spanish,
    Slovene, Empty
  • Tokenizer: UnicodeTokenizer, VocabularyTokenizer
  • WordWeightType: TermFreq, TfIdf, LogDfTfId

getValues

operation: getValues

WSDL: http://ropot.ijs.si/webservices/janez/getvalues.php?wsdl
Description: This operation parses the web service wsdl description and return a list of possible parameter values for the inputed parameter name.

Text mining tasks available in Taverna

In addition to the services listed above the e-LICO text mining Web Services can provide:

The majority of e-LICO services are listed on BioCatalogue

Here is a short summary of the Web Service operations available. For more information please follow the BioCatalogue link)

Text cleaner (BioCatalogue:2173)

operation: cleanText

This operation will remove all XML-invalid characters from the text supplied. Valid XML characters are specified here http://www.w3.org/TR/REC-xml/#charsets

operation: cleanTextASCII

This operation will remove all XML-invalid and non-ASCII characters from the text supplied. This operation can be used to clean text so that it is suitable as input for the NaCTeM service TerMine (http://www.biocatalogue.org/services/32-termine_35834), which only accepts ASCII text. XML-invalid characters are specified here (http://www.w3.org/TR/REC-xml/#charsets). ASCII characters are defined as having a Unicode code point between 0000 and 007F.

PDF to text (BioCatalogue:2172)

operation: pdfToText

This operation accepts a byte array representation of a PDF file and returns a byte array representation of the extracted text

operation: pdfToTextBase64

This operation accepts a Base64 encoded string representation of a PDF file and returns a Base64 encoded representation of the extracted text (a string)

Article section text classifier (BioCatalogue:2171)

operation: classifyText

This operation will classify a piece of text as being most likely to come from one of the four common scientific article sections (Introduction, Methods, Results, Discussion). This is a document-type web service, and this operation accepts a single string as input (the text to be classified). If you want to use this operation in Taverna, then you should use an XML input and output splitter.

operation: classifyTextDetailed

This operation will classify a piece of text as being most likely to come from one of the four common scientific article sections (Introduction, Methods, Results, Discussion). This is a document-type web service, and this operation accepts a single string as input (the text to be classified). If you want to use this operation in Taverna, then you should use an input XML splitter and a chain of two output XML splitters.

Sentence splitter service (BioCatalogue:2161)

operation: splitIntoSentences

This is the only operation it accepts a single string and returns an array of strings. Both the input and output are wrapped up in an XML document. To get access to the input and output data in Taverna, please add an "XML Input Splitter" and an "XML Output Splitter" after adding the operation to your workflow.

Finding things service (BioCatalogue:3334)

operation: findCellTypesInText

This operation accepts plain text and returns a list of cell types found in the text. Character offsets into the original submitted text string are provided for each cell type find.

operation: findTissueTypesInText

This operation searches the provided text string for mentions of tissue types. The tissue types are obtained from the Mouse adult gross anatomy ontology (http://purl.org/obo/owl/MA).

operation: findSynonymsInText

This operations accepts two inputs; a list of ids each with literal strings to be found, and a text string to be searched for the literal strings.

operation: findSynonymsInTexts

This operation accepts two inputs; a list of ids each with a set of associated literal strings, and a list of text strings to be searched for all of these literal strings.

Finding relationships service (BioCatalogue:3335)

operation: findMetaboliteLocalisationRelationships

This operation accepts a list of chemical, cell type and tissue type annotations. It then returns a list of relationships between these entities.

operation: findProteinInteractionRelationships

This operation accepts a list of protein entity annotations. It then returns a list of relationships between these entities.

operation: findProteinLocalisationRelationships

This operation accepts a list of protein, cell type and tissue type annotations. It then returns a list of relationships between these entities.

operation: findTermRelationships

This operation accepts a list of term annotations. It then returns a list of relationships between these entities.

All these Web services are available as Taverna workflows through MyExperiment portal, where example workflows are given: