Taverna Worfklows

All e-LICO text mining Taverna workflows are available at myExperiment in the e-LICO group.

Our workflows can be used to perform these tasks

  • Text cleaning
  • PDF to text conversion
  • Term recognition
  • Biological entity recognition
  • Protein protein interaction detection
  • Chemical localisation detection
  • Term relationship detection
  • Gene expression detection
  • Text stemming with Porter stemmer
  • Lemmatization
  • BOW (bag of words) construction

Have a look at the Workflows tab to see more information.

The more complex workflows are built from a number of component workflows in a modular fashion (see image). This allows you to build your own text mining workflow using modular components of your own choosing.

Here is a demo of how to use the e-LICO Taverna workflows for text mining. In this video a directory of PDFs are converted to plain text and the keywords in the text are returned as output.

Here is a summary of the e-LICO text mining workflows. More information can be found by following the myExperiment link.

Identify relationships between terms from plain text documents (myExperiment workflow:1961)

This workflow accepts a single directory path as input. The path should point to a directory containing only text files. The text files are processed, by identifying candidate terms and then possible relationships between them.

Protein-protein interaction identification in plain text files (myExperiment workflow:1960)

The input is a path to a directory containing only plain text files. Proteins are then identified in the text using Whatizit and the interactions between them are identified using e-LICO software.The input is a path to a directory containing only plain text files. Proteins are then identified in the text using Whatizit and the interactions between them are identified using software from the e-LICO project.

Protein localisation from plain text (myExperiment workflow:1959)

Protein localisation workflow. This workflow finds protein localisation relationships from tissue types, cell types and proteins.

Chemical localisation from plain text (myExperiment workflow:1958)

Chemical localisation workflow.  Detects relationships between cell types, tissue types and chemicals.

Terms from collection of PDF files (myExperiment workflow:1061)

This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows.

Terms from collection of text files (myExperiment workflow:1065)

This workflow will give you a set of candidate terms for each text file in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows.

One sentence per line (myExperiment workflow:2106)

This workflow accepts a plain text input and provides a single text document per input containing one sentence per line. Newline characters are removed from the original input. The OpenNLP sentence splitter is used to split the text, this is provided by University of Manchester Web Services.

Clean plain text (myExperiment workflow:1055)

This workflow will remove any XML-invalid characters (these characters often appear in the output of PDF to text software) from any text supplied to the input port. This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Load plain text from directory (myExperiment workflow:1056)

This workflow will automate the reading of a set of text files stored in a single directory (the path to which should be supplied as a single input value).  It will assume that the text files are saved using the default character encoding for the system that Taverna is running on.  This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Clean plain text (ASCII) (myExperiment workflow:1054)

This workflow will remove any XML-invalid and non-ASCII characters (e.g. for sending to the ASCII-only Termine service) from any text supplied to the input port. This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Load PDF from directory (myExperiment workflow:1057)

This workflow will automate the reading of a set of PDF files stored in a single directory (the path to which should be supplied as a single input value). This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

PDF to plain text (myExperiment workflow:1058)

This workflow will extract the plain text content of PDF files supplied to the input port. You can connect the Load PDF from directory workflow to this workflows input. We recommend you send the output from this workflow to the Clean plain text workflow, because the PDF to text process can add characters into the text that are XML-invalid and therefore can not be sent to most services as plain text. Another way round this problem is to encode the text as Base64 using the handy local service ("Encode Byte Array to Base 64") included with Taverna, although this requires a service that knows to decode the Base 64 back to text, which is not common. The PDF to text service makes use of the "pdftotext" executable from Xpdf. This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Sentence splitting (myExperiment workflow:1059)

This workflow will attempt to split up text into sentences, returning a list of sentences to the output port. The sentence splitting service makes use of the OpenNLP sentence detector and has been trained to work on english text. This workflow can be used to provide input to the Termine with c-value threshold workflow. This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Termine with c-value threshold (myExperiment workflow:1060)

This workflow accepts a list of sentences from a single document and returns the terms found by the TerMine web service. It also allows you to set a threshold c-value score so that only terms with a user-controlled probability (of being a real term) are returned as an output. To get sentences to supply to this workflow you can use the sentence splitting workflow. The TerMine service (used in this workflow) only accepts text in ASCII encoding, so you should also use the Clean plain text (ASCII) workflow before splitting sentences. This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Text stemming with Porter Stemmer  (myExperiment workflow:1753)

This workflow does text stemming. Stemming removes the inflicted endings of words. It is often used as text preprocessing for text mining, since stemmed words can be easily matched and counted. The input to the workflow is the text to be stemmed, the output is the stemmed text.

Text preprocessing (myExperiment workflow:1750)

The input to this workflow is plain text. The text is preprocessed so that non- alfanumeric symbols are removed, the text is transformed to to lower case and stop words are removed.

The workflow first removes the characters from this set: `~!@#$%^&*()_+=-{}|\][":;'?><,./.

Then it transforms the text to lower case. The user will be prompted to select a dictionary for stop words from a list. The workflow will, based on the selected list, remove the stop words. Stop words are words that do not carry meaning, like, the, an,... The web service for stop words removal integrates six English stop words dictionaries and one for the Slovenian language. The output of the workflow is text in lower case without non-alfanumeric
charachters and without stop words.

Select from a list of possible web service parameter values (myExperiment workflow:1743)

The workflow for selecting from a list of possible web service parameter values has two input ports: the wsdl address of the web service and the variable name. It parses the web service wsdl description (the web service http://ropot.ijs.si/webservices/janez/getvalues.php?wsdl does that) and then it asks the user to select one value from a drop-down menu. This workflow is very useful when web services have inputs which expect as a parameter one value from a list of possible values.

BOW construction (myExperiment workflow:1739)

BOW construction is a document corpora processing task as it transforms a corpus of documents into a Bag-Of-Words format. In this format, each document is represented as an unordered collection of words, disregarding grammar and even word order. There are several preprocessing options and parameters that can be set to this service.

This version of BOW construction prompts the user for the following BOW
construction parameters:

      * Stemmer: Lemmatizer_Bulgarian, Lemmatizer_Czech,
        Lemmatizer_English, Lemmatizer_Estonian, Lemmatizer_French,
        Lemmatizer_German, Lemmatizer_Hungarian, Lemmatizer_Italian,
        Lemmatizer_Romanian, Lemmatizer_Serbian, Lemmatizer_Slovene,
        Lemmatizer_Spanish, PorterStemmer, None
      * StopWordSets: English, EnglishGoogle, English523, English425,
        English319, English8, EnglishInet, French, German, Spanish,
        Slovene, Empty
      * Tokenizer: UnicodeTokenizer, VocabularyTokenizer
      * WordWeightType: TermFreq, TfIdf, LogDfTfId

It is user friendly, as it offers the possible values in drop-down
menus.

Lemmatization (myExperiment workflow:1738)

The workflow lemmatizes the text in the input port. Takes text as input and returns (language dependent) lemmatized text as output. All the words in the resulting text are in the same order as in the original text, but they are transformed to their dictionary form.

The workflow asks for the language of lemmatization. Currently, 12 languages are supported: en,sl,ge,bg,cs,et,fr,hu,ro,sr,it,sp.

From PDF to lemmatized text (myExperiment workflow:1516)

This workflow lematizes the text from a supplied PDF document. The workflow accepts a PDF file as an input an uses e-LICO workflows to preprocess the data. The workflow interactively asks the user to select the language for lemmatization. The output is a string in Taverna Workbench.