RapidAnalytics


RapidAnalytics is the core data mining workflow execution engine of e-LICO. Storing data, processes, and meta data, it also serves as e-LICO's Data Mining Experiments Repository, the DMER.

tl_files/elico/software/rapidanalytics/rapidanalytics_arch.png
The RapidAnalytics architecture and its role within e-LICO.

Both Taverna and RapidMiner can connect to RapidAnalytics, using it as a repository for processes, data, meta data, and results. Furthermore, processes can be executed inside the server. Whereas RapidMiner normally runs processes on the user's desktop PC, processes can now be sent to the server and executed there. Once the process completes, results can be downloaded and viewed in RapidMiner or browsed in RapidAnalytics' Web interface.

Within e-LICO, RapidAnalytics serves as the DMER in that it stores data, processes, meta data, and logs. Thus, the execution of processes is completely recorded in RapidAnalytics and can be reproduced at any time.

All RapidMiner extensions that provide operators presented under automatically become available as Web services in RapidAnalytics.

The following video gives a short introduction into the usage of RapidAnalytics.

First introduction video on using RapidAnalytics.

A larger collection of video tutorials is available on YouTube.

Information about usage and installation can be found in the Installation and User Manual as PDF. Miscellaneous information is also available at Rapid-I's blog.

Workflows

Among others, these workflows using RapidAnalytics are available on myExperiment.org. Note that some of these workflows are RapidMiner processes, some are meant to be used as services in RapidAnalytics, and some are Taverna workflows that use RapidAnalytics through the RapidMiner service type available in Taverna.

Search more on myExperiment.org.

Since operators are passed their input objects by reference, data must be uploaded to the server as a first step. Uploading data is a non-trivial task since so many different data formats exist. In general, it is not sucient to upload a data file to a repository for the purpose of data mining. E. g., many commonly used le formats, like CSV and Excel don't specify the full set of meta data required for data mining, like data types and the roles of columns. Other information, like column names, are either missing or represented in a non-standard format. Typically, the rst row in a CSV le will contain column names, but this is not necessarily the case. Because of this, user interaction is often necessary.

When uploading data to the repository, the user has several options.

  • RapidMiner acts as a client attaching to the Web service. It offers a Wizard to import common data file types, including CSV, Excel, Access, and SQL data bases. The Wizard will guide the user through the import process and nally store the data table into the repository.
  • Data files can be directly uploaded using the Web interface.
  • The most flexible approach is to use one of the various data reader operators, or even an entire ETL process, provided by RapidMiner to parse data files and store the process' output inside the repository. The import process can then either run on the client machine or can be executed directly inside the server (provided the server can access the data le or URL).
  • Programmatically, using a command line tool or a workflow execution engine, the user can make a HTTP request to the URI corresponding to the document.

The first three options are all clients to the last, which is described here.

Browsing the Repository through the RepositoryService

In order to administer the repository, the user can use the RepositoryService Web Service. Among others, the service exposes the following four methods. All of them return a Response object containing a status code and errorMessage property indicating a potential error. A status code of zero indicates no error.

In order to query the contents of the repository, the user can call

EntryResponse getEntry(String entryLocation)

and

FolderContentsResponse getFolderContents(String folderLocation)

The first method returns a structured object, an EntryResponse containing various information, like type,
owner, modification date, etc. of the referenced object. The FolderContentsResponse returned by the
second method contains a list of such EntryResponses.

Using the method

Response deleteEntry(String entryLocation)

the user can delete an entry, and new folders can be created by calling

EntryResponse makeFolder(String parentLocation, String subfolderName)

 

RESTful API to Upload and Download Data

In order to upload, download, or delete objects to and from the repository, the user makes an HTTP GET,
PUT, POST or DELETE request to a URL of the form

http://rapid-i.dyndns.org:8080/RAWS/resources/location

where location references the object's location inside the server. The effects of the four HTTP methods are
obvious:

  • GET: Downloads the entry in a RapidMiner specific binary format.
  • PUT and POST:Uploads a new entry to the given location. Directories will be created as necessary.
  • DELETE:Deletes an entry from the repository and frees all associated resources and references.

For both upload and download, the default format delivered and expected is a RapidMiner specific binary format
containing all the necessary meta information.

Download Formats

In order to download a representation of the data in other formats, the user can attach a format=formatName
query parameter.

The set of formats available for this conversion is not fixed. It can be extended programmatically by implementing
the interface IOObjectFormatter. This interface contains three methods. The main method simply writes
the representation of the I/O-Object to a stream. The second method returns the class of objects that this
converter can handle (e. g., data tables), and the last method returns a self-description object containing the
format name, a human readable name, icon, etc. Once the interface is implemented, it can be registered with
the IOObjectFormatterRegistry to make it available to the REST API and the Web interface.

Currently supported formats include:

  • arff: The table data in Attribute-Relation File Format data format.
  • xrff: The table data in eXtensible Attribute-Relation File Format data format.
  • xml: An XML representation.
  • excel: Microsoft Excel 2003 spreadsheet.
  • csv: The table data in Comma Separated Values format.
  • html: An HTML representation of the table data.
  • owl: An OWL representation of the data table in terms of the Data Mining Ontology developed in Work Package 5. Details are described in the Section "Retrieving Meta Data".
  • pmml: Predictive Modelling Markup Language (PMML) for exchanging prediction models
  • mhtml: A human readable description of the meta data in HTML format.

Note that all of the above is not limited to data tables. The RapidAnalytics repository also recognizes all
other I/O-Objects produced by RapidMiner as well as RapidMiner processes. These processes can, e. g., be
downloaded an OWL representation compatible with the plans generated by the e-LICO Planner. In order
to do so, the user has to append a format=powl query parameter to the URL.

Furthermore, arbitrary binary data can be stored in the repository as blobs.

Upload

When uploading, the format is not indicated by a query parameter but by an HTTP Content-Type header. The
interpreted formats are:

  • application/vnd.rapidminer.rmp+xml: RapidMiner process description files in XML format.
  • application/vnd.rapidminer.ioo: RapidMiner specific binary format for data tables and other objects passed between operators.
  • application/arff: Data tables in Arff format.

Any other content type will be stored as an unintepreted blob in the repository.

Paging

When data of tables are displayed by a front-end (e. g. by eProPlan) it is desirable to show only a fraction of
the data. In order to do enable this, we have introduced a paging mechanism. Data tables can be downloaded
partially for all of the above formats by appending two query parameters to the URL:

  • offset: The first item to deliver, where numbering starts at 0.
  • length: The maximum number of items to deliver.

Efficient Internal Data Representation

When storing data on the server, one has to take into consideration the characteristics of typical data sets used
for mining. It is a well-known fact that many data sets contain hundreds of thousands of attributes. This is
true, e. g., for text mining and biomedical data sets. For typical database management systems it is therefore
impossible to store these tables as a single database table since these pose restrictions on the number of columns
that can be used. Typically, the limit is in the order of several hundred columns. One way out of this issue is to
distribute the data across several tables sharing a common identier column, using a join to access the whole
table. However, this approach is of limited use since also joins and views pose limits on the number of columns.
We circumvent this issue by implementing the join outside the database. When iterating over the rows of a
data table, we send independent queries for each table representing a subset of the attributes, and join these
rows manually. Since rows arrive in order of their ids, this manual join is simple and computational cheap. It is
also easy to process data in batches as long as we consider batches that are blocks of continuous ids.

Another approach to handle large data sets is to perform the actual computation inside the database. Together
with Ingres VectorWise, a column based data base, we implemented some data mining operations that are
partially executed inside the database, using SQL statements, and partially inside RapidMiner, using plain Java.
Thus, we were able to run Naive Bayes and decision tree induction algorithms on data sets that would not have
t into memory with near in-memory performance.

Whereas this approach is useful for executing the actual learning task for huge amounts of memory, the Predictive
Model Markup Language (PMML) provides a standard for running the actual scoring inside the database. A
PMML model is an XML representation of a data mining model stored inside a database. We have implemented
a RapidMiner extension that allows for the generation of such models for various models, including decision
trees, rule sets, linear regression, Naive Bayes models, and support vector machines. This boosts performance
for scoring large amounts of data when a PMML-capable data base is available.

The MetaDataService: Retrieving Meta Data and Annotations

The MetaDataService provides methods for retrieving meta data and managing annotations.

Retrieving Meta Data

The Data Mining Ontology (DMO) defined in WP5.1 (see [10]) defines a class IOObject and a subclass
DataTable. IOObjects are objects that can be consumed and produced by operators. Most important of these
are DataTables, that can be described in terms of their attributes, columns, rows, and various statistics. In
order to decide which operator can be meaningfully applied to an object, we describe these objects as individuals
of classes dened in the DMO.

The ontological description of each object stored in the repository can be accessed by appending a format=owl
query parameter to the URI associated with the object. E. g., to retrieve the OWL description of a well-known
example data set, one can retrieve the document referenced by http://rapid-i.dyndns.org:8080/RAWS/resources/demo/Iris?format=owl.

The returned file is an XML file that can be opened, e. g., in Protégé.

The ExecutorService: Executing Operators

The ExecutorService provides methods for service discovery and operator execution. These are described
in the following two sections. Since the ExecutorService is integrated with the Data Mining Experiments
Repository (DMER), details on exchanging actual data will be described in the corresponding Deliverable 5.2
and in the repository manual.

Service Discovery and Self Description

All RapidMiner operators provided by the ExecutorService are annotated with conditions and effects as
specified in the data mining ontology (DMO, WP 5.1). This ontological derscription can be obtained from the
server by accessing the following URL: http://rapid-i.dyndns.org:8080/e-LICO/RapidMiner-Operators.owl

The planner and meta mining tools developed within e-LICO can access this file.

The DMO annotates the operators on a very fine-grained level. Other tools like workflow execution engines may
only utilize a subset of this information, and it may be inadequate to parse the entire DMO just to extract this
information. Hence, we provide a simpler interface that can be used, e. g., by workflow execution engines to
present an appropriate user interface. This approach is currently being evaluated by the Taverna development
team.

The ExecutorService exposes two self-description methods.

public String[] getRegisteredOperatorNames()

This method returns a list of all operator names currently registered with the service. Each of these values can
be passed to the following method to obtain a list of parameter names:

public String[] getParameterNames(String operatorName)

The parameter names returned by this method can be passed to any of the operator invocation methods
described in the following section.

Execution of Basic Operators

RapidMiner provides several hundred operators, including all Weka operators [8]. An operator execution is
defined by a name and a set of parameter values. It is therefore a natural approach to execute all of these
operators by a single generic execution service rather than exposing a Web method for each. This is particularly
true since this approach is easily extensible with new operators and new parameters. Most importantly, it is
desirable that the Web interface (WSDL) does not change and thus does not require to be reimported whenever
new operators are registered with the service.

In order to execute an operator, the user needs to specify the following:

  1. An operator name, as returned by getRegisteredOperatorNames() or as specified in the DMO.
  2. A list of parameters, specified as an array of OperatorParameters. An OperatorParameter is a
    structured object containing a key and a value where the key corresponds to one of the parameter
    names specified in the DMO or returned by getParameterNames(String) for this operator.
  3. A (set of) input objects. Since we are not passing real data but only references, input objects are specified
    as references to objects stored on the server.

This information is sufficient to execute an operator using the following Web method:

public String[] executeBasicOperatorImplicitOutput(
    String operatorName,
    OperatorParameter[] parameters,
    String[] inputLocations)

This method will lookup and instantiate the operator, pass it the parameters specified and invoke it. The
output objects will then be stored on the server under generated (temporary) names. The names generated for
this invocation are returned as a result and can be used as input to the subsequent operator invocation or to
download it as a result.

In cases where the client wants to have more control over where the output is stored, the following method can
be used:

public void executeBasicOperatorExplicitOutput(
    String operatorName,
    OperatorParameter[] parameters,
    String[] inputLocations,
    String[] outputLocations)

This method specifies locations at which the output objects should be stored explicitly. E. g., when transforming
a data table several times in a row it may be desirable to save it back to the place it was loaded from in order
to save resources.

These Web methods described above can be conveniently called through Taverna's RapidMiner activity.

Below you see a demo report based on the KupKB data made with RapidAnalytics. It may not be available at all times.

RapidAnalytics comes with a graphical installer. To run this installer, you must first install a Java Runtime Environment that you can download from java.com. Furthermore, you need to install a SQL database which has a JDBC driver. In this database, create a database and user that can be utilized by RapidAnalytics.

Download and unpack the zip file and type

java -jar RapidAnalytics-Installer.jar

The installer will ask for a hostname and port to run on, memory assignment, a JDBC driver and the parameters of the database you just created. Optionally, you can specify a connection to a mail server or register RapidAnalytics as a Windows service.

To get the features that are used for e-LICO components (required for the Taverna RapidMiner Activity, the Taverna IDA and  eProPlan, in particular the ExecutorService, MetaDataService, and IDAService), you need to install a special build that can be downloaded below. To install it, just replace the file server/default/deploy/RapidAnalytics.ear in your RapidAnalytics installation directory by the file RapidAnalytics-eLICO.ear below.

Once this is done, you can start RapidAnalytics from your installation directory by typing

run.bat -b 0.0.0.0

or

run.sh -b 0.0.0.0

Once the server is started, you can open a Web browser and go to http://localhost:8080 or whatever hostname and port you installed it on. You can connect RapidMiner to RapidAnalytics by going to the Repository Browser and selecting "New Repository", "Remote Repository" and entering this URL.

Downloads

RapidAnalytics-e-LICO.ear (35.5 MB)

RapidAnalytics-Manual-1.1.014.pdf (1.7 MB)