Ontology Based Meta Miner

Ontology-based meta-mining (MM for short) is a new approach for meta-learning with two distinctive features: first, instead of focusing exclusively on the learning task, MM applies meta-learning to the full data mining (DM) process. Second, while traditional meta-learning regards learning algorithms as black boxes and essentially correlates properties of their input (data) with the performance of their output (learned model), MM makes use of the Data Mining Optimization Ontology (dmop) to tear open the black box and analyse DM algorithms and workflows in terms of their core components, their underlying assumptions, the cost functions and optimization strategies they use, and the models and decision boundaries they generate. These two features together lay the groundwork for what we call deep or semantic meta-mining, i.e., DM process or workflow mining that is driven simultaneously by meta-data and by the collective expertise of data miners embodied in the data mining ontology and knowledge base.

The tool delivered here is a basic prototype to extract meta-data from a set of base-level DM processes; characteristics of dataset given by the DCT Tool, and frequent abstract workflow patterns as described in [1]. These meta-data are then used by the Probabilistic Ranker in order to rank candidate workflows submitted by the AI-planner following the approach described in [2].

 

[1] M. Hilario, P. Nguyen, H. Do, A. Woznica, A. Kalousis. Ontology-based meta-mining of knowledge discovery workflows. In N. Jankowski et al., Meta-Learning in Computational Intelligence, Springer, 2011. Please note that this paper describes the status of DMOP ontology in 2010 and does not reflect perfectly the current version. 

[2] Nguyen P., Kalousis A & Hilario, M (2011). A meta-mining infrastructure to support KD workflow optimization. In Proc of the ECML/PKDD-11 Workshop on Planning to Learn and Service-Oriented Knowledge Discovery (PlanSoKD-2011).


Running Base-level DM Experiments

The MM receives as input two sets: a set of workflows WF and a set of datasets DS, stored in the repository folder. These two sets are crossed to run base-level DM experiments. We assume that these inputs are in RapidMiner format, ie ”.rmp” for workflows and ”.ioo” for datasets.

In order to do meta-mining, we need an experimental protocol. For now, the MM has been built for classification problems, and it requires the following settings for each experiment:

  • Validation method: a 10-fold cross validation.
  • Predictions for each separate test set should be saved in order to compare DM experiments later.
  • The final learnt model and performance measure should also be saved for future research.

These constraints have to be defined in your RapidMiner workflows. You can find some workflow examples in ”repository/workflows”. Note that these workflows are specially customized for running them inside bash
scripts. For instance, the input of the workflows is a macro called ”INPUT” which is then replaced by the script with the correct input dataset.

You can find a main script in ”bin/exp/allExp.sh”. Just run it from the MM root folder. This script looks at your repository and cross all workflows ∈ WF with all datasets ∈ DS. The results are stored in ”repository/experiments” under the form of several archives, one per experiment, that contains all saved informations as described by our protocol. The name of an experiment archive is the concatenation of the name of the current workflow plus the name of the current dataset. Note that depending on the size of your two sets, WF and DS, these base-level experiments can be highly time-consuming.

Extracting metadata

Once your base-level DM experiments done, you can now extract metadata from them. In summary, we define a base-level DM experiment as a pair of { workflow, dataset } associated with a rank value. Thus, we have to extract three types of metadata; one for the workflow, one for the dataset and one for getting the experiment’s rank. For extracting metadata from workflows, we use the tree mining approach described in [1]. For dataset characteristics, we use the DCT tool incorporated in RapidMiner as plugin. For ranking DM experiments, we do a statistical pairwise comparison of DM experiments applied on the same data. More details are given below. The results of these extractions are three separate files stored in the ”metadata” folder: ”wfs.csv” for workflow metadata, ”dct.csv” for dataset metadata and ”rank.csv” for experiment’s ranks.

Frequent Workflow Patterns

Before starting the mining, you need first to export the RapidMiner workflow XML files into their tree representation. For that, execute the bash script ”bin/mining/parse.sh” which will parse all your RapidMiner workflows.

Then go to the subdirectory ”WF MetaMiner”. The patterns are extracted by mining augmented workflow parse trees with DMOP concepts (Hilario et al., 2011). The full process is executed within Flora2. The steps are the following:

  1. Start the Flora2 system by typing ”runflora”.
  2. Load the knowledge base by typing ”[kb].”.
  3. Executes the first mining step by typing ”step1.”. This should generate two files: ParseTree.myWF.flr
    and TreePattern.myWF.pl.
  4. Restart the system by typing ” halt.”, then ”runflora” again.
  5. Load the knowledge base by typing ”[kb].”.
  6. Executes the second mining step by typing ”step2.”. This should generate the final file: TreePat-
    tern.myWF.2.pl.

Once these steps done, come back to the MM root directory, then executes the bash script ”bin/mining/buildWFPMatrix.sh”. This script propositionalize the patterns and generates the end csv file, ”metadata/wfs.csv”. This file contains one row entry per workflow, and the columns are the patterns.

Dataset Characteristics

In order to extract characteristics for your datasets, you need first to install the DCT tool as RapidMiner Extension. The DCT tool is available in the e-Lico svn at http://www.e-lico.eu/svn/public/trunk/software/DCT/, with instructions on how to install it at http://www.e-lico.eu/?q=node/267. We provide a bash script ”bin/exp/runDCT.sh” to build the final csv file, ”metadata/dct.csv”. This script launch a specific workflow ”repository/utils/Denis DCT.rmp” that is run under RapidMiner. If you want to modify which characteristics to compute then simply modify the options of the DCT operator in this workflow. The final file contains one row entry per dataset where the columns are the characteristics.

Ranking Base-level DM Experiments

In order to rank base-level DM experiments, we statistically compare pairwise DM experiments applied over the same dataset. An experiment get 1 point if it is statistically better (in terms of predictions) than one other, otherwise the other get the point. If there is no statistical difference, both experiments get 0.5 point. The rank of an experiment is the total points its wins over the N (N − 1)/2 comparisons (for N experiments).

We provide a bash script ”bin/stats/rankExp.sh” that looks at all experiments applied over the same dataset and ranks each of them. The result is a csv file stored in ”metadata/rank.csv” where the row entries are the datasets and the columns the workflows.

Building meta-mined models

Having all metadata available, we can start now building simple meta-models. Our current approach is to weight workflow patterns according to the ranks of the DM experiments that match a given pattern. We provide a bash script to compute these weights. Run the script ”bin/R/runMM.sh”. This script will output the file ”metadata/Z.discr.pl” which contains weights for each workflow patterns. This resulting file can then be seed inside the Probabilistic Ranker to rank candidate workflows proposed by the e-Lico AI-planner.

The prototype has been tested on a Mac Book Pro, OS 10.6.2, with 2gb of RAM. You can download it from Download tab. Note that this installation is addressed to experienced users and requires advanced knowledge in bash scripting, perl, flora2 (XSB), java and R. It requires the following tool’s versions:

  • GNU bash shell, version 3.2.48.
  • GNU Perl, version 5.8.9.
  • GNU gcc, version 4.2.1.
  • FLORA-2 with XSB, version 3.2
  • Java, version 1.6.0.
  • CRAN R, version 2.12.2
  • RapidMiner, version 5.1.006 (or higher) with R extension

To have a working MM, you need to compile the tree miner of Zaki (2005) and a Java code to parse RapidMiner workflows. Execute the bash script bin/install.sh to compile and install them.

The last installation step is to configure your MM. You can find in bin/config.sh configuration settings to customize according to your machine:

  • MM_HOME = Full path to the MetaMining infra.
  • RAPIDMINER_HOME = Full path to RapidMiner.
  • RM_repository_name = Alias name of the MM infra repository (ie the subdir repository).

Your installation of RapidMiner needs to know where your MM repository resides in order to retrieve the correct input dataset when running a workflow. For that, you need to update the file ∼/.RapidMiner5/repositories.xml by adding the full path and alias name of the MM infra repository, or you can add the MM repository inside RapidMiner.

Download

MetaMining_Infra.zip (2.9 MB)