///Java Data Mining

Java Data Mining

JDM aims at building a standard API for data mining with the goal that client applications coded to the specification are not dependent on any specific vendor application. The JDBC specification provides a good analogy to the potential of JDM. The promise is that just like it is fairly easy to access different databases using the JDBC protocol, in the same manner applications written to the JDM specification should make it simple to switch between different implementations of data mining functions. JDM has wide support from the industry, with representations from a number of companies including Oracle, IBM, SPSS, CA, Fair Issac, SAP, SAS, BEA, and others. Oracle1 and KXEN2 have implementations compliant with the JDM specification as of early 2008. But it is only a question of time that other vendors and data mining toolkits adopt the specification.

This article is based on Collective Intelligence in Action, published October, 2008. It is being reproduced here by permission from Manning Publications. Manning early access books and ebooks are sold exclusively through Manning. Visit the book’s page for more information.

Developer Tutorial readers can get 30% off any version (ebook or print book) of Collective Intelligence in Action. Simply use the code “devtut30” at checkout. Offer expires January 31, 2010.

Work on JSR 733 began in July 2000 with the final release in August 2004. JDM supports five different types of algorithms: clustering, classification, regression, attribute importance and association rules. It also supports common data mining operations such as building a model, evaluating a model, applying a model, and saving a model. It also defines XML schema for representing models as well as accessing data mining capabilities from a web service.

JSR 247 4 , commonly known as JDM 2.0 addresses features that were deferred from JDM 1.0. Some of the features JSR 247 addresses are: multivariate statistics, time series analysis, anomaly detection, transformations, text mining, multi-target models, and model comparisons. Work on it started in June 2004 and the public review draft was approved in December 2006.

If you are interested in the details of JDM you are highly encouraged to download and read the two specifications – they are well written and easy to follow. You should also look at a recent well-written book5 by Mark Hornick, the specification lead for the two JSRs on data mining and JDM. He co-authored the book with two other members of the specification committee Erik Marcadé, from KXEN and Sunil Venkayala from Oracle.

Here, we will briefly look at the JDM architecture and the core components of the API. Toward the end of the article, we will write code that demonstrates how a connection can be made to a data mining engine using the JDM APIs.

1 http://www.oracle.com/technology/products/bi/odm/odm_jdev_extension.html
2 http://kxen.com/products/analytic_framework/apis.php
3 http://www.jcp.org/en/jsr/detail?id=73
4 http://www.jcp.org/en/jsr/detail?id=247
5 Java Data Mining: Strategy, Standard, and Practice, 2007, Morgan Kaufmann.

JDM Architecture

The JDM architecture has the following three logical components. These components could be either collocated or distributed on different machines.

  1. The API: the programming interface that is used by the client. It shields the client from knowing about any vendor specific implementations.
  2. The Data Mining Engine (DME): is the engine that provides data mining functionality to the client.
  3. Mining object repository (MOR): This is the repository to store the data mining objects.

All packages in JDM begin with javax.dataming. There are several key packages which are shown in Table 1.

Table 1: Key JDM packages

Concept Packages Comments
Common objects used throughout Javax.datamining Contains common objects such as MiningObject, Factory that are used throughout the JDM packages
Top-level objects used in other packages Javax.datamining.base Contains top level interfaces such as Task, Model, BuildSettings, AlgorithmSettings. Also introduced to avoid cyclic package dependencies
Algorithms related packages Javax.datamining.algorithm

Javax.datamining.association

Javax.datamining.attributeimportance

Javax.datamining.clustering

Javax.datamining.supervised

Javax.datamining.rule

Contains interfaces associated with the different types of algorithms, namely: association, attribute importance, clustering, supervised learning – includes both classification and categorization. Also contains Java interfaces representing the predicate rules created as part of the models such as tree model
Connecting to the data mining engine Javax.datamining.resource Contains classes associated with connecting to a data mining engine (DME) and Metadata associated with the DME
Data related packages Javax.datamining.data

Javax.datamining.statistics

Contains classes associated with representing both a physical and logical dataset and statistics associated with the input mining data
Models and tasks Javax.datamining.task

Javax.datamining.modeldetail

Contains classes for the different types of Tasks: build, evaluate, import and export.

Provides detail on the various model representations

Next, let us take a deeper look at some of the key JDM objects.

Key JDM Objects

The MiningObject is a top-level interface for JDM classes. It has basic information such as a name and description and can be saved in the MOR by the DME. JDM has the following types of MiningObject as shown in Figure 1.

  • Classes associated with describing the input data, including both the Physical (PhysicalDataSet) and Logical (LogicalDataSet) aspects of the data.
  • Classes associated with settings. There are two kinds of settings, first related to setting for the algorithm. AlgorithSettings is the base class for specifying the setting associated with an algorithm. Second, is the high-level specification for building a data mining model. BuildSettings is the base implementation for the five different kinds of models: association, clustering, regression, classification, and attribute importance
  • Model is the base class for mining models created by analyzing the data. There are five different kinds of models association, clustering, regression, classification, and attribute importance.
  • Task is the base class for the different kinds of data mining operations, such as apply a model, test a model, import and export a model.

base

We will look at each of these in more detail. Let’s begin with representing the dataset.

Representing the Dataset

JDM has different interfaces to describe the physical and logical aspects of the data as shown in Figure 2. PhysicalDataset is an interface to describe input data used for data mining, while LogicalData is used to represent the data used for model input. Attributes of the PhysicalDataset, represented by PhysicalAttribute are mapped to attributes of the LogicalData, which is represented by LogicalAttribute. The separation of physical and logical data enables mapping multiple PhysicalDataset into one LogicalData for building a model. One PhysicalDataset can also translate to multiple LogicalData with variations in the mappings or definitions of the attributes.

Each PhysicalDataset is composed of zero or more PhysicalAttribute. An instance of the PhysicalAttribute is created through the PhysicalAttributeFactory. Each PhysicalAttribute has a AttributeDataType, which is an enumeration and contains one of the values {double, integer, string, unknown}. The PhysicalAttribute also has a PhysicalAttributeRole, another enumeration is used to define special roles that some attributes may have. For example, taxonomyParentId represents a column of data that contains the parent identifiers for a taxonomy.

LogicalData is composed of one or more LogicalAttribute. Each LogicalAttribute is created by the LogicalAttributeFactory and has an associated AttributeType. Each AttributeType is an enumeration with values {numerical, categorical, ordinal, not specified}. Associated with a LogicalAttribute is also a DataPreparationStatus, which specifies if the data is prepared or unprepared. For categorical attributes there is also an associated CategorySet, which specifies the set of categorical values associated with the LogicalAttribute.

attributes

Figure 2: Key JDM interfaces to describe the physical and logical aspects of the data

Now that we know on how to represent a dataset, let us look at how models are represented in the JDM.

Learning Models

The output of a data mining algorithm from analyzing data is represented by the Model interface. Model, which extends MiningObject, is the base class for representing the five different kinds of data mining models as shown in Figure 3. Each Model may have an associated ModelDetail, which captures algorithm specific implementations. For example, NeuralNetworkModelDetail in the case of a neural network model captures the detailed representation of a fully connected, MLP network model. Similarly, TreeModelDetail contains model details for a decision tree and contains methods to traverse the tree and get information related to the decision tree. To keep Figure 3 simple, the subclasses of ModelDetail are omitted.

model

Table 2 shows the six subclasses of the Model interface. Note that SupervisedModel acts as a base interface for both ClassificationModel and RegressionModel.

Table 2: Key subclasses for Model

Model Type Description
AssociationModel Model created by an association algorithm. It contains data associated with itemsets and rules
AttributeImportanceModel Ranks the attributes analyzed. Each attribute has a weight associated with it, which can be used as an input for building a model
Clustering Model Represents the output from a clustering algorithm. Contains information to describe the clusters and associate a point with the appropriate cluster
SupervisedModel Is a common interface for supervised learning related models
ClassificationModel Represents the model created by a classification algorithm
RegressionModel Represents the model created by a regression algorithm

So far, we have looked at how to represent the data and the kinds of model representation. Next, let us look at how settings are set for the different kinds of algorithms.

Algorithm Settings

AlgorithmSettings, as shown in Figure 4 is the common base class for specifying the settings associated with the various algorithms. A DME will typically use defaults for the settings and use the settings specified to override the defaults.

algorithm-settings

Each specific kind of an algorithm typically has its own interface to capture the settings. For example, the KMeansSettings captures the settings associated with the k-means algorithm. This interface specifies settings such as the number of clusters, the maximum number of iterations, the distance function to be used, and the error tolerance range.

Next, let us look at the different kinds of tasks that are supported by the JDM.

JDM Tasks

There are five main types of tasks in JDM. These are tasks associated with building a model, evaluating a model, computing statistics, applying a model, importing and exporting models from the MOR. Figure 5 shows the interfaces for some of the tasks in JDM. Tasks can be executed either synchronously or asynchronously. Some of the tasks associated with data mining such as learning the model, evaluating a very large dataset take a long time to run. JDM supports specifying these as asynchronous Tasks and monitoring the status associated with them.

The interface, Task is an abstraction of the metadata needed to define a data mining task. The task of applying a mining model to data is captured by the ApplyTask. The DataSetApplyTask is used to apply the model to a dataset, while RecodApplyTask is used to apply the mining model to a single record. ExportTask and ImportTask are used to export and import mining models from the MOR.

tasks

Figure 5: The interfaces associated with the various tasks supported by JDM

Task objects can be referenced, re-executed or executed at a later time. DME doesn’t allow two tasks to be executed with the same name. But a task that has completed can be re-executed if required. Tasks executed asynchronously provide a reference to a ExecutionHandle. Clients can monitor and control the execution of the task using the ExecutionHandle object.

Next, we will look at the details of clients connecting to the DME and the use of ExecutionHandle to monitor the status.

JDM Connection

JDM allows clients to connect to the DME using a vendor-neutral connection architecture. This architecture is based on the principles of Java Connection Architecture (JCX). Figure 6 shows the key interfaces associated with this process.

The client code looks up an instance of ConnectionFactory, perhaps by using JNDI and specifies a user name and password to the ConnectionFactory. The ConnectionFactory creates Connection objects, which are expected to be single-threaded and are analogous to the Connection objects created, while accessing the database using the JDBC protocol. The ConnectionSpec associated with the ConnectionFactory contains details about the DME name, URI, locale, the user name and password to be used.

A Connection object encapsulates a connection to the DME. It authenticates users, supports the retrieval and storage of named objects, and executes tasks. Each Connection object is a relatively heavyweight JDM object and needs to be associated with a single thread. Clients can access the DME via either a single Connection object or multiple instances. Version specification for the implementation is captured in the ConnectionMetaData object.

connection-detail

The Connection interface has two methods available to execute a task. The first one is used for synchronous tasks and returns an ExecutionStatus object

public ExecutionStatus execute( Task task, java.lang.Long timeout) throws JDMException

while the other one is for asynchronous execution

public ExecutionHandle execute(java.lang.String taskName) throws JDMException

it returns a reference to an ExecutionHandle, which can be used to monitor the status of the task. The Connection object also has methods to look for mining objects, such as the one below which looks for mining objects of the specified type that were created in a specified time period.

public java.util.Collection getObjectNames(java.util.Date createdAfter, java.util.Date createdBefore, NamedObject objectType) throws JDMException

With this overview of the connection process let us look at some sample code that can be used to connect to the DME.

Sample Code for Accessing DME

It is now time to write some code to illustrate how the JDM APIs can be used to create a Connection to the DME. The first part of the code deals with the constructor and the main method, which calls the method to create a new connection; this is shown in Listing 1.

Listing 1 Constructor and main method for JDMConnectionExample

package com.alag.ci.jdm.connect;

import java.util.Hashtable;

import javax.datamining.JDMException;
import javax.datamining.resource.Connection;
import javax.datamining.resource.ConnectionFactory;
import javax.datamining.resource.ConnectionSpec;
import javax.naming.Context;
import javax.naming.InitialContext;
import javax.naming.NamingException;

public class JDMConnectionExample {

private String userName = null;
private String password = null;
private String serverURI = null;
private String providerURI = null;

public JDMConnectionExample(String userName, String password,
String serverURI, String providerURI) { <#1>

this.userName = userName;
this.password = password;
this.serverURI = serverURI;
this.providerURI = providerURI;

}

public static void main(String [] args) throws Exception {
JDMConnectionExample eg = new JDMConnectionExample(“username”, “password”,
“serverURI”,”http://yourHost:yourPort/yourDMService”);

Connection connection = eg.createANewConnection(); <#2>

}

<#1> constructor for the JDMConnectionExample
<#2> get a Connection using the JDMConnectionExample instance

In our example, we will use JDMConnectionExample object to create a new instance of the Connection object. The constructor for JDMConnectionExample takes in four parameters: the username and password for the DME, the URI for the DME server, and the URI for the provider. Sample values for which are shown in the main method. The main method creates a Connection object by calling

Connection connection = eg.createANewConnection();

There are three steps involved in getting a new Connection as shown in Listing 2

Listing 2 Creating a new connection in the JDMConnectionExample

public Connection createANewConnection() throws JDMException, NamingException {

ConnectionFactory connectionFactory = createConnectionFactory(); <#1>
ConnectionSpec connectionSpec = getConnectionSpec(connectionFactory); <#2>
return connectionFactory.getConnection(connectionSpec);

}

<#1> create a ConnectionFactory
<#2> get a ConnectionSpec
<#3> get a Connection from the ConnectionFactory

First, we need to create an instance of the ConnectionFactory. Next, we need to obtain a ConnectionSpec from the ConnectionFactory, populate it with the credentials and then create a new Connection from the ConnectionFactory using the ConnectionSpec.

Listing 3 contains the remaining part of the code for this example and deals with creating the connection factory and the initial context.

Listing 3 Getting a ConnectionFactory and ConnectionSpec

private ConnectionFactory createConnectionFactory() throws NamingException {

InitialContext initialJNDIContext = createInitialContext(); <#1>
return (ConnectionFactory) initialJNDIContext.lookup(“java:com/env/jdm/yourDMServer”);

}

private InitialContext createInitialContext() throws NamingException {

Hashtable environment= new Hashtable(); <#2>
environment.put(Context.INITIAL_CONTEXT_FACTORY,
“com.your-company.javax.datamining.resource.initialContextFactory-impl”);
environment.put(Context.PROVIDER_URL, this.providerURI);
environment.put(Context.SECURITY_PRINCIPAL, this.userName);
environment.put(Context.SECURITY_CREDENTIALS, this.password);
return new InitialContext(environment);

}

private ConnectionSpec getConnectionSpec(ConnectionFactory connectionFactory) {

ConnectionSpec connectionSpec = connectionFactory.getConnectionSpec(); <#3>
connectionSpec.setName(this.userName);
connectionSpec.setPassword(this.password);
connectionSpec.setURI(this.serverURI);
return connectionSpec;

}

<#1> create the InitialContext for the JNDI lookup
<#2> environment variables are set in the Hashtable
<#3> get the ConnectionSpec from the ConnectionFactory

To get the ConnectionFactory, we first need to create the InitialContext for the JNDI lookup. The constructor for InitialContext takes a Hashtable and we set the provider url, username and password for the lookup. The code

(ConnectionFactory) initialJNDIContext.lookup(“java:com/env/jdm/yourDMServer”);
Provides access to the ConnectionFactory. We get access to the ConnectionSpec by
ConnectionSpec connectionSpec = connectionFactory.getConnectionSpec();

The ConnectionSpec object is populated with the serverURI, the name and password credentials and a new Connection object is created from the ConnectionFactory by the code

connectionFactory.getConnection(connectionSpec);

Once you have a Connection object, you can execute the different types of Tasks that are available as per the JDM specification. This completes our JDM example and a brief overview of the JDM architecture, and the key APIs.

Summary

You do not want to tie your application code with a specific vendor implementation of data mining algorithms. Java Data Mining (JDM) is a specification developed under Java Community Process JSR 73 and JSR 247. JDM aims at providing a set of vendor-neutral APIs for accessing and using a data mining engine. There are couple of data mining engines that are compliant with the JDM specification and it is expected that more companies will implement it in the future.

2010-05-25T21:12:19+00:00 December 7th, 2009|Java|0 Comments

About the Author:

Leave A Comment