Python Plugins – Using scikit-learn for Outlier Detection

Apama_team · December 19, 2018, 10:00pm

Machine learning is becoming ever more useful in data processing, and with Apama’s new Python plug-in capability it is now even easier to use this from within EPL. There are various machine learning libraries available for use, such as TensorFlow and scikit-learn. We’ve chosen to create this demo using scikit-learn, as an example of outlier detection using this library already exists. We’ll be basing this demo on the example (found here).

This demo will train several classifiers on a subset of the Boston Housing Dataset. It will then receive a series of events and check each one to see if it is considered an outlier by each classifier. The results will be output in the log.

The demo will be created within Software AG Designer. Steps for setting this up can be found here and in this video tutorial

The full source for this demo can be found here.

Setup

In order to run this sample, several libraries are required. In Designer, open Window > Preferences > PyDev > Interpreters > Python Interpreters. Select the interpreter you set up and click the ‘Install/Uninstall with pip’ button.

Install scikit-learn by running the command: install scikit-learn

Install Numpy by running the command: install numpy

The Sample

Begin by creating a monitor file to encapsulate the EPL logic. This file will load and initialize a plug-in and pass events to the plug-in for analysis.

At the top of the monitor file, create an event. This will represent the housing data that is sent into the system.

event HousingData {
    float RAD; 		// index of accessibility to radial highways
    float PTRATIO;  // pupil-teacher ratio by town
}

Then create your monitor. This monitor listens for the event we created above, and checks these events to see if they are outliers.

package apamax.ml;

event HousingData {
	float RAD; // index of accessibility to radial highways
	float PTRATIO; // pupil-teacher ratio by town
}

monitor Test {
    import "testPlugin" as plugin;        // Load our Python plug-in
    
    action onload() {
        plugin.Train();        // Call the plug-in function to train our classifiers
        
        on all HousingData() as hd {
            
            // Check if this event is an outlier.
            // Results is a dictionary of {name : result} where name is the name of
            // the classifier and result is whether or not that classifier considers
            // this data to be an outlier.
            dictionary<string, boolean> results := plugin.CheckIfOutlier(hd);
            
            string r;
            for r in results.keys() {
                if(results[r]) {
                    // If this classifier determines this data to be an outlier, output to the log
                    log hd.toString() + " - " + r + " determined this to be an outlier!";
                }
            }
        }        
    }
}

With the wrapper monitor file created, it’s time to create the plug-in. Begin by importing the relevant modules.

from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
from sklearn.datasets import load_boston
from apama.eplplugin import EPLAction, EPLPluginBase

In your class initialization, create some classifiers to use for outlier detection, and load the training data.

self.classifiers = {
    "Empirical Covariance": EllipticEnvelope(support_fraction=1., contamination=0.261),
    "Robust Covariance (Minimum Covariance Determinant)": EllipticEnvelope(contamination=0.261),
    "OCSVM": OneClassSVM(nu=0.261, gamma=0.05)
}
        
# Get data
self.TrainingData = load_boston()['data'][:, [8, 10]]

Create a training function to train each classifier on the loaded data.

for clf in self.classifiers.values():
    clf.fit(self.TrainingData)

Finally, create the function to check if a given piece of housing data is an outlier or not.

Begin by extracting the data from the EPL event and storing it in the correct format. Then, for each classifier, run the predict function, which will return a list of values representing whether or not it considered each item of data an outlier. Since we are only passing in one piece of data, we can just read back the first entry in the list. The result will be -1 for an outlier, or 1 for an inlier. Store the result as a boolean along with the name of the classifier in a dictionary. Once each classifier has been run, return the results to EPL.

Since the result stored in predictions is a numpy.int32 or numpy.int64, comparing the value to -1 will return a numpy.bool_ . EPL can’t implicitly cast this to a boolean, and will throw an error for a return value of the wrong type To avoid this, cast the result of predictions[0] == -1 to a boolean before adding it to the results dictionary.

@EPLAction("action<apamax.ml.HousingData> returns dictionary<string, boolean>")
def CheckIfOutlier(self, d):
    asData = [d.fields["RAD"], d.fields["PTRATIO"]]
    res = {}
    for (clf_name, clf) in self.classifiers.items():
        # Test this data point against our classifier
        predictions = clf.predict([asData])
        # This is numpy.bool_ by default, since predictions[0] is a numpy.int32 or int64. Cast so EPL understands it
        entry = {clf_name : bool(predictions[0] == -1)}   # -1 means outlier, 1 means inlier
        res.update(entry)
        
    return res

With this, your application should be ready to run. Create an event file to send in some events, some of which are outliers, to see the results.

// This is considered an outlier only by the Empirical Covariance classifier
apamax.ml.HousingData(4.,21.)
// These are all inlier events
apamax.ml.HousingData(5.,15.)
apamax.ml.HousingData(5.,20.)
apamax.ml.HousingData(6.,16.)
// These are the outliers for all classifiers
apamax.ml.HousingData(5.,10.)
apamax.ml.HousingData(1.,2.)

You should see some output like below.

apamax.ml.HousingData(4,21) - Empirical Covariance determined this to be an outlier!
apamax.ml.HousingData(5,10) - Empirical Covariance determined this to be an outlier!
apamax.ml.HousingData(5,10) - OCSVM determined this to be an outlier!
apamax.ml.HousingData(1,2) - Empirical Covariance determined this to be an outlier!
apamax.ml.HousingData(1,2) - OCSVM determined this to be an outlier!
apamax.ml.HousingData(1,2) - Robust Covariance (Minimum Covariance Determinant) determined this to be an outlier!

And with that, we’ve performed outlier detection in EPL using a Python plug-in. This demonstrates some of the powerful possibilities of being able to use Python from an Apama correlator.