Thursday, September 23, 2010

Modern Data Algorithms.

Modern Analytics: A look at the SMO. part 1 of a week long series.

In celebration of modern data intelligence week, I will begin a week long series of posts that explores the cutting edge of predictive analytics so that our customers and friends of the blog can get an idea of what is out there in the big wide world of data. Our first study is on the SVM and the popular SMO by John Platt.

——————————————————————————————————————-

1. Introduction to modern PA.

Predictive Analytics is the art of mathematically manipulating distributions of historical input variables in the hope that a model can be made that maps a general input to an output variable with some measure of consistency {often either binary, multiple integer or continuous parameter}.

In plain speak: look at my past data and find a measure of similarity to some new data I just received.

In a way it is a quest for similarity, and people love similarity because it invokes comforting feelings of certainty about things in life and health and business. The chaotic world of massive data inspires us to pull measures of similarity out of data streams and this process is called classification or regression in the case of continuous variables.

——————————————————————————————————————-

2. The Quick and the Dead

Statistics, regression techniques,the Bayesian arts and Neural nets all have their place in the world of PA but we here are very much in support of modern algorithms. (the former are very old, well studied theories). big, established BI companies use these simpler algorithms, and to be fair one must give great respect to SAS in particular. SAS is able to use these algorithms competently and the competitors are pretty much dead in the water trying to figure out how to leverage 100 year old algorithms in a world where data is exploding at massive rates and at the same time slowly becoming numerically intense.

I believe that the future of operational financial analysis will be driven by Statisticians who use mathematical models to drive insights and discover indicators. This drives us to somewhat break away from the Bayesian techniques and explore more complex ways of data crunching. As a footnote, one must also respect open source methods like R and Rapid Miner who ambitiously try to code up the modern algorithms for the everyday IT worker to deploy.

——————————————————————————————————————-

3. Precursor to the SMO. The Support Vector Machine, also known as convex optimization.

If you do PA you have likely heard a lot of buzz over the last 10 years on the Support Vector Machine (SVM). The SVM was the first modern predictive algorithm that had real chops and was invented by the great Vladimir Vapnik and Corinna Cortes in 1995 at Bell Labs. It attempts to answer a simple problem called supervised learning. If you have a bunch of inputs and a related binary output {basically true or false}:

Xi = [x1 , x2 , ... , xn] ->Yi either [-1,1]

can we look at a bunch of past results and generalize a test vector map to a new result accurately? It turns out that you can and you can do this while robustly guarding against getting the mapping wrong. The answer was to build a plane that slices the statistical space into two half spaces in such a way that you maximize the distance between the two convex hulls of the labeled historical sets.

In plain speak: draw the tightest box around the similar data points that you can and make a cut in the stat space such that you max out the distance between the two boxes and the “line in the sand“. The line in the sand represents the plane where you would be exactly 50-50 in decision.



Using some clever Lagrangian optimization mathematics and then mapping the Lagrangian to its dual allows you to form the problem into a convex function with convex constraints (namely that the edge of the convex hull you are trying to max out the distance of with respect to the separating plane is touching another plane called the “support vector”).

min ||w||^2 + C*SUM(ε) w.r.t ||w||

subject to: y[i]*{w’Φ(x)} ≥ 1-ε

The dual is a classic minimization problem, specifically the minimization of the norm of the model parameters with slack variables added for each data point. This problem can be solved with a QP solver and a total global optima for the model parameters can be found which sets up an environment for robust classification.

In plain speak: you can play tricks with building a linear separator in your stat space and encourage sparsity by minimizing the norm of the model parameters vector. (this can be seen as choosing as few data points as possible to describe the edges of the convex hull) while at the same time keeping misclassification as low as possible by adding parameter C which basically penalizes data points that are on the wrong side of the line.

It turns out this is a robust predictor that is the most well known of modern algorithms, famously being used in your spam filters, in medical algorithms, classifying skin cancer and a plethora of financial operations.

——————————————————————————————————————-

4. The SMO. Sequential Minimization Optimization is just a way of speeding up the SVM

It turns out that using a Quadratic Programming solver to solve the minimization above is not as efficient as the famous SMO method of John Platt. Essentially, but perhaps not always, a QP solver is a numeric gradient descent method that iteratively solves the entire optimization problem in a stepwise fashion until the optimal parameters are found. The parameters are the Lagrange multipliers which minimize the cost function and represent the global minimum. This is found when the Objective function stops changing significantly with subsequent pair-wise analytic steps in any direction of feature space.

Platt proved that you can solve the optimization problem analytically rather then numerically in his seminal paper. It turns out this algorithm was very very fast. It remains as far as I’m aware the method of choice for deploying a convex optimization solver to build supervised classification point estimation models quickly. Models with a million vectors of inputs can be solved on a relatively small data crunching center rapidly. It works superbly and gives strong predictive models for all types of supervised learning questions.

——————————————————————————————————————-

5. So what does SMO mean for me?

The SMO is a robust, powerful model for classification that answers the problem of putting a test input into a bin based on past recorded inputs that have given clean measurable results. When you have data inputs which have clear measurable past outputs and you don’t care about the magnitude of the prediction, you should look here and no further. a good example would be a buy sell on a stock with small value. Should I buy or sell this stock at value X based on the returns similar stocks have got at different values?

Give me a call if you want to use this algorithm in your business. Tomorrow Ie will explore this algorithm in a Bayesian format that retains sparsity and robustness against misclassification. This is Tipping’s Relevance Vector Machine (RVM)

Until next time!

No comments:

Post a Comment