Support Vector Machine

SVM Deduction
Lagrange Multipliers and Constrained Optimization
Lagrange Duality

Brief Description

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Quick View

Category	Usage	Methematics	Application Field
Supervised Learning	Classification (Main), Regression, Outliers Detection Clustering (Unsupervised)	Convex Optimization, Constrained Optimization, Lagrange Multipliers	Numerous

Support Vector Machine is suited for extreme cases (little sample set)
SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized
Implies that only support vector are important whereas other trainning examples are ignorable
SVM can only be used on data that is linear separable (i.e. a hyper-plane can be drawn between the two groups)
By using kernel trick, everything will be linear seprable in higher dimension
Important Notes:
- SVM natively only handles binary classification
  - You have to make sure your data label is either -1 or 1 !! (or it can never find the alphas so as support vectors)
  - For better converage you have to normalize your data

Advantage

Effective in high dimensional spaces
Effective in dimensions >> samples
Use a subset of training points in the decision function => Memory efficient
Different kernel functions for various decision functions
- It's possible to add kernel functions together to achieve even more complex hyperplanes

Disadvantage

Poor performance when features >> samples
SVMs do not provide probability estimates

Terminology

Hyperplane: The Decision Boundary that seperates different classes
- 2 Dimension: line
- 3 Dimension: plane
- 4 Dimension or up: hyperplane
Support Vector: The vectors that helps us to find the best hyperplane
Margin: The space between support vectors and hyperplane

To find a hyperplane best splits the data, because it is as far as possible from the support vectors which is another way of saying we maximized the margin

Parameter
- C Parameter (penalty parameter): Allows you to decide how much you want to penalize misclassified points
  - Low C: Prioritize simplicity (soft margin: because we allowing the miss classification)
  - High C: Prioritize making few mistakes
- Gamma Parameter (for RBF, ploynomial, sigmoid kernel)
  - Small Gamma: Less complexity
  - High Gamma: More complexity
Multiclass SVM (Decision function shape)
- OVR: One vs. Rest (For each class make a binary classifier answer is or isn't the class)
  - Pros: Fewer classifications
  - Cons: Classes may be imbalanced
- OVO: One vs. One
  - Pros: Less sensitive to imbalance
  - Cons: More classifications
Linear Separable
Kernel Function
- Linear
- Radial Basis Function (RBF)
- Polynomial
- Sigmoid

Concepts

Maximize the Margin

It's a constrained optimization problem

To solve constrained optimization problem is to use Lagrange Multipliers technique

Convex quadratic programming problem ===[Lagrange Multipliers]===> Dual problem

SMO (Sequential Minimal Optimization)

Consider a non-linear separable problem

Linear Support Vector Machine (LSVM) => to apply when the classes are linearly separable
If have a data set that is not linear separable

=> Transform data into high dimensional feature space (make data linearly separable by map it to higher dimension)
- e.g. 1D to 2D: Use a polynomial function to get a parabola $f(x) = x^2$

But it's computationally expensive

Kernal Function

Use a kernal trick to reduce the computational cost
- Kernel Function: Transform a non-linear space into a linear space
- Popular kernel types
  - Linear Kernel
    
    $K(x, y) = x \times y$
  - Polynomial Kernel
    
    $K(x, y) = (x \times y + 1)^d$
  - Radial Basis Function (RBF) Kernel
    
    $K(x, y) = e^{-\gamma ||x-y||^2}$
  - Sigmoid Kernel
  - ...

Tune Parameter

Choosing the correct kernel is a non-trivial task. And may depend on specific task at hand.
- Need to tune the kernel parameters to get good performance from a classifier
- Parameter tuning technique
  - K-Fold
  - Cross Validation

C Parameter

Multiple Classes

MSVM (Multiclass Support Vector Machine)

Decision Function Shape

OVR
OVO

Maximize Margin

Converage Problem

The simplified implementation is not guaranteed to coverage to the global optimum for all datasets!
But it should always converge, unless there are numerical problems. Make sure the data is properly scaled. It is a bad idea if different features have values in different orders of magnitude. You might want to normalize all features to the range [-1,+1], especially for problems with more than 100 features.

Libsvm FAQ about Converge

Q: The program keeps running (with output, i.e. many dots). What should I do?

In theory libsvm guarantees to converge. Therefore, this means you are handling ill-conditioned situations (e.g. too large/small parameters) so numerical difficulties occur.

You may get better numerical stability by replacing

typedef float Qfloat; in svm.cpp with typedef double Qfloat; That is, elements in the kernel cache are stored in double instead of single. However, this means fewer elements can be put in the kernel cache.

Q: The training time is too long. What should I do?

For large problems, please specify enough cache size (i.e., -m). Slow convergence may happen for some difficult cases (e.g. -c is large). You can try to use a looser stopping tolerance with -e. If that still doesn't work, you may train only a subset of the data. You can use the program subset.py in the directory "tools" to obtain a random subset.

If you have extremely large data and face this difficulty, please contact us. We will be happy to discuss possible solutions.

When using large -e, you may want to check if -h 0 (no shrinking) or -h 1 (shrinking) is faster. See a related question below.

Vary Large Datasets

Paper - Core Vector Machines: Fast SVM Training on Very Large Data Sets

Links

CS229 Simplified SMO Algorithm

Tutorial

Youtube - Support Vector Machine
Siraj Raval - Support Vector Machine
- Classifying Data Using a Support Vector Machine
Youtube - Support Vector Machines: A Visual Explanation with Sample Ptyhon Code
- adashofdata/muffin-cupcake

MOOC

NTU Hsuan-Tien Lin - Machine Learning Technique
- Linear SVM
  - Course Introduction
  - Large-Margin Separating Hyperplane
  - Standard Large-Margin Problem
  - Support Vector Machine
  - Reasons behind Large-Margin Hyperplane
- Dual SVM
  - Motivation of Dual SVM
  - Largange Dual SVM
  - Solving Dual SVM
  - Messages behind Dual SVM
- Kernel SVM
  - Kernel Trick
  - Polynomial Kernel
  - Gaussian Kernel
  - Comparison of Kernels
- Soft-Margin SVM
  - Motivation and Primal
  - Dual Problem
  - Messages
  - Model Selection
Stanford Andrew Ng - CS229
- Optimization Object
- Large Margin Intuition
- Mathematics Behind Large Margin Classification
- Kernels-I, Kernels-II
- Using a SVM

Library

Libsvm
- LIBSVM FAQ
Scikit Learn
- Support Vector Machine

Wikipedia

Support Vector Machine

Github

eriklindernoren/ML-From-Scratch - Support Vector Machine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVM.md

SVM.md

Support Vector Machine

Brief Description

Quick View

Terminology

Concepts

Maximize the Margin

Consider a non-linear separable problem

Kernal Function

Tune Parameter

C Parameter

Multiple Classes

Decision Function Shape

Maximize Margin

Converage Problem

Libsvm FAQ about Converge

Vary Large Datasets

Links

Tutorial

MOOC

Library

Wikipedia

Github

Files

SVM.md

Latest commit

History

SVM.md

File metadata and controls

Support Vector Machine

Brief Description

Quick View

Terminology

Concepts

Maximize the Margin

Consider a non-linear separable problem

Kernal Function

Tune Parameter

C Parameter

Multiple Classes

Decision Function Shape

Maximize Margin

Converage Problem

Libsvm FAQ about Converge

Vary Large Datasets

Links

Tutorial

MOOC

Library

Wikipedia

Github