Skip to content

Latest commit

 

History

History
233 lines (159 loc) · 9.35 KB

File metadata and controls

233 lines (159 loc) · 9.35 KB

Support Vector Machine

Brief Description

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Quick View

Category Usage Methematics Application Field
Supervised Learning Classification (Main), Regression, Outliers Detection Clustering (Unsupervised) Convex Optimization, Constrained Optimization, Lagrange Multipliers Numerous
  • Support Vector Machine is suited for extreme cases (little sample set)

  • SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized

  • Implies that only support vector are important whereas other trainning examples are ignorable

  • SVM can only be used on data that is linear separable (i.e. a hyper-plane can be drawn between the two groups)

  • By using kernel trick, everything will be linear seprable in higher dimension

  • Important Notes:

    • SVM natively only handles binary classification
      • You have to make sure your data label is either -1 or 1 !! (or it can never find the alphas so as support vectors)
      • For better converage you have to normalize your data

Advantage

  • Effective in high dimensional spaces
  • Effective in dimensions >> samples
  • Use a subset of training points in the decision function => Memory efficient
  • Different kernel functions for various decision functions
    • It's possible to add kernel functions together to achieve even more complex hyperplanes

Disadvantage

  • Poor performance when features >> samples
  • SVMs do not provide probability estimates

Terminology

  • Hyperplane: The Decision Boundary that seperates different classes
    • 2 Dimension: line
    • 3 Dimension: plane
    • 4 Dimension or up: hyperplane
  • Support Vector: The vectors that helps us to find the best hyperplane
  • Margin: The space between support vectors and hyperplane

To find a hyperplane best splits the data, because it is as far as possible from the support vectors which is another way of saying we maximized the margin

  • Parameter

    • C Parameter (penalty parameter): Allows you to decide how much you want to penalize misclassified points
      • Low C: Prioritize simplicity (soft margin: because we allowing the miss classification)
      • High C: Prioritize making few mistakes
    • Gamma Parameter (for RBF, ploynomial, sigmoid kernel)
      • Small Gamma: Less complexity
      • High Gamma: More complexity
  • Multiclass SVM (Decision function shape)

    • OVR: One vs. Rest (For each class make a binary classifier answer is or isn't the class)
      • Pros: Fewer classifications
      • Cons: Classes may be imbalanced
    • OVO: One vs. One
      • Pros: Less sensitive to imbalance
      • Cons: More classifications
  • Linear Separable

  • Kernel Function

    • Linear
    • Radial Basis Function (RBF)
    • Polynomial
    • Sigmoid

Concepts

Maximize the Margin

It's a constrained optimization problem

  • To solve constrained optimization problem is to use Lagrange Multipliers technique

Convex quadratic programming problem ===[Lagrange Multipliers]===> Dual problem

  • SMO (Sequential Minimal Optimization)

Consider a non-linear separable problem

  • Linear Support Vector Machine (LSVM) => to apply when the classes are linearly separable

  • If have a data set that is not linear separable

    => Transform data into high dimensional feature space (make data linearly separable by map it to higher dimension)

    • e.g. 1D to 2D: Use a polynomial function to get a parabola $f(x) = x^2$

But it's computationally expensive

Kernal Function

  • Use a kernal trick to reduce the computational cost
    • Kernel Function: Transform a non-linear space into a linear space
    • Popular kernel types
      • Linear Kernel

        $K(x, y) = x \times y$

      • Polynomial Kernel

        $K(x, y) = (x \times y + 1)^d$

      • Radial Basis Function (RBF) Kernel

        $K(x, y) = e^{-\gamma ||x-y||^2}$

      • Sigmoid Kernel

      • ...

Tune Parameter

  • Choosing the correct kernel is a non-trivial task. And may depend on specific task at hand.
    • Need to tune the kernel parameters to get good performance from a classifier
    • Parameter tuning technique
      • K-Fold
      • Cross Validation

C Parameter

Multiple Classes

  • MSVM (Multiclass Support Vector Machine)

Decision Function Shape

  • OVR
  • OVO

Maximize Margin

Converage Problem

  • The simplified implementation is not guaranteed to coverage to the global optimum for all datasets!

  • But it should always converge, unless there are numerical problems. Make sure the data is properly scaled. It is a bad idea if different features have values in different orders of magnitude. You might want to normalize all features to the range [-1,+1], especially for problems with more than 100 features.

Libsvm FAQ about Converge

Q: The program keeps running (with output, i.e. many dots). What should I do?

In theory libsvm guarantees to converge. Therefore, this means you are handling ill-conditioned situations (e.g. too large/small parameters) so numerical difficulties occur.

You may get better numerical stability by replacing

typedef float Qfloat; in svm.cpp with typedef double Qfloat; That is, elements in the kernel cache are stored in double instead of single. However, this means fewer elements can be put in the kernel cache.

Q: The training time is too long. What should I do?

For large problems, please specify enough cache size (i.e., -m). Slow convergence may happen for some difficult cases (e.g. -c is large). You can try to use a looser stopping tolerance with -e. If that still doesn't work, you may train only a subset of the data. You can use the program subset.py in the directory "tools" to obtain a random subset.

If you have extremely large data and face this difficulty, please contact us. We will be happy to discuss possible solutions.

When using large -e, you may want to check if -h 0 (no shrinking) or -h 1 (shrinking) is faster. See a related question below.

Vary Large Datasets

Links

Tutorial

MOOC

  • NTU Hsuan-Tien Lin - Machine Learning Technique

    • Linear SVM
      • Course Introduction
      • Large-Margin Separating Hyperplane
      • Standard Large-Margin Problem
      • Support Vector Machine
      • Reasons behind Large-Margin Hyperplane
    • Dual SVM
      • Motivation of Dual SVM
      • Largange Dual SVM
      • Solving Dual SVM
      • Messages behind Dual SVM
    • Kernel SVM
      • Kernel Trick
      • Polynomial Kernel
      • Gaussian Kernel
      • Comparison of Kernels
    • Soft-Margin SVM
      • Motivation and Primal
      • Dual Problem
      • Messages
      • Model Selection
  • Stanford Andrew Ng - CS229

    • Optimization Object
    • Large Margin Intuition
    • Mathematics Behind Large Margin Classification
    • Kernels-I, Kernels-II
    • Using a SVM

Library

Wikipedia

Github