Skip to content

Latest commit

 

History

History
205 lines (197 loc) · 16.6 KB

BACKGROUND.MD

File metadata and controls

205 lines (197 loc) · 16.6 KB

cnn-image-classification

Overview:

  • What are Convolutional Neural Networks? -> End goals, features, examples, human brains vs ANN image recognition

  • Step 1 - Convolution Operation -> feature detector, filters, features maps, different parameters, visual examples

  • Step 1(b) - Rectified Linear Unit (ReLU) Layer - why linearity is not good, more non-linearity for image recognition

  • Step 2 - Pooling, max pooling, mean pooling, sub pooling, and other approaches. Cool example (visual interactive tool)

  • Step 3 - Flattening

  • Step 4 - Full Connection - puts everything together, how it all works, final neurons classifies neurons

  • Summary

  • Extra - Softmax and Cross-Entropy

  • What are Convolutional Neural Networks?

    • What our brain is looking for is features, we categorize and classify things in a certain way.

    • Process certain features and classifies them.

    • Convolutional Neural Network Search Term > Artificial Neural Network Search Term

    • CNNs -> Self-driving Cars recognize stop signs, tag people in images in Facebook

    • Yann Lecun grandfather of CNN, Geoffrey Hinton's student, NYU Professor, Facebook, Mafia of Deep Learning

    • How CNN works

      • Input Image
      • CNN
      • Output Label Image (Cheetah)
    • After being trained, categorized images prior

      • Example -> Smiling face -> CNN -> Happy (probability0
      • Example -> Frowning -> CNN -> Sad (probability)
      • Sometimes we don't see enough features, it's all about features
      • How to recognize these features?
    • B/W Image 2x2px -> 2-D Array (Pixel 1, Pixel 2, Pixel 3, Pixel 4)

      • 0 < pixel 255 value
      • Any black and white image has a digital form
    • Colored Image 2x2px -> 3-D Array -> RGB layers (Red, Green, Blue), 0 < pixel value < 255

      • Red Channel, Blue Channel, Green Channel
      • 0 < pixel 255 value
    • Smiling Face Example

      • 0 = white, 1 = black
    • Step 1: Convolution

    • Step 2: Max Pooling

    • Step 3: Flattening

    • Step 4: Full Connection

    • Yann LeCun et al., 1998, Gradient-Based Learning Applied to Document Recognition http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

    • In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that have successfully been applied to analyzing visual imagery.

    • CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics

    • Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

    • CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.

    • They have applications in image and video recognition, recommender systems, and natural language processing.

  • Design of CNN

    • A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers are either convolutional, pooling, or fully connected
    • Convolutional
      • Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution emulates the response of an individual neuron to visual stimuli.
      • Each convolutional neuron processes data only for its receptive field. Tiling allows CNNs to tolerate translation of the input image (eg. translation, rotation, perspective distortion)
      • The convolution operation reduces the number of free parameters and improves generalization. In other words, it resolves the vanishing or exploding problems in training traditional multi-layer neural networks with many layers by using backpropagation.
    • Pooling
      • Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters at one layer into a single neuron in the next layer.
      • For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer.
      • Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer.
    • Fully connected
      • Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP).
  • Step 1 - Convolution

    • (f * g)(t) (def)= Integral (-infinity to infinity) f(T) g(t - T) dT
    • Mathematics of CNNs -> Jianxin Wu, 2017, Introduction to Convolutional Neural Networks - https://cs.nju.edu.cn/wujx/paper/CNN.pdf
    • What is a convolution
      • Feature Detector ~ Filter, Kernel ~ (e.g. usually 3x3 matrix, 5x5 matrix)
      • Input Image (x) Feature Detector = Feature Map
        • image put it on as a filter, multiply each value, element wise multiplication of these matrices
        • Stride of one pixel, can change the stride (1, 2, 3), conventional is two
        • Iterates and matches up sometimes
        • What have done here in createing the Feature Map (Convuled Feature, Activation Map)
          • Reduce the size (stride by 2 is smaller)
          • Make the image smaller for easier/faster to process, 300x300 image, 300 number of squares, feature detector reduce image
          • But do we lose information? => Some of course will be lost, but the purpose of feature detector is detect integral features, match exactly, features is how we see things and detect (feather, nose, eyes, cheetahs), and this is what the feature detector helps us to preserve.
      • Input Image
        • We create many feature maps to obtain our first convolution layer - to preserve features
        • Training determines which feature is important to apply filters, feature map applied this filter
        • Feature Detector better suited than filter, what the purpose is to detect features
      • Example
        • Taj Mahal filters, convolution matrix, image processing, filters
        • Sharpen (-1, -1, 5, -1, -1)
        • Blur (1,1,1,1,1)
        • Edge Enhance
        • Edge Detect (1, 1, -4, 1, 1)
        • Emboss - asymmetrical
        • Feature maps, using different feature detectors
      • Beauty of Neural Networks, process so many things and understand without understanding the intuition, explaination which feature is important to them
    • Main take away of convolution is to find feature in images, putting them in feature map, still perserves the spatial relationships between the pixels which is very important. Most of the time features neural network use to detect means nothing to humans, but nevertheless they work.
  • Step 1(B) - ReLU Layer

    • Additional step on top of the convolution step
    • Input Image -> Convolutional Layer -> Rectifier Function
    • Increase non-lineraity in Network, Rectifier acts as the function to break up linearity
    • Reason why: Images are highly non-linear, transition between pixels are often not non-linear, lots of images
    • Convolution - feature detection, risk we might create something linear
    • e.g. Rectified Linear Unit removes all the black, only non-negative values
    • Very mathematical concept to really explain
    • C.-C. Jay Kuo, 2016, Understanding Convolutional Neural Networks with A Mathematical Model https://arxiv.org/pdf/1609.04112.pdf
      • Explain ReLU better in a mathematical sense
      • First answer - why a non-linear activation function is essential at filter ouput at all intermediate layers?
    • Kaiming He et al., 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
  • Step 2 - Max Pooling

    • What is pooling? And why do we need it?
      • E.g. Cheetahs (three image)
        • Properly
        • Rotated
        • Squashed
        • Lots of cheetahs in different poses
        • all looking in different directions, all different, lighting, landscape
        • an exact distinctive feature for neural network
      • Our neural network needs spatial invariance - doesn't care where the feature it is, or tilted, closer, further apart, relative to each other, a bit of distortion still needs the flexibility to find
    • How does pooling work?
      • Feature Map -> Max Pooling -> Pooled Feature Map
      • Box of 2x2 pixels (doesn't have to be 2x2), place top left corner (maximum value, and disregard other three) and iterate (stride) over and put max values in Pooled Feature Map
      • Still able to perserve the features (large numbers, closest to the features)
      • By pooling we get rid of 75% of information that's not important, disregarding three pixels out of four
      • Also, we're taking the maximum, therefore we are accounting for distortion
      • Pooling we perserve features and account for distortion and reducing the size, introduce spatial invariance; moreover, reducing number of parameters for final layers preventing overfitting -> that way our model doesn't overfit, humans look at features and unnecessary noise
      • Why max pooling? Why a stride of two?
        • Read here: Dominik Scherer et al., 2010, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition http://ais.uni-bonn.de/papers/icann2010_maxpool.pdf
        • Beauty of this paper is very simple, great place to start, very easy to read with convolution & pooling
        • 20 minutes read, even skip part two (related work)
        • sub-sampling - average pooling (we are taking maximum here, average value of them), sub-sampling is generalized approach of mean pooling
      • Recap
        • Input Image -> Convolution Layer -> Pooling Layer
      • http://scs.ryerson.ca/~aharley/vis/conv/flat.html
        • Pooling and downsampling are synonymous
  • Step 3: Flattening

    • Pooled Feature Map (After applying convolution + pooling)
    • Pooled Feature Map -> Flattening (to Columns, for processing of ANN)
    • Pooling Layer -> Flattening -> Vector of Inputs for future ANN
    • Input Layer -> Convolutional Layer -> Pooling -> Flattening
  • Step 4: Full Connection

    • In this step adding a whole ANN to the Convolutional Neural Network (CNN)
    • Flattening -> Input Layer -> Fully Connected Layer (but more specific hidden layer that are connected) -> Output Value
    • Pass values to ANN to further optimize
    • Flattening -> 5 Attributes (Input), Six Neurons and Eight Neurons, Output(Dog and Cat)
      • Output classification, output per classes (could've been binary if two)
      • More than two categories, neuron per every category
      • Prediction is made (80% dog, -> cat)
      • Error is calculated, Loss Function (Cost Function) -> Cross Entropy Errors and Mean Square Errors
      • Minimize that function and backpropagated to the network, weights are adjusted in synapses and feature detector is adjusted in CNNs, gradient descent, mathematical
      • Data goes from the start the very end
      • Same story, a bit longer
    • How do these two output neurons work? Rather than one?
      • Dog neuron -> what weights to assign to all of the synapses connected to the dog
      • neurons, features of an image (very processed in ANN), neuron will fire up when a dog, cat neuron ignore flappy ears, lots of iterations, 0.1 to 1.0, lighting up very often when dog, more significance placed on that dog. Many samples and epochs, dog neuron and eye brow neuron, floppy ears neuron contribute well to the classification it is looking for. Cat is ignoring neurons related to dog
      • Cat Neuron -> (whiskers, small size, pointy ears, cat eyes)
      • Features propogated to the output, distinctive feature of that class, network is trained, backpropagation feature detector, 1000s of iteration, disregarded, final layer of neurons, combination of dogs and cats.
      • Example (dog) -> no idea of what the image is but have learned which neurons to look for, probability 0.95, cat 0.05
      • Example (cats) -> high and low, dog 0.21, cat 0.79
      • voting is used for final fully connected layer, which neurons get to vote, learn weights vary based on importance relative to the ouput neuron
  • Summary

  1. Input Image to which we apply multiple different feature detectors (filters) to create feature maps which comprise of our convolution layer.
  2. On top of Convolution Layer we apply ReLU to remove any linearity, increase non-linearity
  3. Apply pooling layer to our convolutional layer, every single feature map we create a pool feature map, has lots of advantages and the main is to make sure that we have spacial invariance in our images (tilts, twists, distort) we can still pick up the features and pooling also reduces the size of our images and helps avoid overfitting of our model to the data by simply getting rid a lot of that data. Same time, pooling preserves features.
  4. Flattened all of the pooled images into one long vector or column of values and input that into an ANN
  5. Fully-connected Artificial Neural Network where all of the features are processed through the network and we have this final fully-connected layer which performs the voting towards the classes we're after. Trained through forward-propagation and backwards-propagation.
  6. Lots of epochs, well-defined neural network; weights are trained but also feature detector are trained and adjusted in the same gradient descent process allowing us to create the best feature maps. We get a fully-trained convolutional neural network which can recognize images and classify them.

Adit Deshpande, 2016, The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3) - https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

  • Nine different CNNs overview and study further

  • Lots of value and how they structured their CNNs and help with architecture, architecture challenge get to the best possible design and performance.

  • Softmax and Cross-Entropy

    • Convolutional Neural Network -> Pops out probability for dogs (0.95) and cat (0.05), after training conducted
    • How can these two values add up to one? As far as we know how'd do they know
    • Traditionally they won't, but applying Softmax function we can bring these values between 0 and 1
    • Softmax squashes a k-dimension vector, brings it to be between 0 and 1 and add up to 1, possible due to summation
    • Makes sense to introduce Softmax to CNN (weird 80% for dogs, and 45% for cats)
    • Softmax comes hand-in-hand with Cross Entropy function
    • What is the Cross Entropy Function?
      • Mean square error function (cost function in ANN, MSE minimize it is the goal)
      • We can still use MSE but Cross Entropy Function is Better in CNN
      • Called the Loss Function (vs. Cost Function)
      • Loss function we want to minimize to maximize performance of network
    • Example of how Cross-Entropy & Softmax    - Dog 0.9 Cat 0.1 => H(p,q) = -∑ p(x)logq(x) cross entropy
    • Example
      • NN1 NN2
      • Pass image of dog NN1 (0.9), NN2 (0.6)
      • Pass image of cat NN1 (0.1), NN2 (0.3)
      • Pass image of dog NN1 (0.4), NN2 (0.1)
    • Classification Error
      • Did you get it right or not regardless of probability?
      • Not a good measure, especially for backpropagation
      • 0.33 vs 0.33
    • Mean Squared Error
      • 0.25 vs 0.71
      • more accurate
    • Cross Entropy
      • 0.38 vs 1.06
    • Why Cross Entropy > Mean Squared Error
      • Not just about numbers
      • For instance, very start of backpropagation your output is very low, gradient is very low, hard for neural network to adjust the weights
      • Versus Cross Entropy with logarithm helps out
      • Intuitive approach: You want an outcome of 1, but right now 0.000001 and improve next to 0.001, squared error didn't improve that much; but looking at Cross Entropy you will see have improved network significant from one millionth to thousandths, in relative terms big improvement. Cross Entropy helps NN to get to the optimal state.
      • Cross Entropy only the preferred method for classification; regression better for MSe
      • Geofrey Hinton Softmax and Cross-Entropy
      • Cross Entropy tailored for Classification and CNN and comes hand-in-hand with Softmax
    • Rob DiPietro, 2016, A Friendly Introduction to Cross-Entropy Loss https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
    • Peter Roelants, 2016, How to implement a neural network Intermezzo 2 - https://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/ Inter