Skip to content

Short Notes

Prashant edited this page Feb 22, 2019 · 10 revisions
Others Points
  • Tensors are the standard way of representing data in deep learning. They are just multidimensional array, an extension of two-dimensional tables(matrices) to data with higher dimensions.

  • in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
    SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there. The error cuve is noiser rather than smooth as in GD
    Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

  • Hessian matrix is the square matrix with second order partial derivatives (if exists, and are continous) of a function.
    Jacobian matrix is the square matrix with first order partial derivatives (if exists) of a function
    The trace of the Hessian matrix is known as the Laplacian operator denoted by ∇2,

  • Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

  • The reason behind this is the problem of Vanishing Gradient. In order to understand this, you’ll need to have some knowledge about how a feed-forward neural network learns. We know that for a conventional feed-forward neural network, the weight updating that is applied on a particular layer is a multiple of the learning rate, the error term from the previous layer and the input to that layer. Thus, the error term for a particular layer is somewhere a product of all previous layers’ errors. When dealing with activation functions like the sigmoid function, the small values of its derivatives (occurring in the error function) gets multiplied multiple times as we move towards the starting layers. As a result of this, the gradient almost vanishes as we move towards the starting layers, and it becomes difficult to train these layers.

  • A covariance refers to the measure of how two random variables will change together and is used to calculate the correlation between variables. The variance refers to the spread of the data set — how far apart the numbers are in relation to the mean, for instance.

  • Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms, are possible:

    • Computing the root of a sum of squares (RMSE) corresponds to the Euclidian norm: it is the notion of distance you are familiar with. It is also called the ℓ2 norm(...)
    • Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm,(...). It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
    • More generally, (... )ℓ 0 just gives the number of non-zero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector.
    • The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.
  • To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use the chi-square test.

  • R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model.

  • When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

  • Ensemble learning (or "ensembling") is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For classification problems, the combination is often done by majority vote. For regression problems, the combination is often done by taking an average of the predictions.For ensembling to work well, the individual models must meet two conditions:

    • Models should be accurate (they must outperform random guessing)
    • Models should be independent (their predictions are not correlated with one another)
  • Random Forests is a slight variation of bagged trees that has even better performance! Here's how it works:

    • Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.

    • However, when building each tree, each time a split is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is only allowed to use one of those m predictors.

    • Notes:

    • A new random sample of predictors is chosen for every single tree at every single split.

    • For classification, m is typically chosen to be the square root of p. For regression, m is typically chosen to be somewhere between p/3 and p.

    • What's the point?

    • Suppose there is one very strong predictor in the data set. When using bagged trees, most of the trees will use that predictor as the top split, resulting in an ensemble of similar trees that are "highly correlated".

    • Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).

    • By randomly leaving out candidate predictors from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.

  • Linear models can overfit if you include irrelevant features.
    Question: Why would that be the case?
    Answer: Because it will learn a coefficient for any feature you feed into the model, regardless of whether that feature has the signal or the noise.
    This is especially a problem when p (number of features) is close to n (number of observations), because that model will naturally have high variance.
    Linear models can also overfit when the included features are highly correlated.
    "...coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance."
    Linear models can also overfit if the coefficients are too large.
    Question: Why would that be the case?
    Answer: Because the larger the absolute value of the coefficient, the more power it has to change the predicted response. Thus it tends toward high variance, which can result in overfitting.

Autoencoder

For higher dimensional data, autoencoders are capable of learning a complex representation of the data (manifold) which can be used to describe observations in a lower dimensionality and correspondingly decoded into the original input space.

Tolerance

Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of the percent of the variance in a predictor which cannot be accounted by other predictors. Large values of tolerance are desirable.

Clone this wiki locally