From ba0088acde09473249aa04d366ee2bf94afa6391 Mon Sep 17 00:00:00 2001 From: sergiomarchio <83197607+sergiomarchio@users.noreply.github.com> Date: Tue, 12 Nov 2024 00:42:32 -0300 Subject: [PATCH] linear calculations nb (#46) --- guides/en/05. Linear Layer.ipynb | 7 +- guides/en/05b. Linear Calculations.ipynb | 331 ++++++++++++++++++ guides/es/05. Capa Linear.ipynb | 12 +- "guides/es/05b. C\303\241lculos Linear.ipynb" | 331 ++++++++++++++++++ 4 files changed, 673 insertions(+), 8 deletions(-) create mode 100644 guides/en/05b. Linear Calculations.ipynb create mode 100644 "guides/es/05b. C\303\241lculos Linear.ipynb" diff --git a/guides/en/05. Linear Layer.ipynb b/guides/en/05. Linear Layer.ipynb index 71d0708..40c286c 100644 --- a/guides/en/05. Linear Layer.ipynb +++ b/guides/en/05. Linear Layer.ipynb @@ -25,7 +25,7 @@ "\n", "## Case with 1 Input and 1 Output\n", "\n", - "In this case the math is similar to the well known 2D line equation $y = wx+b$. In this case $w$, $x$, $b$ and $y$ are all scalars, and we are just multiplying $x$ by $w$ and then adding $b$. \n", + "In this case the math is similar to the well known 2D line equation $y = wx+b$, where $w$, $x$, $b$ and $y$ are all scalars, and we are just multiplying $x$ by $w$ and then adding $b$. \n", "\n", "\n", "\n", @@ -43,7 +43,7 @@ "\n", "Note that:\n", "* $x w$ is now a matrix multiplication\n", - "* The order between $x$ and $w$ matters because matrix multiplication is not associative\n", + "* The order between $x$ and $w$ matters because matrix multiplication is not commutative\n", " * A $1×I$ array ($x$) multiplied by another $I×O$ array ($w$) results in a $1×O$ array ($y$)\n", " * The reverse definition, $y=wx$, would require that $x$ and $y$ be column vectors, or that $w$ has size $O×I$,\n", "\n", @@ -78,6 +78,7 @@ "source": [ "# Create a Linear layer with 2 input and 3 output values\n", "# Initialize it with values sampled from a normal distribution\n", + "# with mean 0 and standard deviation 1e-12\n", "\n", "std = 1e-12\n", "input_dimension = 2\n", @@ -93,7 +94,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Forward Method\n", + "# Forward method\n", "\n", "Now that we know how to create and initialize `Linear` layer objects, let's move on to the `forward` method, which can be found in the `edunn/models/linear.py` file.\n", "\n", diff --git a/guides/en/05b. Linear Calculations.ipynb b/guides/en/05b. Linear Calculations.ipynb new file mode 100644 index 0000000..fc79bb8 --- /dev/null +++ b/guides/en/05b. Linear Calculations.ipynb @@ -0,0 +1,331 @@ +{ + "cells": [ + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "# Backward method\n", + "\n", + "In the `Bias` layer implementation, the formulas for the derivatives were relatively simple, and the complexity relied on how to use the framework and in the understanding of the difference between the derivative of the input and the one of the parameters.\n", + "\n", + "`Linear` layer's backward method requires teh calculation of $\\frac{dE}{dy}$ and $\\frac{dE}{dw}$. In terms of the framework the implementation is very similar to the `Bias` layer, but the formulas of the derivatives are more complex.\n", + "\n", + "First we'll assume there's only one input example $x$ to make it simpler, the we'll generalize to a batch of $N$ examples.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## $dE / dx$\n", + "\n", + "Let's start with $\\frac{dE}{dx}$. This is symmetrical to $\\frac{dE}{dw}$,but easier to grasp conceptually.\n", + "\n", + "We'll think this derivative by scenario, from the simplest to the most complex, increasing the input and output dimensions.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 1 input, 1 output\n", + "\n", + "When both the input and output are 1D then $x \\in R$ and $w \\in R$, they're scalars. Then $\\frac{dE}{dy}$ is also a scalar, and following the Chain Rule:\n", + "\n", + "$\\frac{dE}{dx} = \\frac{dE}{dy} \\frac{dy}{dx} = \\frac{dE}{dy} \\frac{d(wx)}{dx} = \\frac{dE}{dy} w$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I inputs, 1 output\n", + "\n", + "With $I$ inputs and 1 output $x$ is a vector with $I$ values, i.e. $x \\in R^I$, then $w \\in R^I$ is also a vector with $I$ values. Then we can think the output as the matrix product between $w$ and $x$\n", + "\n", + "$y = x . w = \\sum_{i=1}^I x_i w_i$\n", + "\n", + "We have one partial derivative for each input: $\\frac{dE}{dx_j}$. Given that $\\frac{dE}{dy}$ is still a scalar (there's only one output) and applying the Chain Rule, we can calculate this derivative:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (\\sum_{i=1}^I w_i x_i)}{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\sum_{i=1}^I \\frac{d (w_i x_i) }{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (w_j x_j) }{dx_j} \\\\\n", + "= \\frac{dE}{dy} w_j \n", + "$\n", + "\n", + "Then $\\frac{dE}{dx_j} = \\frac{dE}{dy} w_j$. We can generalize this definition and calculate the gradient with respect to the whole vector $x$ as:\n", + "\n", + "$\\frac{dE}{dx} = \\frac{dE}{dy} w$\n", + "\n", + "\n", + "#### Notes\n", + "\n", + "1. It's great that the same definition of $\\frac{dE}{dx}$ works in both scenarios, being with $1$ input or with an arbitrary amount of $I$ inputs.\n", + "1. It's important to consider that in this context we cant think of $\\frac{dE}{dy}$ as a constant, since it's values were calculated previously.\n", + "1. We could obtain $\\frac{dy}{dx}$ without taking into consideration the network error, and then get $\\frac{dE}{dx}$ applying the Chain Rule $\\frac{dE}{dx} =\\frac{dE}{dy} \\frac{dy}{dx}$. We're doing everythiong at the same time to be clearer in the context of the `backward` method of a network.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I inputs, O outputs\n", + "\n", + "Again, let's go for one of the input's derivatives, i.e., $\\frac{dE}{dx_j}$:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} = \\frac{dE}{dy} \\frac{dy}{dx_j} \n", + "$\n", + "\n", + "In this case $y$ is a vector, so we have to add the contribution of each element of $y$ to the Chain Rule:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dx_j} \n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{dy_i}{dx_j}\n", + "$\n", + "\n", + "We know that $y_i$ is the dot product between the column $i$ of $w$ with the input $x$, according to the matrix multiplication definition, then:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{dy_i}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{d(w_{:,i} \\cdot x)}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{d(\\sum_{k=1}^I w_{k,i} x_k)}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} ( \\sum_{k=1}^I \\frac{d (w_{k,i} x_k)}{dx_j} ) \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} w_{j,i}\n", + "$\n", + "\n", + "Now, $\\sum_{i=1}^O \\frac{dE}{dy_i} w_{j,i}$ is simply the dot product between the column $i$ of $w$ ($w_{:,i}$) and $\\frac{dE}{dy}$. Then we can write:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} = \\frac{dE}{dy} \\cdot w_{:,i}\n", + "$\n", + "\n", + "Generalizing to the entire vector $x$, if $\\frac{dE}{dx_j}$ is the product between tho vectors, where $j$ is the column of $w$, we can write $\\frac{dE}{dx}$ as a product between the $\\frac{dE}{dy}$ vector and the entire $w$ matrix:\n", + "\n", + "$\n", + "\\frac{dE}{dx} = w \\frac{dE}{dy}\n", + "$\n", + "\n", + "In this case again, the order matters. $w$ has size $I \\times O$ and $\\frac{dE}{dy}$ has size $O$, then $w \\frac{dE}{dy}$ has size $I$ (the same as $x$)\n", + "\n", + "\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Batch implementation\n", + "\n", + "To implement the derivative for an example batch we can iterate over each and calculate the derivatives as we did before. Alternatively we can rewrite the derivative to make it work for a batch of $N$ examples (then, of $N$ vectors of derivatives, both for the input and for the output).\n", + "\n", + "In the batch implementation of $\\frac{dE}{dx}$ we have $x$ as a matrix of size $N \\times I$, then $\\frac{dE}{dx}$ also is. At the same time as $\\frac{dE}{dy}$ is next layer's $\\frac{dE}{dx}$, $\\frac{dE}{dy}$ is a matrix of size $N \\times O$.\n", + "\n", + "Given that, we can't multiply $w \\in R^{I \\times O}$ by $\\frac{dE}{dy} \\in R^{N \\times O}$. In this case you can verify that the correct formula is $\\frac{dE}{dy} w^T$, since when multiplying a matrix of size $N \\times O$ by one of size $O \\times I$ ($w^T$), we get a matrix of size $N \\times I$: the same size as $x$:\n", + "\n", + "$\n", + "\\frac{dE}{dx} = \\frac{dE}{dy} w^T\n", + "$" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## $dE/dw$\n", + "\n", + "For the gradient of the error with respect of $w$, we'll also assume at first that there's only one input example $x$, and we'll go from the simplest to the more complex scenarios.\n", + "\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 1 input, 1 output\n", + "\n", + "This is the simplest scneario, and it's symmetrical to $\\frac{dE}{dx}$:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\frac{dE}{dw} \\frac{dw}{dx} = \\frac{dE}{dw} \\frac{d (wx)}{dw} = \\frac{dE}{dy} x\n", + "$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I inputs, 1 output\n", + "\n", + "Now we have $I$ inputs and 1 output.\n", + "\n", + "$y = x . w = \\sum_{i=1}^I x_i w_i$\n", + "\n", + "As $w$ has $I$ elements there's a partial derivative for each value of $w$: $\\frac{dE}{dw_j}$. Keep in mind that $\\frac{dE}{dy}$ is still a scalar (there's only one output), so applying the Chain Rule we can calculate this derivative:\n", + "\n", + "$\n", + "\\frac{dE}{dw_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dw_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d \\sum_{i=1}^I w_i x_i }{dw_j} \\\\\n", + "= \\frac{dE}{dy} \\sum_{i=1}^I \\frac{d (w_i x_i) }{dxw_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (w_j x_j) }{dw_j} \\\\\n", + "= \\frac{dE}{dy} x_j\n", + "$\n", + "\n", + "Then $\\frac{dE}{dw_j} = \\frac{dE}{dy} x_j$. We can generalize this definition and calculate the gradient with respect to the entire vector $x$ as:\n", + "\n", + "$\\frac{dE}{dw} = \\frac{dE}{dy} x$\n", + "\n", + "Again, this case is **symmetrical** with $x$, since $\\frac{dE}{dx} = \\frac{dE}{dy} w$.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I inputs, O outputs\n", + "\n", + "In this case having $O$ outputs, we'll have to find the derivative of each weight for each input $i$ for each output $j$. We'll lose the previous symmetry, but it'll be recovered in the batch version.\n", + "\n", + "Given that, we'll find $\\frac{dE}{dw_{i,j}}$. Applying the Chain Rule:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{dy}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}} \n", + "$\n", + "\n", + "As $y$ is a vector, we'll have to add for each value to apply the Chain Rule:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}}\n", + "= \\sum_{k=1}^O \\frac{dE}{dy_k} \\frac{d(xw)_k}{dw_{i,j}} \n", + "$\n", + "\n", + "As $y_k$ only depends on $w_{i,j}$ if $j=k$, i.e. if we're calculating the output for the column $k$, then:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}}\n", + "= \\frac{dE}{dy_j} \\frac{d(xw)_j}{dw_{i,j}} \n", + "$\n", + "\n", + "By the matrix multiplication definition, $(xw)_j = \\sum_{l=1}^O x_l w_{l,j}$, so we multiply $x$ for the column $j$ of $w$. Replacing the values:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy_j} \\frac{d(xw)_j}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\frac{d(\\sum_{l=1}^O x_l w_{l,j})}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\sum_{l=1}^O \\frac{d (x_l w_{l,j}) }{dw_{i,j}}\n", + "$\n", + "\n", + "As $w_{i,j}$ is one particular weight of $w$, of the entire sum only remains the term that contains it: $\\frac{d (x_i w_{i,j})}{w_{i,j}} = x_i$. Replacing the values:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}} \n", + "= \\frac{dE}{dy_j} \\sum_{l=1}^O \\frac{d (x_l w_{l,j})}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\frac{d(x_i w_{i,j})}{d w_{i,j}} \\\\ \n", + "= \\frac{dE}{dy_j} x_i\n", + "$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### Vector expression\n", + "\n", + "The previous expression is helpful, but we should use a `for` loop with `i` and `j` indexes over the entire `w` matrix. Instead we can generalize by observing the pattern of the $\\frac{dE}{dw}$ matrix:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\left(\n", + "\\begin{matrix} \n", + " \\frac{dE}{dy_1} x_1 & \\frac{dE}{dy_2} x_1 & \\dots & \\frac{dE}{dy_O} x_1 \\\\\n", + " \\frac{dE}{dy_1} x_2 & \\frac{dE}{dy_2} x_2 & \\dots & \\frac{dE}{dy_O} x_2 \\\\\n", + " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", + " \\frac{dE}{dy_1} x_I & \\frac{dE}{dy_2} x_I & \\dots & \\frac{dE}{dy_O} x_I \\\\\n", + "\\end{matrix}\n", + "\\right) = x \\otimes \\frac{dE}{dy}\n", + "$\n", + "\n", + "Where $\\otimes$ is the [outer product](https://en.wikipedia.org/wiki/Outer_product) between two vectors. With `numpy` the [`outer`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) function allows this kind of operation without the need of loops.\n", + "\n", + "Keep in mind that the outer product is *not* commutative: if $a$ and $b$ have sizes $p$ and $q$, then $ a \\otimes b$ has size $p \\times q$ and $b \\otimes a$ has size $q \\times p$. Given that, as $\\frac{dE}{dw}$ must have size $I \\times O$, we have to calculate $x \\otimes\\frac{dE}{dy}$ and not $\\frac{dE}{dy} \\otimes x$.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "## Batch calculation\n", + "\n", + "When we have a batch of $N$ examples, $x$ has size $N \\times I$, $w$ has size $I \\times O$, and $\\frac{dE}{dy}$ has size $N \\times O$.\n", + "\n", + "As before with $b$, to calculate $\\frac{dE}{dw}$ we need to add the gradient that contributes with each example $x_i$, then:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\sum_{i=1}^{n} x_{i,:} \\otimes \\frac{dE}{dy_{i,:}}\n", + "$\n", + "\n", + "Where $x_{i,:}$ is $x$'s $i$ row, i.e. the $i^{th}$ example (`numpy`'s equivalent would be `x[i,:]`)\n", + "\n", + "For example, for $N=2$ we can verify:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = x_{1,:} \\otimes \\frac{dE}{dy_{1,:}} + x_{1,:} \\otimes \\frac{dE}{dy_{2,:}} \\\\\n", + "= \\left(\n", + "\\begin{matrix}\n", + " \\frac{dE}{dy_{1,1}} x_{1,1} + \\frac{dE}{dy_{2,1}} x_{2,1} & \\frac{dE}{dy_{1,2}} x_{1,1} + \\frac{dE}{dy_{2,2}} x_{2,1} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,1} + \\frac{dE}{dy_{2,O}} x_{2,1} \\\\\n", + " \\frac{dE}{dy_{1,1}} x_{1,2} + \\frac{dE}{dy_{2,1}} x_{2,2} & \\frac{dE}{dy_{1,2}} x_{1,2} + \\frac{dE}{dy_{2,2}} x_{2,2} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,2} + \\frac{dE}{dy_{2,O}} x_{2,2} \\\\\n", + " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", + " \\frac{dE}{dy_{1,1}} x_{1,I} + \\frac{dE}{dy_{2,1}} x_{2,I} & \\frac{dE}{dy_{1,2}} x_{1,I} + \\frac{dE}{dy_{2,2}} x_{2,I} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,I} + \\frac{dE}{dy_{2,O}} x_{2,I} \\\\\n", + "\\end{matrix}\n", + "\\right) \\\\\n", + " = x^t \\frac{dE}{dy}\n", + "$\n", + "\n", + "This is valid for any $N$. We can confirm this identity given the sizes: multiplying $x^t$ (size $I \\times N$) by $\\frac{dE}{dy}$ (size $N \\times O$), we obtain a matrix of size $I \\times O$, same as $w$, just the size $\\frac{dE}{dw}$ must have!.\n", + "\n", + "Then we can see the symmetry between both derivatives;:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\frac{dy}{dw} \\frac{dE}{dy} = x^t \\frac{dE}{dy} \\\\\n", + "\\frac{dE}{dx}= \\frac{dy}{dx} \\frac{dE}{dy} = \\frac{dE}{dy} w\n", + "$\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/guides/es/05. Capa Linear.ipynb b/guides/es/05. Capa Linear.ipynb index e7714bf..c0384c6 100644 --- a/guides/es/05. Capa Linear.ipynb +++ b/guides/es/05. Capa Linear.ipynb @@ -24,6 +24,8 @@ "\n", "## Caso con 1 entrada y 1 salida\n", "\n", + "En este caso la matemática es similar al caso de la ecuación lineal en 2D $y = wx + b$, donde $y$, $w$, $x$, y $b$ son todos escalares, y solo se multiplica $x$ por $w$ y luego se le suma $b$.\n", + "\n", "## Caso con I entradas y S salidas\n", "\n", "En el caso más general, donde $w$ es una matriz que combina $I$ entradas de forma lineal para generar $O$ salidas, entonces $x \\in R^{1×I}$ e $y \\in R^{1×O}$. En este caso definimos entonces a $x$ e $y$ como _vectores fila_. \n", @@ -38,12 +40,12 @@ "\n", "Notamos que\n", "* $x w$ ahora es un producto matricial\n", - "* En este caso es importante el orden entre $x$ y $w$, ya que el producto de matrices no es asociativo\n", + "* En este caso es importante el orden entre $x$ y $w$, ya que el producto de matrices no es conmutativo\n", "\t* Un arreglo de $1×I$ ($x$) multiplicado por otro de $I×O$ ($w$) da como resultado un arreglo de $1×O$ ($y$)\n", "\t* La definición inversa, $y=wx$, requeriría que $x$ e $y$ sean vectores columna, o que $w$ tenga tamaño $O×I$, \n", "\n", "\n", - "## Lotes\n", + "## Lotes (Batches)\n", "\n", "Las capas reciben no un solo ejemplo, sino un lote de los mismos. Entonces, dada una entrada `x` de $N×I$ valores, donde $N$ es el tamaño de lote de ejemplos, `y` tiene tamaño $N×O$. El tamaño de $w$ no se ve afectado; sigue siendo $I×O$.\n", "\n", @@ -58,7 +60,7 @@ "source": [ "# Creación e Inicialización\n", "\n", - "La capa `Linear` tiene un vector de parámetros `w`, que debe crearse en base a un tamaño de entrada y de salida de la capa, que debe establecerse al crearse.\n", + "La capa `Linear` tiene un vector de parámetros `w`, que debe crearse en base a un tamaño de entrada y de salida de la capa, establecidos al momento crearse.\n", "\n", "Usaremos el inicializador `RandomNormal` creado previamente\n" ] @@ -69,7 +71,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Creamos una capa Linear con 2 valores de entrada y 3 de salida\n", + "# Crea una capa Linear con 2 valores de entrada y 3 de salida\n", "# inicializado con valores muestreados de una normal\n", "# con media 0 y desviación estándar 1e-12\n", "\n", @@ -79,7 +81,7 @@ "linear = nn.Linear(input_dimension, output_dimension, initializer=nn.initializers.RandomNormal(std))\n", "print(f\"Nombre de la capa: {linear.name}\")\n", "print(f\"Parámetros de la capa : {linear.get_parameters()}\")\n", - "print(\"(deben cambiar cada vez que vuelvas a correr esta celda)\")\n", + "print(\"(estos valores deben cambiar cada vez que vuelvas a correr esta celda)\")\n", "print()\n" ] }, diff --git "a/guides/es/05b. C\303\241lculos Linear.ipynb" "b/guides/es/05b. C\303\241lculos Linear.ipynb" new file mode 100644 index 0000000..e6860c5 --- /dev/null +++ "b/guides/es/05b. C\303\241lculos Linear.ipynb" @@ -0,0 +1,331 @@ +{ + "cells": [ + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "# Método backward\n", + "\n", + "En la implementación de la capa `Bias` las fórmulas de las derivadas eran relativamente sencillas, y la complejidad estaba más que todo en cómo utilizar el framework y comprender la diferencia entre la derivada de la entrada y la de los parámetros.\n", + "\n", + "El método backward de la capa `Linear` requiere calcular $\\frac{dE}{dy}$ y $\\frac{dE}{dw}$. En términos de uso del framework, la implementación es muy similar a la de `Bias`, pero las fórmulas de las derivadas son más complicadas.\n", + "\n", + "Primero asumiremos que hay un solo ejemplo de entrada $x$, para simplificar el desarrollo, y luego generalizaremos a un lote de $N$ ejemplos.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## $dE / dx$\n", + "\n", + "Comenzamos con el caso de $\\frac{dE}{dx}$. Si bien este caso es en realidad simétrico con respecto a $\\frac{dE}{dw}$, es un poco más fácil de atacar conceptualmente.\n", + "\n", + "Vamos a pensar en esta derivada por casos, desde el más simple al más complejo, incrementando las dimensiones de entrada y salida.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 1 entrada, 1 salida\n", + "\n", + "Comenzamos por en el caso más simple, donde tanto la entrada como la salida son 1D, entonces $x \\in R$ y $w \\in R$, es decir, son escalares. Entonces $\\frac{dE}{dy}$ también es un escalar, y por regla de la cadena:\n", + "\n", + "$\\frac{dE}{dx} = \\frac{dE}{dy} \\frac{dy}{dx} = \\frac{dE}{dy} \\frac{d(wx)}{dx} = \\frac{dE}{dy} w$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I entradas, 1 salida\n", + "\n", + "Pasemos ahora al caso con $I$ entradas y 1 salida. Entonces $x$ es un vector de $I$ valores, es decir, $x \\in R^I$ y por ende $w \\in R^I$ también es un vector con $I$ valores. En tal caso, podemos pensar la salida como el producto punto o matricial entre $w$ y $x$\n", + "\n", + "$y = x . w = \\sum_{i=1}^I x_i w_i$\n", + "\n", + "Entonces, tenemos una derivada parcial por cada entrada: $\\frac{dE}{dx_j}$. Recordando que $\\frac{dE}{dy}$ sigue siendo un escalar (porque hay una sola salida), y utilizando la regla de la cadena, podemos calcular esta derivada:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (\\sum_{i=1}^I w_i x_i)}{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\sum_{i=1}^I \\frac{d (w_i x_i) }{dx_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (w_j x_j) }{dx_j} \\\\\n", + "= \\frac{dE}{dy} w_j \n", + "$\n", + "\n", + "Entonces $\\frac{dE}{dx_j} = \\frac{dE}{dy} w_j$. Podemos generalizar esta definición, y calcular el gradiente respecto a todo el vector $x$ como:\n", + "\n", + "$\\frac{dE}{dx} = \\frac{dE}{dy} w$\n", + "\n", + "\n", + "#### Notas\n", + "\n", + "1. Es genial que la misma definición de $\\frac{dE}{dx}$ funcione en ambos casos, ya sea con $1$ entrada o una cantidad $I$ arbitraria de ellas.\n", + "1. Es importante tener en cuenta que en este contexto podemos tratar a $\\frac{dE}{dy}$ como una constante, ya que sus valores han sido previamente calculados.\n", + "1. Podríamos hacer la derivación de $\\frac{dy}{dx}$ aparte, sin tomar en cuenta el error de la red, y luego obtener $\\frac{dE}{dx}$ aplicando la regla de la cadena $\\frac{dE}{dx} =\\frac{dE}{dy} \\frac{dy}{dx}$. No obstante, para ser más claros en el contexto del método `backward` una red, lo estamos haciendo todo al mismo tiempo.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I entradas, O salidas\n", + "\n", + "Nuevamente, vamos por la derivada de una de las entradas, es decir, $\\frac{dE}{dx_j}$:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} = \\frac{dE}{dy} \\frac{dy}{dx_j} \n", + "$\n", + "\n", + "En este caso, $y$ es ahora un vector, con lo cual tenemos que sumar las contribuciones de cada elemento de $y$ a la regla de la cadena. Por ende:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dx_j} \n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{dy_i}{dx_j}\n", + "$\n", + "\n", + "Ahora, sabemos que $y_i$ es el producto punto de la columna $i$ de $w$ con la entrada $x$, por la definición de la multiplicación de matrices. Entonces:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} \n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{dy_i}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{d(w_{:,i} \\cdot x)}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} \\frac{d(\\sum_{k=1}^I w_{k,i} x_k)}{dx_j} \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} ( \\sum_{k=1}^I \\frac{d (w_{k,i} x_k)}{dx_j} ) \\\\\n", + "= \\sum_{i=1}^O \\frac{dE}{dy_i} w_{j,i}\n", + "$\n", + "\n", + "Ahora, $\\sum_{i=1}^O \\frac{dE}{dy_i} w_{j,i}$ es simplemente el producto punto entre la columna $i$ de $w$ ($w_{:,i}$) y $\\frac{dE}{dy}$. Entonces podemos escribir:\n", + "\n", + "$\n", + "\\frac{dE}{dx_j} = \\frac{dE}{dy} \\cdot w_{:,i}\n", + "$\n", + "\n", + "Generalizando para todo el vector $x$, si $\\frac{dE}{dx_j}$ es el producto entre dos vectores, donde $j$ indica la columna de $w$, entonces podemos escribir $\\frac{dE}{dx}$ como un producto entre el vector $\\frac{dE}{dy}$ y la matriz $w$ entera:\n", + "\n", + "$\n", + "\\frac{dE}{dx} = w \\frac{dE}{dy}\n", + "$\n", + "\n", + "En este caso, el orden importa nuevamente. $w$ tiene tamaño $I \\times O$ y $\\frac{dE}{dy}$ tiene tamaño $O$, con lo cual $w \\frac{dE}{dy}$ tiene tamaño $I$ (el mismo que $x$)\n", + "\n", + "\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Implementación por lotes\n", + "\n", + "Para implementar la derivada para un lote de ejemplos podemos iterar sobre cada uno y calcular las derivadas como indicamos antes. Alternativamente, podemos reescribir la derivada para que funcione directamente para un lote de $N$ ejemplos (y por ende, de $N$ vectores de derivadas, tanto para la entrada como la salida)\n", + "\n", + "En la implementación por lotes de $\\frac{dE}{dx}$, tenemos que $x$ es una matriz de tamaño $N \\times I$, y por ende también lo es $\\frac{dE}{dx}$. Al mismo tiempo, como $\\frac{dE}{dy}$ es en realidad $\\frac{dE}{dx}$ de la capa siguiente, tenemos que $\\frac{dE}{dy}$ es una matriz de tamaño $N \\times O$.\n", + "\n", + "Entonces, no podemos multiplicar $w \\in R^{I \\times O}$ por $\\frac{dE}{dy} \\in R^{N \\times O}$. En este caso, puedes comprobar que la fórmula correcta es $\\frac{dE}{dy} w^T$, ya que al multiplicar una matriz de tamaño $N \\times O$ por una de tamaño $O \\times I$ ($w^T$), obtenemos una matriz de tamaño $N \\times I$, o sea, del mismo tamaño de $x$:\n", + "\n", + "$\n", + "\\frac{dE}{dx} = \\frac{dE}{dy} w^T\n", + "$" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## $dE/dw$\n", + "\n", + "En el caso del gradiente del error con respecto a $w$, también primero asumiremos que hay un solo ejemplo de entrada $x$, y vamos por casos de más simple a más complejo.\n", + "\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 1 entrada, 1 salida\n", + "\n", + "Este es el caso más simple, y es simétrico al de $\\frac{dE}{dx}$:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\frac{dE}{dw} \\frac{dw}{dx} = \\frac{dE}{dw} \\frac{d (wx)}{dw} = \\frac{dE}{dy} x\n", + "$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I entradas, 1 salida\n", + "\n", + "Pasemos ahora al caso con $I$ entradas y 1 salida.\n", + "\n", + "$y = x . w = \\sum_{i=1}^I x_i w_i$\n", + "\n", + "Como $w$ tiene $I$ elementos, entonces hay una derivada parcial por cada valor de $w$: $\\frac{dE}{dw_j}$. Recordando que $\\frac{dE}{dy}$ sigue siendo un escalar (porque hay una sola salida), y utilizando la regla de la cadena, podemos calcular esta derivada:\n", + "\n", + "$\n", + "\\frac{dE}{dw_j} \n", + "= \\frac{dE}{dy} \\frac{dy}{dw_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d \\sum_{i=1}^I w_i x_i }{dw_j} \\\\\n", + "= \\frac{dE}{dy} \\sum_{i=1}^I \\frac{d (w_i x_i) }{dxw_j} \\\\\n", + "= \\frac{dE}{dy} \\frac{d (w_j x_j) }{dw_j} \\\\\n", + "= \\frac{dE}{dy} x_j\n", + "$\n", + "\n", + "Entonces $\\frac{dE}{dw_j} = \\frac{dE}{dy} x_j$. Podemos generalizar esta definición entonces, y calcular el gradiente respecto a todo el vector $x$ como:\n", + "\n", + "$\\frac{dE}{dw} = \\frac{dE}{dy} x$\n", + "\n", + "De nuevo, este caso es entonces **simétrico** con $x$, ya que $\\frac{dE}{dx} = \\frac{dE}{dy} w$.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### I entradas, O salidas\n", + "\n", + "En este caso, al tener $O$ salidas, ahora vamos a tener que buscar la derivada de los pesos de cada entrada $i$ para cada salida $j$. En este caso, perdemos la simetría anterior (pero la recuperaremos en la versión por lotes).\n", + "\n", + "Por ende, buscamos $\\frac{dE}{dw_{i,j}}$. Por regla de la cadena:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{dy}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}} \n", + "$\n", + "\n", + "Como $y$ es un vector, tenemos que sumar por todos sus valores para aplicar la regla de la cadena:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}}\n", + "= \\sum_{k=1}^O \\frac{dE}{dy_k} \\frac{d(xw)_k}{dw_{i,j}} \n", + "$\n", + "\n", + "Como $y_k$ solo depende de $w_{i,j}$ si $j=k$, es decir, si estamos calculando la salida de la columna $k$, entonces:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy} \\frac{d (xw)}{dw_{i,j}}\n", + "= \\frac{dE}{dy_j} \\frac{d(xw)_j}{dw_{i,j}} \n", + "$\n", + "\n", + "Por definición de la multiplicación de matrices, $(xw)_j = \\sum_{l=1}^O x_l w_{l,j}$, o sea, multiplicamos $x$ por la columna $j$ de $w$. Reemplazando:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}}\n", + "= \\frac{dE}{dy_j} \\frac{d(xw)_j}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\frac{d(\\sum_{l=1}^O x_l w_{l,j})}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\sum_{l=1}^O \\frac{d (x_l w_{l,j}) }{dw_{i,j}}\n", + "$\n", + "\n", + "Como $w_{i,j}$ es solo un peso en particular de $w$, entonces de toda esa sumatoria solo queda el término que la contiene, es decir $\\frac{d (x_i w_{i,j})}{w_{i,j}} = x_i$. Reemplazando:\n", + "\n", + "$\n", + "\\frac{dE}{dw_{i,j}} \n", + "= \\frac{dE}{dy_j} \\sum_{l=1}^O \\frac{d (x_l w_{l,j})}{dw_{i,j}} \\\\\n", + "= \\frac{dE}{dy_j} \\frac{d(x_i w_{i,j})}{d w_{i,j}} \\\\ \n", + "= \\frac{dE}{dy_j} x_i\n", + "$\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "### Expresión vectorial\n", + "\n", + "La expresión anterior nos ayuda, pero deberíamos utilizar un loop `for` con índices `i` y `j` sobre toda la matriz de `w`. En lugar de eso, podemos generalizar entonces, observando el patrón de la matriz $\\frac{dE}{dw}$:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\left(\n", + "\\begin{matrix} \n", + " \\frac{dE}{dy_1} x_1 & \\frac{dE}{dy_2} x_1 & \\dots & \\frac{dE}{dy_O} x_1 \\\\\n", + " \\frac{dE}{dy_1} x_2 & \\frac{dE}{dy_2} x_2 & \\dots & \\frac{dE}{dy_O} x_2 \\\\\n", + " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", + " \\frac{dE}{dy_1} x_I & \\frac{dE}{dy_2} x_I & \\dots & \\frac{dE}{dy_O} x_I \\\\\n", + "\\end{matrix}\n", + "\\right) = x \\otimes \\frac{dE}{dy}\n", + "$\n", + "\n", + "Donde $\\otimes$ es el [producto diádico o tensorial](https://es.wikipedia.org/wiki/Producto_tensorial) ([outer product](https://en.wikipedia.org/wiki/Outer_product) en inglés) entre dos vectores. En `numpy`, la función [`outer`](https://numpy.org/doc/stable/reference/generated/numpy.outer.html) permite hacer este tipo de operación sin loops.\n", + "\n", + "Hay que tener en cuenta que el producto diádico *no* es conmutativo: si $a$ y $b$ tienen tamaño $p$ y $q$, entonces $ a \\otimes b$ tiene tamaño $p \\times q$, y $b \\otimes a$ tiene tamaño $q \\times p$. Por eso, como $\\frac{dE}{dw}$ debe tener tamaño $I \\times O$, entonces debemos computar $x \\otimes\\frac{dE}{dy}$ y no $\\frac{dE}{dy} \\otimes x$.\n" + ] + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "\n", + "## Caso por lotes\n", + "\n", + "En el caso de tener un lote de $N$ ejemplos, entonces recordamos que $x$ tiene tamaño $N \\times I$, $w$ tiene tamaño $I \\times O$, y $\\frac{dE}{dy}$ tiene tamaño $N \\times O$.\n", + "\n", + "Al igual que en el caso de $b$, para calcular $\\frac{dE}{dw}$ tenemos que sumar el gradiente que contribuye cada ejemplo $x_i$. Entonces:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\sum_{i=1}^{n} x_{i,:} \\otimes \\frac{dE}{dy_{i,:}}\n", + "$\n", + "\n", + "Donde $x_{i,:}$ es la fila $i$ de $x$, es decir, el ejemplo $i$ (el equivalente en `numpy` sería `x[i,:]`)\n", + "\n", + "Por ejemplo, si $N=2$, podemos verificar que:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = x_{1,:} \\otimes \\frac{dE}{dy_{1,:}} + x_{1,:} \\otimes \\frac{dE}{dy_{2,:}} \\\\\n", + "= \\left(\n", + "\\begin{matrix}\n", + " \\frac{dE}{dy_{1,1}} x_{1,1} + \\frac{dE}{dy_{2,1}} x_{2,1} & \\frac{dE}{dy_{1,2}} x_{1,1} + \\frac{dE}{dy_{2,2}} x_{2,1} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,1} + \\frac{dE}{dy_{2,O}} x_{2,1} \\\\\n", + " \\frac{dE}{dy_{1,1}} x_{1,2} + \\frac{dE}{dy_{2,1}} x_{2,2} & \\frac{dE}{dy_{1,2}} x_{1,2} + \\frac{dE}{dy_{2,2}} x_{2,2} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,2} + \\frac{dE}{dy_{2,O}} x_{2,2} \\\\\n", + " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", + " \\frac{dE}{dy_{1,1}} x_{1,I} + \\frac{dE}{dy_{2,1}} x_{2,I} & \\frac{dE}{dy_{1,2}} x_{1,I} + \\frac{dE}{dy_{2,2}} x_{2,I} & \\dots & \\frac{dE}{dy_{1,O}} x_{1,I} + \\frac{dE}{dy_{2,O}} x_{2,I} \\\\\n", + "\\end{matrix}\n", + "\\right) \\\\\n", + " = x^t \\frac{dE}{dy}\n", + "$\n", + "\n", + "Esto también vale para cualquier $N$. Podemos confirmar esta identidad en base a los tamaños: si multiplicamos $x^t$ (tamaño $I \\times N$) con $\\frac{dE}{dy}$ (tamaño $N \\times O$), obtenemos una matriz de tamaño $I \\times O$, igual que $w$ y ¡justo el tamaño que debe tener $\\frac{dE}{dw}$!.\n", + "\n", + "Entonces, ahora si podemos ver la simetría entre las dos derivadas:\n", + "\n", + "$\n", + "\\frac{dE}{dw} = \\frac{dy}{dw} \\frac{dE}{dy} = x^t \\frac{dE}{dy} \\\\\n", + "\\frac{dE}{dx}= \\frac{dy}{dx} \\frac{dE}{dy} = \\frac{dE}{dy} w\n", + "$\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}