From 91f770bf81f37ca62a3a7f32a47bc9829e111a59 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Thu, 30 Jan 2025 03:30:37 +0000 Subject: [PATCH] Deployed e262b589 with MkDocs version: 1.6.1 --- 404.html | 2 +- AI/CS231n/CS231n_notes/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- AI/CS231n/Numpy/index.html | 2 +- AI/Dive_into_Deep_Learning/index.html | 2 +- AI/EECS 498-007/KNN/index.html | 2 +- AI/EECS 498-007/Pytorch/index.html | 2 +- AI/EECS 498-007/linear_classifer/index.html | 2 +- AI/FFB6D/FFB6D_Conda/index.html | 2 +- AI/FFB6D/FFB6D_Docker/index.html | 2 +- AI/SLAM14/index.html | 2 +- AI/index.html | 2 +- .../index.html" | 2 +- Blogs/archives/index.html | 4 +- Blogs/index.html | 4 +- Blogs/posts/24-12-29/index.html | 2 +- Blogs/posts/24-12-30/index.html | 2 +- Blogs/posts/25-01-20/index.html | 2 +- .../index.html" | 2 +- .../posts/Gaussian_Splatting_Code/index.html | 2 +- .../index.html" | 2 +- Blogs/posts/OCRN/index.html | 2 +- Blogs/posts/ULIP-2/index.html | 2 +- Blogs/posts/notes_software/index.html | 86 ++++++++++++ .../index.html" | 4 +- CS_Basic/15-213/CSAPP/index.html | 2 +- CS_Basic/C++/Accelerated C++/index.html | 2 +- CS_Basic/C++/C++ Basic/index.html | 2 +- CS_Basic/CS61A/CS61A/index.html | 2 +- CS_Basic/CS61A/Composing_Programs/index.html | 2 +- .../index.html" | 2 +- CS_Basic/Network/Security/index.html | 2 +- CS_Basic/index.html | 2 +- Links/index.html | 2 +- Robot/calibration/index.html | 2 +- Robot/index.html | 2 +- Robot/kalman/index.html | 2 +- Robot/pnp/index.html | 2 +- Summaries/2024/weekly/2024-W51-12/index.html | 2 +- Summaries/2024/weekly/2024-W52-12/index.html | 2 +- Summaries/2025/weekly/2025-W01-12/index.html | 2 +- Summaries/2025/weekly/2025-W02-01/index.html | 2 +- Summaries/2025/weekly/2025-W03-01/index.html | 2 +- Summaries/2025/weekly/2025-W04-01/index.html | 2 +- .../Semesters/2024summer_vacation/index.html | 2 +- Summaries/index.html | 4 +- Tags/index.html | 2 +- Tools/AI/prompt/index.html | 2 +- Tools/AI/prompt_writing/index.html | 2 +- Tools/Blog/Mkdocs_Material/index.html | 2 +- Tools/Environment/Ubuntu_setup/index.html | 2 +- Tools/Environment/environment/index.html | 2 +- Tools/Environment/obsidian_setup/index.html | 2 +- Tools/Make/CMake/index.html | 2 +- Tools/Make/Makeflie/index.html | 2 +- Tools/Others/Chezmoi/index.html | 2 +- Tools/Others/SSH/index.html | 2 +- .../index.html" | 2 +- Tools/Terminal/Tabby_Zsh/index.html | 2 +- Tools/index.html | 2 +- about/index.html | 2 +- index.html | 2 +- search/search_index.json | 2 +- sitemap.xml | 126 +++++++++--------- sitemap.xml.gz | Bin 1010 -> 1022 bytes 66 files changed, 218 insertions(+), 128 deletions(-) create mode 100644 Blogs/posts/notes_software/index.html diff --git a/404.html b/404.html index 96b40ad5..b9320754 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - wnc 的咖啡馆

404 - Not found

\ No newline at end of file + wnc 的咖啡馆

404 - Not found

\ No newline at end of file diff --git a/AI/CS231n/CS231n_notes/index.html b/AI/CS231n/CS231n_notes/index.html index f4f5b116..0ad28b6f 100644 --- a/AI/CS231n/CS231n_notes/index.html +++ b/AI/CS231n/CS231n_notes/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}
Skip to content

Computer Vision

8852 个字 167 张图片 预计阅读时间 44 分钟 共被读过

This note is based on GitHub - DaizeDong/Stanford-CS231n-2021-and-2022: Notes and slides for Stanford CS231n 2021 & 2022 in English. I merged the contents together to get a better version. Assignments are not included. 斯坦福 cs231n 的课程笔记 ( 英文版本,不含实验代码 ),将 2021 2022 两年的课程进行了合并,分享以供交流。
And I will add some blogs, articles and other understanding.

Topic Chapter
Deep Learning Basics 2 - 4
Perceiving and Understanding the Visual World 5 - 12
Reconstructing and Interacting with the Visual World 13 - 16
Human-Centered Applications and Implications 17 - 18

1 - Introduction

A brief history of computer vision & deep learning...

2 - Image Classification

Image Classification: A core task in Computer Vision. The main drive to the progress of CV.

Challenges: Viewpoint variation, background clutter, illumination, occlusion, deformation, intra-class variation...

K Nearest Neighbor

Hyperparameters: Distance metric (\(p\) norm), \(k\) number.

Choose hyperparameters using validation set.

Never use k-Nearest Neighbor with pixel distance.

Linear Classifier

Pass...

3 - Loss Functions and Optimization

Loss Functions

Dataset \(\big\{(x_i,y_i)\big\}_{i=1}^N\\\)
Loss Function \(L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)\\\)
Loss Function with Regularization \(L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)+\lambda R(W)\\\)

Motivation: Want to interpret raw classifier scores as probabilities.

Softmax Classifier \(p_i=Softmax(y_i)=\frac{\exp(y_i)}{\sum_{j=1}^N\exp(y_j)}\\\)
Cross Entropy Loss \(L_i=-y_i\log p_i\\\)
Cross Entropy Loss with Regularization \(L=-\frac{1}{N}\sum_{i=1}^Ny_i\log p_i+\lambda R(W)\\\)

Optimization

SGD with Momentum

Problems that SGD can't handle:

  1. Inequality of gradient in different directions.
  2. Local minima and saddle point (much more common in high dimension).
  3. Noise of gradient from mini-batch.

Momentum: Build up “velocity” \(v_t\) as a running mean of gradients.

SGD SGD + Momentum
\(x_{t+1}=x_t-\alpha\nabla f(x_t)\) \(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\)
Naive gradient descent. \(\rho\) gives "friction", typically \(\rho=0.9,0.99,0.999,...\)

Nesterov Momentum: Use the derivative on point \(x_t+\rho v_t\) as gradient instead point \(x_t\).

Momentum Nesterov Momentum
\(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\) \(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t+\rho v_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\)
Use gradient at current point. Look ahead for the gradient in velocity direction.

AdaGrad and RMSProp

AdaGrad: Accumulate squared gradient, and gradually decrease the step size.

RMSProp: Accumulate squared gradient while decaying former ones, and gradually decrease the step size. ("Leaky AdaGrad")

AdaGrad RMSProp
\(\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=r+\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}\) \(\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=\rho r+(1-\rho)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}\)
Continually accumulate squared gradients. \(\rho\) gives "decay rate", typically \(\rho=0.9,0.99,0.999,...\)

Adam

Sort of like "RMSProp + Momentum".

Adam (simple version) Adam (full version)
\(\begin{align}\text{Initialize:}&\\&r_1:=0\\&r_2:=0\\\text{Update:}&\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{r_1}{\sqrt{r_2}}\end{align}\) \(\begin{align}\text{Initialize:}\\&r_1:=0\\&r_2:=0\\\text{For }i\text{:}\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&r_1'=\frac{r_1}{1-\beta_1^i}\\&r_2'=\frac{r_2}{1-\beta_2^i}\\&x_{t+1}=x_t-\alpha\frac{r_1'}{\sqrt{r_2'}}\end{align}\)
Build up “velocity” for both gradient and squared gradient. Correct the "bias" that \(r_1=r_2=0\) for the first few iterations.

Overview

Learning Rate Decay

Reduce learning rate at a few fixed points to get a better convergence over time.

\(\alpha_0\) : Initial learning rate.

\(\alpha_t\) : Learning rate in epoch \(t\).

\(T\) : Total number of epochs.

Method Equation Picture
Step Reduce \(\alpha_t\) constantly in a fixed step.
Cosine \(\begin{align}\alpha_t=\frac{1}{2}\alpha_0\Bigg[1+\cos(\frac{t\pi}{T})\Bigg]\end{align}\)
Linear \(\begin{align}\alpha_t=\alpha_0\Big(1-\frac{t}{T}\Big)\end{align}\)
Inverse Sqrt \(\begin{align}\alpha_t=\frac{\alpha_0}{\sqrt{t}}\end{align}\)

High initial learning rates can make loss explode, linearly increasing learning rate in the first few iterations can prevent this.

Learning rate warm up:

Empirical rule of thumb: If you increase the batch size by \(N\), also scale the initial learning rate by \(N\) .

Second-Order Optimization

Picture Time Complexity Space Complexity
First Order \(O(n)\) \(O(n)\)
Second Order \(O(n^2)\) with BGFS optimization \(O(n)\) with L-BGFS optimization

L-BGFS : Limited memory BGFS.

  1. Works very well in full batch, deterministic \(f(x)\).
  2. Does not transfer very well to mini-batch setting.

Summary

Method Performance
Adam Often chosen as default method.
Work ok even with constant learning rate.
SGD + Momentum Can outperform Adam.
Require more tuning of learning rate and schedule.
L-BGFS If can afford to do full batch updates then try out.

4 - Neural Networks and Backpropagation

Neural Networks

Motivation: Inducted bias can appear to be high when using human-designed features.

Activation: Sigmoid, tanh, ReLU, LeakyReLU...

Architecture: Input layer, hidden layer, output layer.

Do not use the size of a neural network as the regularizer. Use regularization instead!

Gradient Calculation: Computational Graph + Backpropagation.

Backpropagation

Using Jacobian matrix to calculate the gradient of each node in a computation graph.

Suppose that we have a computation flow like this:

Input X Input W Output Y
\(X=\begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}\) \(W=\begin{bmatrix}w_{11}&w_{12}&\cdots&w_{1n}\\w_{21}&w_{22}&\cdots&w_{2n}\\\vdots&\vdots&\ddots&\vdots\\w_{m1}&w_{m2}&\cdots&w_{mn}\end{bmatrix}\) \(Y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\)
\(n\times 1\) \(m\times n\) \(m\times 1\)

After applying feed forward, we can calculate gradients like this:

Derivative Matrix of X Jacobian Matrix of X Derivative Matrix of Y
\(D_X=\begin{bmatrix}\frac{\partial L}{\partial x_1}\\\frac{\partial L}{\partial x_2}\\\vdots\\\frac{\partial L}{\partial x_n}\end{bmatrix}\) \(J_X=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&\frac{\partial y_1}{\partial x_2}&\cdots&\frac{\partial y_1}{\partial x_n}\\\frac{\partial y_2}{\partial x_1}&\frac{\partial y_2}{\partial x_2}&\cdots&\frac{\partial y_2}{\partial x_n}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_m}{\partial x_1}&\frac{\partial y_m}{\partial x_2}&\cdots&\frac{\partial y_m}{\partial x_n}\end{bmatrix}\) \(D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}\)
\(n\times 1\) \(m\times n\) \(m\times 1\)
Derivative Matrix of W Jacobian Matrix of W Derivative Matrix of Y
\(W=\begin{bmatrix}\frac{\partial L}{\partial w_{11}}&\frac{\partial L}{\partial w_{12}}&\cdots&\frac{\partial L}{\partial w_{1n}}\\\frac{\partial L}{\partial w_{21}}&\frac{\partial L}{\partial w_{22}}&\cdots&\frac{\partial L}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial L}{\partial w_{m1}}&\frac{\partial L}{\partial w_{m2}}&\cdots&\frac{\partial L}{\partial w_{mn}}\end{bmatrix}\) \(J_W^{(k)}=\begin{bmatrix}\frac{\partial y_k}{\partial w_{11}}&\frac{\partial y_k}{\partial w_{12}}&\cdots&\frac{\partial y_k}{\partial w_{1n}}\\\frac{\partial y_k}{\partial w_{21}}&\frac{\partial y_k}{\partial w_{22}}&\cdots&\frac{\partial y_k}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_k}{\partial w_{m1}}&\frac{\partial y_k}{\partial w_{m2}}&\cdots&\frac{\partial y_k}{\partial w_{mn}}\end{bmatrix}\)
\(J_W=\begin{bmatrix}J_W^{(1)}&J_W^{(2)}&\cdots&J_W^{(m)}\end{bmatrix}\)
\(D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}\)
\(m\times n\) \(m\times m\times n\) $ m\times 1$

For each element in \(D_X\) , we have:

\(D_{Xi}=\frac{\partial L}{\partial x_i}=\sum_{j=1}^m\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial x_i}\\\)

5 - Convolutional Neural Networks

Convolution Layer

Introduction

Convolve a filter with an image: Slide the filter spatially within the image, computing dot products in each region.

Giving a \(32\times32\times3\) image and a \(5\times5\times3\) filter, a convolution looks like:

Convolve six \(5\times5\times3\) filters to a \(32\times32\times3\) image with step size \(1\), we can get a \(28\times28\times6\) feature:

With an activation function after each convolution layer, we can build the ConvNet with a sequence of convolution layers:

By changing the step size between each move for filters, or adding zero-padding around the image, we can modify the size of the output:

\(1\times1\) Convolution Layer

This kind of layer makes perfect sense. It is usually used to change the dimension (channel) of features.

A \(1\times1\) convolution layer can also be treated as a full-connected linear layer.

Summary

Input
image size \(W_1\times H_1\times C\)
filter size \(F\times F\times C\)
filter number \(K\)
stride \(S\)
zero padding \(P\)
Output
output size \(W_2\times H_2\times K\)
output width \(W_2=\frac{W_1-F+2P}{S}+1\\\)
output height \(H_2=\frac{H_1-F+2P}{S}+1\\\)
Parameters
parameter number (weight) \(F^2CK\)
parameter number (bias) \(K\)

Pooling layer

Make the representations smaller and more manageable.

An example of max pooling:

Input
image size \(W_1\times H_1\times C\)
spatial extent \(F\times F\)
stride \(S\)
Output
output size \(W_2\times H_2\times C\)
output width \(W_2=\frac{W_1-F}{S}+1\\\)
output height \(H_2=\frac{H_1-F}{S}+1\\\)

Convolutional Neural Networks (CNN)

CNN stack CONV, POOL, FC layers.

CNN Trends:

  1. Smaller filters and deeper architectures.
  2. Getting rid of POOL/FC layers (just CONV).

Historically architectures of CNN looked like:

where usually \(m\) is large, \(0\le n\le5\), \(0\le k\le2\).

Recent advances such as ResNet / GoogLeNet have challenged this paradigm.

6 - CNN Architectures

Best model in ImageNet competition:

AlexNet

8 layers.

First use of ConvNet in image classification problem.

Filter size decreases in deeper layer.

Channel number increases in deeper layer.

VGG

19 layers. (also provide 16 layers edition)

Static filter size (\(3\times3\)) in all layers:

  1. The effective receptive field expands with the layer gets deeper.
  2. Deeper architecture gets more non-linearities and few parameters.

Most memory is in early convolution layers.

Most parameter is in late FC layers.

GoogLeNet

22 layers.

No FC layers, only 5M parameters. ( \(8.3\%\) of AlexNet, \(3.7\%\) of VGG )

Devise efficient "inception module".

Inception Module

Design a good local network topology (network within a network) and then stack these modules on top of each other.

Naive Inception Module:

  1. Apply parallel filter operations on the input from previous layer.
  2. Concatenate all filter outputs together channel-wise.
  3. Problem: The depth (channel number) increases too fast, costing expensive computation.

Inception Module with Dimension Reduction:

  1. Add "bottle neck" layers to reduce the dimension.
  2. Also get fewer computation cost.

Architecture

ResNet

152 layers for ImageNet.

Devise "residual connections".

Use BN in place of dropout.

Residual Connections

Hypothesis: Deeper models have more representation power than shallow ones. But they are harder to optimize.

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping.

It is necessary to use ReLU as activation function, in order to apply identity mapping when \(F(x)=0\) .

Architecture

SENet

Using ResNeXt-152 as a base architecture.

Add a “feature recalibration” module. (adjust weights of each channel)

Using the global avg-pooling layer + FC layers to determine feature map weights.

Improvements of ResNet

Wide Residual Networks, ResNeXt, DenseNet, MobileNets...

Other Interesting Networks

NASNet: Neural Architecture Search with Reinforcement Learning.

EfficientNet: Smart Compound Scaling.

7 - Training Neural Networks

Activation Functions

Activation Usage
Sigmoid, tanh Do not use.
ReLU Use as default.
Leaky ReLU, Maxout, ELU, SELU Replace ReLU to squeeze out some marginal gains.
Swish No clear usage.

Data Processing

Apply centralization and normalization before training.

In practice for pictures, usually we apply channel-wise centralization only.

Weight Initialization

Assume that we have 6 layers in a network.

\(D_i\) : input size of layer \(i\)

\(W_i\) : weights in layer \(i\)

\(X_i\) : output after activation of layer \(i\), we have \(X_i=g(Z_i)=g(W_iX_{i-1}+B_i)\)

We initialize each parameter in \(W_i\) randomly in \([-k_i,k_i]\) .

Tanh Activation Output Distribution
\(k_i=0.01\)
\(k_i=0.05\)
Xavier Initialization \(k_i=\frac{1}{\sqrt{D_i}\\}\)

When \(k_i=0.01\), the variance keeps decreasing as the layer gets deeper. As a result, the output of each neuron in deep layer will all be 0. The partial derivative \(\frac{\partial Z_i}{\partial W_i}=X_{i-1}=0\\\). (no gradient)

When \(k_i=0.05\), most neurons is saturated. The partial derivative \(\frac{\partial X_i}{\partial Z_i}=g'(Z_i)=0\\\). (no gradient)

To solve this problem, We need to keep the variance same in each layer.

Assuming that \(Var\big(X_{i-1}^{(1)}\big)=Var\big(X_{i-1}^{(2)}\big)=\dots=Var\big(X_{i-1}^{(D_i)}\big)\)

We have \(Z_i=X_{i-1}^{(1)}W_i^{(:,1)}+X_{i-1}^{(2)}W_i^{(:,2)}+\dots+X_{i-1}^{(D_i)}W_i^{(:,D_i)}=\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\\\)

We want \(Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)\)

Let's do some conduction:

\(\begin{aligned}Var\big(Z_i\big)&=Var\Bigg(\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\Bigg)\\&=D_i\ Var\Big(X_{i-1}^{(n)}W_i^{(:,n)}\Big)\\&=D_i\ Var\Big(X_{i-1}^{(n)}\Big)\ Var\Big(W_i^{(:,n)}\Big)\end{aligned}\)

So \(Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)\) only when \(Var\Big(W_i^{(:,n)}\Big)=\frac{1}{D_i}\\\), that is to say \(k_i=\frac{1}{\sqrt{D_i}}\\\)

ReLU Activation Output Distribution
Xavier Initialization \(k_i=\frac{1}{\sqrt{D_i}\\}\)
Kaiming Initialization \(k_i=\sqrt{2D_i}\)

For ReLU activation, when using xavier initialization, there still exist "variance decreasing" problem.

We can use kaiming initialization instead to fix this.

Batch Normalization

Force the inputs to be "nicely scaled" at each layer.

\(N\) : batch size

\(D\) : feature size

\(x\) : input with shape \(N\times D\)

\(\gamma\) : learnable scale and shift parameter with shape \(D\)

\(\beta\) : learnable scale and shift parameter with shape \(D\)

The procedure of batch normalization:

  1. Calculate channel-wise mean \(\mu_j=\frac{1}{N}\sum_{i=1}^Nx_{i,j}\\\) . The result \(\mu\) with shape \(D\) .
  2. Calculate channel-wise variance \(\sigma_j^2=\frac{1}{N}\sum_{i=1}^N(x_{i,j}-\mu_j)^2\\\) . The result \(\sigma^2\) with shape \(D\) .
  3. Calculate normalized \(\hat{x}_{i,j}=\frac{x_{i,j}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}}\\\) . The result \(\hat{x}\) with shape \(N\times D\) .
  4. Scale normalized input to get output \(y_{i,j}=\gamma_j\hat{x}_{i,j}+\beta_j\) . The result \(y\) with shape \(N\times D\) .

Why scale: The constraint "zero-mean, unit variance" may be too hard.

Pros:

  1. Makes deep networks much easier to train!
  2. Improves gradient flow.
  3. Allows higher learning rates, faster convergence.
  4. Networks become more robust to initialization.
  5. Acts as regularization during training.
  6. Zero overhead at test-time: can be fused with conv!

Cons:

Behaves differently during training and testing: this is a very common source of bugs!

Transfer Learning

Train on a pre-trained model with other datasets.

An empirical suggestion:

very similar dataset very different dataset
very little data Use Linear Classifier on top layer. You’re in trouble… Try linear classifier from different stages.
quite a lot of data Finetune a few layers. Finetune a larger number of layers.

Regularization

Common Pattern of Regularization

Training: Add some kind of randomness. \(y=f(x,z)\)

Testing: Average out randomness (sometimes approximate). \(y=f(x)=E_z\big[f(x,z)\big]=\int p(z)f(x,z)dz\\\)

Regularization Term

L2 regularization: \(R(W)=\sum_k\sum_lW_{k,l}^2\) (weight decay)

L1 regularization: \(R(W)=\sum_k\sum_l|W_{k,l}|\)

Elastic net : \(R(W)=\sum_k\sum_l\big(\beta W_{k,l}^2+|W_{k,l}|\big)\) (L1+L2)

Dropout

Training: Randomly set some neurons to 0 with a probability \(p\) .

Testing: Each neuron multiplies by dropout probability \(p\) . (scale the output back)

More common: Scale the output with \(\frac{1}{p}\) when training, keep the original output when testing.

Why dropout works:

  1. Forces the network to have a redundant representation. Prevents co-adaptation of features.
  2. Another interpretation: Dropout is training a large ensemble of models (that share parameters).

Batch Normalization

See above.

Data Augmentation

  1. Horizontal Flips
  2. Random Crops and Scales
  3. Color Jitter
  4. Rotation
  5. Stretching
  6. Shearing
  7. Lens Distortions
  8. ...

There also exists automatic data augmentation method using neural networks.

Other Methods and Summary

DropConnect: Drop connections between neurons.

Fractional Max Pooling: Use randomized pooling regions.

Stochastic Depth: Skip some layers in the network.

Cutout: Set random image regions to zero.

Mixup: Train on random blends of images.

Regularization Method Usage
Dropout For large fully-connected layers.
Batch Normalization & Data Augmentation Almost always a good idea.
Cutout & Mixup For small classification datasets.

Hyperparameter Tuning

Most Common Hyperparameters Less Sensitive Hyperparameters
learning rate
learning rate decay schedule
weight decay
setting of momentum
...

Tips on hyperparameter tuning:

  1. Prefer one validation fold to cross-validation.
  2. Search for hyperparameters on log scale. (e.g. multiply the hyperparameter by a fixed number \(k\) at each search)
  3. Prefer random search to grid search.
  4. Careful with best values on border.
  5. Stage your search from coarse to fine.

Implementation

Have a worker that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint to a file.

Have a master that launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics.

Common Procedures

  1. Check initial loss.

Turn off weight decay, sanity check loss at initialization \(\log(C)\) for softmax with \(C\) classes.

  1. Overfit a small sample. (important)

Try to train to 100% training accuracy on a small sample of training data.

Fiddle with architecture, learning rate, weight initialization.

  1. Find learning rate that makes loss go down.

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within 100 iterations.

Good learning rates to try: \(0.1,0.01,0.001,0.0001,\dots\)

  1. Coarse grid, train for 1-5 epochs.

Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for 1-5 epochs.\

Good weight decay to try: \(0.0001,0.00001,0\)

  1. Refine grid, train longer.

Pick best models from Step 4, train them for longer (10-20 epochs) without learning rate decay.

  1. Look at loss and accuracy curves.
  2. GOTO step 5.

Gradient Checks

CS231n Convolutional Neural Networks for Visual Recognition

Compute analytical gradient manually using \(f_a'=\frac{\partial f(x)}{\partial x}=\frac{f(x-h)-f(x+h)}{2h}\\\)

Get relative error between numerical gradient \(f_n'\) and analytical gradient \(f_a'\) using \(E=\frac{|f_n'-f_a'|}{\max{|f_n'|,|f_a'|}}\\\)

Relative Error Result
\(E>10^{-2}\) Probably \(f_n'\) is wrong.
\(10^{-2}>E>10^{-4}\) Not good, should check the gradient.
\(10^{-4}>E>10^{-6}\) Okay for objectives with kinks. (e.g. ReLU)
Not good for objectives with no kink. (e.g. softmax, tanh)
\(10^{-7}>E\) Good.

Tips on gradient checks:

  1. Use double precision.
  2. Use only few data points.
  3. Careful about kinks in the objective. (e.g. \(x=0\) for ReLU activation)
  4. Careful with the step size \(h\).
  5. Use gradient check after the loss starts to go down.
  6. Remember to turn off anything that may affect the gradient. (e.g. regularization / dropout / augmentations)
  7. Check only few dimensions for every parameter. (reduce time cost)

8 - Visualizing and Understanding

Feature Visualization and Inversion

Visualizing what models have learned

Visualize Areas
Filters Visualize the raw weights of each convolution kernel. (better in the first layer)
Final Layer Features Run dimensionality reduction for features in the last FC layer. (PCA, t-SNE...)
Activations Visualize activated areas. (Understanding Neural Networks Through Deep Visualization)

Understanding input pixels

Maximally Activating Patches
  1. Pick a layer and a channel.
  2. Run many images through the network, record values of the chosen channel.
  3. Visualize image patches that correspond to maximal activation features.

For example, we have a layer with shape \(128\times13\times13\). We pick the 17th channel from all 128 channels. Then we run many pictures through the network. During each run we can find a maximal activation feature among all the \(13\times13\) features in channel 17. We then record the corresponding picture patch for each maximal activation feature. At last, we visualize all picture patches for each feature.

This will help us find the relationship between each maximal activation feature and its corresponding picture patches.

(each row of the following picture represents a feature)

Saliency via Occlusion

Mask part of the image before feeding to CNN, check how much predicted probabilities change.

Saliency via Backprop
  1. Compute gradient of (unnormalized) class score with respect to image pixels.
  2. Take absolute value and max over RGB channels to get saliency maps.

Intermediate Features via Guided Backprop
  1. Pick a single intermediate neuron. (e.g. one feature in a \(128\times13\times13\) feature map)
  2. Compute gradient of neuron value with respect to image pixels.

Striving for Simplicity: The All Convolutional Net

Just like "Maximally Activating Patches", this could find the part of an image that a neuron responds to.

Gradient Ascent

Generate a synthetic image that maximally activates a neuron.

  1. Initialize image \(I\) to zeros.
  2. Forward image to compute current scores \(S_c(I)\) (for class \(c\) before softmax).
  3. Backprop to get gradient of neuron value with respect to image pixels.
  4. Make a small update to the image.

Objective: \(\max S_c(I)-\lambda\lVert I\lVert^2\)

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Adversarial Examples

Find an fooling image that can make the network misclassify correctly-classified images when it is added to the image.

  1. Start from an arbitrary image.
  2. Pick an arbitrary class.
  3. Modify the image to maximize the class.
  4. Repeat until network is fooled.

DeepDream and Style Transfer

Feature Inversion

Given a CNN feature vector \(\Phi_0\) for an image, find a new image \(x\) that:

  1. Features of new image \(\Phi(x)\) matches the given feature vector \(\Phi_0\).
  2. "looks natural”. (image prior regularization)

Objective: \(\min \lVert\Phi(x)-\Phi_0\lVert+\lambda R(x)\)

Understanding Deep Image Representations by Inverting Them

DeepDream: Amplify Existing Features

Given an image, amplify the neuron activations at a layer to generate a new one.

  1. Forward: compute activations at chosen layer.
  2. Set gradient of chosen layer equal to its activation.
  3. Backward: Compute gradient on image.
  4. Update image.

Texture Synthesis

Nearest Neighbor
  1. Generate pixels one at a time in scanline order
  2. Form neighborhood of already generated pixels, copy the nearest neighbor from input.

Neural Texture Synthesis

Gram Matrix: 格拉姆矩阵(Gram matrix)详细解读

  1. Pretrain a CNN on ImageNet.
  2. Run input texture forward through CNN, record activations on every layer.

Layer \(i\) gives feature map of shape \(C_i\times H_i\times W_i\).

  1. At each layer compute the Gram matrix \(G_i\) giving outer product of features.
  • Reshape feature map at layer \(i\) to \(C_i\times H_iW_i\).
  • Compute the Gram matrix \(G_i\) with shape \(C_i\times C_i\).
  1. Initialize generated image from random noise.
  2. Pass generated image through CNN, compute Gram matrix \(\hat{G}_l\) on each layer.
  3. Compute loss: Weighted sum of L2 distance between Gram matrices.
  • \(E_l=\frac{1}{aN_l^2M_l^2}\sum_{i,j}\Big(G_i^{(i,j)}-\hat{G}_i^{(i,j)}\Big)^2\\\)
  • \(\mathcal{L}(\vec{x},\hat{\vec{x}})=\sum_{l=0}^L\omega_lE_l\\\)
  1. Backprop to get gradient on image.
  2. Make gradient step on image.
  3. GOTO 5.

Texture Synthesis Using Convolutional Neural Networks

Style Transfer

Feature + Gram Reconstruction

Problem: Style transfer requires many forward / backward passes. Very slow!

Fast Style Transfer

9 - Object Detection and Image Segmentation

Semantic Segmentation

Paired Training Data: For each training image, each pixel is labeled with a semantic category.

Fully Convolutional Network: Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!

Problem: Convolutions at original image resolution will be very expensive...

Solution: Design fully convolutional network with downsampling and upsampling inside it!

  • Downsampling: Pooling, strided convolution.
  • Upsampling: Unpooling, transposed convolution.

Unpooling:

Nearest Neighbor "Bed of Nails" "Position Memory"

Transposed Convolution: (example size \(3\times3\), stride \(2\), pad \(1\))

Normal Convolution Transposed Convolution

Object Detection

Single Object

Classification + Localization. (classification + regression problem)

Multiple Object

R-CNN

Using selective search to find “blobby” image regions that are likely to contain objects.

  1. Find regions of interest (RoI) using selective search. (region proposal)
  2. Forward each region through ConvNet.
  3. Classify features with SVMs.

Problem: Very slow. Need to do 2000 independent forward passes for each image!

Fast R-CNN

Pass the image through ConvNet before cropping. Crop the conv feature instead.

  1. Run whole image through ConvNet.
  2. Find regions of interest (RoI) from conv features using selective search. (region proposal)
  3. Classify RoIs using CNN.

Problem: Runtime is dominated by region proposals. (about \(90\%\) time cost)

Faster R-CNN

Insert Region Proposal Network (RPN) to predict proposals from features.

Otherwise same as Fast R-CNN: Crop features for each proposal, classify each one.

Region Proposal Network (RPN) : Slide many fixed windows over ConvNet features.

  1. Treat each point in the feature map as the anchor.

We have \(k\) fixed windows (anchor boxes) of different size/scale centered with each anchor.

  1. For each anchor box, predict whether it contains an object.

For positive boxes, also predict a corrections to the ground-truth box.

  1. Slide anchor over the feature map, get the “objectness” score for each box at each point.
  2. Sort the “objectness” score, take top \(300\) as the proposals.

Faster R-CNN is a Two-stage object detector:

  1. First stage: Run once per image

Backbone network

Region proposal network

  1. Second stage: Run once per region

Crop features: RoI pool / align

Predict object class

Prediction bbox offset

Single-Stage Object Detectors: YOLO

You Only Look Once: Unified, Real-Time Object Detection

  1. Divide image into grids. (example image grids shape \(7\times7\))
  2. Set anchors in the middle of each grid.
  3. For each grid: - Using \(B\) anchor boxes to regress \(5\) numbers: \(\text{dx, dy, dh, dw, confidence}\). - Predict scores for each of \(C\) classes.
  4. Finally the output is \(7\times7\times(5B+C)\).

Instance Segmentation

Mask R-CNN: Add a small mask network that operates on each RoI and predicts a \(28\times28\) binary mask.

Mask R-CNN performs very good results!

10 - Recurrent Neural Networks

Supplement content added according to Deep Learning Book - RNN.

Recurrent Neural Network (RNN)

Motivation: Sequence Processing

One to One One to Many Many to One Many to Many Many to Many
Vanilla Neural Networks Image Captioning Action Prediction Video Captioning Video Classification on Frame Level

Vanilla RNN

\(x^{(t)}\) : Input at time \(t\).

\(h^{(t)}\) : State at time \(t\).

\(o^{(t)}\) : Output at time \(t\)​​.

\(y^{(t)}\) : Expected output at time \(t\).

Many to One

Calculation
State Transition \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(\tau)}=\text{sigmoid}\ \big(Vh^{(\tau)}+c\big)\)
Many to Many (type 2)

Calculation
State Transition \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

RNN with Teacher Forcing

Update current state according to last-time output instead of last-time state.

Calculation
State Transition \(h^{(t)}=\tanh(Wo^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

RNN with "Output Forwarding"

We can also combine last-state output with this-state input together.

Calculation
State Transition (training) \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ry^{(t-1)}+b)\)
State Transition (testing) \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ro^{(t-1)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

Usually we use \(o^{(t-1)}\) in place of \(y^{(t-1)}\) at testing time.

Bidirectional RNN

When dealing with a whole input sequence, we can process features from two directions.

Calculation
State Transition (forward) \(h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)\)
State Transition (backward) \(g^{(t)}=\tanh(W_2g^{(t+1)}+U_2x^{(t)}+b_2)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+Wg^{(t)}+c\big)\)

Encoder-Decoder Sequence to Sequence RNN

This is a many-to-many structure (type 1).

First we encode information according to \(x\) with no output.

Later we decode information according to \(y\) with no input.

\(C\) : Context vector, often \(C=h^{(T)}\) (last state of encoder).

Calculation
State Transition (encode) \(h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)\)
State Transition (decode, training) \(s^{(t)}=\tanh(W_2s^{(t-1)}+U_2y^{(t)}+TC+b_2)\)
State Transition (decode, testing) \(s^{(t)}=\tanh(W_2s^{(t-1)}+U_2o^{(t)}+TC+b_2)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vs^{(t)}+c\big)\)

Example: Image Captioning

Summary

Advantages of RNN:

  1. Can process any length input.
  2. Computation for step \(t\) can (in theory) use information from many steps back.
  3. Model size doesn’t increase for longer input.
  4. Same weights applied on every timestep, so there is symmetry in how inputs are processed.

Disadvantages of RNN:

  1. Recurrent computation is slow.
  2. In practice, difficult to access information from many steps back.
  3. Problems with gradient exploding and gradient vanishing. (check Deep Learning Book - RNN Page 396, Chap 10.7)

Long Short Term Memory (LSTM)

Add a "cell block" to store history weights.

\(c^{(t)}\) : Cell at time \(t\).

\(f^{(t)}\) : Forget gate at time \(t\). Deciding whether to erase the cell.

\(i^{(t)}\) : Input gate at time \(t\). Deciding whether to write to the cell.

\(g^{(t)}\) : External input gate at time \(t\). Deciding how much to write to the cell.

\(o^{(t)}\) : Output gate at time \(t\). Deciding how much to reveal the cell.

Calculation (Gate)
Forget Gate \(f^{(t)}=\text{sigmoid}\ \big(W_fh^{(t-1)}+U_fx^{(t)}+b_f\big)\)
Input Gate \(i^{(t)}=\text{sigmoid}\ \big(W_ih^{(t-1)}+U_ix^{(t)}+b_i\big)\)
External Input Gate \(g^{(t)}=\tanh(W_gh^{(t-1)}+U_gx^{(t)}+b_g)\)
Output Gate \(o^{(t)}=\text{sigmoid}\ \big(W_oh^{(t-1)}+U_ox^{(t)}+b_o\big)\)
Calculation (Main)
Cell Transition \(c^{(t)}=f^{(t)}\odot c^{(t-1)}+i^{(t)}\odot g^{(t)}\)
State Transition \(h^{(t)}=o^{(t)}\odot\tanh(c^{(t)})\)
Output Calculation \(O^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

Other RNN Variants

GRU...

11 - Attention and Transformers

RNN with Attention

Encoder-Decoder Sequence to Sequence RNN Problem:

Input sequence bottlenecked through a fixed-sized context vector \(C\). (e.g. \(T=1000\))

Intuitive Solution:

Generate new context vector \(C_t\) at each step \(t\) !

\(e_{t,i}\) : Alignment score for input \(i\) at state \(t\). (scalar)

\(a_{t,i}\) : Attention weight for input \(i\) at state \(t\).

\(C_t\) : Context vector at state \(t\).

Calculation
Alignment Score \(e_i^{(t)}=f(s^{(t-1)},h^{(i)})\).
Where \(f\) is an MLP.
Attention Weight \(a_i^{(t)}=\text{softmax}\ (e_i^{(t)})\).
Softmax includes all \(e_i\) at state \(t\).
Context Vector \(C^{(t)}=\sum_i a_i^{(t)}h^{(i)}\)
Decoder State Transition \(s^{(t)}=\tanh(Ws^{(t-1)}+Uy^{(t)}+TC^{(t)}+b)\)

Example on Image Captioning:

General Attention Layer

Add linear transformations to the input vector before attention.

Notice:

  1. Number of queries \(q\) is variant. (can be different from the number of keys \(k\))
  2. Number of outputs \(y\) is equal to the number of queries \(q\).

Each \(y\) is a linear weighting of values \(v\).

  1. Alignment \(e\) is divided by \(\sqrt{D}\) to avoid "explosion of softmax", where \(D\) is the dimension of input feature.

Self-attention Layer

The query vectors \(q\) are also generated from the inputs.

In this way, the shape of \(y\) is equal to the shape of \(x\).

Example with CNN:

Positional Encoding

Self-attention layer doesn’t care about the orders of the inputs!

To encode ordered sequences like language or spatially ordered image features, we can add positional encoding to the inputs.

We use a function \(P:R\rightarrow R^d\) to process the position \(i\) into a d-dimensional vector \(p_i=P(i)\).

Constraint Condition of \(P\)
Uniqueness \(P(i)\ne P(j)\)
Equidistance \(\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2\)
Boundness \(P(i)\in[a,b]\)
Determinacy \(P(i)\) is always a static value. (function is not dynamic)

We can either train a encoder model, or design a fixed function.

A Practical Positional Encoding Method: Using \(\sin\) and \(\cos\) with different frequency \(\omega\) at different dimension.

\(P(t)=\begin{bmatrix}\sin(\omega_1,t)\\\cos(\omega_1,t)\\\\\sin(\omega_2,t)\\\cos(\omega_2,t)\\\vdots\\\sin(\omega_{\frac{d}{2}},t)\\\cos(\omega_{\frac{d}{2}},t)\end{bmatrix}\), where frequency \(\omega_k=\frac{1}{10000^{\frac{2k}{d}}}\\\). (wave length \(\lambda=\frac{1}{\omega}=10000^{\frac{2k}{d}}\\\))

\(P(t)=\begin{bmatrix}\sin(1/10000^{\frac{2}{d}},t)\\\cos(1/10000^{\frac{2}{d}},t)\\\\\sin(1/10000^{\frac{4}{d}},t)\\\cos(1/10000^{\frac{4}{d}},t)\\\vdots\\\sin(1/10000^1,t)\\\cos(1/10000^1,t)\end{bmatrix}\), after we substitute \(\omega_k\) into the equation.

\(P(t)\) is a vector with size \(d\), where \(d\) is a hyperparameter to choose according to the length of input sequence.

An intuition of this method is the binary encoding of numbers.

[lecture 11d] 注意力和 transformer (positional encoding 补充,代码实现,距离计算 )

It is easy to prove that \(P(t)\) satisfies "Equidistance": (set \(d=2\) for example)

\(\begin{aligned}\lVert P(i+k)-P(i)\rVert^2&=\big[\sin(\omega_1,i+k)-\sin(\omega_1,i)\big]^2+\big[\cos(\omega_1,i+k)-\cos(\omega_1,i)\big]^2\\&=2-2\sin(\omega_1,i+k)\sin(\omega_1,i)-2\cos(\omega_1,i+k)\cos(\omega_1,i)\\&=2-2\cos(\omega_1,k)\end{aligned}\)

So the distance is not associated with \(i\), we have \(\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2\).

Visualization of \(P(t)\) features: (set \(d=32\), \(x\) axis represents the position of sequence)

Masked Self-attention Layer

To prevent vectors from looking at future vectors, we manually set alignment scores to \(-\infty\).

Multi-head Self-attention Layer

Multiple self-attention heads in parallel.

Transformer

Attention Is All You Need

Encoder Block

Inputs: Set of vectors \(z\). (in which \(z_i\) can be a word in a sentence, or a pixel in a picture...)

Output: Set of context vectors \(c\). (encoded features of \(z\))

The number of blocks \(N=6\) in original paper.

Notice:

  1. Self-attention is the only interaction between vectors \(x_0,x_1,\dots,x_n\).
  2. Layer norm and MLP operate independently per vector.
  3. Highly scalable, highly parallelizable, but high memory usage.

Decoder Block

Inputs: Set of vectors \(y\). (\(y_i\) can be a word in a sentence, or a pixel in a picture...)

Inputs: Set of context vectors \(c\).

Output: Set of vectors \(y'\). (decoded result, \(y'_i=y_{i+1}\) for the first \(n-1\) number of \(y'\))

The number of blocks \(N=6\) in original paper.

Notice:

  1. Masked self-attention only interacts with past inputs.
  2. Multi-head attention block is NOT self-attention. It attends over encoder outputs.
  3. Highly scalable, highly parallelizable, but high memory usage. (same as encoder)

Why we need mask in decoder:

  1. Needs for the special formation of output \(y'_i=y_{i+1}\).
  2. Needs for parallel computation.

举个例子讲下 transformer 的输入输出细节及其他

在测试或者预测时,Transformer decoder 为什么还需要 seq mask

Example on Image Captioning (Only with Transformers)

Comparing RNNs to Transformer

RNNs Transformer
Pros LSTMs work reasonably well for long sequences. 1. Good at long sequences. Each attention calculation looks at all inputs.
2. Can operate over unordered sets or ordered sequences with positional encodings.
3. Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
Cons 1. Expects an ordered sequences of inputs.
2. Sequential computation: Subsequent hidden states can only be computed after the previous ones are done.
Requires a lot of memory: \(N\times M\) alignment and attention scalers need to be calculated and stored for a single self-attention head.

Comparing ConvNets to Transformer

ConvNets strike back!

12 - Video Understanding

Video Classification

Take video classification task for example.

Input size: \(C\times T\times H\times W\).

The problem is, videos are quite big. We can't afford to train on raw videos, instead we train on video clips.

Raw Videos Video Clips
\(1920\times1080,\ 30\text{fps}\) \(112\times112,\ 5\text{f}/3.2\text{s}\)
\(10\text{GB}/\text{min}\) \(588\text{KB}/\text{min}\)

Plain CNN Structure

Single Frame 2D-CNN

Train a normal 2D-CNN model.

Classify each frame independently.

Average the result of each frame as the final result.

Late Fusion

Get high-level appearance of each frame, and combine them.

Run 2D-CNN on each frame, pool features and feed to Linear Layers.

Problem: Hard to compare low-level motion between frames.

Early Fusion

Compare frames with very first Conv Layer, after that normal 2D-CNN.

Problem: One layer of temporal processing may not be enough!

3D-CNN

Convolve on 3 dimensions: Height, Width, Time.

Input size: \(C_{in}\times T\times H\times W\).

Kernel size: \(C_{in}\times C_{out}\times 3\times 3\times 3\).

Output size: \(C_{out}\times T\times H\times W\). (with zero paddling)

C3D (VGG of 3D-CNNs)

The cost is quite expensive...

Network Calculation
AlexNet 0.7 GFLOP
VGG-16 13.6 GFLOP
C3D 39.5 GFLOP

Two-Stream Networks

Separate motion and appearance.

I3D (Inflating 2D Networks to 3D)

Take a 2D-CNN architecture.

Replace each 2D conv/pool layer with a 3D version.

Modeling Long-term Temporal Structure

Recurrent Convolutional Network

Similar to multi-layer RNN, we replace the dot-product operation with convolution.

Feature size in layer \(L\), time \(t-1\): \(W_h\times H\times W\).

Feature size in layer \(L-1\), time \(t\): \(W_x\times H\times W\).

Feature size in layer \(L\), time \(t\): \((W_h+W_x)\times H\times W\).

Problem: RNNs are slow for long sequences. (can’t be parallelized)

Spatio-temporal Self-attention

Introduce self-attention into video classification problems.

Vision Transformers for Video

Factorized attention: Attend over space / time.

So many papers...

Visualizing Video Models

Multimodal Video Understanding

Temporal Action Localization

Given a long untrimmed video sequence, identify frames corresponding to different actions.

Spatio-Temporal Detection

Given a long untrimmed video, detect all the people in both space and time and classify the activities they are performing.

Visually-guided Audio Source Separation

And So on...

13 - Generative Models

PixelRNN and PixelCNN

Fully Visible Belief Network (FVBN)

\(p(x)\) : Likelihood of image \(x\).

\(p(x_1,x_2,\dots,x_n)\) : Joint likelihood of all \(n\) pixels in image \(x\).

\(p(x_i|x_1,x_2,\dots,x_{i-1})\) : Probability of pixel \(i\) value given all previous pixels.

For explicit density models, we have \(p(x)=p(x_1,x_2,\dots,x_n)=\prod_{i=1}^np(x_i|x_1,x_2,\dots,x_{i-1})\\\).

Objective: Maximize the likelihood of training data.

PixelRNN

Generate image pixels starting from corner.

Dependency on previous pixels modeled using an RNN (LSTM).

Drawback: Sequential generation is slow in both training and inference!

PixelCNN

Still generate image pixels starting from corner.

Dependency on previous pixels modeled using a CNN over context region (masked convolution).

Drawback: Though its training is faster, its generation is still slow. (pixel by pixel)

Variational Autoencoder

Supplement content added according to Tutorial on Variational Autoencoders. (paper with notes: VAE Tutorial.pdf)

变分自编码器 VAE:原来是这么一回事 | 附开源代码

Autoencoder

Learn a lower-dimensional feature representation with unsupervised approaches.

\(x\rightarrow z\) : Dimension reduction for input features.

\(z\rightarrow \hat{x}\) : Reconstruct input features.

After training, we throw the decoder away and use the encoder for transferring.

For generative models, there is a problem:

We can’t generate new images from an autoencoder because we don’t know the space of \(z\).

Variational Autoencoder

Character Description

\(X\) : Images. (random variable)

\(Z\) : Latent representations. (random variable)

\(P(X)\) : True distribution of all training images \(X\).

\(P(Z)\) : True distribution of all latent representations \(Z\).

\(P(X|Z)\) : True posterior distribution of all images \(X\) with condition \(Z\).

\(P(Z|X)\) : True prior distribution of all latent representations \(Z\) with condition \(X\).

\(Q(Z|X)\) : Approximated prior distribution of all latent representations \(Z\) with condition \(X\).

\(x\) : A specific image.

\(z\) : A specific latent representation.

\(\theta\): Learned parameters in decoder network.

\(\phi\): Learned parameters in encoder network.

\(p_\theta(x)\) : Probability that \(x\sim P(X)\).

\(p_\theta(z)\) : Probability that \(z\sim P(Z)\).

\(p_\theta(x|z)\) : Probability that \(x\sim P(X|Z)\).

\(p_\theta(z|x)\) : Probability that \(z\sim P(Z|X)\).

\(q_\phi(z|x)\) : Probability that \(z\sim Q(Z|X)\).

Decoder

Objective:

Generate new images from \(\mathscr{z}\).

  1. Generate a value \(z^{(i)}\) from the prior distribution \(P(Z)\).
  2. Generate a value \(x^{(i)}\) from the conditional distribution \(P(X|Z)\).

Lemma:

Any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function. (source: Tutorial on Variational Autoencoders, Page 6)

Solutions:

  1. Choose prior distribution \(P(Z)\) to be a simple distribution, for example \(P(Z)\sim N(0,1)\).
  2. Learn the conditional distribution \(P(X|Z)\) through a neural network (decoder) with parameter \(\theta\).

Encoder

Objective:

Learn \(\mathscr{z}\) with training images.

Given: (From the decoder, we can deduce the following probabilities.)

  1. data likelihood: \(p_\theta(x)=\int p_\theta(x|z)p_\theta(z)dz\).
  2. posterior density: \(p_\theta(z|x)=\frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)}=\frac{p_\theta(x|z)p_\theta(z)}{\int p_\theta(x|z)p_\theta(z)dz}\).

Problem:

Both \(p_\theta(x)\) and \(p_\theta(z|x)\) are intractable. (can't be optimized directly as they contain integral operation)

Solution:

Learn \(Q(Z|X)\) to approximate the true posterior \(P(Z|X)\).

Use \(q_\phi(z|x)\) in place of \(p_\theta(z|x)\).

Variational Autoencoder (Combination of Encoder and Decoder)

Objective:

Maximize \(p_\theta(x)\) for all \(x^{(i)}\) in the training set.

$$ \begin{aligned} \log p_\theta\big(x^{(i)}\big)&=\mathbb{E}{z\sim q\phi\big(z|x^{(i)}\big)}\Big[\log p_\theta\big(x^{(i)}\big)\Big]\

&=\mathbb{E}z\Bigg[\log\frac{p\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Bayes' Rule)}\

&=\mathbb{E}z\Bigg[\log\frac{p\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\frac{q_\phi\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Multiply by Constant)}\

&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-\mathbb{E}z\Bigg[\log\frac{q\phi\big(z|x^{(i)}\big)}{p_\theta\big(z\big)}\Bigg]+\mathbb{E}z\Bigg[\log\frac{p\theta\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Logarithm)}\

&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\quad\text{(KL Divergence)} \end{aligned} $$

Analyze the Formula by Term:

\(\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]\): Decoder network gives \(p_\theta\big(x^{(i)}|z\big)\), can compute estimate of this term through sampling.

\(D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\): This KL term (between Gaussians for encoder and \(z\) prior) has nice closed-form solution!

\(D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\): The part \(p_\theta\big(z|x^{(i)}\big)\) is intractable. However, we know KL divergence always \(\ge0\).

Tractable Lower Bound:

We can maximize the lower bound of that formula.

As \(D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\ge0\) , we can deduce that:

$$ \begin{aligned} \log p_\theta\big(x^{(i)}\big)&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\

&\ge\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big] \end{aligned} $$

So the loss function \(\mathcal{L}\big(x^{(i)},\theta,\phi\big)=-\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]+D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\).

\(\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]\): Decoder, reconstruct the input data.

\(D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\): Encoder, make approximate posterior distribution close to prior.

Generative Adversarial Networks (GANs)

Motivation & Modeling

Objective: Not modeling any explicit density function.

Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this!

Solution: Sample from a simple distribution, e.g. random noise. Learn the transformation to training distribution.

Problem: We can't learn the mapping relation between sample \(z\) and training images.

Solution: Use a discriminator network to tell whether the generate image is within data distribution or not.

Discriminator network: Try to distinguish between real and fake images.

Generator network: Try to fool the discriminator by generating real-looking images.

\(x\) : Real data.

\(y\) : Fake data, which is generated by the generator network. \(y=G_{\theta_g}(z)\).

\(D_{\theta_d}(x)\) : Discriminator score, which is the likelihood of real image. \(D_{\theta_d}(x)\in[0,1]\).

Objective of discriminator network:

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

Objective of generator network:

\(\min_{\theta_g}\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

Training Strategy

Two combine this two networks together, we can train them alternately:

  1. Gradient ascent on discriminator.

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

  1. Gradient descent on generator.

\(\min_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

However, the gradient of generator decreases with the value itself, making it hard to optimize.

So we replace \(\log\big(1-D_{\theta_d}(y)\big)\) with \(-\log D_{\theta_d}(y)\), and use gradient ascent instead.

  1. Gradient ascent on discriminator.

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

  1. Gradient ascent on generator.

\(\max_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log D_{\theta_d}(y)\Big)\bigg]\)

Summary

Pros: Beautiful, state-of-the-art samples!

Cons:

  1. Trickier / more unstable to train.
  2. Can’t solve inference queries such as \(p(x), p(z|x)\).

14 - Self-supervised Learning

Aim: Solve “pretext” tasks that produce good features for downstream tasks.

Application:

  1. Learn a feature extractor from pretext tasks. (self-supervised)
  2. Attach a shallow network on the feature extractor.
  3. Train the shallow network on target task with small amount of labeled data. (supervised)

Pretext Tasks

Labels are generated automatically.

Rotation

Train a classifier on randomly rotated images.

Rearrangement

Train a classifier on randomly shuffled image pieces.

Predict the location of image pieces.

Inpainting

Mask part of the image, train a network to predict the masked area.

Method referencing Context Encoders: Feature Learning by Inpainting.

Combine two types of loss together to get better performance:

  1. Reconstruction loss (L2 loss): Used for reconstructing global features.
  2. Adversarial loss: Used for generating texture features.

Coloring

Transfer between greyscale images and colored images.

Cross-channel predictions for images: Split-Brain Autoencoders.

Video coloring: Establish mappings between reference and target frames in a learned feature space. Tracking Emerges by Colorizing Videos.

Summary for Pretext Tasks

  1. Pretext tasks focus on “visual common sense”.
  2. The models are forced learn good features about natural images.
  3. We don’t care about the performance of these pretext tasks.

What we care is the performance of downstream tasks.

Problems of Specific Pretext Tasks

  1. Coming up with individual pretext tasks is tedious.
  2. The learned representations may not be general.

Intuitive Solution: Contrastive Learning.

Contrastive Representation Learning

Local additional references: Contrastive Learning.md.

Objective:

Given a chosen score function \(s\), we aim to learn an encoder function \(f\) that yields:

  1. For each sample \(x\), increase the similarity \(s\big(f(x),f(x^+)\big)\) between \(x\) and positive samples \(x^+\).
  2. Finally we want \(s\big(f(x),f(x^+)\big)\gg s\big(f(x),f(x^-)\big)\).

Loss Function:

Given \(1\) positive sample and \(N-1\) negative samples:

InfoNCE Loss Cross Entropy Loss
\(\begin{aligned}\mathcal{L}=-\mathbb{E}_X\Bigg[\log\frac{\exp{s\big(f(x),f(x^+)\big)}}{\exp{s\big(f(x),f(x^+)\big)}+\sum_{j=1}^{N-1}\exp{s\big(f(x),f(x^+)\big)}}\Bigg]\\\end{aligned}\) \(\begin{aligned}\mathcal{L}&=-\sum_{i=1}^Np(x_i)\log q(x_i)\\&=-\mathbb{E}_X\big[\log q(x)\big]\\&=-\mathbb{E}_X\Bigg[\log\frac{\exp(x)}{\sum_{j=1}^N\exp(x_j)}\Bigg]\end{aligned}\)

The InfoNCE Loss is a lower bound on the mutual information between \(f(x)\) and \(f(x^+)\):

\(\text{MI}\big[f(x),f(x^+)\big]\ge\log(N)-\mathcal{L}\)

The larger the negative sample size \(N\), the tighter the bound.

So we use \(N-1\) negative samples.

Instance Contrastive Learning

SimCLR

Use a projection function \(g(\cdot)\) to project features to a space where contrastive learning is applied.

The extra projection contributes a lot to the final performance.

Score Function: Cos similarity \(s(u,v)=\frac{u^Tv}{||u||||v||}\\\).

Positive Pair: Pair of augmented data.

Momentum Contrastive Learning (MoCo)

There are mainly \(3\) training strategy in contrastive learning:

  1. end-to-end: Keys are updated together with queries, e.g. SimCLR.

(limited by GPU size)

  1. memory bank: Store last-time keys for sampling.

(inconsistency between \(q\) and \(k\))

  1. MoCo: Use momentum methods to encode keys.

(combination of end-to-end & memory bank)

Key differences to SimCLR:

  1. Keep a running queue of keys (negative samples).
  2. Compute gradients and update the encoder only through the queries.
  3. Decouple min-batch size with the number of keys: can support a large number of negative samples.
  4. The key encoder is slowly progressing through the momentum update rules:

\(\theta_k\leftarrow m\theta_k+(1-m)\theta_q\)

Sequence Contrastive Learning

Contrastive Predictive Coding (CPC)

Contrastive: Contrast between “right” and “wrong” sequences using contrastive learning.

Predictive: The model has to predict future patterns given the current context.

Coding: The model learns useful feature vectors, or “code”, for downstream tasks, similar to other self-supervised methods.

Other Examples (Frontier)

Contrastive Language Image Pre-training (CLIP)

Contrastive learning between image and natural language sentences.

15 - Low-Level Vision

Pass...

16 - 3D Vision

Representation

Explicit vs Implicit

Explicit: Easy to sample examples, hard to do inside/outside check.

Implicit: Hard to sample examples, easy to do inside/outside check.

Non-parametric Parametric
Explicit Points.
Meshes.
Splines.
Subdivision Surfaces.
Implicit Level Sets.
Voxels.
Algebraic Surfaces.
Constructive Solid Geometry.

Point Clouds

The simplest representation.

Collection of \((x,y,z)\) coordinates.

Cons:

  1. Difficult to draw in under-sampled regions.
  2. No simplification or subdivision.
  3. No direction smooth rendering.
  4. No topological information.

Polygonal Meshes

Collection of vertices \(v\) and edges \(e\).

Pros:

  1. Can apply downsampling or upsampling on meshes.
  2. Error decreases by \(O(n^2)\) while meshes increase by \(O(n)\).
  3. Can approximate arbitrary topology.
  4. Efficient rendering.

Splines

Use specific functions to approximate the surface. (e.g. Bézier Curves)

Algebraic Surfaces

Use specific functions to represent the surface.

Constructive Solid Geometry

Combine implicit geometry with Boolean operations.

Level Sets

Store a grim of values to approximate the function.

Surface is found where interpolated value equals to \(0\).

Voxels

Binary thresholding the volumetric grid.

AI + 3D

Pass...

wnc's café

Computer Vision

8852 个字 167 张图片 预计阅读时间 44 分钟 共被读过

This note is based on GitHub - DaizeDong/Stanford-CS231n-2021-and-2022: Notes and slides for Stanford CS231n 2021 & 2022 in English. I merged the contents together to get a better version. Assignments are not included. 斯坦福 cs231n 的课程笔记 ( 英文版本,不含实验代码 ),将 2021 2022 两年的课程进行了合并,分享以供交流。
And I will add some blogs, articles and other understanding.

Topic Chapter
Deep Learning Basics 2 - 4
Perceiving and Understanding the Visual World 5 - 12
Reconstructing and Interacting with the Visual World 13 - 16
Human-Centered Applications and Implications 17 - 18

1 - Introduction

A brief history of computer vision & deep learning...

2 - Image Classification

Image Classification: A core task in Computer Vision. The main drive to the progress of CV.

Challenges: Viewpoint variation, background clutter, illumination, occlusion, deformation, intra-class variation...

K Nearest Neighbor

Hyperparameters: Distance metric (\(p\) norm), \(k\) number.

Choose hyperparameters using validation set.

Never use k-Nearest Neighbor with pixel distance.

Linear Classifier

Pass...

3 - Loss Functions and Optimization

Loss Functions

Dataset \(\big\{(x_i,y_i)\big\}_{i=1}^N\\\)
Loss Function \(L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)\\\)
Loss Function with Regularization \(L=\frac{1}{N}\sum_{i=1}^NL_i\big(f(x_i,W),y_i\big)+\lambda R(W)\\\)

Motivation: Want to interpret raw classifier scores as probabilities.

Softmax Classifier \(p_i=Softmax(y_i)=\frac{\exp(y_i)}{\sum_{j=1}^N\exp(y_j)}\\\)
Cross Entropy Loss \(L_i=-y_i\log p_i\\\)
Cross Entropy Loss with Regularization \(L=-\frac{1}{N}\sum_{i=1}^Ny_i\log p_i+\lambda R(W)\\\)

Optimization

SGD with Momentum

Problems that SGD can't handle:

  1. Inequality of gradient in different directions.
  2. Local minima and saddle point (much more common in high dimension).
  3. Noise of gradient from mini-batch.

Momentum: Build up “velocity” \(v_t\) as a running mean of gradients.

SGD SGD + Momentum
\(x_{t+1}=x_t-\alpha\nabla f(x_t)\) \(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\)
Naive gradient descent. \(\rho\) gives "friction", typically \(\rho=0.9,0.99,0.999,...\)

Nesterov Momentum: Use the derivative on point \(x_t+\rho v_t\) as gradient instead point \(x_t\).

Momentum Nesterov Momentum
\(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\) \(\begin{align}&v_{t+1}=\rho v_t+\nabla f(x_t+\rho v_t)\\&x_{t+1}=x_t-\alpha v_{t+1}\end{align}\)
Use gradient at current point. Look ahead for the gradient in velocity direction.

AdaGrad and RMSProp

AdaGrad: Accumulate squared gradient, and gradually decrease the step size.

RMSProp: Accumulate squared gradient while decaying former ones, and gradually decrease the step size. ("Leaky AdaGrad")

AdaGrad RMSProp
\(\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=r+\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}\) \(\begin{align}\text{Initialize:}&\\&r:=0\\\text{Update:}&\\&r:=\rho r+(1-\rho)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{\nabla f(x_t)}{\sqrt{r}}\end{align}\)
Continually accumulate squared gradients. \(\rho\) gives "decay rate", typically \(\rho=0.9,0.99,0.999,...\)

Adam

Sort of like "RMSProp + Momentum".

Adam (simple version) Adam (full version)
\(\begin{align}\text{Initialize:}&\\&r_1:=0\\&r_2:=0\\\text{Update:}&\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&x_{t+1}=x_t-\alpha\frac{r_1}{\sqrt{r_2}}\end{align}\) \(\begin{align}\text{Initialize:}\\&r_1:=0\\&r_2:=0\\\text{For }i\text{:}\\&r_1:=\beta_1r_1+(1-\beta_1)\nabla f(x_t)\\&r_2:=\beta_2r_2+(1-\beta_2)\Big[\nabla f(x_t)\Big]^2\\&r_1'=\frac{r_1}{1-\beta_1^i}\\&r_2'=\frac{r_2}{1-\beta_2^i}\\&x_{t+1}=x_t-\alpha\frac{r_1'}{\sqrt{r_2'}}\end{align}\)
Build up “velocity” for both gradient and squared gradient. Correct the "bias" that \(r_1=r_2=0\) for the first few iterations.

Overview

Learning Rate Decay

Reduce learning rate at a few fixed points to get a better convergence over time.

\(\alpha_0\) : Initial learning rate.

\(\alpha_t\) : Learning rate in epoch \(t\).

\(T\) : Total number of epochs.

Method Equation Picture
Step Reduce \(\alpha_t\) constantly in a fixed step.
Cosine \(\begin{align}\alpha_t=\frac{1}{2}\alpha_0\Bigg[1+\cos(\frac{t\pi}{T})\Bigg]\end{align}\)
Linear \(\begin{align}\alpha_t=\alpha_0\Big(1-\frac{t}{T}\Big)\end{align}\)
Inverse Sqrt \(\begin{align}\alpha_t=\frac{\alpha_0}{\sqrt{t}}\end{align}\)

High initial learning rates can make loss explode, linearly increasing learning rate in the first few iterations can prevent this.

Learning rate warm up:

Empirical rule of thumb: If you increase the batch size by \(N\), also scale the initial learning rate by \(N\) .

Second-Order Optimization

Picture Time Complexity Space Complexity
First Order \(O(n)\) \(O(n)\)
Second Order \(O(n^2)\) with BGFS optimization \(O(n)\) with L-BGFS optimization

L-BGFS : Limited memory BGFS.

  1. Works very well in full batch, deterministic \(f(x)\).
  2. Does not transfer very well to mini-batch setting.

Summary

Method Performance
Adam Often chosen as default method.
Work ok even with constant learning rate.
SGD + Momentum Can outperform Adam.
Require more tuning of learning rate and schedule.
L-BGFS If can afford to do full batch updates then try out.

4 - Neural Networks and Backpropagation

Neural Networks

Motivation: Inducted bias can appear to be high when using human-designed features.

Activation: Sigmoid, tanh, ReLU, LeakyReLU...

Architecture: Input layer, hidden layer, output layer.

Do not use the size of a neural network as the regularizer. Use regularization instead!

Gradient Calculation: Computational Graph + Backpropagation.

Backpropagation

Using Jacobian matrix to calculate the gradient of each node in a computation graph.

Suppose that we have a computation flow like this:

Input X Input W Output Y
\(X=\begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}\) \(W=\begin{bmatrix}w_{11}&w_{12}&\cdots&w_{1n}\\w_{21}&w_{22}&\cdots&w_{2n}\\\vdots&\vdots&\ddots&\vdots\\w_{m1}&w_{m2}&\cdots&w_{mn}\end{bmatrix}\) \(Y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\)
\(n\times 1\) \(m\times n\) \(m\times 1\)

After applying feed forward, we can calculate gradients like this:

Derivative Matrix of X Jacobian Matrix of X Derivative Matrix of Y
\(D_X=\begin{bmatrix}\frac{\partial L}{\partial x_1}\\\frac{\partial L}{\partial x_2}\\\vdots\\\frac{\partial L}{\partial x_n}\end{bmatrix}\) \(J_X=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&\frac{\partial y_1}{\partial x_2}&\cdots&\frac{\partial y_1}{\partial x_n}\\\frac{\partial y_2}{\partial x_1}&\frac{\partial y_2}{\partial x_2}&\cdots&\frac{\partial y_2}{\partial x_n}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_m}{\partial x_1}&\frac{\partial y_m}{\partial x_2}&\cdots&\frac{\partial y_m}{\partial x_n}\end{bmatrix}\) \(D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}\)
\(n\times 1\) \(m\times n\) \(m\times 1\)
Derivative Matrix of W Jacobian Matrix of W Derivative Matrix of Y
\(W=\begin{bmatrix}\frac{\partial L}{\partial w_{11}}&\frac{\partial L}{\partial w_{12}}&\cdots&\frac{\partial L}{\partial w_{1n}}\\\frac{\partial L}{\partial w_{21}}&\frac{\partial L}{\partial w_{22}}&\cdots&\frac{\partial L}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial L}{\partial w_{m1}}&\frac{\partial L}{\partial w_{m2}}&\cdots&\frac{\partial L}{\partial w_{mn}}\end{bmatrix}\) \(J_W^{(k)}=\begin{bmatrix}\frac{\partial y_k}{\partial w_{11}}&\frac{\partial y_k}{\partial w_{12}}&\cdots&\frac{\partial y_k}{\partial w_{1n}}\\\frac{\partial y_k}{\partial w_{21}}&\frac{\partial y_k}{\partial w_{22}}&\cdots&\frac{\partial y_k}{\partial w_{2n}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\partial y_k}{\partial w_{m1}}&\frac{\partial y_k}{\partial w_{m2}}&\cdots&\frac{\partial y_k}{\partial w_{mn}}\end{bmatrix}\)
\(J_W=\begin{bmatrix}J_W^{(1)}&J_W^{(2)}&\cdots&J_W^{(m)}\end{bmatrix}\)
\(D_Y=\begin{bmatrix}\frac{\partial L}{\partial y_1}\\\frac{\partial L}{\partial y_2}\\\vdots\\\frac{\partial L}{\partial y_m}\end{bmatrix}\)
\(m\times n\) \(m\times m\times n\) $ m\times 1$

For each element in \(D_X\) , we have:

\(D_{Xi}=\frac{\partial L}{\partial x_i}=\sum_{j=1}^m\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial x_i}\\\)

5 - Convolutional Neural Networks

Convolution Layer

Introduction

Convolve a filter with an image: Slide the filter spatially within the image, computing dot products in each region.

Giving a \(32\times32\times3\) image and a \(5\times5\times3\) filter, a convolution looks like:

Convolve six \(5\times5\times3\) filters to a \(32\times32\times3\) image with step size \(1\), we can get a \(28\times28\times6\) feature:

With an activation function after each convolution layer, we can build the ConvNet with a sequence of convolution layers:

By changing the step size between each move for filters, or adding zero-padding around the image, we can modify the size of the output:

\(1\times1\) Convolution Layer

This kind of layer makes perfect sense. It is usually used to change the dimension (channel) of features.

A \(1\times1\) convolution layer can also be treated as a full-connected linear layer.

Summary

Input
image size \(W_1\times H_1\times C\)
filter size \(F\times F\times C\)
filter number \(K\)
stride \(S\)
zero padding \(P\)
Output
output size \(W_2\times H_2\times K\)
output width \(W_2=\frac{W_1-F+2P}{S}+1\\\)
output height \(H_2=\frac{H_1-F+2P}{S}+1\\\)
Parameters
parameter number (weight) \(F^2CK\)
parameter number (bias) \(K\)

Pooling layer

Make the representations smaller and more manageable.

An example of max pooling:

Input
image size \(W_1\times H_1\times C\)
spatial extent \(F\times F\)
stride \(S\)
Output
output size \(W_2\times H_2\times C\)
output width \(W_2=\frac{W_1-F}{S}+1\\\)
output height \(H_2=\frac{H_1-F}{S}+1\\\)

Convolutional Neural Networks (CNN)

CNN stack CONV, POOL, FC layers.

CNN Trends:

  1. Smaller filters and deeper architectures.
  2. Getting rid of POOL/FC layers (just CONV).

Historically architectures of CNN looked like:

where usually \(m\) is large, \(0\le n\le5\), \(0\le k\le2\).

Recent advances such as ResNet / GoogLeNet have challenged this paradigm.

6 - CNN Architectures

Best model in ImageNet competition:

AlexNet

8 layers.

First use of ConvNet in image classification problem.

Filter size decreases in deeper layer.

Channel number increases in deeper layer.

VGG

19 layers. (also provide 16 layers edition)

Static filter size (\(3\times3\)) in all layers:

  1. The effective receptive field expands with the layer gets deeper.
  2. Deeper architecture gets more non-linearities and few parameters.

Most memory is in early convolution layers.

Most parameter is in late FC layers.

GoogLeNet

22 layers.

No FC layers, only 5M parameters. ( \(8.3\%\) of AlexNet, \(3.7\%\) of VGG )

Devise efficient "inception module".

Inception Module

Design a good local network topology (network within a network) and then stack these modules on top of each other.

Naive Inception Module:

  1. Apply parallel filter operations on the input from previous layer.
  2. Concatenate all filter outputs together channel-wise.
  3. Problem: The depth (channel number) increases too fast, costing expensive computation.

Inception Module with Dimension Reduction:

  1. Add "bottle neck" layers to reduce the dimension.
  2. Also get fewer computation cost.

Architecture

ResNet

152 layers for ImageNet.

Devise "residual connections".

Use BN in place of dropout.

Residual Connections

Hypothesis: Deeper models have more representation power than shallow ones. But they are harder to optimize.

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping.

It is necessary to use ReLU as activation function, in order to apply identity mapping when \(F(x)=0\) .

Architecture

SENet

Using ResNeXt-152 as a base architecture.

Add a “feature recalibration” module. (adjust weights of each channel)

Using the global avg-pooling layer + FC layers to determine feature map weights.

Improvements of ResNet

Wide Residual Networks, ResNeXt, DenseNet, MobileNets...

Other Interesting Networks

NASNet: Neural Architecture Search with Reinforcement Learning.

EfficientNet: Smart Compound Scaling.

7 - Training Neural Networks

Activation Functions

Activation Usage
Sigmoid, tanh Do not use.
ReLU Use as default.
Leaky ReLU, Maxout, ELU, SELU Replace ReLU to squeeze out some marginal gains.
Swish No clear usage.

Data Processing

Apply centralization and normalization before training.

In practice for pictures, usually we apply channel-wise centralization only.

Weight Initialization

Assume that we have 6 layers in a network.

\(D_i\) : input size of layer \(i\)

\(W_i\) : weights in layer \(i\)

\(X_i\) : output after activation of layer \(i\), we have \(X_i=g(Z_i)=g(W_iX_{i-1}+B_i)\)

We initialize each parameter in \(W_i\) randomly in \([-k_i,k_i]\) .

Tanh Activation Output Distribution
\(k_i=0.01\)
\(k_i=0.05\)
Xavier Initialization \(k_i=\frac{1}{\sqrt{D_i}\\}\)

When \(k_i=0.01\), the variance keeps decreasing as the layer gets deeper. As a result, the output of each neuron in deep layer will all be 0. The partial derivative \(\frac{\partial Z_i}{\partial W_i}=X_{i-1}=0\\\). (no gradient)

When \(k_i=0.05\), most neurons is saturated. The partial derivative \(\frac{\partial X_i}{\partial Z_i}=g'(Z_i)=0\\\). (no gradient)

To solve this problem, We need to keep the variance same in each layer.

Assuming that \(Var\big(X_{i-1}^{(1)}\big)=Var\big(X_{i-1}^{(2)}\big)=\dots=Var\big(X_{i-1}^{(D_i)}\big)\)

We have \(Z_i=X_{i-1}^{(1)}W_i^{(:,1)}+X_{i-1}^{(2)}W_i^{(:,2)}+\dots+X_{i-1}^{(D_i)}W_i^{(:,D_i)}=\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\\\)

We want \(Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)\)

Let's do some conduction:

\(\begin{aligned}Var\big(Z_i\big)&=Var\Bigg(\sum_{n=1}^{D_i}X_{i-1}^{(n)}W_i^{(:,n)}\Bigg)\\&=D_i\ Var\Big(X_{i-1}^{(n)}W_i^{(:,n)}\Big)\\&=D_i\ Var\Big(X_{i-1}^{(n)}\Big)\ Var\Big(W_i^{(:,n)}\Big)\end{aligned}\)

So \(Var\big(Z_i\big)=Var\big(X_{i-1}^{(n)}\big)\) only when \(Var\Big(W_i^{(:,n)}\Big)=\frac{1}{D_i}\\\), that is to say \(k_i=\frac{1}{\sqrt{D_i}}\\\)

ReLU Activation Output Distribution
Xavier Initialization \(k_i=\frac{1}{\sqrt{D_i}\\}\)
Kaiming Initialization \(k_i=\sqrt{2D_i}\)

For ReLU activation, when using xavier initialization, there still exist "variance decreasing" problem.

We can use kaiming initialization instead to fix this.

Batch Normalization

Force the inputs to be "nicely scaled" at each layer.

\(N\) : batch size

\(D\) : feature size

\(x\) : input with shape \(N\times D\)

\(\gamma\) : learnable scale and shift parameter with shape \(D\)

\(\beta\) : learnable scale and shift parameter with shape \(D\)

The procedure of batch normalization:

  1. Calculate channel-wise mean \(\mu_j=\frac{1}{N}\sum_{i=1}^Nx_{i,j}\\\) . The result \(\mu\) with shape \(D\) .
  2. Calculate channel-wise variance \(\sigma_j^2=\frac{1}{N}\sum_{i=1}^N(x_{i,j}-\mu_j)^2\\\) . The result \(\sigma^2\) with shape \(D\) .
  3. Calculate normalized \(\hat{x}_{i,j}=\frac{x_{i,j}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}}\\\) . The result \(\hat{x}\) with shape \(N\times D\) .
  4. Scale normalized input to get output \(y_{i,j}=\gamma_j\hat{x}_{i,j}+\beta_j\) . The result \(y\) with shape \(N\times D\) .

Why scale: The constraint "zero-mean, unit variance" may be too hard.

Pros:

  1. Makes deep networks much easier to train!
  2. Improves gradient flow.
  3. Allows higher learning rates, faster convergence.
  4. Networks become more robust to initialization.
  5. Acts as regularization during training.
  6. Zero overhead at test-time: can be fused with conv!

Cons:

Behaves differently during training and testing: this is a very common source of bugs!

Transfer Learning

Train on a pre-trained model with other datasets.

An empirical suggestion:

very similar dataset very different dataset
very little data Use Linear Classifier on top layer. You’re in trouble… Try linear classifier from different stages.
quite a lot of data Finetune a few layers. Finetune a larger number of layers.

Regularization

Common Pattern of Regularization

Training: Add some kind of randomness. \(y=f(x,z)\)

Testing: Average out randomness (sometimes approximate). \(y=f(x)=E_z\big[f(x,z)\big]=\int p(z)f(x,z)dz\\\)

Regularization Term

L2 regularization: \(R(W)=\sum_k\sum_lW_{k,l}^2\) (weight decay)

L1 regularization: \(R(W)=\sum_k\sum_l|W_{k,l}|\)

Elastic net : \(R(W)=\sum_k\sum_l\big(\beta W_{k,l}^2+|W_{k,l}|\big)\) (L1+L2)

Dropout

Training: Randomly set some neurons to 0 with a probability \(p\) .

Testing: Each neuron multiplies by dropout probability \(p\) . (scale the output back)

More common: Scale the output with \(\frac{1}{p}\) when training, keep the original output when testing.

Why dropout works:

  1. Forces the network to have a redundant representation. Prevents co-adaptation of features.
  2. Another interpretation: Dropout is training a large ensemble of models (that share parameters).

Batch Normalization

See above.

Data Augmentation

  1. Horizontal Flips
  2. Random Crops and Scales
  3. Color Jitter
  4. Rotation
  5. Stretching
  6. Shearing
  7. Lens Distortions
  8. ...

There also exists automatic data augmentation method using neural networks.

Other Methods and Summary

DropConnect: Drop connections between neurons.

Fractional Max Pooling: Use randomized pooling regions.

Stochastic Depth: Skip some layers in the network.

Cutout: Set random image regions to zero.

Mixup: Train on random blends of images.

Regularization Method Usage
Dropout For large fully-connected layers.
Batch Normalization & Data Augmentation Almost always a good idea.
Cutout & Mixup For small classification datasets.

Hyperparameter Tuning

Most Common Hyperparameters Less Sensitive Hyperparameters
learning rate
learning rate decay schedule
weight decay
setting of momentum
...

Tips on hyperparameter tuning:

  1. Prefer one validation fold to cross-validation.
  2. Search for hyperparameters on log scale. (e.g. multiply the hyperparameter by a fixed number \(k\) at each search)
  3. Prefer random search to grid search.
  4. Careful with best values on border.
  5. Stage your search from coarse to fine.

Implementation

Have a worker that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint to a file.

Have a master that launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics.

Common Procedures

  1. Check initial loss.

Turn off weight decay, sanity check loss at initialization \(\log(C)\) for softmax with \(C\) classes.

  1. Overfit a small sample. (important)

Try to train to 100% training accuracy on a small sample of training data.

Fiddle with architecture, learning rate, weight initialization.

  1. Find learning rate that makes loss go down.

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within 100 iterations.

Good learning rates to try: \(0.1,0.01,0.001,0.0001,\dots\)

  1. Coarse grid, train for 1-5 epochs.

Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for 1-5 epochs.\

Good weight decay to try: \(0.0001,0.00001,0\)

  1. Refine grid, train longer.

Pick best models from Step 4, train them for longer (10-20 epochs) without learning rate decay.

  1. Look at loss and accuracy curves.
  2. GOTO step 5.

Gradient Checks

CS231n Convolutional Neural Networks for Visual Recognition

Compute analytical gradient manually using \(f_a'=\frac{\partial f(x)}{\partial x}=\frac{f(x-h)-f(x+h)}{2h}\\\)

Get relative error between numerical gradient \(f_n'\) and analytical gradient \(f_a'\) using \(E=\frac{|f_n'-f_a'|}{\max{|f_n'|,|f_a'|}}\\\)

Relative Error Result
\(E>10^{-2}\) Probably \(f_n'\) is wrong.
\(10^{-2}>E>10^{-4}\) Not good, should check the gradient.
\(10^{-4}>E>10^{-6}\) Okay for objectives with kinks. (e.g. ReLU)
Not good for objectives with no kink. (e.g. softmax, tanh)
\(10^{-7}>E\) Good.

Tips on gradient checks:

  1. Use double precision.
  2. Use only few data points.
  3. Careful about kinks in the objective. (e.g. \(x=0\) for ReLU activation)
  4. Careful with the step size \(h\).
  5. Use gradient check after the loss starts to go down.
  6. Remember to turn off anything that may affect the gradient. (e.g. regularization / dropout / augmentations)
  7. Check only few dimensions for every parameter. (reduce time cost)

8 - Visualizing and Understanding

Feature Visualization and Inversion

Visualizing what models have learned

Visualize Areas
Filters Visualize the raw weights of each convolution kernel. (better in the first layer)
Final Layer Features Run dimensionality reduction for features in the last FC layer. (PCA, t-SNE...)
Activations Visualize activated areas. (Understanding Neural Networks Through Deep Visualization)

Understanding input pixels

Maximally Activating Patches
  1. Pick a layer and a channel.
  2. Run many images through the network, record values of the chosen channel.
  3. Visualize image patches that correspond to maximal activation features.

For example, we have a layer with shape \(128\times13\times13\). We pick the 17th channel from all 128 channels. Then we run many pictures through the network. During each run we can find a maximal activation feature among all the \(13\times13\) features in channel 17. We then record the corresponding picture patch for each maximal activation feature. At last, we visualize all picture patches for each feature.

This will help us find the relationship between each maximal activation feature and its corresponding picture patches.

(each row of the following picture represents a feature)

Saliency via Occlusion

Mask part of the image before feeding to CNN, check how much predicted probabilities change.

Saliency via Backprop
  1. Compute gradient of (unnormalized) class score with respect to image pixels.
  2. Take absolute value and max over RGB channels to get saliency maps.

Intermediate Features via Guided Backprop
  1. Pick a single intermediate neuron. (e.g. one feature in a \(128\times13\times13\) feature map)
  2. Compute gradient of neuron value with respect to image pixels.

Striving for Simplicity: The All Convolutional Net

Just like "Maximally Activating Patches", this could find the part of an image that a neuron responds to.

Gradient Ascent

Generate a synthetic image that maximally activates a neuron.

  1. Initialize image \(I\) to zeros.
  2. Forward image to compute current scores \(S_c(I)\) (for class \(c\) before softmax).
  3. Backprop to get gradient of neuron value with respect to image pixels.
  4. Make a small update to the image.

Objective: \(\max S_c(I)-\lambda\lVert I\lVert^2\)

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Adversarial Examples

Find an fooling image that can make the network misclassify correctly-classified images when it is added to the image.

  1. Start from an arbitrary image.
  2. Pick an arbitrary class.
  3. Modify the image to maximize the class.
  4. Repeat until network is fooled.

DeepDream and Style Transfer

Feature Inversion

Given a CNN feature vector \(\Phi_0\) for an image, find a new image \(x\) that:

  1. Features of new image \(\Phi(x)\) matches the given feature vector \(\Phi_0\).
  2. "looks natural”. (image prior regularization)

Objective: \(\min \lVert\Phi(x)-\Phi_0\lVert+\lambda R(x)\)

Understanding Deep Image Representations by Inverting Them

DeepDream: Amplify Existing Features

Given an image, amplify the neuron activations at a layer to generate a new one.

  1. Forward: compute activations at chosen layer.
  2. Set gradient of chosen layer equal to its activation.
  3. Backward: Compute gradient on image.
  4. Update image.

Texture Synthesis

Nearest Neighbor
  1. Generate pixels one at a time in scanline order
  2. Form neighborhood of already generated pixels, copy the nearest neighbor from input.

Neural Texture Synthesis

Gram Matrix: 格拉姆矩阵(Gram matrix)详细解读

  1. Pretrain a CNN on ImageNet.
  2. Run input texture forward through CNN, record activations on every layer.

Layer \(i\) gives feature map of shape \(C_i\times H_i\times W_i\).

  1. At each layer compute the Gram matrix \(G_i\) giving outer product of features.
  • Reshape feature map at layer \(i\) to \(C_i\times H_iW_i\).
  • Compute the Gram matrix \(G_i\) with shape \(C_i\times C_i\).
  1. Initialize generated image from random noise.
  2. Pass generated image through CNN, compute Gram matrix \(\hat{G}_l\) on each layer.
  3. Compute loss: Weighted sum of L2 distance between Gram matrices.
  • \(E_l=\frac{1}{aN_l^2M_l^2}\sum_{i,j}\Big(G_i^{(i,j)}-\hat{G}_i^{(i,j)}\Big)^2\\\)
  • \(\mathcal{L}(\vec{x},\hat{\vec{x}})=\sum_{l=0}^L\omega_lE_l\\\)
  1. Backprop to get gradient on image.
  2. Make gradient step on image.
  3. GOTO 5.

Texture Synthesis Using Convolutional Neural Networks

Style Transfer

Feature + Gram Reconstruction

Problem: Style transfer requires many forward / backward passes. Very slow!

Fast Style Transfer

9 - Object Detection and Image Segmentation

Semantic Segmentation

Paired Training Data: For each training image, each pixel is labeled with a semantic category.

Fully Convolutional Network: Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!

Problem: Convolutions at original image resolution will be very expensive...

Solution: Design fully convolutional network with downsampling and upsampling inside it!

  • Downsampling: Pooling, strided convolution.
  • Upsampling: Unpooling, transposed convolution.

Unpooling:

Nearest Neighbor "Bed of Nails" "Position Memory"

Transposed Convolution: (example size \(3\times3\), stride \(2\), pad \(1\))

Normal Convolution Transposed Convolution

Object Detection

Single Object

Classification + Localization. (classification + regression problem)

Multiple Object

R-CNN

Using selective search to find “blobby” image regions that are likely to contain objects.

  1. Find regions of interest (RoI) using selective search. (region proposal)
  2. Forward each region through ConvNet.
  3. Classify features with SVMs.

Problem: Very slow. Need to do 2000 independent forward passes for each image!

Fast R-CNN

Pass the image through ConvNet before cropping. Crop the conv feature instead.

  1. Run whole image through ConvNet.
  2. Find regions of interest (RoI) from conv features using selective search. (region proposal)
  3. Classify RoIs using CNN.

Problem: Runtime is dominated by region proposals. (about \(90\%\) time cost)

Faster R-CNN

Insert Region Proposal Network (RPN) to predict proposals from features.

Otherwise same as Fast R-CNN: Crop features for each proposal, classify each one.

Region Proposal Network (RPN) : Slide many fixed windows over ConvNet features.

  1. Treat each point in the feature map as the anchor.

We have \(k\) fixed windows (anchor boxes) of different size/scale centered with each anchor.

  1. For each anchor box, predict whether it contains an object.

For positive boxes, also predict a corrections to the ground-truth box.

  1. Slide anchor over the feature map, get the “objectness” score for each box at each point.
  2. Sort the “objectness” score, take top \(300\) as the proposals.

Faster R-CNN is a Two-stage object detector:

  1. First stage: Run once per image

Backbone network

Region proposal network

  1. Second stage: Run once per region

Crop features: RoI pool / align

Predict object class

Prediction bbox offset

Single-Stage Object Detectors: YOLO

You Only Look Once: Unified, Real-Time Object Detection

  1. Divide image into grids. (example image grids shape \(7\times7\))
  2. Set anchors in the middle of each grid.
  3. For each grid: - Using \(B\) anchor boxes to regress \(5\) numbers: \(\text{dx, dy, dh, dw, confidence}\). - Predict scores for each of \(C\) classes.
  4. Finally the output is \(7\times7\times(5B+C)\).

Instance Segmentation

Mask R-CNN: Add a small mask network that operates on each RoI and predicts a \(28\times28\) binary mask.

Mask R-CNN performs very good results!

10 - Recurrent Neural Networks

Supplement content added according to Deep Learning Book - RNN.

Recurrent Neural Network (RNN)

Motivation: Sequence Processing

One to One One to Many Many to One Many to Many Many to Many
Vanilla Neural Networks Image Captioning Action Prediction Video Captioning Video Classification on Frame Level

Vanilla RNN

\(x^{(t)}\) : Input at time \(t\).

\(h^{(t)}\) : State at time \(t\).

\(o^{(t)}\) : Output at time \(t\)​​.

\(y^{(t)}\) : Expected output at time \(t\).

Many to One

Calculation
State Transition \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(\tau)}=\text{sigmoid}\ \big(Vh^{(\tau)}+c\big)\)
Many to Many (type 2)

Calculation
State Transition \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

RNN with Teacher Forcing

Update current state according to last-time output instead of last-time state.

Calculation
State Transition \(h^{(t)}=\tanh(Wo^{(t-1)}+Ux^{(t)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

RNN with "Output Forwarding"

We can also combine last-state output with this-state input together.

Calculation
State Transition (training) \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ry^{(t-1)}+b)\)
State Transition (testing) \(h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+Ro^{(t-1)}+b)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

Usually we use \(o^{(t-1)}\) in place of \(y^{(t-1)}\) at testing time.

Bidirectional RNN

When dealing with a whole input sequence, we can process features from two directions.

Calculation
State Transition (forward) \(h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)\)
State Transition (backward) \(g^{(t)}=\tanh(W_2g^{(t+1)}+U_2x^{(t)}+b_2)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+Wg^{(t)}+c\big)\)

Encoder-Decoder Sequence to Sequence RNN

This is a many-to-many structure (type 1).

First we encode information according to \(x\) with no output.

Later we decode information according to \(y\) with no input.

\(C\) : Context vector, often \(C=h^{(T)}\) (last state of encoder).

Calculation
State Transition (encode) \(h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1)\)
State Transition (decode, training) \(s^{(t)}=\tanh(W_2s^{(t-1)}+U_2y^{(t)}+TC+b_2)\)
State Transition (decode, testing) \(s^{(t)}=\tanh(W_2s^{(t-1)}+U_2o^{(t)}+TC+b_2)\)
Output Calculation \(o^{(t)}=\text{sigmoid}\ \big(Vs^{(t)}+c\big)\)

Example: Image Captioning

Summary

Advantages of RNN:

  1. Can process any length input.
  2. Computation for step \(t\) can (in theory) use information from many steps back.
  3. Model size doesn’t increase for longer input.
  4. Same weights applied on every timestep, so there is symmetry in how inputs are processed.

Disadvantages of RNN:

  1. Recurrent computation is slow.
  2. In practice, difficult to access information from many steps back.
  3. Problems with gradient exploding and gradient vanishing. (check Deep Learning Book - RNN Page 396, Chap 10.7)

Long Short Term Memory (LSTM)

Add a "cell block" to store history weights.

\(c^{(t)}\) : Cell at time \(t\).

\(f^{(t)}\) : Forget gate at time \(t\). Deciding whether to erase the cell.

\(i^{(t)}\) : Input gate at time \(t\). Deciding whether to write to the cell.

\(g^{(t)}\) : External input gate at time \(t\). Deciding how much to write to the cell.

\(o^{(t)}\) : Output gate at time \(t\). Deciding how much to reveal the cell.

Calculation (Gate)
Forget Gate \(f^{(t)}=\text{sigmoid}\ \big(W_fh^{(t-1)}+U_fx^{(t)}+b_f\big)\)
Input Gate \(i^{(t)}=\text{sigmoid}\ \big(W_ih^{(t-1)}+U_ix^{(t)}+b_i\big)\)
External Input Gate \(g^{(t)}=\tanh(W_gh^{(t-1)}+U_gx^{(t)}+b_g)\)
Output Gate \(o^{(t)}=\text{sigmoid}\ \big(W_oh^{(t-1)}+U_ox^{(t)}+b_o\big)\)
Calculation (Main)
Cell Transition \(c^{(t)}=f^{(t)}\odot c^{(t-1)}+i^{(t)}\odot g^{(t)}\)
State Transition \(h^{(t)}=o^{(t)}\odot\tanh(c^{(t)})\)
Output Calculation \(O^{(t)}=\text{sigmoid}\ \big(Vh^{(t)}+c\big)\)

Other RNN Variants

GRU...

11 - Attention and Transformers

RNN with Attention

Encoder-Decoder Sequence to Sequence RNN Problem:

Input sequence bottlenecked through a fixed-sized context vector \(C\). (e.g. \(T=1000\))

Intuitive Solution:

Generate new context vector \(C_t\) at each step \(t\) !

\(e_{t,i}\) : Alignment score for input \(i\) at state \(t\). (scalar)

\(a_{t,i}\) : Attention weight for input \(i\) at state \(t\).

\(C_t\) : Context vector at state \(t\).

Calculation
Alignment Score \(e_i^{(t)}=f(s^{(t-1)},h^{(i)})\).
Where \(f\) is an MLP.
Attention Weight \(a_i^{(t)}=\text{softmax}\ (e_i^{(t)})\).
Softmax includes all \(e_i\) at state \(t\).
Context Vector \(C^{(t)}=\sum_i a_i^{(t)}h^{(i)}\)
Decoder State Transition \(s^{(t)}=\tanh(Ws^{(t-1)}+Uy^{(t)}+TC^{(t)}+b)\)

Example on Image Captioning:

General Attention Layer

Add linear transformations to the input vector before attention.

Notice:

  1. Number of queries \(q\) is variant. (can be different from the number of keys \(k\))
  2. Number of outputs \(y\) is equal to the number of queries \(q\).

Each \(y\) is a linear weighting of values \(v\).

  1. Alignment \(e\) is divided by \(\sqrt{D}\) to avoid "explosion of softmax", where \(D\) is the dimension of input feature.

Self-attention Layer

The query vectors \(q\) are also generated from the inputs.

In this way, the shape of \(y\) is equal to the shape of \(x\).

Example with CNN:

Positional Encoding

Self-attention layer doesn’t care about the orders of the inputs!

To encode ordered sequences like language or spatially ordered image features, we can add positional encoding to the inputs.

We use a function \(P:R\rightarrow R^d\) to process the position \(i\) into a d-dimensional vector \(p_i=P(i)\).

Constraint Condition of \(P\)
Uniqueness \(P(i)\ne P(j)\)
Equidistance \(\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2\)
Boundness \(P(i)\in[a,b]\)
Determinacy \(P(i)\) is always a static value. (function is not dynamic)

We can either train a encoder model, or design a fixed function.

A Practical Positional Encoding Method: Using \(\sin\) and \(\cos\) with different frequency \(\omega\) at different dimension.

\(P(t)=\begin{bmatrix}\sin(\omega_1,t)\\\cos(\omega_1,t)\\\\\sin(\omega_2,t)\\\cos(\omega_2,t)\\\vdots\\\sin(\omega_{\frac{d}{2}},t)\\\cos(\omega_{\frac{d}{2}},t)\end{bmatrix}\), where frequency \(\omega_k=\frac{1}{10000^{\frac{2k}{d}}}\\\). (wave length \(\lambda=\frac{1}{\omega}=10000^{\frac{2k}{d}}\\\))

\(P(t)=\begin{bmatrix}\sin(1/10000^{\frac{2}{d}},t)\\\cos(1/10000^{\frac{2}{d}},t)\\\\\sin(1/10000^{\frac{4}{d}},t)\\\cos(1/10000^{\frac{4}{d}},t)\\\vdots\\\sin(1/10000^1,t)\\\cos(1/10000^1,t)\end{bmatrix}\), after we substitute \(\omega_k\) into the equation.

\(P(t)\) is a vector with size \(d\), where \(d\) is a hyperparameter to choose according to the length of input sequence.

An intuition of this method is the binary encoding of numbers.

[lecture 11d] 注意力和 transformer (positional encoding 补充,代码实现,距离计算 )

It is easy to prove that \(P(t)\) satisfies "Equidistance": (set \(d=2\) for example)

\(\begin{aligned}\lVert P(i+k)-P(i)\rVert^2&=\big[\sin(\omega_1,i+k)-\sin(\omega_1,i)\big]^2+\big[\cos(\omega_1,i+k)-\cos(\omega_1,i)\big]^2\\&=2-2\sin(\omega_1,i+k)\sin(\omega_1,i)-2\cos(\omega_1,i+k)\cos(\omega_1,i)\\&=2-2\cos(\omega_1,k)\end{aligned}\)

So the distance is not associated with \(i\), we have \(\lVert P(i+k)-P(i)\rVert^2=\lVert P(j+k)-P(j)\rVert^2\).

Visualization of \(P(t)\) features: (set \(d=32\), \(x\) axis represents the position of sequence)

Masked Self-attention Layer

To prevent vectors from looking at future vectors, we manually set alignment scores to \(-\infty\).

Multi-head Self-attention Layer

Multiple self-attention heads in parallel.

Transformer

Attention Is All You Need

Encoder Block

Inputs: Set of vectors \(z\). (in which \(z_i\) can be a word in a sentence, or a pixel in a picture...)

Output: Set of context vectors \(c\). (encoded features of \(z\))

The number of blocks \(N=6\) in original paper.

Notice:

  1. Self-attention is the only interaction between vectors \(x_0,x_1,\dots,x_n\).
  2. Layer norm and MLP operate independently per vector.
  3. Highly scalable, highly parallelizable, but high memory usage.

Decoder Block

Inputs: Set of vectors \(y\). (\(y_i\) can be a word in a sentence, or a pixel in a picture...)

Inputs: Set of context vectors \(c\).

Output: Set of vectors \(y'\). (decoded result, \(y'_i=y_{i+1}\) for the first \(n-1\) number of \(y'\))

The number of blocks \(N=6\) in original paper.

Notice:

  1. Masked self-attention only interacts with past inputs.
  2. Multi-head attention block is NOT self-attention. It attends over encoder outputs.
  3. Highly scalable, highly parallelizable, but high memory usage. (same as encoder)

Why we need mask in decoder:

  1. Needs for the special formation of output \(y'_i=y_{i+1}\).
  2. Needs for parallel computation.

举个例子讲下 transformer 的输入输出细节及其他

在测试或者预测时,Transformer decoder 为什么还需要 seq mask

Example on Image Captioning (Only with Transformers)

Comparing RNNs to Transformer

RNNs Transformer
Pros LSTMs work reasonably well for long sequences. 1. Good at long sequences. Each attention calculation looks at all inputs.
2. Can operate over unordered sets or ordered sequences with positional encodings.
3. Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
Cons 1. Expects an ordered sequences of inputs.
2. Sequential computation: Subsequent hidden states can only be computed after the previous ones are done.
Requires a lot of memory: \(N\times M\) alignment and attention scalers need to be calculated and stored for a single self-attention head.

Comparing ConvNets to Transformer

ConvNets strike back!

12 - Video Understanding

Video Classification

Take video classification task for example.

Input size: \(C\times T\times H\times W\).

The problem is, videos are quite big. We can't afford to train on raw videos, instead we train on video clips.

Raw Videos Video Clips
\(1920\times1080,\ 30\text{fps}\) \(112\times112,\ 5\text{f}/3.2\text{s}\)
\(10\text{GB}/\text{min}\) \(588\text{KB}/\text{min}\)

Plain CNN Structure

Single Frame 2D-CNN

Train a normal 2D-CNN model.

Classify each frame independently.

Average the result of each frame as the final result.

Late Fusion

Get high-level appearance of each frame, and combine them.

Run 2D-CNN on each frame, pool features and feed to Linear Layers.

Problem: Hard to compare low-level motion between frames.

Early Fusion

Compare frames with very first Conv Layer, after that normal 2D-CNN.

Problem: One layer of temporal processing may not be enough!

3D-CNN

Convolve on 3 dimensions: Height, Width, Time.

Input size: \(C_{in}\times T\times H\times W\).

Kernel size: \(C_{in}\times C_{out}\times 3\times 3\times 3\).

Output size: \(C_{out}\times T\times H\times W\). (with zero paddling)

C3D (VGG of 3D-CNNs)

The cost is quite expensive...

Network Calculation
AlexNet 0.7 GFLOP
VGG-16 13.6 GFLOP
C3D 39.5 GFLOP

Two-Stream Networks

Separate motion and appearance.

I3D (Inflating 2D Networks to 3D)

Take a 2D-CNN architecture.

Replace each 2D conv/pool layer with a 3D version.

Modeling Long-term Temporal Structure

Recurrent Convolutional Network

Similar to multi-layer RNN, we replace the dot-product operation with convolution.

Feature size in layer \(L\), time \(t-1\): \(W_h\times H\times W\).

Feature size in layer \(L-1\), time \(t\): \(W_x\times H\times W\).

Feature size in layer \(L\), time \(t\): \((W_h+W_x)\times H\times W\).

Problem: RNNs are slow for long sequences. (can’t be parallelized)

Spatio-temporal Self-attention

Introduce self-attention into video classification problems.

Vision Transformers for Video

Factorized attention: Attend over space / time.

So many papers...

Visualizing Video Models

Multimodal Video Understanding

Temporal Action Localization

Given a long untrimmed video sequence, identify frames corresponding to different actions.

Spatio-Temporal Detection

Given a long untrimmed video, detect all the people in both space and time and classify the activities they are performing.

Visually-guided Audio Source Separation

And So on...

13 - Generative Models

PixelRNN and PixelCNN

Fully Visible Belief Network (FVBN)

\(p(x)\) : Likelihood of image \(x\).

\(p(x_1,x_2,\dots,x_n)\) : Joint likelihood of all \(n\) pixels in image \(x\).

\(p(x_i|x_1,x_2,\dots,x_{i-1})\) : Probability of pixel \(i\) value given all previous pixels.

For explicit density models, we have \(p(x)=p(x_1,x_2,\dots,x_n)=\prod_{i=1}^np(x_i|x_1,x_2,\dots,x_{i-1})\\\).

Objective: Maximize the likelihood of training data.

PixelRNN

Generate image pixels starting from corner.

Dependency on previous pixels modeled using an RNN (LSTM).

Drawback: Sequential generation is slow in both training and inference!

PixelCNN

Still generate image pixels starting from corner.

Dependency on previous pixels modeled using a CNN over context region (masked convolution).

Drawback: Though its training is faster, its generation is still slow. (pixel by pixel)

Variational Autoencoder

Supplement content added according to Tutorial on Variational Autoencoders. (paper with notes: VAE Tutorial.pdf)

变分自编码器 VAE:原来是这么一回事 | 附开源代码

Autoencoder

Learn a lower-dimensional feature representation with unsupervised approaches.

\(x\rightarrow z\) : Dimension reduction for input features.

\(z\rightarrow \hat{x}\) : Reconstruct input features.

After training, we throw the decoder away and use the encoder for transferring.

For generative models, there is a problem:

We can’t generate new images from an autoencoder because we don’t know the space of \(z\).

Variational Autoencoder

Character Description

\(X\) : Images. (random variable)

\(Z\) : Latent representations. (random variable)

\(P(X)\) : True distribution of all training images \(X\).

\(P(Z)\) : True distribution of all latent representations \(Z\).

\(P(X|Z)\) : True posterior distribution of all images \(X\) with condition \(Z\).

\(P(Z|X)\) : True prior distribution of all latent representations \(Z\) with condition \(X\).

\(Q(Z|X)\) : Approximated prior distribution of all latent representations \(Z\) with condition \(X\).

\(x\) : A specific image.

\(z\) : A specific latent representation.

\(\theta\): Learned parameters in decoder network.

\(\phi\): Learned parameters in encoder network.

\(p_\theta(x)\) : Probability that \(x\sim P(X)\).

\(p_\theta(z)\) : Probability that \(z\sim P(Z)\).

\(p_\theta(x|z)\) : Probability that \(x\sim P(X|Z)\).

\(p_\theta(z|x)\) : Probability that \(z\sim P(Z|X)\).

\(q_\phi(z|x)\) : Probability that \(z\sim Q(Z|X)\).

Decoder

Objective:

Generate new images from \(\mathscr{z}\).

  1. Generate a value \(z^{(i)}\) from the prior distribution \(P(Z)\).
  2. Generate a value \(x^{(i)}\) from the conditional distribution \(P(X|Z)\).

Lemma:

Any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function. (source: Tutorial on Variational Autoencoders, Page 6)

Solutions:

  1. Choose prior distribution \(P(Z)\) to be a simple distribution, for example \(P(Z)\sim N(0,1)\).
  2. Learn the conditional distribution \(P(X|Z)\) through a neural network (decoder) with parameter \(\theta\).

Encoder

Objective:

Learn \(\mathscr{z}\) with training images.

Given: (From the decoder, we can deduce the following probabilities.)

  1. data likelihood: \(p_\theta(x)=\int p_\theta(x|z)p_\theta(z)dz\).
  2. posterior density: \(p_\theta(z|x)=\frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)}=\frac{p_\theta(x|z)p_\theta(z)}{\int p_\theta(x|z)p_\theta(z)dz}\).

Problem:

Both \(p_\theta(x)\) and \(p_\theta(z|x)\) are intractable. (can't be optimized directly as they contain integral operation)

Solution:

Learn \(Q(Z|X)\) to approximate the true posterior \(P(Z|X)\).

Use \(q_\phi(z|x)\) in place of \(p_\theta(z|x)\).

Variational Autoencoder (Combination of Encoder and Decoder)

Objective:

Maximize \(p_\theta(x)\) for all \(x^{(i)}\) in the training set.

$$ \begin{aligned} \log p_\theta\big(x^{(i)}\big)&=\mathbb{E}{z\sim q\phi\big(z|x^{(i)}\big)}\Big[\log p_\theta\big(x^{(i)}\big)\Big]\

&=\mathbb{E}z\Bigg[\log\frac{p\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Bayes' Rule)}\

&=\mathbb{E}z\Bigg[\log\frac{p\theta\big(x^{(i)}|z\big)p_\theta\big(z\big)}{p_\theta\big(z|x^{(i)}\big)}\frac{q_\phi\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Multiply by Constant)}\

&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-\mathbb{E}z\Bigg[\log\frac{q\phi\big(z|x^{(i)}\big)}{p_\theta\big(z\big)}\Bigg]+\mathbb{E}z\Bigg[\log\frac{p\theta\big(z|x^{(i)}\big)}{q_\phi\big(z|x^{(i)}\big)}\Bigg]\quad\text{(Logarithm)}\

&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\quad\text{(KL Divergence)} \end{aligned} $$

Analyze the Formula by Term:

\(\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]\): Decoder network gives \(p_\theta\big(x^{(i)}|z\big)\), can compute estimate of this term through sampling.

\(D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\): This KL term (between Gaussians for encoder and \(z\) prior) has nice closed-form solution!

\(D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\): The part \(p_\theta\big(z|x^{(i)}\big)\) is intractable. However, we know KL divergence always \(\ge0\).

Tractable Lower Bound:

We can maximize the lower bound of that formula.

As \(D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\ge0\) , we can deduce that:

$$ \begin{aligned} \log p_\theta\big(x^{(i)}\big)&=\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]+D_{\text{KL}}\Big[p_\theta\big(z|x^{(i)}\big)||q_\phi\big(z|x^{(i)}\big)\Big]\

&\ge\mathbb{E}z\Big[\log p\theta\big(x^{(i)}|z\big)\Big]-D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big] \end{aligned} $$

So the loss function \(\mathcal{L}\big(x^{(i)},\theta,\phi\big)=-\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]+D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\).

\(\mathbb{E}_z\Big[\log p_\theta\big(x^{(i)}|z\big)\Big]\): Decoder, reconstruct the input data.

\(D_{\text{KL}}\Big[q_\phi\big(z|x^{(i)}\big)||p_\theta\big(z\big)\Big]\): Encoder, make approximate posterior distribution close to prior.

Generative Adversarial Networks (GANs)

Motivation & Modeling

Objective: Not modeling any explicit density function.

Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this!

Solution: Sample from a simple distribution, e.g. random noise. Learn the transformation to training distribution.

Problem: We can't learn the mapping relation between sample \(z\) and training images.

Solution: Use a discriminator network to tell whether the generate image is within data distribution or not.

Discriminator network: Try to distinguish between real and fake images.

Generator network: Try to fool the discriminator by generating real-looking images.

\(x\) : Real data.

\(y\) : Fake data, which is generated by the generator network. \(y=G_{\theta_g}(z)\).

\(D_{\theta_d}(x)\) : Discriminator score, which is the likelihood of real image. \(D_{\theta_d}(x)\in[0,1]\).

Objective of discriminator network:

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

Objective of generator network:

\(\min_{\theta_g}\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

Training Strategy

Two combine this two networks together, we can train them alternately:

  1. Gradient ascent on discriminator.

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

  1. Gradient descent on generator.

\(\min_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

However, the gradient of generator decreases with the value itself, making it hard to optimize.

So we replace \(\log\big(1-D_{\theta_d}(y)\big)\) with \(-\log D_{\theta_d}(y)\), and use gradient ascent instead.

  1. Gradient ascent on discriminator.

\(\max_{\theta_d}\bigg[\mathbb{E}_x\Big(\log D_{\theta_d}(x)\Big)+\mathbb{E}_{z\sim p(z)}\Big(\log\big(1-D_{\theta_d}(y)\big)\Big)\bigg]\)

  1. Gradient ascent on generator.

\(\max_{\theta_g}\bigg[\mathbb{E}_{z\sim p(z)}\Big(\log D_{\theta_d}(y)\Big)\bigg]\)

Summary

Pros: Beautiful, state-of-the-art samples!

Cons:

  1. Trickier / more unstable to train.
  2. Can’t solve inference queries such as \(p(x), p(z|x)\).

14 - Self-supervised Learning

Aim: Solve “pretext” tasks that produce good features for downstream tasks.

Application:

  1. Learn a feature extractor from pretext tasks. (self-supervised)
  2. Attach a shallow network on the feature extractor.
  3. Train the shallow network on target task with small amount of labeled data. (supervised)

Pretext Tasks

Labels are generated automatically.

Rotation

Train a classifier on randomly rotated images.

Rearrangement

Train a classifier on randomly shuffled image pieces.

Predict the location of image pieces.

Inpainting

Mask part of the image, train a network to predict the masked area.

Method referencing Context Encoders: Feature Learning by Inpainting.

Combine two types of loss together to get better performance:

  1. Reconstruction loss (L2 loss): Used for reconstructing global features.
  2. Adversarial loss: Used for generating texture features.

Coloring

Transfer between greyscale images and colored images.

Cross-channel predictions for images: Split-Brain Autoencoders.

Video coloring: Establish mappings between reference and target frames in a learned feature space. Tracking Emerges by Colorizing Videos.

Summary for Pretext Tasks

  1. Pretext tasks focus on “visual common sense”.
  2. The models are forced learn good features about natural images.
  3. We don’t care about the performance of these pretext tasks.

What we care is the performance of downstream tasks.

Problems of Specific Pretext Tasks

  1. Coming up with individual pretext tasks is tedious.
  2. The learned representations may not be general.

Intuitive Solution: Contrastive Learning.

Contrastive Representation Learning

Local additional references: Contrastive Learning.md.

Objective:

Given a chosen score function \(s\), we aim to learn an encoder function \(f\) that yields:

  1. For each sample \(x\), increase the similarity \(s\big(f(x),f(x^+)\big)\) between \(x\) and positive samples \(x^+\).
  2. Finally we want \(s\big(f(x),f(x^+)\big)\gg s\big(f(x),f(x^-)\big)\).

Loss Function:

Given \(1\) positive sample and \(N-1\) negative samples:

InfoNCE Loss Cross Entropy Loss
\(\begin{aligned}\mathcal{L}=-\mathbb{E}_X\Bigg[\log\frac{\exp{s\big(f(x),f(x^+)\big)}}{\exp{s\big(f(x),f(x^+)\big)}+\sum_{j=1}^{N-1}\exp{s\big(f(x),f(x^+)\big)}}\Bigg]\\\end{aligned}\) \(\begin{aligned}\mathcal{L}&=-\sum_{i=1}^Np(x_i)\log q(x_i)\\&=-\mathbb{E}_X\big[\log q(x)\big]\\&=-\mathbb{E}_X\Bigg[\log\frac{\exp(x)}{\sum_{j=1}^N\exp(x_j)}\Bigg]\end{aligned}\)

The InfoNCE Loss is a lower bound on the mutual information between \(f(x)\) and \(f(x^+)\):

\(\text{MI}\big[f(x),f(x^+)\big]\ge\log(N)-\mathcal{L}\)

The larger the negative sample size \(N\), the tighter the bound.

So we use \(N-1\) negative samples.

Instance Contrastive Learning

SimCLR

Use a projection function \(g(\cdot)\) to project features to a space where contrastive learning is applied.

The extra projection contributes a lot to the final performance.

Score Function: Cos similarity \(s(u,v)=\frac{u^Tv}{||u||||v||}\\\).

Positive Pair: Pair of augmented data.

Momentum Contrastive Learning (MoCo)

There are mainly \(3\) training strategy in contrastive learning:

  1. end-to-end: Keys are updated together with queries, e.g. SimCLR.

(limited by GPU size)

  1. memory bank: Store last-time keys for sampling.

(inconsistency between \(q\) and \(k\))

  1. MoCo: Use momentum methods to encode keys.

(combination of end-to-end & memory bank)

Key differences to SimCLR:

  1. Keep a running queue of keys (negative samples).
  2. Compute gradients and update the encoder only through the queries.
  3. Decouple min-batch size with the number of keys: can support a large number of negative samples.
  4. The key encoder is slowly progressing through the momentum update rules:

\(\theta_k\leftarrow m\theta_k+(1-m)\theta_q\)

Sequence Contrastive Learning

Contrastive Predictive Coding (CPC)

Contrastive: Contrast between “right” and “wrong” sequences using contrastive learning.

Predictive: The model has to predict future patterns given the current context.

Coding: The model learns useful feature vectors, or “code”, for downstream tasks, similar to other self-supervised methods.

Other Examples (Frontier)

Contrastive Language Image Pre-training (CLIP)

Contrastive learning between image and natural language sentences.

15 - Low-Level Vision

Pass...

16 - 3D Vision

Representation

Explicit vs Implicit

Explicit: Easy to sample examples, hard to do inside/outside check.

Implicit: Hard to sample examples, easy to do inside/outside check.

Non-parametric Parametric
Explicit Points.
Meshes.
Splines.
Subdivision Surfaces.
Implicit Level Sets.
Voxels.
Algebraic Surfaces.
Constructive Solid Geometry.

Point Clouds

The simplest representation.

Collection of \((x,y,z)\) coordinates.

Cons:

  1. Difficult to draw in under-sampled regions.
  2. No simplification or subdivision.
  3. No direction smooth rendering.
  4. No topological information.

Polygonal Meshes

Collection of vertices \(v\) and edges \(e\).

Pros:

  1. Can apply downsampling or upsampling on meshes.
  2. Error decreases by \(O(n^2)\) while meshes increase by \(O(n)\).
  3. Can approximate arbitrary topology.
  4. Efficient rendering.

Splines

Use specific functions to approximate the surface. (e.g. Bézier Curves)

Algebraic Surfaces

Use specific functions to represent the surface.

Constructive Solid Geometry

Combine implicit geometry with Boolean operations.

Level Sets

Store a grim of values to approximate the function.

Surface is found where interpolated value equals to \(0\).

Voxels

Binary thresholding the volumetric grid.

AI + 3D

Pass...

wnc's café

Image Classification-Data-driven Approach, k-Nearest Neighbor, train_val_test splits

653 个字 28 行代码 预计阅读时间 4 分钟 共被读过

image classification

  • challenges
    • viewpoint variation
    • scale variation
    • deformation
    • occlusion
    • illumination conditions
    • background clutter
    • intra-class variation
  • data-driven approach
  • the image classification pipeline
    • input
    • learning
      • training a classifier
      • learning a model
    • evaluation

Nearest Neighbor Classifier

\[ d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| \]
Python
import numpy as np
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Image Classification-Data-driven Approach, k-Nearest Neighbor, train_val_test splits

653 个字 28 行代码 预计阅读时间 4 分钟 共被读过

image classification

  • challenges
    • viewpoint variation
    • scale variation
    • deformation
    • occlusion
    • illumination conditions
    • background clutter
    • intra-class variation
  • data-driven approach
  • the image classification pipeline
    • input
    • learning
      • training a classifier
      • learning a model
    • evaluation

Nearest Neighbor Classifier

\[ d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| \]
Python
import numpy as np
 
 class NearestNeighbor(object):  
   def **init**(self):  
diff --git a/AI/CS231n/Linear classification-Support Vector Machine, Softmax/index.html b/AI/CS231n/Linear classification-Support Vector Machine, Softmax/index.html
index ebfdfd52..28e4c645 100644
--- a/AI/CS231n/Linear classification-Support Vector Machine, Softmax/index.html	
+++ b/AI/CS231n/Linear classification-Support Vector Machine, Softmax/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Linear classification-Support Vector Machine, Softmax

129 个字 预计阅读时间 1 分钟 共被读过

Linear Classifiaction

\[ L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta) \]
\[ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2 \]
\[ L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j} \]
\[ \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} \]

![[Pasted image 20241031210509.png]]

wnc's café

Linear classification-Support Vector Machine, Softmax

129 个字 预计阅读时间 1 分钟 共被读过

Linear Classifiaction

\[ L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta) \]
\[ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2 \]
\[ L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j} \]
\[ \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} \]

![[Pasted image 20241031210509.png]]

wnc's café

Numpy

Python

49 个字 104 行代码 预计阅读时间 2 分钟 共被读过

string

Python
s = "hello"
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Numpy

Python

49 个字 104 行代码 预计阅读时间 2 分钟 共被读过

string

Python
s = "hello"
 print(s.capitalize())  # Capitalize a string; prints "Hello"
 print(s.upper())       # Convert a string to uppercase; prints "HELLO"
 print(s.rjust(7))      # Right-justify a string, padding with spaces; prints "  hello"
diff --git a/AI/Dive_into_Deep_Learning/index.html b/AI/Dive_into_Deep_Learning/index.html
index 1c97111c..589d2e48 100644
--- a/AI/Dive_into_Deep_Learning/index.html
+++ b/AI/Dive_into_Deep_Learning/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Dive into Deep Learning

1547 个字 387 行代码 预计阅读时间 13 分钟 共被读过

1 引言

2 预备知识

2.1 数据操作

  • tensor
  • ndarray (MXNet)
  • Tensor (TensorFlow)
Python
x = torch.arrange(12)
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Dive into Deep Learning

1547 个字 387 行代码 预计阅读时间 13 分钟 共被读过

1 引言

2 预备知识

2.1 数据操作

  • tensor
  • ndarray (MXNet)
  • Tensor (TensorFlow)
Python
x = torch.arrange(12)
 x.shape
 x.numel()
 x.reshape(3, 4)
diff --git a/AI/EECS 498-007/KNN/index.html b/AI/EECS 498-007/KNN/index.html
index 4eb1b5cc..5b1f92a3 100644
--- a/AI/EECS 498-007/KNN/index.html	
+++ b/AI/EECS 498-007/KNN/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

KNN

对于一个待分类的样本,找到训练数据集中与其最接近的 K 个样本(即最近邻,然后根据这 K 个样本的类别来决定待分类样本的类别。

374 个字 100 行代码 预计阅读时间 3 分钟 共被读过

数学推导

假设我们有一个训练数据集 \(T = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\}\),其中 \(x_i\) 是特征向量, \(y_i\) 是对应的类别标签。对于一个新的待分类样本 xKNN 算法的目标是预测其类别 \(y\)

  1. 距离度量:首先,我们需要一个距离度量来计算待分类样本 \(x\) 与训练集中每个样本 \(x_i\) 之间的距离。常用的距离度量包括欧氏距离(Euclidean distance、曼哈顿距离(Manhattan distance)和闵可夫斯基距离(Minkowski distance。以欧氏距离为例,两个样本 \(x\) \(x_i\) 之间的距离定义为:
\[ d(x, x_i) = \sqrt{\sum_{j=1}^{d} (x_j - x_{i,j})^2} \]

其中, \(d\) 是特征的维度。

  1. 寻找最近邻:然后,我们根据计算出的距离,选择距离最近的 K 个样本,构成待分类样本的邻域 \(N_k(x)\)
  2. 决策规则:最后,根据邻域 \( N_k(x) \) 中的样本类别,通过多数投票的方式来决定待分类样本的类别。即:
\[ y = \arg\max_{c_j} \sum_{x_i \in N_k(x)} I(y_i = c_j) \]

其中, \(I\) 是指示函数,当 \(y_i = c_j\) 时取值为 1,否则为 0

作业中的实现

Python
import torch
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

KNN

对于一个待分类的样本,找到训练数据集中与其最接近的 K 个样本(即最近邻,然后根据这 K 个样本的类别来决定待分类样本的类别。

374 个字 100 行代码 预计阅读时间 3 分钟 共被读过

数学推导

假设我们有一个训练数据集 \(T = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\}\),其中 \(x_i\) 是特征向量, \(y_i\) 是对应的类别标签。对于一个新的待分类样本 xKNN 算法的目标是预测其类别 \(y\)

  1. 距离度量:首先,我们需要一个距离度量来计算待分类样本 \(x\) 与训练集中每个样本 \(x_i\) 之间的距离。常用的距离度量包括欧氏距离(Euclidean distance、曼哈顿距离(Manhattan distance)和闵可夫斯基距离(Minkowski distance。以欧氏距离为例,两个样本 \(x\) \(x_i\) 之间的距离定义为:
\[ d(x, x_i) = \sqrt{\sum_{j=1}^{d} (x_j - x_{i,j})^2} \]

其中, \(d\) 是特征的维度。

  1. 寻找最近邻:然后,我们根据计算出的距离,选择距离最近的 K 个样本,构成待分类样本的邻域 \(N_k(x)\)
  2. 决策规则:最后,根据邻域 \( N_k(x) \) 中的样本类别,通过多数投票的方式来决定待分类样本的类别。即:
\[ y = \arg\max_{c_j} \sum_{x_i \in N_k(x)} I(y_i = c_j) \]

其中, \(I\) 是指示函数,当 \(y_i = c_j\) 时取值为 1,否则为 0

作业中的实现

Python
import torch
 
 def compute_distances_two_loops(x_train, x_test):
   num_train = x_train.shape[0]
diff --git a/AI/EECS 498-007/Pytorch/index.html b/AI/EECS 498-007/Pytorch/index.html
index 07bb889a..592684a9 100644
--- a/AI/EECS 498-007/Pytorch/index.html	
+++ b/AI/EECS 498-007/Pytorch/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

pytorch 的基本使用

564 个字 45 行代码 预计阅读时间 3 分钟 共被读过

Python
# Create a rank 1 tensor from a Python list
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

pytorch 的基本使用

564 个字 45 行代码 预计阅读时间 3 分钟 共被读过

Python
# Create a rank 1 tensor from a Python list
 a = torch.tensor([[1, 2, 3], [4, 5, 6]])
 print('Here is a:')
 print(a)
diff --git a/AI/EECS 498-007/linear_classifer/index.html b/AI/EECS 498-007/linear_classifer/index.html
index c71420a3..f779b729 100644
--- a/AI/EECS 498-007/linear_classifer/index.html	
+++ b/AI/EECS 498-007/linear_classifer/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Linear classifer

原理

677 个字 216 行代码 预计阅读时间 6 分钟 共被读过

两种线性分类器:支持向量机(SVM)和 Softmax 分类器。这两种分类器都是监督学习算法,用于分类任务。

支持向量机(SVM)

SVM 的目标是找到一个超平面,它可以最大化不同类别之间的边界。这个超平面被称为最优分割超平面。对于二分类问题,SVM 的损失函数可以表示为:

\[ L(W, b) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i (W \cdot x_i + b)) \]

其中,\(W\) 是权重向量,\(b\) 是偏置项,\(x_i\) 是输入特征,\(y_i\) 是标签(-1 1\(N\) 是样本数量。

为了实现多分类,我们使用结构化 SVM 损失函数,它考虑了每个类别的分数,并尝试最大化正确类别的分数与次高类别分数之间的差距。损失函数可以表示为:

\[ L(W) = \frac{1}{N} \sum_{i=1}^{N} \sum_{j \neq y_i} \max(0, \text{score}_j - \text{score}_{y_i} + \Delta) \]

其中,\(\text{score}_j = W_j \cdot x_i\)\(\Delta\) 是一个常数,通常设置为 1

Softmax 分类器

Softmax 分类器使用 Softmax 函数将输入特征映射到概率分布上。对于每个样本,Softmax 函数输出每个类别的概率。Softmax 函数定义为:

\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]

其中,\(z_i\) 是第 \(i\) 个类别的分数,\(K\) 是类别总数。

Softmax 分类器的损失函数是交叉熵损失,可以表示为:

\[ L(W) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\text{softmax}(z_j)) \]

其中,\(y_{ij}\) 是一个指示变量,如果样本 \(i\) 属于类别 \(j\),则为 1,否则为 0

正则化

为了防止过拟合,我们在损失函数中添加了正则化项。L2 正则化的损失函数可以表示为:

\[ L(W) = L(W) + \lambda \lVert W \rVert^2 \]

其中,\(\lambda\) 是正则化强度。

代码实现

代码中实现了两种损失函数的朴素版本(svm_loss_naivesoftmax_loss_naive)和向量化版本(svm_loss_vectorizedsoftmax_loss_vectorized。向量化版本通过避免显式循环来提高计算效率。

训练过程(train_linear_classifier)使用随机梯度下降(SGD)来优化损失函数。在每次迭代中,我们随机抽取一个批次的样本,计算损失和梯度,然后更新权重。

预测过程(predict_linear_classifier)使用训练好的权重来预测新样本的类别。

超参数搜索

代码中还包含了超参数搜索的函数(svm_get_search_paramssoftmax_get_search_params,它们返回不同的学习率和正则化强度的候选值,以便找到最佳的模型参数。

作业实现

Python
import torch
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Linear classifer

原理

677 个字 216 行代码 预计阅读时间 6 分钟 共被读过

两种线性分类器:支持向量机(SVM)和 Softmax 分类器。这两种分类器都是监督学习算法,用于分类任务。

支持向量机(SVM)

SVM 的目标是找到一个超平面,它可以最大化不同类别之间的边界。这个超平面被称为最优分割超平面。对于二分类问题,SVM 的损失函数可以表示为:

\[ L(W, b) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i (W \cdot x_i + b)) \]

其中,\(W\) 是权重向量,\(b\) 是偏置项,\(x_i\) 是输入特征,\(y_i\) 是标签(-1 1\(N\) 是样本数量。

为了实现多分类,我们使用结构化 SVM 损失函数,它考虑了每个类别的分数,并尝试最大化正确类别的分数与次高类别分数之间的差距。损失函数可以表示为:

\[ L(W) = \frac{1}{N} \sum_{i=1}^{N} \sum_{j \neq y_i} \max(0, \text{score}_j - \text{score}_{y_i} + \Delta) \]

其中,\(\text{score}_j = W_j \cdot x_i\)\(\Delta\) 是一个常数,通常设置为 1

Softmax 分类器

Softmax 分类器使用 Softmax 函数将输入特征映射到概率分布上。对于每个样本,Softmax 函数输出每个类别的概率。Softmax 函数定义为:

\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]

其中,\(z_i\) 是第 \(i\) 个类别的分数,\(K\) 是类别总数。

Softmax 分类器的损失函数是交叉熵损失,可以表示为:

\[ L(W) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\text{softmax}(z_j)) \]

其中,\(y_{ij}\) 是一个指示变量,如果样本 \(i\) 属于类别 \(j\),则为 1,否则为 0

正则化

为了防止过拟合,我们在损失函数中添加了正则化项。L2 正则化的损失函数可以表示为:

\[ L(W) = L(W) + \lambda \lVert W \rVert^2 \]

其中,\(\lambda\) 是正则化强度。

代码实现

代码中实现了两种损失函数的朴素版本(svm_loss_naivesoftmax_loss_naive)和向量化版本(svm_loss_vectorizedsoftmax_loss_vectorized。向量化版本通过避免显式循环来提高计算效率。

训练过程(train_linear_classifier)使用随机梯度下降(SGD)来优化损失函数。在每次迭代中,我们随机抽取一个批次的样本,计算损失和梯度,然后更新权重。

预测过程(predict_linear_classifier)使用训练好的权重来预测新样本的类别。

超参数搜索

代码中还包含了超参数搜索的函数(svm_get_search_paramssoftmax_get_search_params,它们返回不同的学习率和正则化强度的候选值,以便找到最佳的模型参数。

作业实现

Python
import torch
 import random
 from abc import abstractmethod
 
diff --git a/AI/FFB6D/FFB6D_Conda/index.html b/AI/FFB6D/FFB6D_Conda/index.html
index b2fc1891..acd402bd 100644
--- a/AI/FFB6D/FFB6D_Conda/index.html
+++ b/AI/FFB6D/FFB6D_Conda/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

FFB6D 环境配置指南:原生系统安装

293 个字 96 行代码 预计阅读时间 3 分钟 共被读过

1. 系统要求

  • Ubuntu 20.04/22.04/24.04
  • NVIDIA GPU(支持 CUDA
  • 至少 8GB 内存
  • 至少 30GB 磁盘空间

2. 基础环境配置

2.1 安装 NVIDIA 驱动

Bash
# 添加NVIDIA包仓库
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

FFB6D 环境配置指南:原生系统安装

293 个字 96 行代码 预计阅读时间 3 分钟 共被读过

1. 系统要求

  • Ubuntu 20.04/22.04/24.04
  • NVIDIA GPU(支持 CUDA
  • 至少 8GB 内存
  • 至少 30GB 磁盘空间

2. 基础环境配置

2.1 安装 NVIDIA 驱动

Bash
# 添加NVIDIA包仓库
 sudo add-apt-repository ppa:graphics-drivers/ppa
 sudo apt-get update
 
diff --git a/AI/FFB6D/FFB6D_Docker/index.html b/AI/FFB6D/FFB6D_Docker/index.html
index 1fb14e6f..205c805f 100644
--- a/AI/FFB6D/FFB6D_Docker/index.html
+++ b/AI/FFB6D/FFB6D_Docker/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Docker 从入门到实践:以 FFB6D 环境配置为例

653 个字 213 行代码 预计阅读时间 6 分钟 共被读过

1. 简介

Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的 Linux Windows 操作系统上。本文将以配置 FFB6D(一个 3D 目标检测模型)的运行环境为例,介绍 Docker 的基本使用。

2. 环境准备

2.1 系统要求

  • Ubuntu 20.04/22.04/24.04
  • NVIDIA GPU(支持 CUDA
  • 至少 8GB 内存
  • 至少 30GB 磁盘空间

2.2 基础组件安装

安装 Docker

Bash
# 更新apt包索引
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Docker 从入门到实践:以 FFB6D 环境配置为例

653 个字 213 行代码 预计阅读时间 6 分钟 共被读过

1. 简介

Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的 Linux Windows 操作系统上。本文将以配置 FFB6D(一个 3D 目标检测模型)的运行环境为例,介绍 Docker 的基本使用。

2. 环境准备

2.1 系统要求

  • Ubuntu 20.04/22.04/24.04
  • NVIDIA GPU(支持 CUDA
  • 至少 8GB 内存
  • 至少 30GB 磁盘空间

2.2 基础组件安装

安装 Docker

Bash
# 更新apt包索引
 sudo apt-get update
 
 # 安装必要的系统工具
diff --git a/AI/SLAM14/index.html b/AI/SLAM14/index.html
index 3466ae73..2f815b87 100644
--- a/AI/SLAM14/index.html
+++ b/AI/SLAM14/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

视觉 SLAM 十四讲

14110 个字 72 行代码 20 张图片 预计阅读时间 71 分钟 共被读过

1 预备知识

1.1 本书讲什么

simultaneous localization and mapping

  • 定位
  • 地图构建
  • 背景知识 :
    • 射影几何
    • 计算机视觉
    • 状态估计理论
    • 李群与李代数

1.2 如何使用本书

1.2.1 组织方式

  • 数学基础篇
    • QQ_1725118522621.png
  • 实践应用篇
    • QQ_1725118558994.png

1.2.2 代码

GitHub - gaoxiang12/slambook2: edition 2 of the slambook

1.2.3 面向的读者

  • 基础知识 :
    • 高数线代概率论
    • C++ 语言基础(C++ 标准库,模板类,一部分 C++11
    • Linux 基础

1.3 风格约定

1.4 致谢和声明

1.5 习题

  • 题目:有线性方程 \(A x=b\),若已知 \(A, b\),需要求解 x,该如何求解?这对 A b 有哪些要求?提示:从 A 的维度和秩角度来分析。
  • 答案:线性方程组 \(Ax = b\) 可以通过多种方法求解,如高斯消元法、矩阵逆法等。要求 \(A\) 是一个方阵且可逆(即 \(A\) 的行列式不为零,这样方程才有唯一解。如果 \(A\) 不是方阵,需要 \(A\) 的秩等于列数且等于增广矩阵 \(\displaystyle [A|b]\) 的秩,这样方程组才有解。
  • 题目:高斯分布是什么?它的一维形式是什么样子?它的高维形式是什么样子?
  • 答案:高斯分布,也称为正态分布,是一种连续概率分布。一维高斯分布的数学表达式为 \(\displaystyle f (x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\),其中 \(\displaystyle \mu\) 是均值,\(\displaystyle \sigma\) 是标准差。高维高斯分布是一维高斯分布在多维空间的推广,其概率密度函数为 \(\displaystyle N (\mathbf{x}; \mathbf{\mu}, \Sigma)\),其中 \(\displaystyle \mathbf{\mu}\) 是均值向量,\(\displaystyle \Sigma\) 是协方差矩阵。
  • 题目:你知道 C++11 标准吗?你听说过或用过其中哪些新特性?有没有其他的标准?
  • 答案:是的,C++11 C++ 语言的一个重要标准,它引入了许多新特性,如自动类型推导(auto、基于范围的 for 循环、lambda 表达式、智能指针等。除了 C++11,还有 C++14C++17 C++20 等后续标准,它们也引入了新的特性和改进。
  • 题目:如何在 Ubuntu 系统中安装软件(不打开软件中心的情况下?这些软件被安装在什么地方?如果只知道模糊的软件名称(比如想要装一个名称中含有 Eigen 的库,应该如何安装它?
  • 答案
  • 软件安装:在 Ubuntu 中,可以使用命令行工具 apt 来安装软件。基本命令为 sudo apt install [package-name]
  • 安装位置:软件通常被安装在 /usr/ 目录下,但具体的文件可能分布在多个子目录中。
  • 模糊名称安装:如果只知道软件名称的一部分,可以使用 apt search 命令来搜索。例如,sudo apt search eigen 可以帮助找到所有包含 "eigen" 的软件包。
  • 题目* 花一个小时学习 Vim,因为你迟早会用它。你可以在终端中输入 vimtutor 阅读一遍所有内容。我们不需要你非常熟练地操作它,只要能够在学习本书的过程中使用它输入代码即可。不要在它的插件上浪费时间,不要想着把 Vim 用成 IDE,我们只用它做文本编辑的工作。
  • 答案:
    • vim 根本不熟练捏

2 初识 SLAM

2.1 引子 : 小萝卜的例子

  • 自主运动能力
  • 感知周边环境
    • 状态
    • 环境
  • 安装于环境中(不太好反正)
  • 机器人本体上
    • 激光 SLAM
    • 视觉 SLAM(本书重点)
      • 单目(Monocular)
        • 只能用一个摄像头
        • 距离感
          • motion
          • Structure
          • Disparity
          • Scale
            • Scale Ambiguity
          • 但是无法确定深度
      • 双目(Sterco)
        • 两个相机的距离(基线 Baseline)已知
        • 配置与标定比较复杂
      • 深度(RGB-D)
        • 红外结构关 Time-of-Flight(ToF)
        • 主要用在室内,室外会有很多影响
      • 还有一些非主流的 : 全景,Event

2.2 经典视觉 SLAM 框架

  • QQ_1725120955824.pngQQ_1725121088279.png
  • 在外界换几个比较稳定的情况下,SLAM 技术已经比较成熟

2.2.1 视觉里程计

  • 只通过视觉里程计来估计轨迹会出现累积漂移(Accumulating Drift
  • 所以需要回环检测与后端优化

2.2.2 后端优化

  • 最大后验概率估计(Maximum-a-Posteriori MAP)
  • 前端
    • 图像的特征提取与匹配
  • 后端
    • 滤波与非线性算法
  • 对运动主体自身和周围环境空间不确定性的估计

2.2.3 回环检测

  • 闭环检测
  • 识别到过的场景
  • 利用图像的相似性

2.2.4 建图

  • 度量地图
    • Sparse
      • Landmark
      • 定位用
    • Dense
      • Grid / Vocel
      • 导航用
  • 拓扑地图
    Graph

2.3 SLAM 问题的数学表述

  • 运动方程
    • \(\displaystyle \quad\boldsymbol{x}_k=f\left(\boldsymbol{x}_{k-1},\boldsymbol{u}_k,\boldsymbol{w}_k\right).\)
      • \(\displaystyle \boldsymbol{u}_{k}\) 是运动传感器的输入
      • \(\displaystyle \boldsymbol{w}_{k}\) 是过程中加入的噪声
  • 观测方程
    • \(\displaystyle \boldsymbol{z}_{k,j} = h (\boldsymbol{y}_{j},\boldsymbol{x}_{k},\boldsymbol{v}_{k,j})\)
      • \(\displaystyle \boldsymbol{v}_{k,j}\) 是观测里的噪声
  • 又很多参数化的方式
  • 可以总结为如下两个方程
\[ \begin{cases}\boldsymbol{x}_k=f\left(\boldsymbol{x}_{k-1},\boldsymbol{u}_k,\boldsymbol{w}_k\right),&k=1,\cdots,K\\\boldsymbol{z}_{k,j}=h\left(\boldsymbol{y}_j,\boldsymbol{x}_k,\boldsymbol{v}_{k,j}\right),&(k,j)\in\mathcal{O}\end{cases}. \]
  • 知道运动测量的读数 \(\displaystyle \boldsymbol{u}\) 和传感器的读数 \(\displaystyle \boldsymbol{z}\),如何求解定位问题和建图问题。
    • 状态估计问题 : 如何通过带有噪声的测量数据,估计内部的、隐藏着的状态变量
  • Linear Gaussian -> Kalman Filter
  • Non-Linear Non-Gaussian -> Extended Kalman Filter 和非线性优化
  • EKF -> Particle Filter -> Graph Optimization

2.4 实践 : 编程基础

2.4.1 安装 Linux 操作系统

2.4.2 Hello SLAM

2.4.3 使用 cmake

Text Only
cmake_minimum_required( VERSION 2.8)
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

视觉 SLAM 十四讲

14110 个字 72 行代码 20 张图片 预计阅读时间 71 分钟 共被读过

1 预备知识

1.1 本书讲什么

simultaneous localization and mapping

  • 定位
  • 地图构建
  • 背景知识 :
    • 射影几何
    • 计算机视觉
    • 状态估计理论
    • 李群与李代数

1.2 如何使用本书

1.2.1 组织方式

  • 数学基础篇
    • QQ_1725118522621.png
  • 实践应用篇
    • QQ_1725118558994.png

1.2.2 代码

GitHub - gaoxiang12/slambook2: edition 2 of the slambook

1.2.3 面向的读者

  • 基础知识 :
    • 高数线代概率论
    • C++ 语言基础(C++ 标准库,模板类,一部分 C++11
    • Linux 基础

1.3 风格约定

1.4 致谢和声明

1.5 习题

  • 题目:有线性方程 \(A x=b\),若已知 \(A, b\),需要求解 x,该如何求解?这对 A b 有哪些要求?提示:从 A 的维度和秩角度来分析。
  • 答案:线性方程组 \(Ax = b\) 可以通过多种方法求解,如高斯消元法、矩阵逆法等。要求 \(A\) 是一个方阵且可逆(即 \(A\) 的行列式不为零,这样方程才有唯一解。如果 \(A\) 不是方阵,需要 \(A\) 的秩等于列数且等于增广矩阵 \(\displaystyle [A|b]\) 的秩,这样方程组才有解。
  • 题目:高斯分布是什么?它的一维形式是什么样子?它的高维形式是什么样子?
  • 答案:高斯分布,也称为正态分布,是一种连续概率分布。一维高斯分布的数学表达式为 \(\displaystyle f (x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\),其中 \(\displaystyle \mu\) 是均值,\(\displaystyle \sigma\) 是标准差。高维高斯分布是一维高斯分布在多维空间的推广,其概率密度函数为 \(\displaystyle N (\mathbf{x}; \mathbf{\mu}, \Sigma)\),其中 \(\displaystyle \mathbf{\mu}\) 是均值向量,\(\displaystyle \Sigma\) 是协方差矩阵。
  • 题目:你知道 C++11 标准吗?你听说过或用过其中哪些新特性?有没有其他的标准?
  • 答案:是的,C++11 C++ 语言的一个重要标准,它引入了许多新特性,如自动类型推导(auto、基于范围的 for 循环、lambda 表达式、智能指针等。除了 C++11,还有 C++14C++17 C++20 等后续标准,它们也引入了新的特性和改进。
  • 题目:如何在 Ubuntu 系统中安装软件(不打开软件中心的情况下?这些软件被安装在什么地方?如果只知道模糊的软件名称(比如想要装一个名称中含有 Eigen 的库,应该如何安装它?
  • 答案
  • 软件安装:在 Ubuntu 中,可以使用命令行工具 apt 来安装软件。基本命令为 sudo apt install [package-name]
  • 安装位置:软件通常被安装在 /usr/ 目录下,但具体的文件可能分布在多个子目录中。
  • 模糊名称安装:如果只知道软件名称的一部分,可以使用 apt search 命令来搜索。例如,sudo apt search eigen 可以帮助找到所有包含 "eigen" 的软件包。
  • 题目* 花一个小时学习 Vim,因为你迟早会用它。你可以在终端中输入 vimtutor 阅读一遍所有内容。我们不需要你非常熟练地操作它,只要能够在学习本书的过程中使用它输入代码即可。不要在它的插件上浪费时间,不要想着把 Vim 用成 IDE,我们只用它做文本编辑的工作。
  • 答案:
    • vim 根本不熟练捏

2 初识 SLAM

2.1 引子 : 小萝卜的例子

  • 自主运动能力
  • 感知周边环境
    • 状态
    • 环境
  • 安装于环境中(不太好反正)
  • 机器人本体上
    • 激光 SLAM
    • 视觉 SLAM(本书重点)
      • 单目(Monocular)
        • 只能用一个摄像头
        • 距离感
          • motion
          • Structure
          • Disparity
          • Scale
            • Scale Ambiguity
          • 但是无法确定深度
      • 双目(Sterco)
        • 两个相机的距离(基线 Baseline)已知
        • 配置与标定比较复杂
      • 深度(RGB-D)
        • 红外结构关 Time-of-Flight(ToF)
        • 主要用在室内,室外会有很多影响
      • 还有一些非主流的 : 全景,Event

2.2 经典视觉 SLAM 框架

  • QQ_1725120955824.pngQQ_1725121088279.png
  • 在外界换几个比较稳定的情况下,SLAM 技术已经比较成熟

2.2.1 视觉里程计

  • 只通过视觉里程计来估计轨迹会出现累积漂移(Accumulating Drift
  • 所以需要回环检测与后端优化

2.2.2 后端优化

  • 最大后验概率估计(Maximum-a-Posteriori MAP)
  • 前端
    • 图像的特征提取与匹配
  • 后端
    • 滤波与非线性算法
  • 对运动主体自身和周围环境空间不确定性的估计

2.2.3 回环检测

  • 闭环检测
  • 识别到过的场景
  • 利用图像的相似性

2.2.4 建图

  • 度量地图
    • Sparse
      • Landmark
      • 定位用
    • Dense
      • Grid / Vocel
      • 导航用
  • 拓扑地图
    Graph

2.3 SLAM 问题的数学表述

  • 运动方程
    • \(\displaystyle \quad\boldsymbol{x}_k=f\left(\boldsymbol{x}_{k-1},\boldsymbol{u}_k,\boldsymbol{w}_k\right).\)
      • \(\displaystyle \boldsymbol{u}_{k}\) 是运动传感器的输入
      • \(\displaystyle \boldsymbol{w}_{k}\) 是过程中加入的噪声
  • 观测方程
    • \(\displaystyle \boldsymbol{z}_{k,j} = h (\boldsymbol{y}_{j},\boldsymbol{x}_{k},\boldsymbol{v}_{k,j})\)
      • \(\displaystyle \boldsymbol{v}_{k,j}\) 是观测里的噪声
  • 又很多参数化的方式
  • 可以总结为如下两个方程
\[ \begin{cases}\boldsymbol{x}_k=f\left(\boldsymbol{x}_{k-1},\boldsymbol{u}_k,\boldsymbol{w}_k\right),&k=1,\cdots,K\\\boldsymbol{z}_{k,j}=h\left(\boldsymbol{y}_j,\boldsymbol{x}_k,\boldsymbol{v}_{k,j}\right),&(k,j)\in\mathcal{O}\end{cases}. \]
  • 知道运动测量的读数 \(\displaystyle \boldsymbol{u}\) 和传感器的读数 \(\displaystyle \boldsymbol{z}\),如何求解定位问题和建图问题。
    • 状态估计问题 : 如何通过带有噪声的测量数据,估计内部的、隐藏着的状态变量
  • Linear Gaussian -> Kalman Filter
  • Non-Linear Non-Gaussian -> Extended Kalman Filter 和非线性优化
  • EKF -> Particle Filter -> Graph Optimization

2.4 实践 : 编程基础

2.4.1 安装 Linux 操作系统

2.4.2 Hello SLAM

2.4.3 使用 cmake

Text Only
cmake_minimum_required( VERSION 2.8)
 
 project(HelloSLAM)
 
diff --git a/AI/index.html b/AI/index.html
index 43a9192e..02e70e21 100644
--- a/AI/index.html
+++ b/AI/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Artificial Intelligence

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

  • 653 213 5 mins
    1734024510
  • 293 96 2 mins
    1734024510
  • 49 104 1 mins
    1734024510
  • 8852 30 mins
    1734012860

Artificial Intelligence

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

  • 653 213 5 mins
    1734024510
  • 293 96 2 mins
    1734024510
  • 49 104 1 mins
    1734024510
  • 8852 30 mins
    1734012860

统计学习方法

3356 个字 43 张图片 预计阅读时间 17 分钟 共被读过

1 统计学习方法概论

1.1 统计学习

  1. 统计学习的特点
    1. 以计算机及网络为平台
    2. 以数据为研究对象
    3. 目的是对数据进行预测与分析
    4. 交叉学科
  2. 统计学习的对象
    1. 是数据
  3. 统计学习的目的
  4. 统计学习的方法
    1. 主要有
      1. 监督学习(本书主要讨论)
      2. 非监督学习
      3. 半监督学习
      4. 强化学习
    2. 三要素
      1. 模型
      2. 策略
      3. 算法
    3. 实现步骤
      1. 得到一个训练数据集合
      2. 确定学习模型的集合
      3. 确定学习的策略
      4. 确定学习的算法
      5. 通过学习方法选择最优模型
      6. 利用学习的最优模型对新数据进行预测或分析
  5. 统计学习的研究
    1. 方法
    2. 理论
    3. 应用
  6. 统计学习的重要性

1.2 监督学习

1.2.1 基本概念

  1. 输入空间、特征空间与输出空间
    1. 每个输入是一个实例,通常由特征向量表示
    2. 监督学习从训练数据集合中学习模型,对测试数据进行预测
    3. 根据输入变量和输出变量的不同类型
      1. 回归问题 : 都连续
      2. 分类问题 : 输出有限离散
      3. 标注问题 : 都是变量序列
  2. 联合概率分布
  3. 假设空间
    1. 模型属于由输入空间到输出空间的映射的集合,这个集合就是假设空间
    2. 模型可以是(非)概率模型

1.2.2 问题的形式化

QQ_1725975153680.png

1.3 统计学习三要读

  • 方法 = 模型 + 策略 + 算法

1.3.1 模型

  • 模型就是索要学习的条件概率分布或决策函数
\[ \mathcal{F}=\{f\mid Y=f(X)\} \]
  • 参数空间
\[ \mathcal{F}=\{f | Y=f_{\theta}(X),\theta\in\mathbf{R}^{n}\} \]
  • 同样可以定义为条件概率的集合
\[ \mathcal{F}=\{P|P(Y|X)\} \]
\[ \mathcal{F}=\{P\mid P_{\theta}(Y\mid X),\theta\in\mathbf{R}^{n}\} \]

1.3.2 策略

  1. 损失函数和风险函数
    1. loos function or cost function \(\displaystyle L(Y,f(X))\)
      1. 0-1 loss function
        1. \(\displaystyle L(Y,f(X))=\begin{cases}1,&Y\neq f(X)\\0,&Y=f(X)\end{cases}\)
      2. quadratic loss function
        1. \(\displaystyle L(Y,f(X))=(Y-f(X))^{2}\)
      3. absolute loss function
        1. \(\displaystyle L(Y,f(X))=|Y-f(X)|\)
      4. logarithmic loss function or log-likelihood loss function
        1. \(\displaystyle L(Y,P(Y\mid X))=-\log P(Y\mid X)\)
    2. \(\displaystyle R_{\exp}(f)=E_{P}[L(Y,f(X))]=\int_{x\times y}L(y,f(x))P(x,y)\mathrm{d}x\mathrm{d}y\)
      1. risk function or expected loss
      2. 但是联合分布位置,所以要学习,但是这样以来风险最小又要用到联合分布,那么这就成为了病态问题 (ill-formed problem)
    3. empirical risk or empirical loss
      1. \(\displaystyle R_{\mathrm{emp}}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))\)
      2. \(\displaystyle N\) 趋于无穷时,经验风险趋于期望风险
        1. 这就关系到两个基本策略 :
          1. 经验风险最小化
          2. 结构风险最小化
  2. 经验风险最小化与结构风险最小化
    1. empirical risk minimization (样本容量比较大的时候)
      1. \(\displaystyle \min_{f\in\mathcal{F}} \frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))\)
      2. maximum likelihood estimation
    2. structural risk minimization
      1. regularization
      2. \(\displaystyle R_{\mathrm{sm}}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))+\lambda J(f)\)
      3. 复杂度表示了对复杂模型的乘法
      4. maximum posterior probability estimation

1.3.3 算法

1.4 模型评估与模型选择

1.4.1 训练误差与测试误差

\[ R_{\mathrm{emp}}(\hat{f})=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},\hat{f}(x_{i})) \]
\[ e_{\mathrm{test}}=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}L(y_{i},\hat{f}(x_{i})) \]
\[ r_{\mathrm{test}}+e_{\mathrm{test}}=1 \]
  • generalization ability

1.4.2 过拟合与模型选择

  • model selection
  • over-fitting
    QQ_1725977613135.png

1.5 正则化与交叉验证

1.5.1 正则化

\[ L(w)=\frac{1}{N}\sum_{i=1}^{N}(f(x_{i};w)-y_{i})^{2}+\frac{\lambda}{2}\parallel w\parallel^{2} \]

1.5.2 交叉验证

  • cross validation
  • 数据集
    • 训练集
    • 验证集
    • 测试集 1. 简单交叉验证 2. \(\displaystyle S\) 折交叉验证 3. 留一交叉验证

1.6 泛化能力

1.6.1 泛化误差

  • generalization error
\[ R_{\exp}(\hat{f})=E_{P}[L(Y,\hat{f}(X))]=\int_{R\times y}L(y,\hat{f}(x))P(x,y)\mathrm{d}x\mathrm{d}y \]

1.6.2 泛化误差上界

  • generalization error bound
  • 样本容量增加时,泛化上界趋于 0
  • 假设空间越大,泛化误差上界越大
    QQ_1725978149442.png
  • 这个定理只适用于假设空间包含有限个函数

1.7 生成模型与判别模型

  • generative model
    • 还原出联合概率分布 \(\displaystyle P(X,Y)\)
    • 朴素贝叶斯法
    • 隐马尔可夫模型
    • 收敛速度快
  • discriminative model
    • 直接学习决策函数或条件概率分布 \(\displaystyle P(Y|X)\)
    • \(\displaystyle k\) 近邻法
    • 感知机
    • 决策树
    • 逻辑斯谛回归模型
    • 最大熵模型
    • 支持向量机
    • 提升方法
    • 条件随机场
    • 准确度高

1.8 分类问题

  • precision \(\displaystyle P=\frac{TP}{TP+FP}\)
  • recall \(\displaystyle R=\frac{TP}{TP+FN}\)
    QQ_1725979882159.png
\[ \frac{2}{F_{1}}=\frac{1}{P}+\frac{1}{R} \]
\[ F_{1}=\frac{2TP}{2TP+FP+FN} \]
  • text classification

1.9 标注问题

  • tagging classificationd 一个推广
  • structure prediction 的简单形式
  • 隐马尔可夫模型
  • 条件随机场

1.10 回归问题

  • regression
  • (非)线性回归,一元回归,多元回归

2 感知机

  • perception
  • 感知机对应于输入空间中将实例划分成正负两类的分离超平面,属于判别模型
  • 原始形式和对偶形式

2.1 感知机模型

QQ_1725980556672.png

  • 假设空间是定义在特征空间中所有的线性分类模型(linear classification model)\(\displaystyle \{f|f(x) = w \cdot x+b\}\)
  • separating hyperplane
    QQ_1725980719817.png

2.2 感知机学习策略

2.2.1 数据集的线性可分性

QQ_1725980832517.png

2.2.2 感知机学习策略

  • 定义损失函数并将损失函数极小化
\[ L(w,b)=-\sum_{x_{i}\in M}y_{i}(w\cdot x_{i}+b) \]

2.2.3 感知机学习算法

2.2.4 感知机学习算法的原始形式

\[ \min_{w,b}L(w,b)=-\sum_{x_{i}\in M}y_{i}(w\cdot x_{i}+b) \]
  • stochastic gradient descent
\[ \nabla_{_w}L(w,b)=-\sum_{x_{i}\in M}y_{i}x_{i} \]
\[ \nabla_{b}L(w,b)=-\sum_{x_{i}eM}y_{i} \]
\[ w\leftarrow w+\eta y_{i}x_{i} \]
\[ b\leftarrow b+\eta y_{i} \]
  • \(\displaystyle \eta\) 被称为学习率(learning rate)
    QQ_1725981107428.png

2.2.5 算法的收敛性

QQ_1725981340195.png

  • 为了得到唯一的超平面,需要对分离超平面增加约束条件,即线性支持向量机
  • 如果训练集线性不可分,那么感知机学习算法不收敛

2.2.6 感知机学习算法的对偶形式

\[ \begin{aligned}&w\leftarrow w+\eta y_{i}x_{i}\\&b\leftarrow b+\eta y_{i}\end{aligned} \]
\[ w=\sum_{i=1}^{N}\alpha_{i}y_{i}x_{i} \]
\[ b=\sum_{i=1}^{N}\alpha_{i}y_{i} \]

QQ_1725982357513.png
QQ_1725982366353.png

  • Gram matrix
\[ G=[x_{i}\cdot x_{j}]_{N\times N} \]

3 \(\displaystyle k\) 近邻法

  • k-nearest neighbor

3.1 \(\displaystyle k\) 近邻算法

QQ_1725982597756.png

3.2 \(\displaystyle k\) 近邻模型

3.2.1 模型

  • cell
  • class label
    QQ_1726016538719.png

3.3 距离度量

  • \(\displaystyle L_{p}\) distance or Minkowski distamce
  • \(\displaystyle L_{p}(x_{i},x_{j})=\left(\sum_{l=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid^{p}\right)^{\frac{1}{p}}\)
  • \(\displaystyle L_{2}(x_{i},x_{j})=\left(\sum_{i=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid^{2}\right)^{\frac{1}{2}}\)
  • \(\displaystyle L_{1}(x_{i}, x_{j})=\sum_{l=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid\)
  • \(\displaystyle L_{\infty}(x_{i}, x_{j})=\max_{l}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid\)
    QQ_1726016713297.png

3.3.1 \(\displaystyle k\) 值的选择

  • if k is small, then the approximation error will reduce
  • estimation error
  • \(\displaystyle k\) 值的减小就意味着整体模型变得复杂,容易发生过拟合
  • 在应用中 , \(\displaystyle k\) 值一般取一个比较小的数值,通常采用交叉验证法来选取最优的 \(\displaystyle k\)

3.3.2 分类决策规则

  • 多数表决规则(majority voting rule)
    QQ_1726017033669.png

3.4 \(\displaystyle k\) 近邻法的实现 : \(\displaystyle kd\)

  • linear scan
  • kd tree

3.4.1 构造 \(\displaystyle kd\)

  • \(\displaystyle kd\) 树是一二叉树,表示对 \(\displaystyle k\) 维空间的一个划分(partition)
  • 通常选择训练实例点在选定坐标轴上的中位数为切分点,虽然这样得到的树是平衡的,但效率未必是最优的
    QQ_1726017375039.png
    QQ_1726017382886.png
    QQ_1726017509413.png
    有意思

3.4.2 搜索 \(\displaystyle kd\)

QQ_1726017566451.png
QQ_1726017574264.png
QQ_1726017707307.png

4 朴素贝叶斯法

  • 基于贝叶斯定理与特征条件独立假设的分类方法

4.1 朴素贝叶斯法的学习与分类

4.1.1 基本方法

  • 学习先验概率分布和条件概率分布于是学习到联合概率分布
\[ P(X=x\mid Y=c_{k})=P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)}\mid Y=c_{k}),\quad k=1,2,\cdots,K \]
  • 引入了条件独立性假设
\[ \begin{aligned} P(X=x|Y=c_{k})& =P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)}\mid Y=c_{k}) \\ &=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}\mid Y=c_{k}) \end{aligned} \]
\[ P(Y=c_{k}\mid X=x)=\frac{P(X=x\mid Y=c_{k})P(Y=c_{k})}{\sum_{k}P(X=x\mid Y=c_{k})P(Y=c_{k})} \]
\[ P(Y=c_{k}\mid X=x)=\frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})},\quad k=1,2,\cdots,K \]
\[ y=f(x)=\arg\max_{c_{k}}\frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})} \]
\[ y=\arg\max_{c_{k}}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k}) \]

4.1.2 后验概率最大化的含义

\[ L(Y,f(X))=\begin{cases}1,&Y\neq f(X)\\0,&Y=f(X)\end{cases} \]
\[ R_{\exp}(f)=E[L(Y,f(X))] \]
\[ R_{\exp}(f)=E_{\chi}\sum_{k=1}^{K}[L(c_{k},f(X))]P(c_{k}\mid X) \]
\[ \begin{align} f(x) &=\arg\min_{y\in\mathcal{Y}}\sum_{k=1}^{K}L(c_{k},y)P(c_{k}\mid X=x) \\ &=\arg\min_{y\in\mathcal{Y}}\sum_{k=1}^{K}P(y\neq c_{k}\mid X=x) \\ &=\arg\min_{y\in\mathcal{Y}}(1-P(y=c_{k}\mid X=x)) \\ &=\arg\max_{y\in\mathcal{Y}}P(y=c_{k}\mid X=x) \end{align} \]
\[ f(x)=\arg\max_{c_{k}}P(c_{k}\mid X=x) \]
  • 期望风险最小化准则就得到联考后验概率最大化准则

4.2 朴素贝叶斯法的参数估计

4.2.1 极大似然估计

\[ P(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})}{N} , k=1,2,\cdots,K \]
\[ P(X^{(j)}=a_{ji}\mid Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{ji},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})}\\j=1,2,\cdots,n ;\quad l=1,2,\cdots,S_{j} ;\quad k=1,2,\cdots,K \]

4.2.2 学习与分类算法

QQ_1726018558207.png

4.2.3 贝叶斯估计

  • 极大似然估计可能会出现所要估计的概率值为 0 的情况
  • 条件概率的贝叶斯估计
\[ P_{\lambda}(X^{(j)}=a_{ji}\mid Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{ji},y_{i}=c_{k})+\lambda}{\sum_{i=1}^{N}I(y_{i}=c_{k})+S_{j}\lambda} \]
  • when \(\displaystyle \lambda = 0\), it's called Laplace smoothing
\[ \begin{aligned}&P_{\lambda}(X^{(j)}=a_{jl}\mid Y=c_{k})>0\\&\sum_{l=1}^{s_{j}}P(X^{(j)}=a_{jl}\mid Y=c_{k})=1\end{aligned} \]
  • 表明贝叶斯估计确实是一种概率分布
  • 先验概率的贝叶斯估计
\[ P_{\lambda}(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})+\lambda}{N+K\lambda} \]

5 决策树

  • decision tree
    • 特征选择
    • 决策树的生成
    • 决策树的修剪

5.1 决策树模型与学习

5.1.1 决策树模型

QQ_1726019158987.png
QQ_1726019189434.png

5.1.2 决策树与 if-then 规则

  • 互斥且完备
  • 每一个实例都被一条路径会规则所覆盖,而且只被一条路径或一条规则所覆盖

5.1.3 决策树与条件概率分布

QQ_1726019332724.png

5.1.4 决策树学习

  • 决策树学习本质上是从训练数据集中归纳出一组分类规则
  • 在损失函数意义下选择最优决策树的问题,是 NP 完全问题,采用启发式方法,近似求解,这样得到的决策树是次最优(sub-optimal)
  • 为了防止过拟合,我们需要对已生成的树自上而下进行剪枝
  • 决策树的生成值考虑局部最优,剪枝则考虑全局最优

5.2 特征选择

5.2.1 特征选择问题

  • 通常特征选择的准则是信息增益或信息增益比
  • information gain

5.2.2 信息增益

  • 熵和条件熵
\[ P(X=x_{i})=p_{i} ,\quad i=1,2,\cdots,n \]
\[ H(X)=-\sum_{i=1}^{n}p_{i}\log p_{i} \]
\[ H(p)=-\sum_{i=1}^{n}p_{i}\log p_{i} \]
\[ 0\leqslant H(p)\leqslant\log n \]
\[ P(X=x_{i},Y=y_{j})=p_{ij} ,\quad i=1,2,\cdots,n ;\quad j=1,2,\cdots,m \]
\[ H(Y\mid X)=\sum_{i=1}^{n}p_{i}H(Y\mid X=x_{i}) \]

QQ_1726020006087.png

  • mutual information
    QQ_1726020056647.png
    QQ_1726020063140.png

5.2.3 信息增益比

QQ_1726020090067.png

5.3 决策树的生成

5.3.1 ID 3 算法

QQ_1726020201483.png
QQ_1726020212514.png

  • ID 3 算法只有树的生成,所以该算法生成的树容易产生过拟合

5.3.2 C 4.5 的生成算法

QQ_1726020446602.png

5.4 决策树的剪枝

  • pruning
\[ C_{\alpha}(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)+\alpha|T| \]
\[ H_{t}(T)=-\sum_{k}\frac{N_{ik}}{N_{t}}\log\frac{N_{ik}}{N_{t}} \]
\[ C(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)=-\sum_{t=1}^{|T|}\sum_{k=1}^{K}N_{tk}\log\frac{N_{tk}}{N_{t}} \]
\[ C_{\alpha}(T)=C(T)+\alpha|T| \]

QQ_1726020900891.png
QQ_1726020910643.png

5.5 CART 算法

  • 分裂与回归树(classification and regression tree)

5.5.1 CART 生成

  • 对回归树用平方误差最小化准则
  • 对分类树用基尼指数(Gini index)最小化准则 1. 回归树的生成
\[ f(x)=\sum_{m=1}^{M}c_{m}I(x\in R_{m}) \]
\[ \hat{c}_{m}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{m}) \]
  • splitting variable
  • splitting point
\[ R_{1}(j,s)=\{x\mid x^{(j)}\leqslant s\}\quad\text{和}\quad R_{2}(j,s)=\{x\mid x^{(j)}>s\} \]
\[ \min_{j,s}\biggl[\min_{c_{1}}\sum_{x_{i}\in R_{i}(j,s)}(y_{i}-c_{1})^{2}+\min_{c_{2}}\sum_{x_{i}\in R_{2}(j,s)}(y_{i}-c_{2})^{2}\biggr] \]
\[ \hat{c}_{1}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{1}(j,s))\quad\hat{\text{和}}\quad\hat{c}_{2}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{2}(j,s)) \]
  • least squares regression tree
    QQ_1726021559438.png 1. 分类树的生成
    QQ_1726021749142.png
    QQ_1726021871291.png
    QQ_1726021935444.png
    QQ_1726021942705.png

5.5.2 CART 剪枝

  1. 剪枝,形成一个子树序列
\[ C_{\alpha}(T)=C(T)+\alpha\left|T\right| \]
\[ g(t)=\frac{C(t)-C(T_{t})}{\mid T_{t}\mid-1} \]
  1. 在剪枝得到的子树序列 \(\displaystyle T_0,T_1,\cdots,T_n\) 中通过交叉验证选取最优子树 \(\displaystyle T_{\alpha}\)
    QQ_1726023182742.png

6 逻辑斯谛回归与最大熵模型

  • logistic regression
  • maximum entropy model
  • 逻辑斯谛回归模型和最大熵模型都属于对数线性模型

6.1 逻辑斯谛回归模型

6.1.1 逻辑斯谛分布

  • logistic distribution
    QQ_1726023396326.png
    QQ_1726023452749.png

6.1.2 二项逻辑斯谛回归模型

  • binomial logistic regression model
    QQ_1726023491542.png

7 支持向量机

8 提升方法

9 \(\displaystyle \boldsymbol{EM}\) 算法及其推广

10 隐马尔可夫模型

11 条件随机场

wnc's café

统计学习方法

3356 个字 43 张图片 预计阅读时间 17 分钟 共被读过

1 统计学习方法概论

1.1 统计学习

  1. 统计学习的特点
    1. 以计算机及网络为平台
    2. 以数据为研究对象
    3. 目的是对数据进行预测与分析
    4. 交叉学科
  2. 统计学习的对象
    1. 是数据
  3. 统计学习的目的
  4. 统计学习的方法
    1. 主要有
      1. 监督学习(本书主要讨论)
      2. 非监督学习
      3. 半监督学习
      4. 强化学习
    2. 三要素
      1. 模型
      2. 策略
      3. 算法
    3. 实现步骤
      1. 得到一个训练数据集合
      2. 确定学习模型的集合
      3. 确定学习的策略
      4. 确定学习的算法
      5. 通过学习方法选择最优模型
      6. 利用学习的最优模型对新数据进行预测或分析
  5. 统计学习的研究
    1. 方法
    2. 理论
    3. 应用
  6. 统计学习的重要性

1.2 监督学习

1.2.1 基本概念

  1. 输入空间、特征空间与输出空间
    1. 每个输入是一个实例,通常由特征向量表示
    2. 监督学习从训练数据集合中学习模型,对测试数据进行预测
    3. 根据输入变量和输出变量的不同类型
      1. 回归问题 : 都连续
      2. 分类问题 : 输出有限离散
      3. 标注问题 : 都是变量序列
  2. 联合概率分布
  3. 假设空间
    1. 模型属于由输入空间到输出空间的映射的集合,这个集合就是假设空间
    2. 模型可以是(非)概率模型

1.2.2 问题的形式化

QQ_1725975153680.png

1.3 统计学习三要读

  • 方法 = 模型 + 策略 + 算法

1.3.1 模型

  • 模型就是索要学习的条件概率分布或决策函数
\[ \mathcal{F}=\{f\mid Y=f(X)\} \]
  • 参数空间
\[ \mathcal{F}=\{f | Y=f_{\theta}(X),\theta\in\mathbf{R}^{n}\} \]
  • 同样可以定义为条件概率的集合
\[ \mathcal{F}=\{P|P(Y|X)\} \]
\[ \mathcal{F}=\{P\mid P_{\theta}(Y\mid X),\theta\in\mathbf{R}^{n}\} \]

1.3.2 策略

  1. 损失函数和风险函数
    1. loos function or cost function \(\displaystyle L(Y,f(X))\)
      1. 0-1 loss function
        1. \(\displaystyle L(Y,f(X))=\begin{cases}1,&Y\neq f(X)\\0,&Y=f(X)\end{cases}\)
      2. quadratic loss function
        1. \(\displaystyle L(Y,f(X))=(Y-f(X))^{2}\)
      3. absolute loss function
        1. \(\displaystyle L(Y,f(X))=|Y-f(X)|\)
      4. logarithmic loss function or log-likelihood loss function
        1. \(\displaystyle L(Y,P(Y\mid X))=-\log P(Y\mid X)\)
    2. \(\displaystyle R_{\exp}(f)=E_{P}[L(Y,f(X))]=\int_{x\times y}L(y,f(x))P(x,y)\mathrm{d}x\mathrm{d}y\)
      1. risk function or expected loss
      2. 但是联合分布位置,所以要学习,但是这样以来风险最小又要用到联合分布,那么这就成为了病态问题 (ill-formed problem)
    3. empirical risk or empirical loss
      1. \(\displaystyle R_{\mathrm{emp}}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))\)
      2. \(\displaystyle N\) 趋于无穷时,经验风险趋于期望风险
        1. 这就关系到两个基本策略 :
          1. 经验风险最小化
          2. 结构风险最小化
  2. 经验风险最小化与结构风险最小化
    1. empirical risk minimization (样本容量比较大的时候)
      1. \(\displaystyle \min_{f\in\mathcal{F}} \frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))\)
      2. maximum likelihood estimation
    2. structural risk minimization
      1. regularization
      2. \(\displaystyle R_{\mathrm{sm}}(f)=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},f(x_{i}))+\lambda J(f)\)
      3. 复杂度表示了对复杂模型的乘法
      4. maximum posterior probability estimation

1.3.3 算法

1.4 模型评估与模型选择

1.4.1 训练误差与测试误差

\[ R_{\mathrm{emp}}(\hat{f})=\frac{1}{N}\sum_{i=1}^{N}L(y_{i},\hat{f}(x_{i})) \]
\[ e_{\mathrm{test}}=\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}L(y_{i},\hat{f}(x_{i})) \]
\[ r_{\mathrm{test}}+e_{\mathrm{test}}=1 \]
  • generalization ability

1.4.2 过拟合与模型选择

  • model selection
  • over-fitting
    QQ_1725977613135.png

1.5 正则化与交叉验证

1.5.1 正则化

\[ L(w)=\frac{1}{N}\sum_{i=1}^{N}(f(x_{i};w)-y_{i})^{2}+\frac{\lambda}{2}\parallel w\parallel^{2} \]

1.5.2 交叉验证

  • cross validation
  • 数据集
    • 训练集
    • 验证集
    • 测试集 1. 简单交叉验证 2. \(\displaystyle S\) 折交叉验证 3. 留一交叉验证

1.6 泛化能力

1.6.1 泛化误差

  • generalization error
\[ R_{\exp}(\hat{f})=E_{P}[L(Y,\hat{f}(X))]=\int_{R\times y}L(y,\hat{f}(x))P(x,y)\mathrm{d}x\mathrm{d}y \]

1.6.2 泛化误差上界

  • generalization error bound
  • 样本容量增加时,泛化上界趋于 0
  • 假设空间越大,泛化误差上界越大
    QQ_1725978149442.png
  • 这个定理只适用于假设空间包含有限个函数

1.7 生成模型与判别模型

  • generative model
    • 还原出联合概率分布 \(\displaystyle P(X,Y)\)
    • 朴素贝叶斯法
    • 隐马尔可夫模型
    • 收敛速度快
  • discriminative model
    • 直接学习决策函数或条件概率分布 \(\displaystyle P(Y|X)\)
    • \(\displaystyle k\) 近邻法
    • 感知机
    • 决策树
    • 逻辑斯谛回归模型
    • 最大熵模型
    • 支持向量机
    • 提升方法
    • 条件随机场
    • 准确度高

1.8 分类问题

  • precision \(\displaystyle P=\frac{TP}{TP+FP}\)
  • recall \(\displaystyle R=\frac{TP}{TP+FN}\)
    QQ_1725979882159.png
\[ \frac{2}{F_{1}}=\frac{1}{P}+\frac{1}{R} \]
\[ F_{1}=\frac{2TP}{2TP+FP+FN} \]
  • text classification

1.9 标注问题

  • tagging classificationd 一个推广
  • structure prediction 的简单形式
  • 隐马尔可夫模型
  • 条件随机场

1.10 回归问题

  • regression
  • (非)线性回归,一元回归,多元回归

2 感知机

  • perception
  • 感知机对应于输入空间中将实例划分成正负两类的分离超平面,属于判别模型
  • 原始形式和对偶形式

2.1 感知机模型

QQ_1725980556672.png

  • 假设空间是定义在特征空间中所有的线性分类模型(linear classification model)\(\displaystyle \{f|f(x) = w \cdot x+b\}\)
  • separating hyperplane
    QQ_1725980719817.png

2.2 感知机学习策略

2.2.1 数据集的线性可分性

QQ_1725980832517.png

2.2.2 感知机学习策略

  • 定义损失函数并将损失函数极小化
\[ L(w,b)=-\sum_{x_{i}\in M}y_{i}(w\cdot x_{i}+b) \]

2.2.3 感知机学习算法

2.2.4 感知机学习算法的原始形式

\[ \min_{w,b}L(w,b)=-\sum_{x_{i}\in M}y_{i}(w\cdot x_{i}+b) \]
  • stochastic gradient descent
\[ \nabla_{_w}L(w,b)=-\sum_{x_{i}\in M}y_{i}x_{i} \]
\[ \nabla_{b}L(w,b)=-\sum_{x_{i}eM}y_{i} \]
\[ w\leftarrow w+\eta y_{i}x_{i} \]
\[ b\leftarrow b+\eta y_{i} \]
  • \(\displaystyle \eta\) 被称为学习率(learning rate)
    QQ_1725981107428.png

2.2.5 算法的收敛性

QQ_1725981340195.png

  • 为了得到唯一的超平面,需要对分离超平面增加约束条件,即线性支持向量机
  • 如果训练集线性不可分,那么感知机学习算法不收敛

2.2.6 感知机学习算法的对偶形式

\[ \begin{aligned}&w\leftarrow w+\eta y_{i}x_{i}\\&b\leftarrow b+\eta y_{i}\end{aligned} \]
\[ w=\sum_{i=1}^{N}\alpha_{i}y_{i}x_{i} \]
\[ b=\sum_{i=1}^{N}\alpha_{i}y_{i} \]

QQ_1725982357513.png
QQ_1725982366353.png

  • Gram matrix
\[ G=[x_{i}\cdot x_{j}]_{N\times N} \]

3 \(\displaystyle k\) 近邻法

  • k-nearest neighbor

3.1 \(\displaystyle k\) 近邻算法

QQ_1725982597756.png

3.2 \(\displaystyle k\) 近邻模型

3.2.1 模型

  • cell
  • class label
    QQ_1726016538719.png

3.3 距离度量

  • \(\displaystyle L_{p}\) distance or Minkowski distamce
  • \(\displaystyle L_{p}(x_{i},x_{j})=\left(\sum_{l=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid^{p}\right)^{\frac{1}{p}}\)
  • \(\displaystyle L_{2}(x_{i},x_{j})=\left(\sum_{i=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid^{2}\right)^{\frac{1}{2}}\)
  • \(\displaystyle L_{1}(x_{i}, x_{j})=\sum_{l=1}^{n}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid\)
  • \(\displaystyle L_{\infty}(x_{i}, x_{j})=\max_{l}\mid x_{i}^{(l)}-x_{j}^{(l)}\mid\)
    QQ_1726016713297.png

3.3.1 \(\displaystyle k\) 值的选择

  • if k is small, then the approximation error will reduce
  • estimation error
  • \(\displaystyle k\) 值的减小就意味着整体模型变得复杂,容易发生过拟合
  • 在应用中 , \(\displaystyle k\) 值一般取一个比较小的数值,通常采用交叉验证法来选取最优的 \(\displaystyle k\)

3.3.2 分类决策规则

  • 多数表决规则(majority voting rule)
    QQ_1726017033669.png

3.4 \(\displaystyle k\) 近邻法的实现 : \(\displaystyle kd\)

  • linear scan
  • kd tree

3.4.1 构造 \(\displaystyle kd\)

  • \(\displaystyle kd\) 树是一二叉树,表示对 \(\displaystyle k\) 维空间的一个划分(partition)
  • 通常选择训练实例点在选定坐标轴上的中位数为切分点,虽然这样得到的树是平衡的,但效率未必是最优的
    QQ_1726017375039.png
    QQ_1726017382886.png
    QQ_1726017509413.png
    有意思

3.4.2 搜索 \(\displaystyle kd\)

QQ_1726017566451.png
QQ_1726017574264.png
QQ_1726017707307.png

4 朴素贝叶斯法

  • 基于贝叶斯定理与特征条件独立假设的分类方法

4.1 朴素贝叶斯法的学习与分类

4.1.1 基本方法

  • 学习先验概率分布和条件概率分布于是学习到联合概率分布
\[ P(X=x\mid Y=c_{k})=P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)}\mid Y=c_{k}),\quad k=1,2,\cdots,K \]
  • 引入了条件独立性假设
\[ \begin{aligned} P(X=x|Y=c_{k})& =P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)}\mid Y=c_{k}) \\ &=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}\mid Y=c_{k}) \end{aligned} \]
\[ P(Y=c_{k}\mid X=x)=\frac{P(X=x\mid Y=c_{k})P(Y=c_{k})}{\sum_{k}P(X=x\mid Y=c_{k})P(Y=c_{k})} \]
\[ P(Y=c_{k}\mid X=x)=\frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})},\quad k=1,2,\cdots,K \]
\[ y=f(x)=\arg\max_{c_{k}}\frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k})} \]
\[ y=\arg\max_{c_{k}}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}\mid Y=c_{k}) \]

4.1.2 后验概率最大化的含义

\[ L(Y,f(X))=\begin{cases}1,&Y\neq f(X)\\0,&Y=f(X)\end{cases} \]
\[ R_{\exp}(f)=E[L(Y,f(X))] \]
\[ R_{\exp}(f)=E_{\chi}\sum_{k=1}^{K}[L(c_{k},f(X))]P(c_{k}\mid X) \]
\[ \begin{align} f(x) &=\arg\min_{y\in\mathcal{Y}}\sum_{k=1}^{K}L(c_{k},y)P(c_{k}\mid X=x) \\ &=\arg\min_{y\in\mathcal{Y}}\sum_{k=1}^{K}P(y\neq c_{k}\mid X=x) \\ &=\arg\min_{y\in\mathcal{Y}}(1-P(y=c_{k}\mid X=x)) \\ &=\arg\max_{y\in\mathcal{Y}}P(y=c_{k}\mid X=x) \end{align} \]
\[ f(x)=\arg\max_{c_{k}}P(c_{k}\mid X=x) \]
  • 期望风险最小化准则就得到联考后验概率最大化准则

4.2 朴素贝叶斯法的参数估计

4.2.1 极大似然估计

\[ P(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})}{N} , k=1,2,\cdots,K \]
\[ P(X^{(j)}=a_{ji}\mid Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{ji},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})}\\j=1,2,\cdots,n ;\quad l=1,2,\cdots,S_{j} ;\quad k=1,2,\cdots,K \]

4.2.2 学习与分类算法

QQ_1726018558207.png

4.2.3 贝叶斯估计

  • 极大似然估计可能会出现所要估计的概率值为 0 的情况
  • 条件概率的贝叶斯估计
\[ P_{\lambda}(X^{(j)}=a_{ji}\mid Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{ji},y_{i}=c_{k})+\lambda}{\sum_{i=1}^{N}I(y_{i}=c_{k})+S_{j}\lambda} \]
  • when \(\displaystyle \lambda = 0\), it's called Laplace smoothing
\[ \begin{aligned}&P_{\lambda}(X^{(j)}=a_{jl}\mid Y=c_{k})>0\\&\sum_{l=1}^{s_{j}}P(X^{(j)}=a_{jl}\mid Y=c_{k})=1\end{aligned} \]
  • 表明贝叶斯估计确实是一种概率分布
  • 先验概率的贝叶斯估计
\[ P_{\lambda}(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})+\lambda}{N+K\lambda} \]

5 决策树

  • decision tree
    • 特征选择
    • 决策树的生成
    • 决策树的修剪

5.1 决策树模型与学习

5.1.1 决策树模型

QQ_1726019158987.png
QQ_1726019189434.png

5.1.2 决策树与 if-then 规则

  • 互斥且完备
  • 每一个实例都被一条路径会规则所覆盖,而且只被一条路径或一条规则所覆盖

5.1.3 决策树与条件概率分布

QQ_1726019332724.png

5.1.4 决策树学习

  • 决策树学习本质上是从训练数据集中归纳出一组分类规则
  • 在损失函数意义下选择最优决策树的问题,是 NP 完全问题,采用启发式方法,近似求解,这样得到的决策树是次最优(sub-optimal)
  • 为了防止过拟合,我们需要对已生成的树自上而下进行剪枝
  • 决策树的生成值考虑局部最优,剪枝则考虑全局最优

5.2 特征选择

5.2.1 特征选择问题

  • 通常特征选择的准则是信息增益或信息增益比
  • information gain

5.2.2 信息增益

  • 熵和条件熵
\[ P(X=x_{i})=p_{i} ,\quad i=1,2,\cdots,n \]
\[ H(X)=-\sum_{i=1}^{n}p_{i}\log p_{i} \]
\[ H(p)=-\sum_{i=1}^{n}p_{i}\log p_{i} \]
\[ 0\leqslant H(p)\leqslant\log n \]
\[ P(X=x_{i},Y=y_{j})=p_{ij} ,\quad i=1,2,\cdots,n ;\quad j=1,2,\cdots,m \]
\[ H(Y\mid X)=\sum_{i=1}^{n}p_{i}H(Y\mid X=x_{i}) \]

QQ_1726020006087.png

  • mutual information
    QQ_1726020056647.png
    QQ_1726020063140.png

5.2.3 信息增益比

QQ_1726020090067.png

5.3 决策树的生成

5.3.1 ID 3 算法

QQ_1726020201483.png
QQ_1726020212514.png

  • ID 3 算法只有树的生成,所以该算法生成的树容易产生过拟合

5.3.2 C 4.5 的生成算法

QQ_1726020446602.png

5.4 决策树的剪枝

  • pruning
\[ C_{\alpha}(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)+\alpha|T| \]
\[ H_{t}(T)=-\sum_{k}\frac{N_{ik}}{N_{t}}\log\frac{N_{ik}}{N_{t}} \]
\[ C(T)=\sum_{t=1}^{|T|}N_{t}H_{t}(T)=-\sum_{t=1}^{|T|}\sum_{k=1}^{K}N_{tk}\log\frac{N_{tk}}{N_{t}} \]
\[ C_{\alpha}(T)=C(T)+\alpha|T| \]

QQ_1726020900891.png
QQ_1726020910643.png

5.5 CART 算法

  • 分裂与回归树(classification and regression tree)

5.5.1 CART 生成

  • 对回归树用平方误差最小化准则
  • 对分类树用基尼指数(Gini index)最小化准则 1. 回归树的生成
\[ f(x)=\sum_{m=1}^{M}c_{m}I(x\in R_{m}) \]
\[ \hat{c}_{m}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{m}) \]
  • splitting variable
  • splitting point
\[ R_{1}(j,s)=\{x\mid x^{(j)}\leqslant s\}\quad\text{和}\quad R_{2}(j,s)=\{x\mid x^{(j)}>s\} \]
\[ \min_{j,s}\biggl[\min_{c_{1}}\sum_{x_{i}\in R_{i}(j,s)}(y_{i}-c_{1})^{2}+\min_{c_{2}}\sum_{x_{i}\in R_{2}(j,s)}(y_{i}-c_{2})^{2}\biggr] \]
\[ \hat{c}_{1}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{1}(j,s))\quad\hat{\text{和}}\quad\hat{c}_{2}=\mathrm{ave}(y_{i}\mid x_{i}\in R_{2}(j,s)) \]
  • least squares regression tree
    QQ_1726021559438.png 1. 分类树的生成
    QQ_1726021749142.png
    QQ_1726021871291.png
    QQ_1726021935444.png
    QQ_1726021942705.png

5.5.2 CART 剪枝

  1. 剪枝,形成一个子树序列
\[ C_{\alpha}(T)=C(T)+\alpha\left|T\right| \]
\[ g(t)=\frac{C(t)-C(T_{t})}{\mid T_{t}\mid-1} \]
  1. 在剪枝得到的子树序列 \(\displaystyle T_0,T_1,\cdots,T_n\) 中通过交叉验证选取最优子树 \(\displaystyle T_{\alpha}\)
    QQ_1726023182742.png

6 逻辑斯谛回归与最大熵模型

  • logistic regression
  • maximum entropy model
  • 逻辑斯谛回归模型和最大熵模型都属于对数线性模型

6.1 逻辑斯谛回归模型

6.1.1 逻辑斯谛分布

  • logistic distribution
    QQ_1726023396326.png
    QQ_1726023452749.png

6.1.2 二项逻辑斯谛回归模型

  • binomial logistic regression model
    QQ_1726023491542.png

7 支持向量机

8 提升方法

9 \(\displaystyle \boldsymbol{EM}\) 算法及其推广

10 隐马尔可夫模型

11 条件随机场

wnc's café

Archives

\ No newline at end of file diff --git a/Blogs/index.html b/Blogs/index.html index 65cb86ff..16ef5aa3 100644 --- a/Blogs/index.html +++ b/Blogs/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Blogs ✍

Abstract

个人博客

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

Selected Blogs

  • 1421 5 mins
    1735373878

Blogs ✍

Abstract

个人博客

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

Selected Blogs

  • 1421 5 mins
    1735373878
\ No newline at end of file diff --git a/Blogs/posts/24-12-29/index.html b/Blogs/posts/24-12-29/index.html index d247f26f..bd95dd93 100644 --- a/Blogs/posts/24-12-29/index.html +++ b/Blogs/posts/24-12-29/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

1038 个字 1 张图片 预计阅读时间 5 分钟 共被读过

1038 个字 1 张图片 预计阅读时间 5 分钟 共被读过

2271 个字 2 张图片 预计阅读时间 11 分钟 共被读过

2271 个字 2 张图片 预计阅读时间 11 分钟 共被读过

2803 个字 预计阅读时间 14 分钟 共被读过

2803 个字 预计阅读时间 14 分钟 共被读过

192 个字 375 行代码 预计阅读时间 6 分钟 共被读过

192 个字 375 行代码 预计阅读时间 6 分钟 共被读过

982 个字 34 行代码 预计阅读时间 5 分钟 共被读过

982 个字 34 行代码 预计阅读时间 5 分钟 共被读过

420 个字 4 张图片 预计阅读时间 2 分钟 共被读过

420 个字 4 张图片 预计阅读时间 2 分钟 共被读过

374 个字 1 张图片 预计阅读时间 2 分钟 共被读过

374 个字 1 张图片 预计阅读时间 2 分钟 共被读过

1428 个字 7 张图片 预计阅读时间 7 分钟 共被读过

1428 个字 7 张图片 预计阅读时间 7 分钟 共被读过

6801 个字 1 张图片 预计阅读时间 34 分钟 共被读过

笔记软件选择

在这篇文章中,我主要会分享我个人的笔记发展历程。穿插一些问题的思考。

注意,文中有大量 deepseek 生成内容,真实性请自行判断。

我对笔记软件的要求

这笔记软件的不断更换实际上是对这几个问题的不同选择:

  1. 商业化运营 vs 非商业化运营
  2. 本地 vs 在线
  3. 笔记形式
    1. 纯文本 vs 富文本
  4. 平台支持
  5. 社区生态

你也可以自己思考思考这几个问题。

对于我个人来说,我需要笔记软件满足以下的几个需求:

  • 必须:
    1. 文件可导出成 markdown 格式。方便后续发布到博客上以及迁移。倘若笔记软件跑路或者不再继续维护,也可以拿着 markdown 找下一家。
    2. 支持双链,markdown 语法的直接书写,支持数学公式,支持引用,代码块,tag。
    3. 有比较好的生态,比较多的插件来提升体验。
    4. 支持跨平台,至少要有 windows 11, Ubuntu(不过现在可能不需要了,我已经完全切换到 linux 环境中
    5. 能够进行比较方便的同步和备份。
  • 可选:
    1. 文件嵌套格式。如image.png
    2. 协同实时编辑功能。但是这也意味着编辑时不会很流畅。
    3. 支持图床(主要是发布和迁移如果没有图床太难搞了)
    4. 有论坛,和比较完整的入门教程。
    5. 不需要 all in one, PDF 和视频最好不要放到笔记里面来。

个人笔记软件使用情况

我的电子化笔记大概是从初中开始的。

  • 初中
    • 主要使用幕布,大纲式笔记,方便梳理知识点和列 TODO list
      • 实际上在 markdown 文件中只是无序列表的一个功能而已。
    • 用过印象笔记
      • 当时好像主要卖点是收纳信息,方便从网上收纳到笔记中。编辑体验也还不错,但是后来因为会员以及过多的广告而停止使用。
  • 高中
    • 疫情时期上网课,主要使用 notability 以及 goodnote 等手写笔记软件。
    • 比较方便地标注 PPT 和电子书,以及写作业,绘图。这些都是手写笔记的优点。
    • 缺点就是不方便整理和搜索,进行二次开发,进一步完善笔记。同时手写速度会有上限。
    • 这一系列还有 OneNote 等。
  • 大学
    • obsidian
      • 在插件的加成下,基本满足我的要求
      • 插件生态好
      • 满足我的基本所有要求,只是有些插件做的还不够好,比如文件嵌套。
    • notion
      • 网卡,编辑体验差
      • 虽然可以进行协同编辑,但是卡(
      • 文件嵌套
      • 块语法,编辑相对 obsidian 来说不够流畅。
      • 可以塞其他文件
    • 飞书
      • 也有点卡,但是相对好很多了
      • 有知识库,内置的文件嵌套
      • 方便分享,同时可以协同编辑
      • 编辑相对 obsidian 来说不够流畅
      • 可以塞其他文件
      • 类似软件有语雀,大学生可以认证获取一年会员。但是付费不太友好。
    • 思源
      • 本地化,很流畅
      • 同步功能需付费
      • 文件嵌套
      • 有很多功能,但是有点过于累赘了
      • 插件生态较差
      • 基本满足所有需求,适合开箱即用的朋友。可 DIY 性不高。
    • zotero
      • 感觉批注不方便,依旧喜欢一边看 PDF, 一边记录内容的方式。
      • 不喜欢拿 zotero obsidian 强行联动,过分复杂且丑陋。
      • 同步不行

我目前的笔记软件使用情况

  • 个人笔记主要使用 obsidian
    • 使用 PicGo + GitHub + Image auto upload 实现图床
    • mkdocs 发布博客,支持双链和 callouts
    • Export Image plugin 导出成图片形式,发布到小红书等 markdown 不友好平台。
    • Git 进行自动备份和同步(不需要同时编辑,所以可以进行同步)
    • Latex Suite + Completr 支持 \(\LaTeX\) 数学公式
    • Auto Link Tile + Easy Typing + Linter + Outliner + Image Toolkit + Paste image rename 增强编辑体验,自动格式化,自动获取链接名称,增强图片查看。
    • Templater 优化工作流
    • Calendar + Periodic Notes 写日记,搭配模板降低复盘压力,同时支持 TODO 功能。
    • TagFolder 管理 Tag, 实现 Tag + 文件夹 + 双链多重方式索引管理文件。
    • Style Settings + MySnippets 管理主题,好看的界面更让人想写笔记。
    • (可选)用 Envelope 自动转换特殊语法,并结合 Hugo 等框架进行发送。
    • (可选)用 Dataview 实现数据库功能(不过我不喜欢,会降低编辑流畅度)
  • 团队协作主要使用飞书
    • 可以说是做的最好的团队协作笔记软件,在团队协作方面的第一选择。
    • 支持知识库形式,且免费。
    • 可以通过链接邀请进行编辑,不需要充会员。
    • 支持数学公式,markdown 语法,支持双链!
    • 在线,不需要同步。
    • Calendar, 但是 notion 有。

🤔,顺便贴一下 deepseek 优化以后的版本:

Obsidian:可编程知识中枢

作为个人知识管理的核心工具,Obsidian 通过模块化插件体系实现了「知识操作系统」的定位:

  • 数据主权架构
    • 基于本地 Markdown 文件存储,通过 Git 实现版本控制与跨设备同步(Windows/Linux/macOS 全兼容)
    • 采用 PicGo+GitHub 构建去中心化图床,实现媒体资产与文本的原子化分离管理
  • 认知增强矩阵
    • 公式系统:Latex Suite+Completr 实现论文级数学表达,支持实时渲染与语义补全
    • 输入优化:Paste image rename+Image Toolkit 重构媒体处理流,降低认知中断频率
    • 知识拓扑:通过 TagFolder 插件实现三维索引(标签 × 文件夹 × 双链,建立非对称信息网络
  • 创作扩展生态
    • 发布管道:mkdocs/Envelope 实现「一次编写,多端发布,适配博客、社交媒体等多重场景
    • 时间管理:Calendar+Periodic Notes 构建时空坐标系统,将日记 /TODO 与知识节点动态绑定
    • 界面工程:Style Settings+MySnippets 实施视觉降噪策略,维持心流状态下的最小视觉熵值

注:刻意规避 Dataview 等结构化插件,保持 Markdown 的文本纯洁性,避免陷入伪数据库建模陷阱

飞书:协同认知空间

在团队协作场景中,飞书展现了 SaaS 时代知识工具的典型进化特征:

  • 群体心智接口
    • 实时协同引擎:采用 OT+CRDT 混合算法,在保持低延迟编辑的同时规避版本冲突风险
    • 知识拓扑支持:通过「知识库 + 多维表格」构建立体信息架构,实现文档 - 数据 - 流程的三元融合
  • 轻量级学术支持
    • 内置 KaTeX 引擎支持学术写作基础需求,双链语法降低团队知识图谱构建门槛
    • 免费版提供 100GB 云存储空间,满足中小型团队非结构化知识资产管理
  • 生态位优势
    • 相比 Notion 的海外服务器延迟问题,国内部署节点保证毫秒级响应
    • 相较于语雀的教育认证限制,其永久免费策略更适合初创团队敏捷迭代

如果你想要选择一个笔记软件

这里有一个表格可以作为参考,欢迎补充:

工具 / 特点 商业 / 非商业 开源 / 闭源 存储方式 文本格式 强生态
Vim/NeoVim/VsCode 非商业 开源 本地 纯文本
Emacs/Org mode 非商业 开源 本地 纯文本
Logseq 非商业 开源 本地 + 在线 纯文本
思源笔记 非商业 开源 本地 + 在线 富文本
Obsidian 非商业 闭源 本地 纯文本
Joplin 非商业 开源 本地 纯文本
Notion 商业 闭源 在线 富文本
Wolai 商业 闭源 在线 富文本
Flowus 商业 闭源 在线 富文本
RoamResearch 商业 闭源 在线 富文本 (JSON)
Tana 商业 闭源 在线 富文本
AppFlowy 非商业 开源 本地 富文本
Affine 非商业 开源 本地 + 在线 富文本
Trilium 非商业 开源 本地 富文本
OneNote 商业 闭源 本地 + 在线 富文本
Heptabase 商业 闭源 在线 富文本
飞书 商业 闭源 在线 富文本
语雀 商业 闭源 在线 富文本
Notability / Goodnotes 商业 闭源 本地 富文本
Zotero 非商业 闭源 本地 富文本

笔记本身的形式

但是实际上笔记软件对应着笔记的不同形式,有的人喜欢手写,有的人喜欢用 Word, 有的人甚至纯用 Vim, 也有喜欢用 \(\LaTeX\) 的。

我想这笔记的形式也对应着不同的学习风格。

笔记的形式与学习风格之间的关联,本质上反映了不同人群在信息处理、知识内化及思维呈现上的差异。这些差异往往由个人认知习惯、学科需求、创作场景共同塑造。

1. 手写笔记:感官沉浸与非线性思维

  • 适用场景:数理推导、草图绘制、课堂速记。
  • 学习风格
    • 动觉型学习者:通过手部动作强化记忆锚点,如化学分子结构的手绘能帮助建立空间想象。
    • 碎片重组者:在纸张空白处随意添加箭头、批注,适合需要反复调整逻辑链条的创意性思考。
  • 工具进化GoodNotes 等数字化手写工具通过「图层分离「矢量笔迹」功能,实现了传统手写与电子检索的平衡。但本质仍是「模拟现实」——正如物理学家费曼坚持用纸笔推演公式,认为触感反馈能激活深层思维。

2. 纯文本(Markdown/Vim/Org-mode:极简主义与系统思维

  • 典型用户:程序员、学术研究者、知识体系构建者。
  • 核心优势
    • 低认知负荷:摆脱格式工具栏干扰,专注内容本身。例如 Vim 用户通过快捷键实现「思维流不间断
    • 可编程性:用代码思维管理笔记,如 Org-mode #+BEGIN_SRC 块直接执行 Python 脚本处理数据。
  • 隐喻延伸:纯文本爱好者常将笔记视为「知识代码库」——双链是函数调用,标签是版本分支,通过 Git 实现「知识迭代管理。这种思维模型尤其适合需要长期演进的技术文档或研究课题。

3. 富文本(Notion/ 飞书:视觉叙事与协作导向

  • 设计哲学:将文档看作「可交互的信息仪表盘
  • 典型用例
    • 产品经理:在 Notion 中嵌入 Figma 原型、用户反馈数据库、甘特图,构建需求全景视图。
    • 学生团队:用飞书多维表格管理实验数据,@ 提及成员更新进度,形成异步协作闭环。
  • 认知陷阱:过度装饰的排版可能沦为「数字手账,陷入形式大于内容的自我感动。真正高效者会像建筑师使用 CAD 软件般克制——用块结构(Block)搭建信息骨架,而非沉迷渐变色图标。

4. 结构化数据(Roam/Tana:网状思维与认知涌现

  • 革命性特征:将「文本」解构为「原子化块(Block,通过双向链接形成知识图谱。
  • 认知科学依据
    • Zettelkasten 卡片盒:社会学家卢曼的 55000 张卡片证明,碎片化输入 + 主动连接能激发创造性洞见。
    • 渐进式总结Tiago Forte 提出的「信息炼金术」——通过多层级折叠(Bullet→段落→思维导图)实现知识蒸馏。
  • 风险预警:初学者易陷入「链接狂热,给每个名词添加双链反而导致认知过载。高阶用户会像园丁修剪枝条般,定期用 Graph Analysis 插件清除低价值节点。

5. 多媒体融合(OneNote/Heptabase:跨模态思维

  • 独特价值:突破文本单一维度,实现「视觉 - 听觉 - 触觉」协同记忆。
  • 实践案例
    • 医学解剖:在 OneNote 中叠加 3D 模型切片、课堂录音、手写标注,构建多感官学习网络。
    • 设计思考Heptabase 的白板功能允许将用户访谈视频、情绪版图片、用户旅程图进行空间化排布,触发右脑直觉思维。
  • 硬件依赖:此类笔记效能与设备密切相关,如 Surface Pen 4096 级压感对素描笔记至关重要,普通触控笔则难以实现细腻笔触。

6. 代码驱动型(Jupyter Notebook/Obsidian Dataview:量化思维

  • 前沿趋势:当笔记系统具备图灵完备性,知识管理开始向「可计算知识」进化。
  • 典型模式
    • 动态文档:在 Jupyter 中混合 Markdown Python 代码,实时运行数据分析并可视化结果。
    • 自动化工作流:用 Obsidian Dataview 插件将笔记转化为数据库,执行 SQL 式查询生成动态列表(如 列出所有包含"待评审"标签的论文摘要
  • 思维跃迁:这种方式将「记录」升级为「知识引擎,如同数学家 Stephen Wolfram 倡导的「计算型知识」——笔记不仅是记忆载体,更是产生新知的实验平台。

选择背后的元问题

无论选择何种形式,最终都在回答三个本质问题: 1. 知识流动性:你的笔记是封闭的档案馆,还是可重组的思想乐高? 2. 认知摩擦系数:工具在何种程度上成为思维的延伸而非阻碍? 3. 时间贴现率:当下投入的格式化时间,未来会以何种复利形式回报?

一个值得警惕的现象是:许多人把「优化笔记方法论」本身变成了一种生产力表演。真正有效的系统往往呈现「隐形性」——就像呼吸不需要思考如何呼吸,当你不再纠结于工具切换,而是让笔记自然地成为思维的体外缓存时,或许才是找到了属于自己的「元解决方案

PKM(个人知识管理:在解决问题与构建体系之间找到平衡

使用 deepseek 进行完善和润色。

关于面向解决问题学习和系统学习的思考 即:只要解决问题就好还是构建完整知识体系 我个人认为这两者应该权衡而不是偏向于任何一侧:我们应该解决问题,并递归至自己熟悉的领域,记下来这个问题的解决方案,并在日后重新归纳到自己的知识体系中。同时我们要构建完整的知识体系,比如学习计算机的四大件。纵使科研可以学一些基础知识,比如深度学习,然后就可以直接去看论文,但是那不免于成为调包侠。 更长远的讲,如果不构建完整知识体系,那么你只不过是一个会 google 的 Ctrl C/V er, 仍然是廉价劳动力,具体可更换性。 所以建立 PKM 就很重要了。

在知识管理的实践中,我们常常面临一个根本性的问题:是应该专注于解决眼前的问题,还是致力于构建一个完整的知识体系? 这个问题并没有一个非黑即白的答案,而是需要在两者之间找到一个动态的平衡。PKM(Personal Knowledge Management,个人知识管理)的核心目标就是帮助我们在解决具体问题的同时,逐步构建并完善自己的知识体系,从而实现长期的认知复利。

解决问题的即时性与知识体系的长期性

  • 解决问题的即时性
    • 快速响应需求:在面对具体问题时,我们往往需要快速找到解决方案。这种“解决问题导向”的学习方式能够迅速满足当下的需求,尤其是在工作或学习中遇到紧急任务时。
    • 递归至熟悉领域:解决问题的过程中,我们通常会从已知的知识出发,逐步扩展到新的领域。通过这种方式,我们可以将新知识与已有的知识体系进行连接,形成更深层次的理解。
    • 记录解决方案:每次解决问题的过程都应该被记录下来,形成可复用的知识片段。这些片段不仅是未来的参考,也是构建知识体系的基础。
  • 知识体系的长期性
    • 系统性学习:构建完整的知识体系需要系统性的学习,尤其是在基础学科领域。例如,计算机科学中的“四大件”(数据结构、操作系统、计算机网络、数据库)是构建技术知识体系的基石。没有这些基础,我们很容易陷入“调包侠”的困境,只能依赖现成的工具而无法深入理解其背后的原理。
    • 避免成为“廉价劳动力”:如果不构建完整的知识体系,我们可能会沦为“会 Google Ctrl+C/V 工程师”,只能解决表面问题,而无法应对复杂的挑战。这种状态下的知识工作者往往具有高度的可替代性,缺乏核心竞争力。
    • 长期认知复利:知识体系的构建是一个长期的过程,但它能够带来持续的认知复利。通过不断积累和整合知识,我们能够在未来的工作和学习中更加高效地解决问题,甚至能够预见问题并提前做好准备。

PKM 的核心原则:动态平衡与递归整合

PKM 的核心在于在解决问题与构建知识体系之间找到动态平衡,并通过递归整合的方式将两者有机结合。以下是 PKM 的几个核心原则:

  • 问题驱动的知识积累
    • 从问题出发:每次遇到新问题时,首先尝试从已有的知识体系中寻找解决方案。如果现有的知识不足以解决问题,则通过学习和研究来扩展知识边界。
    • 记录与反思:在解决问题的过程中,记录下关键的思考步骤、解决方案以及遇到的挑战。通过定期的反思和总结,将这些零散的知识点逐步整合到已有的知识体系中。
  • 递归整合与知识重构
    • 递归至熟悉领域:将新学到的知识与已有的知识进行递归整合,找到它们之间的联系。例如,学习一个新的算法时,可以将其与已有的数据结构知识进行关联,理解其背后的原理。
    • 知识重构:随着知识的不断积累,定期对知识体系进行重构。通过重新组织知识结构,删除过时的内容,强化重要的概念,确保知识体系的简洁性和有效性。
  • 工具与流程的支持
    • 笔记工具的选择:选择适合个人需求的笔记工具(如 ObsidianNotion ,利用双链、标签、文件夹等功能,构建一个灵活的知识管理系统。
    • 自动化与工作流优化:通过自动化工具(如 Git 同步、Templater 插件等)优化知识管理的工作流,减少重复劳动,提高知识积累和整合的效率。

PKM 的实践:从碎片到体系

PKM 的实践过程可以看作是从碎片化知识到系统化知识的逐步演进。以下是 PKM 实践的几个关键步骤:

  • 碎片化知识的收集
    • 多渠道输入:通过阅读书籍、论文、博客,观看视频课程,参与讨论等多种方式获取碎片化知识。
    • 即时记录:使用笔记工具快速记录下有价值的知识点、灵感或问题,确保不会遗漏重要的信息。
  • 知识的初步整理
    • 分类与标签:将收集到的知识按照主题、项目或领域进行分类,并打上标签,方便后续的检索和整合。
    • 初步连接:通过双链功能将相关的知识点进行连接,形成初步的知识网络。
  • 知识的深度整合
    • 主题笔记的创建:针对某个主题或领域,创建专门的笔记,将相关的碎片化知识进行整合,形成系统化的理解。
    • 知识图谱的构建:通过双链和标签,逐步构建个人知识图谱,可视化知识之间的联系,发现潜在的知识盲区。
  • 知识的应用与迭代
    • 实践与验证:将整合后的知识应用到实际问题中,验证其有效性,并根据实践结果进行迭代和优化。
    • 定期复盘:定期对知识体系进行复盘,删除过时的内容,强化重要的概念,确保知识体系的简洁性和实用性。

PKM 的长期价值:从知识工作者到知识创造者

通过有效的 PKM 实践,我们不仅能够提高解决问题的效率,还能够逐步构建起一个强大的个人知识体系。这个体系不仅是我们应对复杂问题的武器库,更是我们进行知识创造的基础。

  • 从知识消费者到知识创造者
    • 知识消费者:仅仅依赖外部资源(如 Google、Stack Overflow)解决问题,缺乏对知识的深入理解和整合。
    • 知识创造者:通过 PKM 构建起自己的知识体系,能够从更高的维度理解问题,并提出创新的解决方案。知识创造者不仅能够解决问题,还能够预见问题,并主动进行知识的探索和创新。
  • 提升个人竞争力
    • 不可替代性:拥有完整知识体系的知识工作者具有更强的不可替代性。他们不仅能够解决表面问题,还能够深入理解问题的本质,提出系统性的解决方案。
    • 认知复利:通过长期的 PKM 实践,知识工作者能够积累大量的认知复利,使他们在未来的工作和学习中更加高效和自信。

最后扔一个 deepseek 给出 PKM 建立方案,个人感觉还是有一点启发性的。包括但不限于引用了 PARA 等经典方案。

如何构建一个 PKM

个人知识管理(PKM)的本质,是在碎片化实践与系统化认知之间建立双向通道。这个过程如同量子隧穿效应——通过持续的知识重组,让经验碎片突破认知势垒,跃迁到更高能级的知识轨道。

知识工程的二象性模型

  1. 粒子态(问题驱动)
    • 突击学习模式:面对具体问题时,启动「最小必要知识」快速检索
      • 示例:开发登录功能时,直接研究 OAuth2.0 协议实现方案
    • 知识捕手工具链
      • 浏览器书签分组 +Raindrop. io 实现临时知识暂存
      • Readwise 配合 Hypothesis 完成高亮批注的自动归集
      • Obsidian QuickAdd 插件实现「闪念笔记→文献笔记」的即时转化
  2. 波动态(体系建设)
    • 渐进式知识炼金术
      • 每周用 MECE 原则对临时笔记进行原子化拆分
      • 通过双链构建概念间的「非对称关系(如「卷积神经网络→计算机视觉」是强关联,反向链接则为弱关联)
    • 认知脚手架
      • 使用 Excalidraw 绘制学科知识地图,标注掌握程度缺口
      • Logseq 中建立「学习看板,用 Kanban 管理知识模块的开发进度

知识演化的三重熔炉

  1. 项目熔炉(实战淬炼)
    • 采用 PARA 方法构建项目知识库:
      • Projects:将每个开发任务视为独立知识单元
      • Areas:维护「前端工程化「机器学习部署」等长期关注领域
      • Resources:积累技术白皮书、论文合集等参考资料
      • Archives:定期归档过时方案(如 Webpack3Vite 迁移文档)
  2. 对话熔炉(认知碰撞)
    • 在飞书知识库中建立「认知冲突区
      • 技术方案评审时强制要求提交对比分析矩阵
      • 用多维表格记录不同架构选择的决策树(如微服务 vs 单体应用的 12 个评估维度)
    • 每周组织「认知红队演练
      • 随机指定成员挑战现有技术方案的底层假设
      • 使用 Miro 白板进行实时架构图攻击推演
  3. 元认知熔炉(系统升级)
    • 构建「反知识」监测体系:
      • Obsidian 中设置「过时警告」标签,标记可能失效的技术方案
      • Dataview 自动生成「知识新鲜度」看板(根据最后更新时间排序)
    • 实施「认知版本控制
      • 使用 Git Tag 标记知识库的重大演进节点(如「Vue2→Vue3 迁移经验总结 v2.1
      • 通过分支管理进行认知实验(feature/blockchain-research 实验性探索)

避免知识管理中的热力学陷阱

  1. 熵增定律防御策略

    • 知识压缩算法
      • 每季度执行「概念蒸馏,将 10 篇相关论文提炼为 1 张本质洞察脑图
      • 用费曼技巧重构复杂知识,强制输出 500 字通俗解释
    • 负熵输入机制
      • Readwise 设置「反常识过滤器,优先推送挑战现有认知的文章
      • 订阅 arXiv 特定分类,保持对前沿研究的触觉敏锐度
  2. 工具理性批判

    • 建立工具评估矩阵:
评估维度 权重 Obsidian Notion 飞书
认知流畅度 30% 9 7 8
知识可迁移性 25% 10 6 5
协作效能 20% 6 9 10
系统扩展性 15% 10 8 6
心智负荷 10% 7 5 8
  • 每半年进行「工具断舍离,移除使用频率低于每周 1 次的插件 / 功能

认知复利增长模型

真正的知识管理应该产生指数级收益,其价值符合「知识资本 = 初始认知 ×1+ 重构效率)^ 时间」的复利公式。当你的知识网络节点数突破「创新临界点(通常约 500 个高质量概念节点,将开始涌现意想不到的跨领域洞见——这可能是工程师突然理解蒙德里安画作中的网格美学,也可能是设计师在神经网络架构中发现分形之美。

最终,PKM 不应成为束缚思维的精致牢笼,而应进化为「可生长的认知操作系统。就像 Lisp 语言发明者 John McCarthy 所说:" 我们不是在记录知识,而是在培育会思考的笔记。" 当你的知识库开始反哺你的创造力时,就是认知飞轮突破静摩擦力的时刻。

\ No newline at end of file diff --git "a/Blogs/posts/\347\224\250AI\345\220\216\347\232\204\351\227\256\351\242\230/index.html" "b/Blogs/posts/\347\224\250AI\345\220\216\347\232\204\351\227\256\351\242\230/index.html" index ed60bade..0970c3c9 100644 --- "a/Blogs/posts/\347\224\250AI\345\220\216\347\232\204\351\227\256\351\242\230/index.html" +++ "b/Blogs/posts/\347\224\250AI\345\220\216\347\232\204\351\227\256\351\242\230/index.html" @@ -1,4 +1,4 @@ - 一些 AI 与个人学习的思考 - wnc 的咖啡馆

2374 个字 预计阅读时间 12 分钟 共被读过

2374 个字 预计阅读时间 12 分钟 共被读过

深入理解计算机系统

1012 个字 36 行代码 14 张图片 预计阅读时间 6 分钟 共被读过

1 计算机系统漫游

2 信息的表示和处理

  • 把位组合再一起,再加上 interpretation
  • 三种重要的数字表示
    • unsigned
    • two's-complement
    • floating-point
  • overflow
  • 浮点数是近似的

2.1 信息存储

  • 1 byte = 8 bits
  • virtual memory
  • address
    • virtual address space
  • 讲存储器空间划分为更可管理的单元,来存放不同的 program object

2.1.1 十六进制表示法

  • 0x...

2.1.2 字数据大小

  • word size
  • nominal size
  • 字长决定的最重要的系统参数就是虚拟地址空间的最大大小
    • 字长为 \(\displaystyle \omega\) 为的机器,虚拟地址的范围为 \(\displaystyle 0\sim2^{\omega} - 1\)
    • 大多数 64 位机器可以运行 32 位机器编译的程序,即向后兼容
      QQ_1726230175376.png
  • 为了避免大小和不同编译器设置带来的奇怪行为,我们有了 int 32_t int 64_t
  • C 语言对声明的关键词顺序不敏感

2.1.3 寻址和字节顺序

  • [[ 计算机组成与设计硬件软件接口 #^da8be4| 小端编址 ]]
    • 就是右边放小的,要从右往左读
  • 字节顺序变得重要的三种情况
    • 网络应用程序的代码编写必须遵守已建立的关于字节顺序的规则
    • disassembler
    • 编写规避正常的类型系统的程序
      • cast or union in C
      • 对应用编程不推荐,但是对系统级编程是必需的
C
#include <stdio.h>
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

深入理解计算机系统

1012 个字 36 行代码 14 张图片 预计阅读时间 6 分钟 共被读过

1 计算机系统漫游

2 信息的表示和处理

  • 把位组合再一起,再加上 interpretation
  • 三种重要的数字表示
    • unsigned
    • two's-complement
    • floating-point
  • overflow
  • 浮点数是近似的

2.1 信息存储

  • 1 byte = 8 bits
  • virtual memory
  • address
    • virtual address space
  • 讲存储器空间划分为更可管理的单元,来存放不同的 program object

2.1.1 十六进制表示法

  • 0x...

2.1.2 字数据大小

  • word size
  • nominal size
  • 字长决定的最重要的系统参数就是虚拟地址空间的最大大小
    • 字长为 \(\displaystyle \omega\) 为的机器,虚拟地址的范围为 \(\displaystyle 0\sim2^{\omega} - 1\)
    • 大多数 64 位机器可以运行 32 位机器编译的程序,即向后兼容
      QQ_1726230175376.png
  • 为了避免大小和不同编译器设置带来的奇怪行为,我们有了 int 32_t int 64_t
  • C 语言对声明的关键词顺序不敏感

2.1.3 寻址和字节顺序

  • [[ 计算机组成与设计硬件软件接口 #^da8be4| 小端编址 ]]
    • 就是右边放小的,要从右往左读
  • 字节顺序变得重要的三种情况
    • 网络应用程序的代码编写必须遵守已建立的关于字节顺序的规则
    • disassembler
    • 编写规避正常的类型系统的程序
      • cast or union in C
      • 对应用编程不推荐,但是对系统级编程是必需的
C
#include <stdio.h>
 
 
 
diff --git a/CS_Basic/C++/Accelerated C++/index.html b/CS_Basic/C++/Accelerated C++/index.html
index ea6b52e5..ea0a5547 100644
--- a/CS_Basic/C++/Accelerated C++/index.html	
+++ b/CS_Basic/C++/Accelerated C++/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Accelerated C++

2520 个字 489 行代码 2 张图片 预计阅读时间 19 分钟 共被读过

0 开始学习 C++

C++
#include <iostream>
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Accelerated C++

2520 个字 489 行代码 2 张图片 预计阅读时间 19 分钟 共被读过

0 开始学习 C++

C++
#include <iostream>
 
 int main()
 {
diff --git a/CS_Basic/C++/C++ Basic/index.html b/CS_Basic/C++/C++ Basic/index.html
index 6d9d26dc..b19f66c1 100644
--- a/CS_Basic/C++/C++ Basic/index.html	
+++ b/CS_Basic/C++/C++ Basic/index.html	
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

C++

3115 个字 523 行代码 9 张图片 预计阅读时间 22 分钟 共被读过

1 文件操作

1.1 文件的概念

  • C/C++ 把每一个文件都看成是一个有序的字节流,以文件结束标志(EOF)结束

1.2 文件的操作步骤

  1. 打开文件,讲文件指针指向文件,决定打开文件的类型
  2. 对文件进行读 / 写操作
  3. 在使用完文件后,关闭文件

1.3 一些函数

1.3.1 freopen 函数

C++
FILE* freopen(const char* filename, const char* mode, FILE* stream);
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

C++

3115 个字 523 行代码 9 张图片 预计阅读时间 22 分钟 共被读过

1 文件操作

1.1 文件的概念

  • C/C++ 把每一个文件都看成是一个有序的字节流,以文件结束标志(EOF)结束

1.2 文件的操作步骤

  1. 打开文件,讲文件指针指向文件,决定打开文件的类型
  2. 对文件进行读 / 写操作
  3. 在使用完文件后,关闭文件

1.3 一些函数

1.3.1 freopen 函数

C++
FILE* freopen(const char* filename, const char* mode, FILE* stream);
 
  • 参数说明
    • filename: 要打开的文件名
    • mode: 文件打开的模式,表示文件访问的权限
    • stream: 文件指针,通常使用标准文件流 ( stdin/stdout ) 或标准错误输出流 (stderr )
    • 返回值:文件指针,指向被打开文件
  • 文件打开格式
    • r:以只读方式打开文件,文件必须存在,只允许读入数据 (常用)
    • r+:以读 / 写方式打开文件,文件必须存在,允许读 / 写数据
    • rb:以只读方式打开二进制文件,文件必须存在,只允许读入数据
    • rb+:以读 / 写方式打开二进制文件,文件必须存在,允许读 / 写数据
    • rt+:以读 / 写方式打开文本文件,允许读 / 写数据
    • w:以只写方式打开文件,文件不存在会新建文件,否则清空内容,只允许写入数据 (常用)
    • w+:以读 / 写方式打开文件,文件不存在将新建文件,否则清空内容,允许读 / 写数据
    • wb:以只写方式打开二进制文件,文件不存在将会新建文件,否则清空内容,只允许写入数据
    • wb+:以读 / 写方式打开二进制文件,文件不存在将新建文件,否则清空内容,允许读 / 写数据
    • a:以只写方式打开文件,文件不存在将新建文件,写入数据将被附加在文件末尾(保留 EOF 符)
    • a+:以读 / 写方式打开文件,文件不存在将新建文件,写入数据将被附加在文件末尾(不保留 EOF 符)
    • at+:以读 / 写方式打开文本文件,写入数据将被附加在文件末尾
    • ab+:以读 / 写方式打开二进制文件,写入数据将被附加在文件末尾
      使用方式
C++
#include <cstdio>
 #include <iostream>
 int mian(void) {
diff --git a/CS_Basic/CS61A/CS61A/index.html b/CS_Basic/CS61A/CS61A/index.html
index 8206d59f..23822c2e 100644
--- a/CS_Basic/CS61A/CS61A/index.html
+++ b/CS_Basic/CS61A/CS61A/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      
wnc's café
wnc's café

COMPOSING PROGRAMS

2713 个字 651 行代码 2 张图片 预计阅读时间 22 分钟 共被读过

1 使用函数构建抽象

1.1 开始

程序由两部分组成 :

  • 计算一些值
  • 执行一些操作
  • 函数
  • 对象
  • 解释器 :
    • 用于计算复杂表达式的程序
  • 增量测试、模块化设计、明确的假设和团队合作

1.2 编程要素

1.2.1 表达式

  • 语言要有的机制 :
    • 原始表达式和语句:语言所关心的最简单的个体
    • 组合方法:由简单元素组合构建复合元素
    • 抽象方法:命名复合元素,并将其作为单元进行操作
  • infix notation

1.2.2 调用表达式

image.png

  • subexpressions
  • 用参数来调用函数
  • nested(嵌套)

1.2.3 导入库函数

1.2.4 名称与环境

  • = is assignment operator
    • 最简单的抽象方法
  • environment

1.2.5 求解嵌套表达式

求值程序本质上是递归的
image.png

  • 表达式树

1.2.6 非纯函数 print

Pure functions
None-pure functions
which has a side effect

1.3 定义新的函数

Python
def <name>(<formal parameters>):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

COMPOSING PROGRAMS

2713 个字 651 行代码 2 张图片 预计阅读时间 22 分钟 共被读过

1 使用函数构建抽象

1.1 开始

程序由两部分组成 :

  • 计算一些值
  • 执行一些操作
  • 函数
  • 对象
  • 解释器 :
    • 用于计算复杂表达式的程序
  • 增量测试、模块化设计、明确的假设和团队合作

1.2 编程要素

1.2.1 表达式

  • 语言要有的机制 :
    • 原始表达式和语句:语言所关心的最简单的个体
    • 组合方法:由简单元素组合构建复合元素
    • 抽象方法:命名复合元素,并将其作为单元进行操作
  • infix notation

1.2.2 调用表达式

image.png

  • subexpressions
  • 用参数来调用函数
  • nested(嵌套)

1.2.3 导入库函数

1.2.4 名称与环境

  • = is assignment operator
    • 最简单的抽象方法
  • environment

1.2.5 求解嵌套表达式

求值程序本质上是递归的
image.png

  • 表达式树

1.2.6 非纯函数 print

Pure functions
None-pure functions
which has a side effect

1.3 定义新的函数

Python
def <name>(<formal parameters>):
     return <return expression>  
 

1.3.1 环境

environment has some frames
frames have some bindings

  • intrinsic name
  • bound name
    不同的名称可能指的是同一个函数,但该函数本身只有一个内在名称
    对函数形式参数的描述被称为函数的签名

1.3.2 调用用户定义的函数

  1. 在新的局部帧中,将实参绑定到函数的形参上。
  2. 在以此帧开始的环境中执行函数体。
    name evaluation

1.3.3 示例:调用用户定义的函数

1.3.4 局部名称

1.3.5 选择名称

PEP 8 – Style Guide for Python Code | peps.python.org

1.3.6 抽象函数

  • functional abstraction
    • domain
    • range
    • intent

1.3.7 运算符

  • truediv
  • floordiv

1.4 设计函数

  • 一个函数一个任务
  • Don't repeat yourself (DRY)
  • 定义通用的函数

1.4.1 文档

docstring

1.4.2 参数默认值

1.5 控制

1.5.1 语句

  • assignment
  • def
  • return

1.5.2 复合语句

header
suite

Python
<header>:
     <statement>
diff --git "a/CS_Basic/CS61C/\350\256\241\347\256\227\346\234\272\347\273\204\346\210\220\344\270\216\350\256\276\350\256\241\347\241\254\344\273\266\350\275\257\344\273\266\346\216\245\345\217\243/index.html" "b/CS_Basic/CS61C/\350\256\241\347\256\227\346\234\272\347\273\204\346\210\220\344\270\216\350\256\276\350\256\241\347\241\254\344\273\266\350\275\257\344\273\266\346\216\245\345\217\243/index.html"
index d2aa1ebc..677bfe26 100644
--- "a/CS_Basic/CS61C/\350\256\241\347\256\227\346\234\272\347\273\204\346\210\220\344\270\216\350\256\276\350\256\241\347\241\254\344\273\266\350\275\257\344\273\266\346\216\245\345\217\243/index.html"
+++ "b/CS_Basic/CS61C/\350\256\241\347\256\227\346\234\272\347\273\204\346\210\220\344\270\216\350\256\276\350\256\241\347\241\254\344\273\266\350\275\257\344\273\266\346\216\245\345\217\243/index.html"
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

计算机组成与设计硬件软件接口

2978 个字 13 行代码 4 张图片 预计阅读时间 15 分钟 共被读过

1 计算机抽象及相关技术

2 指令 : 计算机的语言

2.1 引言

设计原则 :

  • 简单源于规整
  • 更少则更快
  • 优秀的设计需要适当的折中

2.2 计算机硬件的操作

Java 编译器 : Just In Time 编译器

2.3 计算机硬件的操作数

  • 寄存器
    • 大小为 64 bits 双字
    • 数量有限通常为 32
    • x + 寄存器编号

2.3.1 存储器操作数

在内存和寄存器之间传输指令 : 数据传输指令
指令提供内存地址
载入指令(load):ld

Text Only
Ld x9, 8(x22)
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

计算机组成与设计硬件软件接口

2978 个字 13 行代码 4 张图片 预计阅读时间 15 分钟 共被读过

1 计算机抽象及相关技术

2 指令 : 计算机的语言

2.1 引言

设计原则 :

  • 简单源于规整
  • 更少则更快
  • 优秀的设计需要适当的折中

2.2 计算机硬件的操作

Java 编译器 : Just In Time 编译器

2.3 计算机硬件的操作数

  • 寄存器
    • 大小为 64 bits 双字
    • 数量有限通常为 32
    • x + 寄存器编号

2.3.1 存储器操作数

在内存和寄存器之间传输指令 : 数据传输指令
指令提供内存地址
载入指令(load):ld

Text Only
Ld x9, 8(x22)
 

X 22 基址寄存器
8 偏移量
字节地址: 0 8 16 24
RICS- V 是小端编址: 只在以双字形式和八个单独字节访问相同数据时会有影响 ^da8be4

存储指令(store)存储双字

Text Only
sd x9, 96(x22)
 
  • 对齐限制 :
    • 字的起始地址是 4 的倍数
    • 双字的起始地址是 8 的倍数
    • 但是 risc-v and Intel x 86 没有
    • MIPS
      Gibibyte (\(\displaystyle 2^{30}\)) and tebibyte (\(\displaystyle 2^{40}\))
      如果变量比寄存器数量更多,那么会把一些放到内存,即寄存器换出。

2.3.2 常数或立即数操作数

Text Only
ld x9, AddConstant4(x3)
 Add x22, x22, x9
diff --git a/CS_Basic/Network/Security/index.html b/CS_Basic/Network/Security/index.html
index 30d01365..254c2509 100644
--- a/CS_Basic/Network/Security/index.html
+++ b/CS_Basic/Network/Security/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Security

常见的密码算法

368 个字 预计阅读时间 2 分钟 共被读过

  • 哈希算法(如 MD5,SHA256
  • 对称加密算法 (如 AES,DES
  • 非对称加密算法 (如 RSA

[[ 加密原理 ]]

弱口令

  1. 较短的密码
  2. 易被猜测或信道攻击的密码 - 风险
    • ssh 中如果设置了 password 认证且设置弱口令,将会导致服务器被未经授权登录,且攻击者可以进行与你同权限的任意操作
    • 无线局域网中如果设置了弱口令被猜测成功后,攻击者将可以进入局域网中对局域网其他设备进行攻击
    • 防范方式
    • 采用其他更为安全的身份认证方法(如 ssh 中采用 publickey 认证)
    • 设置随机字符串作为密码,并且长度超过 8

公网 IP

  • 我们希望从任意接入互联网的地方使用 ssh 连接到服务器,一个简单的方法是让服务器拥有一个公网 IP 并运行 sshd 服务。
  • 常见的攻击方式
    1. 扫描开放端口信息,并确定端口上运行的服务
    2. 对可能存在的服务进行攻击,尝试利用服务的漏洞(如弱口令)获取服务器的访问权限
  • 常见的防范方式
    • 使用防火墙。配置防火墙规则,仅允许必要的服务和端口对外开放。
    • 审查开放的服务的安全性。确保当前主机开放的所有服务均是安全的。
wnc's café

Security

常见的密码算法

368 个字 预计阅读时间 2 分钟 共被读过

  • 哈希算法(如 MD5,SHA256
  • 对称加密算法 (如 AES,DES
  • 非对称加密算法 (如 RSA

[[ 加密原理 ]]

弱口令

  1. 较短的密码
  2. 易被猜测或信道攻击的密码 - 风险
    • ssh 中如果设置了 password 认证且设置弱口令,将会导致服务器被未经授权登录,且攻击者可以进行与你同权限的任意操作
    • 无线局域网中如果设置了弱口令被猜测成功后,攻击者将可以进入局域网中对局域网其他设备进行攻击
    • 防范方式
    • 采用其他更为安全的身份认证方法(如 ssh 中采用 publickey 认证)
    • 设置随机字符串作为密码,并且长度超过 8

公网 IP

  • 我们希望从任意接入互联网的地方使用 ssh 连接到服务器,一个简单的方法是让服务器拥有一个公网 IP 并运行 sshd 服务。
  • 常见的攻击方式
    1. 扫描开放端口信息,并确定端口上运行的服务
    2. 对可能存在的服务进行攻击,尝试利用服务的漏洞(如弱口令)获取服务器的访问权限
  • 常见的防范方式
    • 使用防火墙。配置防火墙规则,仅允许必要的服务和端口对外开放。
    • 审查开放的服务的安全性。确保当前主机开放的所有服务均是安全的。
wnc's café

Computer Science Basic

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

  • 1012 36 4 mins
    1734027543

Computer Science Basic

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

  • 1012 36 4 mins
    1734027543
\ No newline at end of file + 友链 - wnc 的咖啡馆
\ No newline at end of file diff --git a/Robot/calibration/index.html b/Robot/calibration/index.html index 3fa5f3d4..a382128a 100644 --- a/Robot/calibration/index.html +++ b/Robot/calibration/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}
wnc's café
wnc's café

Robot

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

Robot

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

卡尔曼滤波

193 个字 115 行代码 预计阅读时间 2 分钟 共被读过

1 Why

  • 差分
    • 受噪声干扰大
    • 有延迟
    • 速度不连续(不能得到瞬时速度)

2 How

2.1 卡尔曼滤波

\[ \begin{array}{|c|}\hline\textbf{Prediction}\\\hline x^{'}=Ax+u\\P^{'}=APA^{T}+R\\\hline\textbf{Measurement update}\\\hline y=z-Cx^{'}\\S=CPC^{T}+Q\\K=PC^{T}S^{-1}\\x=x^{'}+Ky\\P=(I-KC)P\\\hline\end{array} \]
C++
#include <iostream>
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

卡尔曼滤波

193 个字 115 行代码 预计阅读时间 2 分钟 共被读过

1 Why

  • 差分
    • 受噪声干扰大
    • 有延迟
    • 速度不连续(不能得到瞬时速度)

2 How

2.1 卡尔曼滤波

\[ \begin{array}{|c|}\hline\textbf{Prediction}\\\hline x^{'}=Ax+u\\P^{'}=APA^{T}+R\\\hline\textbf{Measurement update}\\\hline y=z-Cx^{'}\\S=CPC^{T}+Q\\K=PC^{T}S^{-1}\\x=x^{'}+Ky\\P=(I-KC)P\\\hline\end{array} \]
C++
#include <iostream>
 #include <cstdio>
 #include <string>
 #include <vector>
diff --git a/Robot/pnp/index.html b/Robot/pnp/index.html
index a5208adf..71a06c19 100644
--- a/Robot/pnp/index.html
+++ b/Robot/pnp/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

pnp

79 个字 55 行代码 预计阅读时间 1 分钟 共被读过

  • 已知
    • 目标物体特定点的像素坐标
    • 目标物体特定点的真实尺寸
    • 相机内参
    • 目标物体在相机坐标系下的 6d pose

像素坐标和物体坐标的对点
但是一般只用 t, 因为 R 的精度不够高
Fetching Title#g70i

![[Pasted image 20241008201602.png]]

C++
#include <iostream>
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

pnp

79 个字 55 行代码 预计阅读时间 1 分钟 共被读过

  • 已知
    • 目标物体特定点的像素坐标
    • 目标物体特定点的真实尺寸
    • 相机内参
    • 目标物体在相机坐标系下的 6d pose

像素坐标和物体坐标的对点
但是一般只用 t, 因为 R 的精度不够高
Fetching Title#g70i

![[Pasted image 20241008201602.png]]

C++
#include <iostream>
 #include <opencv2/opencv.hpp>
 #include <opencv2/imgproc/imgproc.hpp>
 #include <opencv2/calib3d/calib3d.hpp>
diff --git a/Summaries/2024/weekly/2024-W51-12/index.html b/Summaries/2024/weekly/2024-W51-12/index.html
index 3a212958..ac5da426 100644
--- a/Summaries/2024/weekly/2024-W51-12/index.html
+++ b/Summaries/2024/weekly/2024-W51-12/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

657 个字 预计阅读时间 3 分钟 共被读过

657 个字 预计阅读时间 3 分钟 共被读过

814 个字 预计阅读时间 4 分钟 共被读过

814 个字 预计阅读时间 4 分钟 共被读过

735 个字 预计阅读时间 4 分钟 共被读过

735 个字 预计阅读时间 4 分钟 共被读过

656 个字 预计阅读时间 3 分钟 共被读过

656 个字 预计阅读时间 3 分钟 共被读过

728 个字 4 张图片 预计阅读时间 4 分钟 共被读过

728 个字 4 张图片 预计阅读时间 4 分钟 共被读过

514 个字 预计阅读时间 3 分钟 共被读过

514 个字 预计阅读时间 3 分钟 共被读过

2024 年高三 - 大一暑假总结

518 个字 预计阅读时间 3 分钟 共被读过

我的暑假大概是从 24.7.10 开始的,到那时才尘埃落定。但又来回摇摆,想着未来的出路(出国?保研?工作?转专业。所以大概到 8 月才开始学习。

  • 计算机
    • crash course computer 看了前 20 讲,后来因为懒得看了就摆烂了
    • cs 61 A
      • 看了前 10 讲的 lecture,但是没做笔记
      • 看了 composing programs 前三章
      • 做完了 4 proj,但是没有做 hw lab
    • cs 61 C
      • 看了前 10 讲的 slide
      • 做了前两个 proj 和前六个 lab
      • 看计算机组成与设计硬件软件接口前两章
    • csapp
      • 书看了前三章
      • 九曲阑干看了前 4
    • Dive into Deep Learning
      • 看了前两章并做了笔记,但感觉一下子跳过太多前置知识很难感受到美感便先放放。
    • games 101
      • 几乎看完了所有的 lecture ( 但是后面几讲不是很认真 ),但是没有做笔记
    • 代码随想录
      • 做到回溯了,但是打算之后不会很经常做(等到要用了再说)
    • mkdocs 搭建了自己的 blog
    • C++
      • 看了菜鸟教程上的相关内容,没做笔记
      • 看了浙大的 C++ ,没做笔记,也没看完()
      • 看了 accelerated C++,做了笔记
    • 看了浙大的实用技能拾遗
      • 复习了 Markdown Latex 语法,学习了如何使用 git,学习了最基础的 shell,vim。
    • 视觉 slam 十四讲
      • 看完了前 7 讲(即理论部分,做了笔记,但是没有跑代码 (环境太难配了)
    • 配置环境
      • wsl 2 , git,vmware,vscode
      • 配置了 obsidian,装了好多插件,现在用起来是很舒服了
  • 运动
    • 每天做做俯卧撑,感觉还不错
    • 大概 7 月份的时候每天下午会出去骑车(city cycling?)
  • 其他
    • 家教
    • 学了驾照
    • 买了一个键盘和显示器 , 重装了电脑
    • 和朋友旅游,去泉州 + 福州
    • 给高中的学弟学妹写了经验分享(数学 + 英语 + 物理 + 技术)
    • 看了不少电影
wnc's café

2024 年高三 - 大一暑假总结

518 个字 预计阅读时间 3 分钟 共被读过

我的暑假大概是从 24.7.10 开始的,到那时才尘埃落定。但又来回摇摆,想着未来的出路(出国?保研?工作?转专业。所以大概到 8 月才开始学习。

  • 计算机
    • crash course computer 看了前 20 讲,后来因为懒得看了就摆烂了
    • cs 61 A
      • 看了前 10 讲的 lecture,但是没做笔记
      • 看了 composing programs 前三章
      • 做完了 4 proj,但是没有做 hw lab
    • cs 61 C
      • 看了前 10 讲的 slide
      • 做了前两个 proj 和前六个 lab
      • 看计算机组成与设计硬件软件接口前两章
    • csapp
      • 书看了前三章
      • 九曲阑干看了前 4
    • Dive into Deep Learning
      • 看了前两章并做了笔记,但感觉一下子跳过太多前置知识很难感受到美感便先放放。
    • games 101
      • 几乎看完了所有的 lecture ( 但是后面几讲不是很认真 ),但是没有做笔记
    • 代码随想录
      • 做到回溯了,但是打算之后不会很经常做(等到要用了再说)
    • mkdocs 搭建了自己的 blog
    • C++
      • 看了菜鸟教程上的相关内容,没做笔记
      • 看了浙大的 C++ ,没做笔记,也没看完()
      • 看了 accelerated C++,做了笔记
    • 看了浙大的实用技能拾遗
      • 复习了 Markdown Latex 语法,学习了如何使用 git,学习了最基础的 shell,vim。
    • 视觉 slam 十四讲
      • 看完了前 7 讲(即理论部分,做了笔记,但是没有跑代码 (环境太难配了)
    • 配置环境
      • wsl 2 , git,vmware,vscode
      • 配置了 obsidian,装了好多插件,现在用起来是很舒服了
  • 运动
    • 每天做做俯卧撑,感觉还不错
    • 大概 7 月份的时候每天下午会出去骑车(city cycling?)
  • 其他
    • 家教
    • 学了驾照
    • 买了一个键盘和显示器 , 重装了电脑
    • 和朋友旅游,去泉州 + 福州
    • 给高中的学弟学妹写了经验分享(数学 + 英语 + 物理 + 技术)
    • 看了不少电影
wnc's café

Summaries

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

Summaries

Abstract

本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

Tags

#Blog

  • 近期的一些想法    1/20/25, 2:34 AM
  • 一些 AI 与个人学习的思考    1/2/25, 6:43 AM
  • 工作规律    12/29/24, 6:35 AM
  • 信息    12/21/24, 9:21 AM
  • # 科研

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding    12/28/24, 7:08 AM
  • Beyond Object Recognition: A New Benchmark towards Object Concept Learning    12/24/24, 1:04 PM
  • # 三维重建

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • # 复现

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • # 周记

  • 2025-W04-01    1/26/25, 2:43 PM
  • 2025-W03-01    1/19/25, 11:16 AM
  • 2025-W02-01    1/6/25, 10:04 AM
  • 2025-W01-12    1/5/25, 2:56 PM
  • 2024-W52-12    12/23/24, 4:06 AM
  • 2024-W51-12    12/16/24, 4:06 AM
  • #prompt

  • prompt    12/30/24, 7:20 AM
  • #Environment

  • obsidian 配置    12/22/24, 10:41 AM
  • #Obsidian

  • obsidian 配置    12/22/24, 10:41 AM
  • #Zotero

  • zotero_使用指南    12/23/24, 4:16 AM
  • #Tools

  • zotero_使用指南    12/23/24, 4:16 AM
  • Tags

    #Blog

  • 近期的一些想法    1/20/25, 2:34 AM
  • 一些 AI 与个人学习的思考    1/2/25, 6:43 AM
  • 工作规律    12/29/24, 6:35 AM
  • 信息    12/21/24, 9:21 AM
  • # 科研

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding    12/28/24, 7:08 AM
  • Beyond Object Recognition: A New Benchmark towards Object Concept Learning    12/24/24, 1:04 PM
  • # 三维重建

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • # 复现

  • FreeSplatter 代码解读    1/2/25, 5:22 AM
  • Gaussian_Splatting_Code    1/1/25, 4:11 AM
  • Gaussian Splatting 复现    12/31/24, 1:11 PM
  • #PKM

  • 笔记软件选择    1/29/25, 11:52 AM
  • # 周记

  • 2025-W04-01    1/26/25, 2:43 PM
  • 2025-W03-01    1/19/25, 11:16 AM
  • 2025-W02-01    1/6/25, 10:04 AM
  • 2025-W01-12    1/5/25, 2:56 PM
  • 2024-W52-12    12/23/24, 4:06 AM
  • 2024-W51-12    12/16/24, 4:06 AM
  • #prompt

  • prompt    12/30/24, 7:20 AM
  • #Environment

  • obsidian 配置    12/22/24, 10:41 AM
  • #Obsidian

  • obsidian 配置    12/22/24, 10:41 AM
  • #Zotero

  • zotero_使用指南    12/23/24, 4:16 AM
  • #Tools

  • zotero_使用指南    12/23/24, 4:16 AM
  • \ No newline at end of file diff --git a/Tools/AI/prompt/index.html b/Tools/AI/prompt/index.html index b1afc2ac..0bfa5b6a 100644 --- a/Tools/AI/prompt/index.html +++ b/Tools/AI/prompt/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

    4234 个字 1 行代码 预计阅读时间 21 分钟 共被读过

    4234 个字 1 行代码 预计阅读时间 21 分钟 共被读过

    AI 使用

    3315 个字 157 行代码 预计阅读时间 19 分钟 共被读过

    1 如何写一个 Prompt

    1.1 Prompt 的基本原则

    1.1.1 明确需求

    1.1.1.1 什么是清晰的指令?

    清晰的指令是指能够准确传达任务意图的描述,避免歧义,让模型理解并生成期望的结果。
    它的核心在于具体化需求,通过明确的语言和结构化的描述,让任务目标易于被模型解析。

    1.1.1.2 如何表达需求无歧义?
    1. 使用具体的语言
      例如,不要简单说“生成摘要”,而是明确内容形式和要求:
    • ❌ 不清晰:总结一下这篇文章。
    • ✅ 清晰:用通俗易懂的语言将以下文章总结为 3 点,并以 Markdown 列表的形式输出。
    1. 设定清晰的边界和限制
      给出明确的范围,避免模型输出无关信息。
    • ❌ 不清晰:解释 AI
    • ✅ 清晰:请用 2-3 句话向高中生解释什么是 AI,避免使用过于专业的术语。
    1. 使用任务指向性强的词语
      强调任务的核心,例如“详细说明”“以简洁语言总结”“列出具体步骤”等。
    1.1.1.3 示例:清晰与模糊指令的对比
    模糊指令 清晰指令
    总结会议记录。 用一个段落总结会议记录,并列出发言人及其建议的行动项目,以 Markdown 列表格式输出。
    解释大数据。 用三句话向中学生解释大数据的定义及作用,并提供一个与日常生活相关的例子。
    生成一份旅行计划。 请为北京三日游生成一份详细的旅行计划,包含每天的行程、景点介绍、预算范围和推荐美食。

    1.1.2 简洁与精炼

    1.1.2.1 避免冗长与无效信息的方法

    Prompt 中,冗长的描述会增加模型的理解难度,同时可能引入无关内容。以下技巧可以帮助优化表达:

    1. 删减无用信息:去掉不必要的修饰词或重复内容。
      • ❌ 冗长:在用来写这篇文章的摘要时,你可以参考以下这些文章的内容……
      • ✅ 精炼:为以下文章写摘要。
    2. 直接切入重点:优先描述任务的核心需求,避免背景信息过多干扰任务。
    3. 层级分明:使用结构化格式,避免将多条指令混为一谈。
    1.1.2.2 “奥卡姆剃刀”原则的实际应用

    奥卡姆剃刀原则强调“如无必要,勿增实体”,在 Prompt 中,表现为尽量减少不必要的约束和附加要求。

    • 示例
      • 不必要的约束:
        请用 500 字左右总结以下文本,不要提到与文本无关的内容,也不要加入个人观点,只需简洁概括主要观点……
      • 优化后:
        请用 500 字总结以下文本的主要观点,语言简洁明了。

    1.1.3 语气与风格

    1.1.3.1 使用正式、礼貌的语言提高生成准确性

    Prompt 中,语气和语言风格会影响模型的生成质量。

    • 正式语言通常更符合大模型的训练数据分布,有助于生成更严谨的内容。
    • 示例:
      • 正式:请用通俗易懂的语言解释以下技术概念。
      • 非正式:帮我把这段话简单说一下。
    1.1.3.2 针对不同任务调整语气的案例
    1. 创意任务
      指令应更具感染力,以激发模型生成更具想象力的内容。
    • 示例:
      • 你是一位小红书爆款文案专家,请为年轻人设计一个具有吸引力的青岛旅游攻略。
    1. 教育任务
      语气需要循循善诱,内容结构清晰明了。
    • 示例:
      • 你是一名高中数学老师,请用通俗易懂的方式讲解二次函数的概念。
    1. 专业任务
      语气应严谨,信息需精确,避免主观性表达。
    • 示例:
      • 请从定义、特性和应用三个方面详细说明区块链技术,并提供相关的行业实例。

    1.2 高级技巧

    1.2.1 提供上下文与示例

    1.2.1.1 使用 Few-shot Prompt 提供有效示例

    Few-shot Prompt 是指在提示语中提供示例以引导模型生成类似的内容。这种方法特别适合复杂任务或需求不明确的场景。

    1. 为何使用 Few-shot Prompt
    • 降低模型的自由发挥度:通过提供示例,限制模型的输出风格和结构。
    • 提升任务准确性:通过示例传递明确的标准,减少偏差。
    • 扩展模型的适应能力:帮助模型适应一些训练数据中可能未见过的场景。
    1. 设计 Few-shot Prompt 的关键点
    • 示例数量:通常 2-5 个示例即可,过多可能导致提示过长,增加噪声。
    • 覆盖不同难度的案例:包括简单场景(easy case、复杂场景(hard case)以及边缘情况(corner case
    • 示例质量:确保提供的示例与预期任务高度相关。
    1. 示例
      任务:判断输入是否属于知识问答类问题。

    Few-shot Prompt

    Text Only
    请判断以下问题是否属于知识问答类问题。
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    AI 使用

    3315 个字 157 行代码 预计阅读时间 19 分钟 共被读过

    1 如何写一个 Prompt

    1.1 Prompt 的基本原则

    1.1.1 明确需求

    1.1.1.1 什么是清晰的指令?

    清晰的指令是指能够准确传达任务意图的描述,避免歧义,让模型理解并生成期望的结果。
    它的核心在于具体化需求,通过明确的语言和结构化的描述,让任务目标易于被模型解析。

    1.1.1.2 如何表达需求无歧义?
    1. 使用具体的语言
      例如,不要简单说“生成摘要”,而是明确内容形式和要求:
    • ❌ 不清晰:总结一下这篇文章。
    • ✅ 清晰:用通俗易懂的语言将以下文章总结为 3 点,并以 Markdown 列表的形式输出。
    1. 设定清晰的边界和限制
      给出明确的范围,避免模型输出无关信息。
    • ❌ 不清晰:解释 AI
    • ✅ 清晰:请用 2-3 句话向高中生解释什么是 AI,避免使用过于专业的术语。
    1. 使用任务指向性强的词语
      强调任务的核心,例如“详细说明”“以简洁语言总结”“列出具体步骤”等。
    1.1.1.3 示例:清晰与模糊指令的对比
    模糊指令 清晰指令
    总结会议记录。 用一个段落总结会议记录,并列出发言人及其建议的行动项目,以 Markdown 列表格式输出。
    解释大数据。 用三句话向中学生解释大数据的定义及作用,并提供一个与日常生活相关的例子。
    生成一份旅行计划。 请为北京三日游生成一份详细的旅行计划,包含每天的行程、景点介绍、预算范围和推荐美食。

    1.1.2 简洁与精炼

    1.1.2.1 避免冗长与无效信息的方法

    Prompt 中,冗长的描述会增加模型的理解难度,同时可能引入无关内容。以下技巧可以帮助优化表达:

    1. 删减无用信息:去掉不必要的修饰词或重复内容。
      • ❌ 冗长:在用来写这篇文章的摘要时,你可以参考以下这些文章的内容……
      • ✅ 精炼:为以下文章写摘要。
    2. 直接切入重点:优先描述任务的核心需求,避免背景信息过多干扰任务。
    3. 层级分明:使用结构化格式,避免将多条指令混为一谈。
    1.1.2.2 “奥卡姆剃刀”原则的实际应用

    奥卡姆剃刀原则强调“如无必要,勿增实体”,在 Prompt 中,表现为尽量减少不必要的约束和附加要求。

    • 示例
      • 不必要的约束:
        请用 500 字左右总结以下文本,不要提到与文本无关的内容,也不要加入个人观点,只需简洁概括主要观点……
      • 优化后:
        请用 500 字总结以下文本的主要观点,语言简洁明了。

    1.1.3 语气与风格

    1.1.3.1 使用正式、礼貌的语言提高生成准确性

    Prompt 中,语气和语言风格会影响模型的生成质量。

    • 正式语言通常更符合大模型的训练数据分布,有助于生成更严谨的内容。
    • 示例:
      • 正式:请用通俗易懂的语言解释以下技术概念。
      • 非正式:帮我把这段话简单说一下。
    1.1.3.2 针对不同任务调整语气的案例
    1. 创意任务
      指令应更具感染力,以激发模型生成更具想象力的内容。
    • 示例:
      • 你是一位小红书爆款文案专家,请为年轻人设计一个具有吸引力的青岛旅游攻略。
    1. 教育任务
      语气需要循循善诱,内容结构清晰明了。
    • 示例:
      • 你是一名高中数学老师,请用通俗易懂的方式讲解二次函数的概念。
    1. 专业任务
      语气应严谨,信息需精确,避免主观性表达。
    • 示例:
      • 请从定义、特性和应用三个方面详细说明区块链技术,并提供相关的行业实例。

    1.2 高级技巧

    1.2.1 提供上下文与示例

    1.2.1.1 使用 Few-shot Prompt 提供有效示例

    Few-shot Prompt 是指在提示语中提供示例以引导模型生成类似的内容。这种方法特别适合复杂任务或需求不明确的场景。

    1. 为何使用 Few-shot Prompt
    • 降低模型的自由发挥度:通过提供示例,限制模型的输出风格和结构。
    • 提升任务准确性:通过示例传递明确的标准,减少偏差。
    • 扩展模型的适应能力:帮助模型适应一些训练数据中可能未见过的场景。
    1. 设计 Few-shot Prompt 的关键点
    • 示例数量:通常 2-5 个示例即可,过多可能导致提示过长,增加噪声。
    • 覆盖不同难度的案例:包括简单场景(easy case、复杂场景(hard case)以及边缘情况(corner case
    • 示例质量:确保提供的示例与预期任务高度相关。
    1. 示例
      任务:判断输入是否属于知识问答类问题。

    Few-shot Prompt

    Text Only
    请判断以下问题是否属于知识问答类问题。
     
     问题:世界上最高的山是什么? # easy case,属于客观知识问答
     答案:是
    diff --git a/Tools/Blog/Mkdocs_Material/index.html b/Tools/Blog/Mkdocs_Material/index.html
    index c7f6c9d0..129585d3 100644
    --- a/Tools/Blog/Mkdocs_Material/index.html
    +++ b/Tools/Blog/Mkdocs_Material/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    mkdocs material 超全配置

    5956 个字 9393 行代码 7 张图片 预计阅读时间 147 分钟 共被读过

    仍然在修改
    如果需要任何的文件,可以直接访问本博客的 GitHub 页面

    1 入门基础

    1.1 什么是 MkDocs

    MkDocs 是一个快速、简单、华丽的静态站点生成器,专门用于构建项目文档。文档源文件使用 Markdown 编写,配置文件使用 YAML 格式。

    1.1.1 MkDocs 的优势

    1. 简单易用 - 使用 Markdown 编写文档 - 配置文件简单直观 - 一键式构建和部署

    2. 功能强大 - 内置开发服务器,支持实时预览 - 多种主题可选 - 支持自定义主题 - 自动生成导航 - 全文搜索功能

    3. 部署方便 - 生成纯静态页面 - 一行命令部署到 GitHub Pages - 支持自定义域名 - 兼容所有静态网站托管平台

    1.1.2 MkDocs vs 其他文档工具

    工具 优势 劣势
    MkDocs - 简单易用
    - 专注文档
    - 部署方便
    - 主题丰富
    - 功能相对简单
    - 插件生态较小
    GitBook - 界面优雅
    - 生态完整
    - 多人协作好
    - 构建速度慢
    - 定制性差
    - 免费版限制多
    Docusaurus - React 技术栈
    - 功能强大
    - 扩展性好
    - 学习曲线陡
    - 配置复杂
    - 构建较慢
    VuePress - Vue 技术栈
    - 定制性强
    - 插件丰富
    - 主题较少
    - 配置繁琐
    - 学习成本高

    1.1.3 MkDocs 工作原理

    MkDocs 的工作流程如下:

    1. 文档编写 - 使用 Markdown 格式编写文档 - 文档存放在 docs 目录下 - 支持多级目录结构

    2. 配置解析 - 读取 mkdocs.yml 配置文件 - 解析主题设置、插件配置等 - 生成导航结构

    3. 构建过程

    Text Only
    Markdown 文件 -> 解析器 -> HTML 文件
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    mkdocs material 超全配置

    5956 个字 9393 行代码 7 张图片 预计阅读时间 147 分钟 共被读过

    仍然在修改
    如果需要任何的文件,可以直接访问本博客的 GitHub 页面

    1 入门基础

    1.1 什么是 MkDocs

    MkDocs 是一个快速、简单、华丽的静态站点生成器,专门用于构建项目文档。文档源文件使用 Markdown 编写,配置文件使用 YAML 格式。

    1.1.1 MkDocs 的优势

    1. 简单易用 - 使用 Markdown 编写文档 - 配置文件简单直观 - 一键式构建和部署

    2. 功能强大 - 内置开发服务器,支持实时预览 - 多种主题可选 - 支持自定义主题 - 自动生成导航 - 全文搜索功能

    3. 部署方便 - 生成纯静态页面 - 一行命令部署到 GitHub Pages - 支持自定义域名 - 兼容所有静态网站托管平台

    1.1.2 MkDocs vs 其他文档工具

    工具 优势 劣势
    MkDocs - 简单易用
    - 专注文档
    - 部署方便
    - 主题丰富
    - 功能相对简单
    - 插件生态较小
    GitBook - 界面优雅
    - 生态完整
    - 多人协作好
    - 构建速度慢
    - 定制性差
    - 免费版限制多
    Docusaurus - React 技术栈
    - 功能强大
    - 扩展性好
    - 学习曲线陡
    - 配置复杂
    - 构建较慢
    VuePress - Vue 技术栈
    - 定制性强
    - 插件丰富
    - 主题较少
    - 配置繁琐
    - 学习成本高

    1.1.3 MkDocs 工作原理

    MkDocs 的工作流程如下:

    1. 文档编写 - 使用 Markdown 格式编写文档 - 文档存放在 docs 目录下 - 支持多级目录结构

    2. 配置解析 - 读取 mkdocs.yml 配置文件 - 解析主题设置、插件配置等 - 生成导航结构

    3. 构建过程

    Text Only
    Markdown 文件 -> 解析器 -> HTML 文件
                   -> 主题渲染
                   -> 插件处理
                   -> 静态资源处理
    diff --git a/Tools/Environment/Ubuntu_setup/index.html b/Tools/Environment/Ubuntu_setup/index.html
    index b10f781f..c12b3812 100644
    --- a/Tools/Environment/Ubuntu_setup/index.html
    +++ b/Tools/Environment/Ubuntu_setup/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Ubuntu 配置

    77 个字 14 行代码 预计阅读时间 1 分钟 共被读过

    Bash
    visudo /etc/sudoers 
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Ubuntu 配置

    77 个字 14 行代码 预计阅读时间 1 分钟 共被读过

    Bash
    visudo /etc/sudoers 
     %sudo   ALL=(ALL:ALL) NOPASSWD: ALL
     
    Bash
    git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting
     
    diff --git a/Tools/Environment/environment/index.html b/Tools/Environment/environment/index.html
    index 26d98377..924be2da 100644
    --- a/Tools/Environment/environment/index.html
    +++ b/Tools/Environment/environment/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}     
    wnc's café
    wnc's café

    1281 个字 55 行代码 预计阅读时间 7 分钟 共被读过

    1281 个字 55 行代码 预计阅读时间 7 分钟 共被读过

    CMake 相关

    161 个字 13 行代码 预计阅读时间 1 分钟 共被读过

    1 构建最小项目

    • CMake 支持大写、小写和混合大小写命令、
    Text Only
    mkdir build
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    CMake 相关

    161 个字 13 行代码 预计阅读时间 1 分钟 共被读过

    1 构建最小项目

    • CMake 支持大写、小写和混合大小写命令、
    Text Only
    mkdir build
     cd build
     cmake -G"MinGW Makefiles" ..
     cmake --build .
    diff --git a/Tools/Make/Makeflie/index.html b/Tools/Make/Makeflie/index.html
    index 507208fe..559b4440 100644
    --- a/Tools/Make/Makeflie/index.html
    +++ b/Tools/Make/Makeflie/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Makeflie

    Make 基础

    220 个字 预计阅读时间 1 分钟 共被读过

    什么是 Make

    Make 是一个自动化构建工具,使用 Makefile 文件来定义如何编译和链接程序。它通过检查文件的时间戳来决定哪些文件需要重新编译。

    Makefile 的基本结构

    Makefile 的基本结构由目标、依赖和命令组成,通常形式为:

    Text Only
    target: dependencies     
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Makeflie

    Make 基础

    220 个字 预计阅读时间 1 分钟 共被读过

    什么是 Make

    Make 是一个自动化构建工具,使用 Makefile 文件来定义如何编译和链接程序。它通过检查文件的时间戳来决定哪些文件需要重新编译。

    Makefile 的基本结构

    Makefile 的基本结构由目标、依赖和命令组成,通常形式为:

    Text Only
    target: dependencies     
         command
     

    Makefile 示例

    让我们考虑一个简单的 C 语言项目,该示例将展示如何使用 Makefile 来编译一个具有多个源文件和头文件的程序,并展示 Makefile 相比手动命令行编译的优势。
    编译进阶 - HPC入门指南

    Make 的常用命令

    • make:执行默认目标,与make all等效。
    • make <target>:执行定义的<target>目标,如果没有这个目标将返回错误信息。
    • make -j:并行执行构建,使用本机的全部线程
    wnc's café

    chezmoi 实现跨设备同步配置

    512 个字 142 行代码 预计阅读时间 4 分钟 共被读过

    本指南将帮助你使用 chezmoi 管理你的配置文件(dotfiles,并使用包管理器维护软件列表。

    前期准备

    1. 需要的工具

    • Git
    • GitHub 账号
    • chezmoi
    • 包管理器(Windows: Scoop, Ubuntu: apt/snap)

    2. 重要的配置文件

    Windows 常用配置文件:

    Text Only
    %USERPROFILE%/
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    chezmoi 实现跨设备同步配置

    512 个字 142 行代码 预计阅读时间 4 分钟 共被读过

    本指南将帮助你使用 chezmoi 管理你的配置文件(dotfiles,并使用包管理器维护软件列表。

    前期准备

    1. 需要的工具

    • Git
    • GitHub 账号
    • chezmoi
    • 包管理器(Windows: Scoop, Ubuntu: apt/snap)

    2. 重要的配置文件

    Windows 常用配置文件:

    Text Only
    %USERPROFILE%/
     ├── .gitconfig                        # Git配置
     ├── .ssh/                            # SSH配置
     ├── Documents/
    diff --git a/Tools/Others/SSH/index.html b/Tools/Others/SSH/index.html
    index ce03d837..91bcb75f 100644
    --- a/Tools/Others/SSH/index.html
    +++ b/Tools/Others/SSH/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    SSH 配置指南

    641 个字 195 行代码 1 张图片 预计阅读时间 6 分钟 共被读过

    一、SSH 基础概念

    1. SSH 工作原理

    SSH(Secure Shell) 是一种加密的网络协议,通过在不安全的网络上为网络服务提供安全的传输环境。SSH 通过使用加密技术,能够有效防止中间人攻击,保护数据传输的安全。

    SSH 工作流程: 1. TCP 连接建立:客户端和服务器建立 TCP 连接(默认端口 22) 2. 版本协商:双方交换版本信息,确定使用的 SSH 协议版本 3. 密钥交换:使用 Diffie-Hellman 算法交换会话密钥 4. 认证:使用公钥或密码进行身份验证 5. 会话:建立加密通信通道

    2. 认证方式详解

    2.1 密码认证

    • 最简单但最不安全的认证方式
    • 容易受到暴力破解攻击
    • 不推荐在生产环境中使用

    2.2 公钥认证

    认证流程 1. 客户端发送公钥信息给服务器 2. 服务器检查authorized_keys文件 3. 服务器生成随机字符串,用公钥加密后发送给客户端 4. 客户端用私钥解密,将结果返回服务器 5. 服务器验证结果,完成认证

    3. 安全建议

    3.1 基本安全设置

    Bash
    # /etc/ssh/sshd_config 安全配置
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    SSH 配置指南

    641 个字 195 行代码 1 张图片 预计阅读时间 6 分钟 共被读过

    一、SSH 基础概念

    1. SSH 工作原理

    SSH(Secure Shell) 是一种加密的网络协议,通过在不安全的网络上为网络服务提供安全的传输环境。SSH 通过使用加密技术,能够有效防止中间人攻击,保护数据传输的安全。

    SSH 工作流程: 1. TCP 连接建立:客户端和服务器建立 TCP 连接(默认端口 22) 2. 版本协商:双方交换版本信息,确定使用的 SSH 协议版本 3. 密钥交换:使用 Diffie-Hellman 算法交换会话密钥 4. 认证:使用公钥或密码进行身份验证 5. 会话:建立加密通信通道

    2. 认证方式详解

    2.1 密码认证

    • 最简单但最不安全的认证方式
    • 容易受到暴力破解攻击
    • 不推荐在生产环境中使用

    2.2 公钥认证

    认证流程 1. 客户端发送公钥信息给服务器 2. 服务器检查authorized_keys文件 3. 服务器生成随机字符串,用公钥加密后发送给客户端 4. 客户端用私钥解密,将结果返回服务器 5. 服务器验证结果,完成认证

    3. 安全建议

    3.1 基本安全设置

    Bash
    # /etc/ssh/sshd_config 安全配置
     PermitRootLogin no                 # 禁止root直接登录
     PasswordAuthentication no          # 禁用密码认证
     PubkeyAuthentication yes          # 启用公钥认证
    diff --git "a/Tools/Others/zotero_\344\275\277\347\224\250\346\214\207\345\215\227/index.html" "b/Tools/Others/zotero_\344\275\277\347\224\250\346\214\207\345\215\227/index.html"
    index 0e30bf3e..17a11c9f 100644
    --- "a/Tools/Others/zotero_\344\275\277\347\224\250\346\214\207\345\215\227/index.html"
    +++ "b/Tools/Others/zotero_\344\275\277\347\224\250\346\214\207\345\215\227/index.html"
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    156 个字 预计阅读时间 1 分钟 共被读过

    156 个字 预计阅读时间 1 分钟 共被读过

    Tabby + Zsh 配置指南

    236 个字 789 行代码 预计阅读时间 11 分钟 共被读过

    前置准备

    系统要求

    Bash
    # Ubuntu/Debian
    +    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Tabby + Zsh 配置指南

    236 个字 789 行代码 预计阅读时间 11 分钟 共被读过

    前置准备

    系统要求

    Bash
    # Ubuntu/Debian
     sudo apt update
     sudo apt install -y \
         git \
    diff --git a/Tools/index.html b/Tools/index.html
    index 8e0ef5fd..d4b20381 100644
    --- a/Tools/index.html
    +++ b/Tools/index.html
    @@ -7,7 +7,7 @@
         .gdesc-inner { font-size: 0.75rem; }
         body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
         body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
    -    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

    Toolbox

    Abstract

    本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

    Toolbox

    Abstract

    本部分内容(除特别声明外)采用 署名 - 非商业性使用 - 保持一致 4.0 国际 (CC BY-NC-SA 4.0) 许可协议进行许可。

    About 🥳

    54 个字 预计阅读时间不到 1 分钟 共被读过

    Welcome, I'm Wnc.

    Some Tags

    • 上海交通大学密西根学院 2024 级本科生
    • INTJ (Maybe)
    • Interested in AI, Robot and ...
    Ways to befriend with me

    You could find my email or qq or WeChat in the icon above.

    Feel free to contact me!

    wnc's café

    About 🥳

    54 个字 预计阅读时间不到 1 分钟 共被读过

    Welcome, I'm Wnc.

    Some Tags

    • 上海交通大学密西根学院 2024 级本科生
    • INTJ (Maybe)
    • Interested in AI, Robot and ...
    Ways to befriend with me

    You could find my email or qq or WeChat in the icon above.

    Feel free to contact me!

    wnc's café
    wnc's café