10.5 神经网络原理：从感知机到 Transformer

神经网络简史：从生物启发到 AI 革命

💡 哲学问题：一个由数十亿个简单的"开关"（神经元）组成的系统，如何能够理解语言、识别图像、甚至创作艺术？

这是深度学习最令人着迷的地方——简单单元的复杂组合产生了智能行为。

神经网络发展timeline：

1943: McCulloch-Pitts 神经元模型
1958: Rosenblatt 感知机（Perceptron）
1986: Rumelhart 反向传播算法
1998: LeCun CNN (LeNet-5)
2006: Hinton 深度信念网络（Deep Belief Networks）
2012: AlexNet 引爆深度学习
2014: GAN (生成对抗网络)
2017: Transformer 架构 ← 现代 LLM 的基石
2018: BERT、GPT
2020: GPT-3
2022: ChatGPT

本章是整个课程的理论高峰，我们将：

从单个神经元出发，理解深度学习的数学基础
推导反向传播算法——深度学习的核心
学习 CNN、RNN/LSTM 等经典架构
深入 Transformer——理解现代大语言模型的秘密

第一部分：感知机与神经元

1. 生物神经元 vs 人工神经元

生物神经元：

树突（接收信号）→ 细胞体（处理）→ 轴突（传输）→ 突触（连接下一个神经元）

人工神经元（感知机）：

python

"""
数学模型：
    z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
    y = activation(z)

其中：
- x: 输入
- w: 权重（突触强度）
- b: 偏置
- activation: 激活函数（决定神经元是否"激发"）
"""

import numpy as np

class Perceptron:
    """
    单层感知机：二分类器

    数学公式：
        y = sign(w·x + b)
        其中 sign(z) = 1 if z >= 0 else -1
    """

    def __init__(self, input_dim: int, learning_rate: float = 0.01):
        self.w = np.zeros(input_dim)  # 权重初始化为 0
        self.b = 0.0  # 偏置
        self.lr = learning_rate

    def predict(self, x: np.ndarray) -> int:
        """预测"""
        z = np.dot(self.w, x) + self.b
        return 1 if z >= 0 else -1

    def train(self, X: np.ndarray, y: np.ndarray, epochs: int = 100):
        """
        训练感知机

        学习规则：
            如果预测正确：不更新
            如果预测错误：w = w + lr * y * x
                          b = b + lr * y
        """
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.predict(xi)
                if prediction != yi:
                    # 更新权重
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi
                    errors += 1

            if errors == 0:
                print(f"收敛于第 {epoch + 1} 轮")
                break

# 示例：学习 AND 门
X_and = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y_and = np.array([-1, -1, -1, 1])  # 只有 (1,1) 输出 1

perceptron = Perceptron(input_dim=2, learning_rate=0.1)
perceptron.train(X_and, y_and)

print("权重:", perceptron.w)
print("偏置:", perceptron.b)

# 测试
for xi in X_and:
    print(f"输入: {xi}, 预测: {perceptron.predict(xi)}")

感知机的局限性：

python

# XOR 问题：感知机无法解决线性不可分问题
X_xor = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y_xor = np.array([-1, 1, 1, -1])  # XOR 输出

# 无论如何训练，单层感知机都无法学会 XOR
# 原因：XOR 不是线性可分的

💡 关键洞察：单层感知机只能解决线性可分问题。要解决 XOR，我们需要多层神经网络。

第二部分：多层神经网络与反向传播

1. 多层感知机（MLP）

通过堆叠多层神经元，我们可以学习非线性函数。

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    """
    多层感知机

    架构：
        输入层 → 隐藏层 → 输出层
    """

    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # 隐藏层：线性变换 + 非线性激活
        h = F.relu(self.fc1(x))
        # 输出层
        y = self.fc2(h)
        return y

# 解决 XOR 问题
X_xor = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y_xor = torch.tensor([[0.], [1.], [1.], [0.]])

model = MLP(input_dim=2, hidden_dim=4, output_dim=1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# 训练
for epoch in range(5000):
    # 前向传播
    predictions = model(X_xor)
    loss = criterion(predictions, y_xor)

    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 1000 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# 测试
model.eval()
with torch.no_grad():
    predictions = model(X_xor)
    print("\nXOR 预测:")
    for x, y, pred in zip(X_xor, y_xor, predictions):
        print(f"输入: {x.numpy()}, 真实: {y.item():.0f}, 预测: {pred.item():.4f}")

2. 激活函数：引入非线性

如果没有激活函数，多层神经网络等价于单层网络（线性变换的组合仍是线性的）。

python

import matplotlib.pyplot as plt

# 常见激活函数
x = np.linspace(-5, 5, 100)

# 1. Sigmoid: σ(x) = 1 / (1 + e^(-x))
sigmoid = 1 / (1 + np.exp(-x))

# 2. Tanh: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
tanh = np.tanh(x)

# 3. ReLU: max(0, x)
relu = np.maximum(0, x)

# 4. Leaky ReLU: max(0.01x, x)
leaky_relu = np.where(x > 0, x, 0.01 * x)

# 5. GELU (用于 Transformer)
gelu = 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

# 可视化
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

activations = [
    (sigmoid, 'Sigmoid'),
    (tanh, 'Tanh'),
    (relu, 'ReLU'),
    (leaky_relu, 'Leaky ReLU'),
    (gelu, 'GELU'),
]

for i, (y, name) in enumerate(activations):
    ax = axes[i // 3, i % 3]
    ax.plot(x, y)
    ax.set_title(name)
    ax.grid(True)
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

激活函数对比：

激活函数	公式	优点	缺点	应用场景
Sigmoid	σ(x) = 1/(1+e⁻ˣ)	输出 (0,1)，可解释为概率	梯度消失	二分类输出层
Tanh	tanh(x)	输出 (-1,1)，零中心化	梯度消失	传统 RNN
ReLU	max(0, x)	计算简单，缓解梯度消失	死亡 ReLU 问题	CNN、MLP 隐藏层
Leaky ReLU	max(αx, x)	解决死亡 ReLU	超参数 α	深度网络
GELU	复杂公式	平滑，性能好	计算开销大	Transformer

3. 反向传播算法：深度学习的核心

📐 数学推导：反向传播本质是链式法则的应用。

简单例子：两层网络

python

"""
网络结构：
    x → [w1] → h → [w2] → y

前向传播：
    h = σ(w1 * x)
    y = w2 * h

损失函数：
    L = (y - y_true)²

反向传播（计算梯度）：
    ∂L/∂w2 = ∂L/∂y * ∂y/∂w2 = 2(y - y_true) * h
    ∂L/∂w1 = ∂L/∂y * ∂y/∂h * ∂h/∂w1
            = 2(y - y_true) * w2 * σ'(w1*x) * x
"""

class TwoLayerNet:
    """从零实现两层神经网络"""

    def __init__(self):
        # 随机初始化权重
        self.w1 = np.random.randn()
        self.w2 = np.random.randn()

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        """Sigmoid 导数: σ'(x) = σ(x)(1 - σ(x))"""
        s = self.sigmoid(x)
        return s * (1 - s)

    def forward(self, x):
        """前向传播"""
        self.x = x
        self.z1 = self.w1 * x
        self.h = self.sigmoid(self.z1)
        self.y = self.w2 * self.h
        return self.y

    def backward(self, y_true, learning_rate=0.1):
        """反向传播"""
        # 计算损失对输出的梯度
        loss_gradient = 2 * (self.y - y_true)

        # 输出层梯度
        grad_w2 = loss_gradient * self.h

        # 隐藏层梯度（链式法则）
        grad_h = loss_gradient * self.w2
        grad_z1 = grad_h * self.sigmoid_derivative(self.z1)
        grad_w1 = grad_z1 * self.x

        # 更新权重
        self.w2 -= learning_rate * grad_w2
        self.w1 -= learning_rate * grad_w1

        return (self.y - y_true) ** 2

# 训练
net = TwoLayerNet()
x_train = np.array([0.5])
y_train = np.array([0.8])

for epoch in range(1000):
    y_pred = net.forward(x_train)
    loss = net.backward(y_train)

    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1}: Loss = {loss:.6f}, w1 = {net.w1:.4f}, w2 = {net.w2:.4f}")

通用反向传播公式：

对于任意层 l：

δ_l = ∂L/∂z_l  (误差项)

1. 输出层：δ_L = ∂L/∂y ⊙ σ'(z_L)
2. 隐藏层：δ_l = (W_{l+1}^T δ_{l+1}) ⊙ σ'(z_l)
3. 权重梯度：∂L/∂W_l = δ_l * a_{l-1}^T
4. 偏置梯度：∂L/∂b_l = δ_l

其中 ⊙ 表示逐元素乘法

第三部分：卷积神经网络（CNN）

1. 为什么需要 CNN？

全连接层的问题：

图像 28×28×3 = 2352 维
第一层 1000 个神经元 → 2,352,000 个参数
参数太多，容易过拟合，计算开销大

CNN 的核心思想：

局部连接：每个神经元只看图像的一小块区域
权重共享：同一个卷积核在整个图像上滑动
空间层次结构：低层检测边缘，高层检测复杂特征

2. 卷积操作

python

import torch
import torch.nn as nn

# 手动实现 2D 卷积
def conv2d_manual(image: np.ndarray, kernel: np.ndarray) -> np.ndarray:
    """
    手动实现 2D 卷积

    Args:
        image: (H, W) 图像
        kernel: (k, k) 卷积核

    Returns:
        卷积结果
    """
    H, W = image.shape
    k = kernel.shape[0]
    output_h = H - k + 1
    output_w = W - k + 1

    output = np.zeros((output_h, output_w))

    for i in range(output_h):
        for j in range(output_w):
            # 提取感受野
            patch = image[i:i+k, j:j+k]
            # 逐元素乘法并求和
            output[i, j] = np.sum(patch * kernel)

    return output

# 示例：边缘检测卷积核
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0]
], dtype=float)

# 垂直边缘检测
vertical_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

# 水平边缘检测
horizontal_kernel = np.array([
    [-1, -1, -1],
    [ 0,  0,  0],
    [ 1,  1,  1]
])

vertical_edges = conv2d_manual(image, vertical_kernel)
horizontal_edges = conv2d_manual(image, horizontal_kernel)

print("原始图像:")
print(image)
print("\n垂直边缘:")
print(vertical_edges)
print("\n水平边缘:")
print(horizontal_edges)

3. CNN 架构组件

python

class ConvNet(nn.Module):
    """
    经典 CNN 架构

    结构：
        Conv → ReLU → Pool → Conv → ReLU → Pool → FC → ReLU → FC
    """

    def __init__(self, num_classes: int = 10):
        super(ConvNet, self).__init__()

        # 卷积层 1
        self.conv1 = nn.Conv2d(
            in_channels=1,      # 输入通道数（灰度图）
            out_channels=32,    # 输出通道数（特征图数量）
            kernel_size=3,      # 卷积核大小
            stride=1,           # 步长
            padding=1           # 填充
        )

        # 卷积层 2
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)

        # 池化层
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # 全连接层
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        # 输入: (batch, 1, 28, 28)

        # 卷积层 1 + ReLU + 池化
        x = self.conv1(x)         # (batch, 32, 28, 28)
        x = F.relu(x)
        x = self.pool(x)          # (batch, 32, 14, 14)

        # 卷积层 2 + ReLU + 池化
        x = self.conv2(x)         # (batch, 64, 14, 14)
        x = F.relu(x)
        x = self.pool(x)          # (batch, 64, 7, 7)

        # 展平
        x = x.view(-1, 64 * 7 * 7)  # (batch, 3136)

        # 全连接层
        x = self.fc1(x)           # (batch, 128)
        x = F.relu(x)
        x = self.fc2(x)           # (batch, 10)

        return x

model = ConvNet()
print(model)

# 计算参数数量
total_params = sum(p.numel() for p in model.parameters())
print(f"\n总参数数: {total_params:,}")

关键概念：

填充（Padding）：在图像边缘填充 0，保持输出尺寸

输出尺寸 = (输入尺寸 - 卷积核尺寸 + 2*padding) / stride + 1

池化（Pooling）：降低空间维度
- 最大池化（Max Pooling）：取窗口内最大值
- 平均池化（Average Pooling）：取窗口内平均值
感受野（Receptive Field）：神经元"看到"的输入区域大小
- 堆叠多层卷积 → 增大感受野
- 高层神经元可以"看到"更大的图像区域

第四部分：循环神经网络（RNN）与 LSTM

1. RNN：处理序列数据

为什么需要 RNN？

传统神经网络无法处理变长序列
无法记忆历史信息

RNN 的核心思想：维护一个隐藏状态，在时间步之间传递信息。

python

class SimpleRNN(nn.Module):
    """
    简单 RNN

    公式：
        h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
        y_t = W_hy * h_t + b_y
    """

    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        super(SimpleRNN, self).__init__()

        self.hidden_size = hidden_size

        # 权重矩阵
        self.W_xh = nn.Linear(input_size, hidden_size)   # 输入到隐藏
        self.W_hh = nn.Linear(hidden_size, hidden_size)  # 隐藏到隐藏
        self.W_hy = nn.Linear(hidden_size, output_size)  # 隐藏到输出

    def forward(self, x, h_prev):
        """
        Args:
            x: 当前输入 (batch, input_size)
            h_prev: 前一时刻隐藏状态 (batch, hidden_size)

        Returns:
            y: 输出 (batch, output_size)
            h: 新隐藏状态 (batch, hidden_size)
        """
        # 更新隐藏状态
        h = torch.tanh(self.W_xh(x) + self.W_hh(h_prev))

        # 计算输出
        y = self.W_hy(h)

        return y, h

# 处理序列
rnn = SimpleRNN(input_size=10, hidden_size=20, output_size=5)

# 初始化隐藏状态
batch_size = 3
h = torch.zeros(batch_size, 20)

# 逐时间步处理
sequence_length = 5
for t in range(sequence_length):
    x_t = torch.randn(batch_size, 10)  # 当前时间步的输入
    y_t, h = rnn(x_t, h)  # 更新隐藏状态
    print(f"时间步 {t+1}: 输出形状 {y_t.shape}, 隐藏状态形状 {h.shape}")

2. LSTM：长短期记忆网络

RNN 的问题：梯度消失/爆炸，无法学习长期依赖。

LSTM 的解决方案：引入门控机制，选择性地记忆和遗忘信息。

python

"""
LSTM 架构：

1. 遗忘门（Forget Gate）：决定丢弃哪些信息
   f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

2. 输入门（Input Gate）：决定更新哪些信息
   i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
   C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

3. 更新细胞状态（Cell State）
   C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

4. 输出门（Output Gate）：决定输出什么
   o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
   h_t = o_t ⊙ tanh(C_t)
"""

class LSTMCell(nn.Module):
    """从零实现 LSTM Cell"""

    def __init__(self, input_size: int, hidden_size: int):
        super(LSTMCell, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size

        # 四个门的权重（合并计算提高效率）
        self.W = nn.Linear(input_size + hidden_size, 4 * hidden_size)

    def forward(self, x, h_prev, c_prev):
        """
        Args:
            x: 输入 (batch, input_size)
            h_prev: 前一隐藏状态 (batch, hidden_size)
            c_prev: 前一细胞状态 (batch, hidden_size)

        Returns:
            h: 新隐藏状态
            c: 新细胞状态
        """
        # 拼接输入和隐藏状态
        combined = torch.cat([x, h_prev], dim=1)

        # 计算四个门
        gates = self.W(combined)

        # 分割为四个门
        i, f, g, o = gates.chunk(4, dim=1)

        # 应用激活函数
        i = torch.sigmoid(i)  # 输入门
        f = torch.sigmoid(f)  # 遗忘门
        g = torch.tanh(g)     # 候选细胞状态
        o = torch.sigmoid(o)  # 输出门

        # 更新细胞状态
        c = f * c_prev + i * g

        # 计算隐藏状态
        h = o * torch.tanh(c)

        return h, c

# 使用 PyTorch 内置 LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)

# 输入：(batch, seq_len, input_size)
x = torch.randn(3, 5, 10)  # 3 个样本，序列长度 5，特征维度 10

# 输出：(batch, seq_len, hidden_size), (h_n, c_n)
output, (h_n, c_n) = lstm(x)

print(f"输出形状: {output.shape}")
print(f"最终隐藏状态: {h_n.shape}")
print(f"最终细胞状态: {c_n.shape}")

第五部分：Transformer 架构详解

🚀 划时代的创新：2017 年，Google 发表论文《Attention Is All You Need》，提出 Transformer 架构，彻底改变了 NLP 领域。

1. 为什么需要 Transformer？

RNN/LSTM 的问题：

串行计算：必须按顺序处理，无法并行化
长距离依赖：即使有 LSTM，超长序列仍有问题
计算效率低：训练慢

Transformer 的优势：

并行化：所有位置同时计算
长距离依赖：通过自注意力机制直接建模
可扩展性：容易扩展到大规模模型

2. 自注意力机制（Self-Attention）

💡 核心思想：让序列中的每个词都能"关注"到其他所有词。

直觉理解：

输入：The cat sat on the mat

"sat" 应该关注：
- "cat" (主语)
- "on" (介词)
- "mat" (宾语)

注意力机制自动学习这些关系！

数学公式：

python

"""
自注意力公式：

1. 计算 Query, Key, Value
   Q = X · W_Q
   K = X · W_K
   V = X · W_V

2. 计算注意力分数
   scores = Q · K^T / sqrt(d_k)

3. Softmax 归一化
   attention_weights = softmax(scores)

4. 加权求和
   output = attention_weights · V
"""

import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    缩放点积注意力

    Args:
        Q: Query (batch, seq_len, d_k)
        K: Key (batch, seq_len, d_k)
        V: Value (batch, seq_len, d_v)
        mask: 注意力掩码（可选）

    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = Q.size(-1)

    # 1. 计算注意力分数
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: (batch, seq_len, seq_len)

    # 2. 应用掩码（可选，用于防止看到未来信息）
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # 3. Softmax 归一化
    attention_weights = F.softmax(scores, dim=-1)

    # 4. 加权求和
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

# 示例
batch_size = 2
seq_len = 4
d_k = 64

Q = torch.randn(batch_size, seq_len, d_k)
K = torch.randn(batch_size, seq_len, d_k)
V = torch.randn(batch_size, seq_len, d_k)

output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"输出形状: {output.shape}")
print(f"注意力权重形状: {attn_weights.shape}")
print(f"\n注意力权重矩阵（第一个样本）:")
print(attn_weights[0].detach().numpy())

可视化注意力权重：

python

import matplotlib.pyplot as plt
import seaborn as sns

# 示例句子
sentence = ["The", "cat", "sat", "on", "mat"]
seq_len = len(sentence)

# 生成随机注意力权重（实际应从模型获取）
attn_weights = torch.softmax(torch.randn(seq_len, seq_len), dim=-1)

# 可视化
plt.figure(figsize=(8, 6))
sns.heatmap(attn_weights.numpy(), annot=True, fmt='.2f',
            xticklabels=sentence, yticklabels=sentence,
            cmap='YlOrRd')
plt.title('Self-Attention Weights')
plt.xlabel('Key')
plt.ylabel('Query')
plt.show()

3. 多头注意力（Multi-Head Attention）

为什么需要多头？

单个注意力头只能学习一种模式
多头可以并行学习多种关系（语法、语义、位置等）

python

class MultiHeadAttention(nn.Module):
    """
    多头注意力机制

    公式：
        MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
        其中 head_i = Attention(Q·W_Q^i, K·W_K^i, V·W_V^i)
    """

    def __init__(self, d_model: int, num_heads: int):
        super(MultiHeadAttention, self).__init__()

        assert d_model % num_heads == 0, "d_model 必须能被 num_heads 整除"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 每个头的维度

        # 线性变换矩阵
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def split_heads(self, x):
        """
        将最后一维分割为 (num_heads, d_k)

        Args:
            x: (batch, seq_len, d_model)

        Returns:
            (batch, num_heads, seq_len, d_k)
        """
        batch_size, seq_len, d_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # 1. 线性变换
        Q = self.W_Q(Q)  # (batch, seq_len, d_model)
        K = self.W_K(K)
        V = self.W_V(V)

        # 2. 分割为多头
        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)

        # 3. 缩放点积注意力
        d_k = Q.size(-1)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        # output: (batch, num_heads, seq_len, d_k)

        # 4. 合并多头
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, -1, self.d_model)
        # output: (batch, seq_len, d_model)

        # 5. 最终线性变换
        output = self.W_O(output)

        return output, attention_weights

# 测试
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2

mha = MultiHeadAttention(d_model, num_heads)
x = torch.randn(batch_size, seq_len, d_model)

output, attn_weights = mha(x, x, x)
print(f"输出形状: {output.shape}")
print(f"注意力权重形状: {attn_weights.shape}")

4. 位置编码（Positional Encoding）

问题：注意力机制是位置无关的，无法区分词的顺序。

解决方案：在输入嵌入中加入位置信息。

python

class PositionalEncoding(nn.Module):
    """
    位置编码

    公式：
        PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
        PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

    其中：
        pos: 位置
        i: 维度
    """

    def __init__(self, d_model: int, max_len: int = 5000):
        super(PositionalEncoding, self).__init__()

        # 创建位置编码矩阵
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # 应用正弦和余弦函数
        pe[:, 0::2] = torch.sin(position * div_term)  # 偶数维度
        pe[:, 1::2] = torch.cos(position * div_term)  # 奇数维度

        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: (batch, seq_len, d_model)

        Returns:
            x + positional encoding
        """
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]

# 可视化位置编码
d_model = 128
max_len = 100

pos_enc = PositionalEncoding(d_model, max_len)
pe_matrix = pos_enc.pe.squeeze(0).numpy()

plt.figure(figsize=(15, 5))
plt.imshow(pe_matrix.T, aspect='auto', cmap='RdBu')
plt.colorbar()
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encoding')
plt.show()

5. 完整的 Transformer Encoder 层

python

class TransformerEncoderLayer(nn.Module):
    """
    Transformer Encoder 层

    结构：
        输入 → Multi-Head Attention → Add & Norm
             → Feed-Forward → Add & Norm → 输出
    """

    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super(TransformerEncoderLayer, self).__init__()

        # 多头注意力
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # 前馈网络
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

        # Layer Normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        """
        Args:
            x: (batch, seq_len, d_model)
            mask: 注意力掩码

        Returns:
            output: (batch, seq_len, d_model)
        """
        # 1. 多头自注意力 + 残差连接 + Layer Norm
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # 2. 前馈网络 + 残差连接 + Layer Norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

# 完整的 Transformer Encoder
class TransformerEncoder(nn.Module):
    """
    完整的 Transformer Encoder
    """

    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        num_heads: int = 8,
        num_layers: int = 6,
        d_ff: int = 2048,
        max_len: int = 5000,
        dropout: float = 0.1
    ):
        super(TransformerEncoder, self).__init__()

        # 词嵌入
        self.embedding = nn.Embedding(vocab_size, d_model)

        # 位置编码
        self.pos_encoding = PositionalEncoding(d_model, max_len)

        # Transformer 层
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        """
        Args:
            x: (batch, seq_len) 词ID序列

        Returns:
            output: (batch, seq_len, d_model)
        """
        # 词嵌入
        x = self.embedding(x) * math.sqrt(self.embedding.embedding_dim)

        # 位置编码
        x = self.pos_encoding(x)
        x = self.dropout(x)

        # 通过所有 Transformer 层
        for layer in self.layers:
            x = layer(x, mask)

        return x

# 测试完整模型
vocab_size = 10000
model = TransformerEncoder(vocab_size)

# 输入：一批词ID序列
batch_size = 2
seq_len = 20
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

output = model(input_ids)
print(f"输出形状: {output.shape}")

6. Transformer Decoder（用于生成任务）

python

class TransformerDecoderLayer(nn.Module):
    """
    Transformer Decoder 层

    结构：
        输入 → Masked Self-Attention → Add & Norm
             → Cross-Attention → Add & Norm
             → Feed-Forward → Add & Norm → 输出
    """

    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super(TransformerDecoderLayer, self).__init__()

        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # 1. Masked Self-Attention（防止看到未来信息）
        attn_output, _ = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))

        # 2. Cross-Attention（关注编码器输出）
        attn_output, _ = self.cross_attn(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))

        # 3. Feed-Forward
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))

        return x

小结

在本节中，我们完成了从基础到前沿的完整旅程：

✅ 感知机与神经元

生物神经元 vs 人工神经元
单层感知机的局限性

✅ 多层神经网络

反向传播算法推导
激活函数的作用

✅ CNN：征服视觉

卷积、池化操作
感受野和参数共享

✅ RNN/LSTM：处理序列

隐藏状态和记忆机制
门控单元

✅ Transformer：现代 AI 的基石

自注意力机制
多头注意力
位置编码
完整的 Encoder-Decoder 架构

练习题

基础题

手动计算一个 3x3 卷积核在 5x5 图像上的卷积结果
实现一个两层全连接网络，用反向传播训练 XOR 问题
可视化 LSTM 的门控机制

进阶题

从零实现一个简单的 CNN，在 MNIST 上达到 95% 准确率
实现 Bi-LSTM（双向 LSTM），用于情感分类
实现带有注意力机制的 Seq2Seq 模型

挑战题

从零实现完整的 Transformer，用于机器翻译
分析不同数量的注意力头对模型性能的影响
实现 Transformer-XL（处理超长序列）

下一节：10.6 大语言模型：Transformers 与现代 NLP

在下一节，我们将学习如何使用 Hugging Face Transformers 库，掌握 BERT、GPT 等预训练模型，并实现微调和提示工程！

10.5 神经网络原理：从感知机到 Transformer ​

神经网络简史：从生物启发到 AI 革命 ​

第一部分：感知机与神经元 ​

1. 生物神经元 vs 人工神经元 ​

第二部分：多层神经网络与反向传播 ​

1. 多层感知机（MLP） ​

2. 激活函数：引入非线性 ​

3. 反向传播算法：深度学习的核心 ​

第三部分：卷积神经网络（CNN） ​

1. 为什么需要 CNN？ ​

2. 卷积操作 ​

3. CNN 架构组件 ​

第四部分：循环神经网络（RNN）与 LSTM ​

1. RNN：处理序列数据 ​

2. LSTM：长短期记忆网络 ​

第五部分：Transformer 架构详解 ​

1. 为什么需要 Transformer？ ​

2. 自注意力机制（Self-Attention） ​

3. 多头注意力（Multi-Head Attention） ​

4. 位置编码（Positional Encoding） ​

5. 完整的 Transformer Encoder 层 ​

6. Transformer Decoder（用于生成任务） ​

小结 ​

练习题 ​

基础题 ​

进阶题 ​

挑战题 ​

10.5 神经网络原理：从感知机到 Transformer

神经网络简史：从生物启发到 AI 革命

第一部分：感知机与神经元

1. 生物神经元 vs 人工神经元

第二部分：多层神经网络与反向传播

1. 多层感知机（MLP）

2. 激活函数：引入非线性

3. 反向传播算法：深度学习的核心

第三部分：卷积神经网络（CNN）

1. 为什么需要 CNN？

2. 卷积操作

3. CNN 架构组件

第四部分：循环神经网络（RNN）与 LSTM

1. RNN：处理序列数据

2. LSTM：长短期记忆网络

第五部分：Transformer 架构详解

1. 为什么需要 Transformer？

2. 自注意力机制（Self-Attention）

3. 多头注意力（Multi-Head Attention）

4. 位置编码（Positional Encoding）

5. 完整的 Transformer Encoder 层

6. Transformer Decoder（用于生成任务）

小结

练习题

基础题

进阶题

挑战题