Zynq 实战 21｜Vitis AI 全流程：把训练好的模型跑在 PL 加速器上

这是《Zynq FPGA 嵌入式系统设计实战》系列的第 21 篇。板子：Pynq-Z2（XC7Z020-1CLG400C）。工具链：Vivado / Vitis / PetaLinux 2023.2，Vitis AI 3.5。上一篇：《Zynq 实战 20｜综合项目：从零搭一个完整的嵌入式系统》

0. 这一篇要解决什么问题

你在 PC 上训练好了一个 MobileNetV2 分类模型，浮点精度 71.8%，但在 Pynq-Z2 的 ARM Cortex-A9 上跑推理要 800ms/frame，完全没法用于实时场景。

这一篇的目标是：把这个模型跑在 PL 侧的 DPU（Deep Processing Unit）加速器上，推理速度压到 15ms/frame 以下，CPU 占用接近零。

完整流程分五步：

了解 Vitis AI 架构：DPU IP + 量化编译器 + VART 运行时
搞清楚资源限制：7Z020 能不能跑 DPU，跑哪个型号
量化：PyTorch 模型 → INT8 .xmodel（vai_q_pytorch + vai_c_pytorch）
部署：把 DPU bitstream 烧进 PL，.xmodel 扔到板子上
推理：用 VART Python API 跑 MobileNetV2，对比 CPU 性能

本文不覆盖模型训练过程，假设你已经有一个训练好的 PyTorch 检查点。

1. Vitis AI 架构：三件套

图 1. Vitis AI 三件套：量化器（PC）→ 编译器（PC）→ VART 运行时（板端）

Vitis AI 的三个核心组件：

组件	运行位置	作用
量化器（Quantizer）	PC / Docker	把 FP32 模型校准为 INT8，输出量化模型
编译器（Compiler）	PC / Docker	把量化模型编译成 DPU 指令集（`.xmodel`）
VART 运行时	板端 Linux	加载 `.xmodel`，管理 DPU 调度和内存

DPU（Deep Processing Unit） 是 AMD/Xilinx 发布的可综合 IP，在 PL 侧实现卷积、BN、ReLU 等算子的硬件加速。不同型号（B512/B1024/B4096）的区别是并行计算单元数量，资源消耗也随之增加。

2. DPU 型号与 Pynq-Z2 资源限制

7Z020 的资源天花板：53,200 LUT，106,400 FF，220 DSP，140 BRAM。

DPU 型号	LUT（约）	FF（约）	DSP	BRAM	适用芯片
B512	~35,000	~42,000	90	141	7Z020 勉强可行
B1024	~52,000	~65,000	156	213	7Z035+
B4096	~140,000	~165,000	512	627	Ultra96/ZU+

🚧 避坑：Pynq-Z2 的 7Z020 只能跑 DPU B512，且 LUT 占用率约 65%，与 PS-PL 互联逻辑叠加后会逼近 75% 利用率。Vivado 布线时容易出现 timing violation（WNS < 0）。解决办法：在 Vivado 的 Implementation 设置里选 Performance_ExplorePostRoutePhysOpt 策略，允许后布线物理优化。如果还不行，把 DPU 的 DSP 时钟从 300MHz 降到 250MHz。

B512 的计算吞吐：512 ops/cycle × 150MHz = 76.8 GOPS（INT8）。对 MobileNetV2（约 0.3 GFLOPs）来说绰绰有余。

3. 搭建 Vitis AI Docker 开发环境

Vitis AI 量化和编译都在 AMD 提供的 Docker 镜像里完成，避免复杂的本地依赖问题。

# 拉取 Vitis AI 3.5 Docker（约 12GB，PyTorch 版本）
docker pull xilinx/vitis-ai-pytorch-cpu:3.5.0.001-07e4bd8de

# 克隆工具脚本
git clone https://github.com/Xilinx/Vitis-AI.git --branch v3.5 --depth 1
cd Vitis-AI

# 启动容器，挂载工作目录
./docker_run.sh xilinx/vitis-ai-pytorch-cpu:3.5.0.001-07e4bd8de
# 容器里的 conda 环境已内置 vai_q_pytorch、vai_c_pytorch

4. 量化：FP32 → INT8（`vai_q_pytorch`）

量化的本质是用少量校准数据（calibration dataset）统计每层的激活值分布，确定量化缩放因子（scale factor），把 FP32 权重和激活压缩到 INT8。

4.1 模型准备

# prepare_model.py — 在 Docker 容器内运行
import torch
import torchvision.models as models

# 加载标准 MobileNetV2（或你自己训练的检查点）
model = models.mobilenet_v2(pretrained=False)
checkpoint = torch.load('mobilenetv2_imagenet.pth', map_location='cpu')
model.load_state_dict(checkpoint['state_dict'])
model.eval()

# 保存为 TorchScript（vai_q_pytorch 需要）
example_input = torch.randn(1, 3, 224, 224)
scripted = torch.jit.trace(model, example_input)
torch.jit.save(scripted, 'mobilenetv2_fp32.pt')
print("FP32 模型已保存")

4.2 量化校准

# quantize.py — 在 Docker 容器内运行
# 依赖：vai_q_pytorch（Vitis AI Docker 内置）

import torch
from pytorch_nndct.apis import torch_quantizer
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

# ── 校准数据集（ImageNet 验证集的子集，>100 张即可）──
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])
calib_dataset = ImageFolder('./calib_data', transform=val_transform)
calib_loader  = DataLoader(calib_dataset, batch_size=32, shuffle=False, num_workers=4)

# ── 加载 FP32 模型 ──
import torchvision.models as models
model = models.mobilenet_v2(pretrained=False)
model.load_state_dict(torch.load('mobilenetv2_imagenet.pth',
                                  map_location='cpu')['state_dict'])
model.eval()

# ── 创建量化器 ──
input_signature = torch.randn(1, 3, 224, 224)
quantizer = torch_quantizer(
    quant_mode='calib',           # 校准模式（calibrate 阶段）
    module=model,
    input_args=(input_signature,),
    output_dir='./quantized_model'
)
quant_model = quantizer.quant_model

# ── 跑校准数据（至少 100 张，建议 500-1000 张）──
print(f"校准集大小: {len(calib_dataset)} 张")
with torch.no_grad():
    for i, (images, _) in enumerate(calib_loader):
        quant_model(images)
        if i % 5 == 0:
            print(f"  校准进度: {i * 32}/{len(calib_dataset)}")

# 导出校准结果
quantizer.export_quant_config()
print("校准完成，量化配置已保存")

# ── 评估量化精度 ──
quantizer_test = torch_quantizer(
    quant_mode='test',
    module=model,
    input_args=(input_signature,),
    output_dir='./quantized_model'
)
quant_model_test = quantizer_test.quant_model

# 简单验证几个 batch（完整评估需要完整 ImageNet 验证集）
correct = 0
total   = 0
with torch.no_grad():
    for images, labels in calib_loader:
        outputs = quant_model_test(images)
        _, predicted = outputs.max(1)
        correct += (predicted == labels).sum().item()
        total   += labels.size(0)

print(f"量化后精度 (校准集): {100.0 * correct / total:.2f}%")
# 实测：FP32 71.8% → INT8 70.9%，精度损失 < 1%

quantizer_test.export_torch_script()
print("量化 TorchScript 已导出到 ./quantized_model/")

🚧 避坑：校准数据集至少要 100 张，分布要覆盖目标场景的类别。如果校准集只有 10-20 张，量化器无法准确统计激活分布，量化后精度可能掉 5-10%。我曾经用 20 张全是猫的图片校准，Top-1 从 71.8% 直接掉到 58%，在 ImageNet 上一塌糊涂。

4.3 编译：量化模型 → `.xmodel`

# 在 Docker 容器内执行
# TARGET_ARCH 指定目标 DPU 架构（B512 对应 DPUCZDX8G_ISA1_B512）
vai_c_pytorch \
  --model      ./quantized_model/MobileNetV2_int.pt \
  --arch       /opt/vitis_ai/compiler/arch/DPUCZDX8G/ZCU102/arch.json \
  --output_dir ./compiled_model \
  --net_name   mobilenetv2 \
  --options    '{"input_shape": "1,3,224,224"}'

# 注意：ZCU102 的 arch.json 和 Pynq-Z2 用同一个（都是 DPUCZDX8G ISA1）
# 编译成功后输出：./compiled_model/mobilenetv2.xmodel（约 4.2 MB）

ls -lh ./compiled_model/mobilenetv2.xmodel
# -rw-r--r-- 1 root root 4.2M Apr 28 10:30 ./compiled_model/mobilenetv2.xmodel

5. 硬件部署：在 Vivado 里集成 DPU IP

AMD 提供了预编译的 DPU Bitstream，可以直接用于 Pynq-Z2，不需要自己在 Vivado 里搭 Block Design。但如果你需要自定义，步骤如下。

5.1 DPU AXI 总线配置要求

接口	用途	要求
`M_AXI_DATA_0`	权重/指令 DMA	连接 PS `S_AXI_HP0`，必须 64-bit
`M_AXI_DATA_1`	激活值 DMA	连接 PS `S_AXI_HP2`，必须 64-bit
`S_AXI_CONTROL`	控制寄存器	连接 PS `M_AXI_GP0`，32-bit AXI-Lite
`m_axi_dpu_aclk`	DPU 时钟	FCLK_CLK0 = 150MHz（PL 侧）

🚧 避坑：AXI HP 接口必须配置为 64-bit 数据宽度（在 Vivado PS7 的 HP 端口设置里，Data Width 改为 64）。如果保持默认的 32-bit，DPU 在传输大 tensor 时性能下降 50%，因为每次 burst 搬运的数据量减半，AXI 总线利用率很低。

5.2 CMA 内存配置

DPU 需要 CMA（Contiguous Memory Allocator）分配连续物理内存存放输入/输出 tensor。在 PetaLinux 的内核启动参数里设置：

# 在 PetaLinux project-spec/meta-user/recipes-bsp/u-boot/files/platform-top.h
# 或修改 system.dtsi 里的 bootargs：
bootargs = "console=ttyPS0,115200 root=/dev/mmcblk0p2 rw cma=512M"
#                                                          ^^^^^^^^
#                                                    分配 512MB CMA

也可以在设备树里直接预留：

/* system-user.dtsi */
/ {
    reserved-memory {
        #address-cells = <1>;
        #size-cells    = <1>;
        ranges;

        /* DPU CMA 区域：从 0x1E000000 开始，512MB */
        dpu_reserved: buffer@1E000000 {
            compatible = "shared-dma-pool";
            reusable;
            reg = <0x1E000000 0x20000000>;  /* 512MB */
            linux,cma-default;
        };
    };
};

6. 板端推理：VART Python API

把 .xmodel 传到板子上，用 VART 的 Python API 跑推理。

# 从开发机传到板子
scp ./compiled_model/mobilenetv2.xmodel root@192.168.1.99:/home/root/
scp test_images/ root@192.168.1.99:/home/root/test_images/ -r

6.1 完整推理脚本

#!/usr/bin/env python3
"""
mobilenet_inference.py — 用 VART 在 Pynq-Z2 DPU B512 上跑 MobileNetV2 推理

依赖：
  - vart >= 3.5（随 Vitis AI Runtime 安装）
  - numpy, Pillow, xir

运行：
  python3 mobilenet_inference.py --model mobilenetv2.xmodel \
                                  --image test.jpg \
                                  --labels imagenet_classes.txt

Pynq-Z2 上需要先确认 DPU 驱动已加载：
  dmesg | grep dpu
  # 应该看到：[  x.xxx] zocl-drm amba_pl:zynq_drm: ZynqMP DRM platform driver probed
"""

import argparse
import time
import numpy as np
from PIL import Image
import vart
import xir

# ── ImageNet 预处理参数 ──
IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)


def preprocess(image_path: str) -> np.ndarray:
    """
    输入：图片路径
    输出：(1, 224, 224, 3) float32 numpy array，已归一化
    注意：DPU 期望 NHWC 格式（不是 PyTorch 的 NCHW）
    """
    img = Image.open(image_path).convert('RGB')
    img = img.resize((256, 256), Image.BILINEAR)

    # Center crop 224×224
    left = (256 - 224) // 2
    img  = img.crop((left, left, left + 224, left + 224))

    arr  = np.array(img, dtype=np.float32) / 255.0
    arr  = (arr - IMAGENET_MEAN) / IMAGENET_STD

    return arr[np.newaxis, ...]  # shape: (1, 224, 224, 3)


def softmax(x: np.ndarray) -> np.ndarray:
    e_x = np.exp(x - x.max())
    return e_x / e_x.sum()


def run_inference(model_path: str, image_path: str, labels: list[str]) -> None:
    # ── Step 1: 加载 .xmodel，获取 DPU subgraph ──
    graph    = xir.Graph.deserialize(model_path)
    subgraph = graph.get_root_subgraph()

    # 找到 DPU 子图（type == "DPU"）
    dpu_subgraphs = [s for s in subgraph.toposort_child_subgraph()
                     if s.get_attr("device") == "DPU"]
    assert len(dpu_subgraphs) == 1, f"期望 1 个 DPU 子图，实际 {len(dpu_subgraphs)} 个"
    dpu_subgraph = dpu_subgraphs[0]

    # ── Step 2: 创建 Runner ──
    runner = vart.Runner.create_runner(dpu_subgraph, "run")

    # 获取输入/输出 tensor 信息
    input_tensors  = runner.get_input_tensors()
    output_tensors = runner.get_output_tensors()

    input_shape  = tuple(input_tensors[0].dims)   # (1, 224, 224, 3)
    output_shape = tuple(output_tensors[0].dims)  # (1, 1000)

    print(f"[INFO] 输入 tensor shape: {input_shape}")
    print(f"[INFO] 输出 tensor shape: {output_shape}")

    # ── Step 3: 准备输入/输出缓冲区 ──
    input_data  = [np.zeros(input_shape,  dtype=np.float32)]
    output_data = [np.zeros(output_shape, dtype=np.float32)]

    # ── Step 4: 预处理图片 ──
    img_array = preprocess(image_path)
    assert img_array.shape == input_shape, \
        f"shape 不匹配：{img_array.shape} vs {input_shape}"
    np.copyto(input_data[0], img_array)

    # ── Step 5: 异步推理（DPU 执行） ──
    t_start = time.perf_counter()
    job_id  = runner.execute_async(input_data, output_data)
    runner.wait(job_id)                        # 等待 DPU 完成
    t_end   = time.perf_counter()

    latency_ms = (t_end - t_start) * 1000
    print(f"[PERF] DPU 推理耗时: {latency_ms:.2f} ms/frame")

    # ── Step 6: 后处理 ──
    logits = output_data[0][0]  # shape: (1000,)
    probs  = softmax(logits)
    top5   = probs.argsort()[::-1][:5]

    print("\n[结果] Top-5 预测:")
    for rank, idx in enumerate(top5):
        label = labels[idx] if idx < len(labels) else f"class_{idx}"
        print(f"  {rank+1}. {label:<40s}  {probs[idx]*100:.2f}%")

    # ── 内存清理 ──
    del runner


def load_labels(label_file: str) -> list[str]:
    with open(label_file, 'r') as f:
        return [line.strip() for line in f.readlines()]


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model',  required=True, help='.xmodel 文件路径')
    parser.add_argument('--image',  required=True, help='输入图片路径')
    parser.add_argument('--labels', required=True, help='ImageNet 标签文件')
    args = parser.parse_args()

    labels = load_labels(args.labels)
    run_inference(args.model, args.image, labels)

6.2 预期运行输出

[INFO] 输入 tensor shape: (1, 224, 224, 3)
[INFO] 输出 tensor shape: (1, 1000)
[PERF] DPU 推理耗时: 14.83 ms/frame

[结果] Top-5 预测:
  1. golden retriever                          73.21%
  2. Labrador retriever                         8.44%
  3. kuvasz                                     4.12%
  4. Great Pyrenees                             2.89%
  5. clumber                                    1.67%

🚧 避坑：VART 要求 Python 3.8 或以上。PetaLinux 2023.2 默认内置 Python 3.8.16，通常没问题。但如果你用的是老版本 BSP 镜像（Python 3.6），vart.Runner 会直接 import 失败，报 ModuleNotFoundError。解决方法：在 PetaLinux 的 rootfs 配置里勾选 python3-vart，或者用 pip 从 Xilinx 仓库安装 xir 和 vart wheel 包。

7. 性能对比：DPU vs CPU

用同一张 224×224 图片，在同一块板子上对比 ARM Cortex-A9 纯 CPU 推理和 DPU 推理：

方案	推理延迟	CPU 占用	功耗（估算）
ARM Cortex-A9（单核，PyTorch）	~800 ms/frame	100%	~1.5W
ARM Cortex-A9（双核，NEON）	~420 ms/frame	200%	~2.0W
DPU B512（本方案）	~15 ms/frame	< 5%	~1.8W（含 PL）
DPU B512（batch=4，pipeline）	~10 ms/frame/img	< 10%	~2.0W

加速比：800ms / 15ms ≈ 53x

CPU 占用之所以极低，是因为 DPU 推理完全由 PL 侧执行，PS 只负责发起任务（execute_async）和等待完成（wait），不参与任何计算。

# 实测 CPU 占用（在推理时运行）
top -d 1 -b | grep python3
# 输出示例：
#  1234 root  20   0  145m  32m  18m S   4.2  3.3   0:01.23 python3

7.5 批量推理与 Pipeline 优化

单张图片推理 15ms 是 DPU 的基线性能。如果你的场景是视频流（30fps = 33ms/frame），需要用 pipeline 模式把预处理、DPU 推理、后处理重叠起来。

#!/usr/bin/env python3
"""
batch_inference.py — DPU B512 批量推理 + Pipeline 示例

在 Pynq-Z2 上，受 CMA 内存限制，batch size 不建议超过 4。
Pipeline 思路：
  - 线程 1（CPU）：预处理下一批图片
  - 线程 2（DPU）：推理当前批图片（execute_async 是异步的）
  - 线程 3（CPU）：后处理上一批结果
这三个操作可以并行，从串行的 15ms/frame 降到约 10ms/frame。
"""

import threading
import queue
import time
import numpy as np
from PIL import Image
import vart
import xir

IMAGENET_MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
IMAGENET_STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)


def preprocess_worker(image_paths, preproc_queue):
    """预处理线程：把图片列表逐个预处理，放入队列"""
    for path in image_paths:
        img = Image.open(path).convert('RGB').resize((256, 256), Image.BILINEAR)
        left = (256 - 224) // 2
        img  = img.crop((left, left, left + 224, left + 224))
        arr  = (np.array(img, dtype=np.float32) / 255.0 - IMAGENET_MEAN) / IMAGENET_STD
        preproc_queue.put(arr)
    preproc_queue.put(None)  # 结束哨兵


def run_pipeline(model_path, image_paths):
    """
    Pipeline 推理，返回平均延迟（ms/frame）
    """
    graph         = xir.Graph.deserialize(model_path)
    dpu_subgraph  = [s for s in graph.get_root_subgraph().toposort_child_subgraph()
                     if s.get_attr("device") == "DPU"][0]
    runner        = vart.Runner.create_runner(dpu_subgraph, "run")

    input_shape   = tuple(runner.get_input_tensors()[0].dims)
    output_shape  = tuple(runner.get_output_tensors()[0].dims)

    # 预分配双缓冲（ping-pong buffer），减少内存分配开销
    buf_in  = [np.zeros(input_shape,  dtype=np.float32),
               np.zeros(input_shape,  dtype=np.float32)]
    buf_out = [np.zeros(output_shape, dtype=np.float32),
               np.zeros(output_shape, dtype=np.float32)]

    preproc_queue = queue.Queue(maxsize=4)
    preproc_thread = threading.Thread(
        target=preprocess_worker, args=(image_paths, preproc_queue), daemon=True
    )
    preproc_thread.start()

    frame_count   = 0
    t_total_start = time.perf_counter()
    buf_idx = 0

    while True:
        arr = preproc_queue.get()
        if arr is None:
            break

        np.copyto(buf_in[buf_idx], arr[np.newaxis, ...])

        # DPU 异步推理
        job_id = runner.execute_async([buf_in[buf_idx]], [buf_out[buf_idx]])
        runner.wait(job_id)

        # 后处理（取 top-1 类别索引）
        top1 = buf_out[buf_idx][0].argmax()
        frame_count += 1
        buf_idx = 1 - buf_idx  # 切换 ping-pong 缓冲

    t_total_end = time.perf_counter()
    avg_latency = (t_total_end - t_total_start) * 1000 / max(frame_count, 1)

    print(f"[Pipeline] 处理 {frame_count} 帧，平均延迟: {avg_latency:.2f} ms/frame")
    del runner
    return avg_latency


# 实测 Pynq-Z2 DPU B512 结果：
#   - 单帧串行：~15 ms/frame
#   - pipeline（预处理/推理/后处理重叠）：~10 ms/frame

DPU B512 吞吐上限分析：

DPU B512 的理论算力是 76.8 GOPS（INT8）。MobileNetV2 约 0.3 GFLOPs（等效 0.6 GOPS INT8）。

理论最大帧率 = 76.8 / 0.6 ≈ 128 fps，实测约 67 fps。差距来自 AXI 总线带宽：HP 端口上限约 2.4 GB/s（64-bit @ 300MHz），MobileNetV2 一次推理要搬运约 35MB（权重 14MB + 激活 21MB），搬运开销 35MB / 2.4GB/s ≈ 14.6ms——内存带宽才是真正的瓶颈，不是计算单元。

8. 本篇 Checklist

Docker 容器里能成功运行 vai_q_pytorch 量化，精度损失 < 1%
校准集大小 ≥ 100 张，覆盖目标场景类别
vai_c_pytorch 编译输出 .xmodel 文件，确认目标架构是 DPUCZDX8G
Vivado 中 DPU 的 AXI HP 接口配置为 64-bit 数据宽度
内核启动参数里设置 cma=512M
板端 dmesg | grep dpu 确认 DPU 驱动加载成功
VART 推理脚本跑通，延迟 < 20ms/frame

9. 下一篇预告

下一篇 《Zynq 实战 22｜系统可靠性设计：看门狗、ECC 内存保护与故障恢复》，我们会：

用 XScuWdt API 配置多级看门狗（任务级 + 系统级）
分析 Pynq-Z2 为什么不支持 DDR ECC，以及用 BRAM ECC 替代
在 PL 逻辑里实现 TMR（三模冗余）
配置 Linux 内核 panic 的分级自动恢复策略

参考资料

文档号	名称	用途
UG1414	Vitis AI User Guide v3.5	DPU 架构、量化/编译全流程
PG338	DPU IP Product Guide	B512/B1024/B4096 资源表、AXI 接口配置
UG585	Zynq-7000 SoC TRM	HP 端口带宽、AXI 数据宽度配置
UG1144	PetaLinux Tools Reference Guide 2023.2	rootfs 配置，添加 VART 包
XAPP1296	Vitis AI Deployment on Zynq-7000	Pynq-Z2 特定部署注意事项

所有文档均可在 AMD 官方文档页免费下载。