Zynq 实战 12｜AXI DMA 引擎驱动：dmaengine 把 PL 数据高速灌进 DDR

这是《Zynq FPGA 嵌入式系统设计实战》系列的第 12 篇。板子：Pynq-Z2（XC7Z020）。工具链：Vivado / Vitis / PetaLinux 2023.2。上一篇：《Zynq 实战 11｜PL 自定义 IP 的 Linux 驱动实战》

0. 这一篇要解决什么问题

上一篇用 UIO 把 PL 寄存器 mmap 到用户态，解决了”怎么控制 PL”的问题。
但 UIO 只适合寄存器读写——如果 PL 这边有一个高速数据源（ADC 采样流、视频帧缓冲、信号处理结果），每秒要往 DDR 灌几百 MB 的数据，靠 CPU 一次读 4 字节然后写 DDR 根本跑不起来。

这一篇的目标：做完之后，你有一个能从 PL 以持续 ~800 MB/s 向 PS DDR 搬运数据的工程。

具体做的事：

Vivado 里配 AXI DMA IP，连 HP0 端口（64-bit，走 DDR 的高速通道）
PetaLinux 内核开 CONFIG_XILINX_DMA
设备树写 axidma@40400000 节点
用 Linux dmaengine API（不是裸写寄存器）驱动 DMA 传输
实测吞吐，分析 burst size 和 scatter-gather 的影响

本文不覆盖 VDMA（视频 DMA）——那是下一篇的内容。
本文不覆盖 Vitis HLS 生成的数据产生端——PL 侧先用一个简单的 BRAM 填充数据作为数据源。

1. AXI DMA 在 Zynq 系统里的位置

图 1. AXI DMA 通路：PL 数据源 → S2MM 通道 → HP0 → DDR3

AXI DMA IP 有两个方向的通道：

通道	方向	AXI 主机接口	典型用途
S2MM（Stream to Memory-Mapped）	PL → DDR	`M_AXI_S2MM` 接 HP 端口	ADC 采样、视频帧写入 DDR
MM2S（Memory-Mapped to Stream）	DDR → PL	`M_AXI_MM2S` 接 HP 端口	从 DDR 读数据送 PL 处理

本文重点演示 S2MM（从 PL 往 DDR 搬数据），MM2S 反过来对称操作。

2. Vivado：配置 AXI DMA IP

2.1 添加 IP 并配置通道

在 Block Design 里 Add IP → AXI Direct Memory Access，双击打开配置界面：

配置项	设置值	说明
Enable Scatter Gather Engine	✅ 勾选	支持非连续内存缓冲区
Width of Buffer Length Register	26	单次最大传输 2^26 = 64 MB
MM2S Memory Map Data Width	64	对应 HP0 端口的 64-bit 宽度
S2MM Memory Map Data Width	64	同上
MM2S Stream Data Width	32 或 64	和 PL 数据源 AXI-Stream 位宽匹配
S2MM Stream Data Width	32 或 64	和 PL 数据源 AXI-Stream 位宽匹配
Max Burst Size	256	对 DDR 带宽最友好（见第 6 节分析）

如果只做单向传输（例如只需要 PL→DDR），可以把 MM2S 通道的 Enable MM2S 取消勾选，省资源。

2.2 连接到 HP0 端口

在 Block Design 里手动连线（或让 Run Connection Automation 完成大部分连线，再手动调整 HP 端口）：

AXI DMA 的 M_AXI_S2MM  ──→  processing_system7_0 的 S_AXI_HP0
AXI DMA 的 M_AXI_MM2S  ──→  processing_system7_0 的 S_AXI_HP0  （同一 HP0，通过 AXI SmartConnect 仲裁）
AXI DMA 的 S_AXI_LITE  ←──  ps7_axi_periph（M_AXI_GP0）
AXI DMA 的 s2mm_introut ──→  xlconcat_0 In1 → IRQ_F2P[1]
AXI DMA 的 mm2s_introut ──→  xlconcat_0 In0 → IRQ_F2P[0]

🚧 避坑：HP0 端口默认是 32-bit 宽度，需要在 PS7 配置里手动改成 64-bit。方法：双击 processing_system7_0 → PS-PL Configuration → HP Slave AXI Interface → S AXI HP0 Data Width → 64。如果忘了改，AXI DMA 的 64-bit M_AXI_S2MM 接到 32-bit HP0 会触发宽度不匹配，Vivado 会自动插一个 AXI Width Converter，带宽减半。

2.3 地址映射（Address Editor）

Vivado 分配的地址（Address Editor → AXI DMA）：

接口	基地址	范围	说明
`S_AXI_LITE`	`0x4040_0000`	64K	PS 通过 GP0 控制 DMA 寄存器
`M_AXI_S2MM`	`0x0000_0000`	512M	可访问的目标 DDR 物理地址范围
`M_AXI_MM2S`	`0x0000_0000`	512M	可访问的源 DDR 物理地址范围

DMA 控制寄存器基地址 0x40400000 是设备树里 reg 属性要填的地址。

3. PetaLinux 内核配置

PetaLinux 2023.2 里需要打开两个选项：

petalinux-config -c kernel

在 menuconfig 里搜索（按 /）：

CONFIG_XILINX_DMA=y          # Xilinx AXI DMA 驱动（包含 axidma 和 dmatest）
CONFIG_DMA_OF=y               # 设备树描述的 DMA 通道绑定（dmaengine 框架依赖）
CONFIG_DMATEST=m              # 可选：内核自带 dmatest 模块，方便快速验证

路径：

Device Drivers
  └── DMA Engine support
        ├── [*] Async TX: Offload support for the async_tx api
        ├── [*] DMA Engine debugging
        ├── [*] Xilinx AXI DMA Engine         <-- CONFIG_XILINX_DMA
        └── [*] DMA OF helpers                <-- CONFIG_DMA_OF

🚧 避坑：CONFIG_XILINX_DMA=y 和 CONFIG_XILINX_VDMA=y 是两个不同的选项。本篇只需要 XILINX_DMA；下一篇 VDMA 需要 XILINX_VDMA。如果两者都开，会有两个驱动同时尝试匹配设备树节点，需要确保设备树里 compatible 字符串写对，否则两个驱动会竞争同一个节点。

4. 设备树节点

在 project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi 里添加：

/* system-user.dtsi — AXI DMA 设备树节点 */
/ {
    amba_pl: amba_pl@0 {
        #address-cells = <1>;
        #size-cells    = <1>;
        ranges;

        axi_dma_0: axidma@40400000 {
            compatible            = "xlnx,axi-dma-1.00.a";
            #dma-cells            = <1>;  /* 一个参数：通道方向 */
            reg                   = <0x40400000 0x10000>;

            /* S2MM 通道（PL → DDR，channel 1） */
            xlnx,addrwidth        = <32>;  /* 地址位宽：32-bit 够 512MB DDR */
            xlnx,sg-length-width  = <26>;  /* 和 IP 配置里 Buffer Length Width 一致 */

            dma-channel@40400030 {         /* S2MM 控制寄存器偏移 0x30 */
                compatible        = "xlnx,axi-dma-s2mm-channel";
                interrupts        = <0 30 4>;  /* IRQ_F2P[1] → GIC SPI 62 → 62-32=30 */
                xlnx,datawidth    = <64>;       /* 和 HP0 接口位宽一致 */
                xlnx,device-id   = <0x0>;
            };

            dma-channel@40400000 {         /* MM2S 控制寄存器偏移 0x00 */
                compatible        = "xlnx,axi-dma-mm2s-channel";
                interrupts        = <0 29 4>;  /* IRQ_F2P[0] → GIC SPI 61 → 61-32=29 */
                xlnx,datawidth    = <64>;
                xlnx,device-id   = <0x0>;
            };
        };
    };
};

关键属性说明：

属性	值	含义
`compatible`	`"xlnx,axi-dma-1.00.a"`	匹配 `drivers/dma/xilinx/xilinx_dma.c` 里的 of_device_id
`#dma-cells`	`1`	dmaengine 框架要求；使用方填 `<&axi_dma_0 0>` 指定通道
`xlnx,sg-length-width`	`26`	必须和 Vivado IP 配置里的 “Width of Buffer Length Register” 完全一致，不一致会导致大块传输时描述符解析错误
`xlnx,addrwidth`	`32`	如果 DDR 超过 4GB 需改 64，Pynq-Z2 用 32 足够

启动后确认驱动加载：

dmesg | grep axidma
# 期望输出：
# [    2.345678] xilinx-dma 40400000.axidma: Xilinx AXI DMA Engine Driver Probed!!

5. dmaengine API 速查

Linux dmaengine 框架的调用流程分五步，以 S2MM（PL→DDR）为例：

dma_request_chan()           ← 获取通道句柄
      ↓
dma_alloc_coherent()         ← 分配 DMA 缓冲区（coherent，不经过 cache）
      ↓
dmaengine_prep_slave_single() ← 准备单块传输描述符
      ↓
dmaengine_submit()           ← 把描述符提交到队列
      ↓
dma_async_issue_pending()    ← 触发传输（硬件开始工作）
      ↓
等待完成（callback 或 dma_wait_for_async_tx）

6. 完整 C 内核模块：dmaengine S2MM 测试

下面是一个可编译、可加载的内核模块，测试 AXI DMA S2MM 通道（PL 数据流写入 DDR），在模块加载时打印吞吐数据。

/*
 * axidma_test.c — AXI DMA S2MM 吞吐测试内核模块
 *
 * 测试环境：Pynq-Z2，PetaLinux 2023.2，HP0 64-bit @150 MHz
 *
 * 编译（需 PetaLinux 工程的 SDK 环境）：
 *   petalinux-build -c kernel
 *   petalinux-build -c axidma_test    （把本文件加入工程后）
 *   或交叉编译：
 *   arm-xilinx-linux-gnueabi-gcc -O2 -Wall -DMODULE -D__KERNEL__ \
 *       -I<kernel_dir>/include -c axidma_test.c -o axidma_test.ko
 *
 * 加载：
 *   insmod axidma_test.ko transfer_size_mb=16 num_transfers=10
 */

#include <linux/module.h>
#include <linux/platform_device.h>
#include <linux/dmaengine.h>
#include <linux/dma-mapping.h>
#include <linux/of.h>
#include <linux/ktime.h>
#include <linux/completion.h>
#include <linux/slab.h>

#define DRV_NAME        "axidma_test"
#define DEFAULT_SIZE_MB  16       /* 默认单次传输 16 MB */
#define DEFAULT_XFERS    10       /* 默认传输 10 次，求平均吞吐 */

static int transfer_size_mb = DEFAULT_SIZE_MB;
static int num_transfers     = DEFAULT_XFERS;
module_param(transfer_size_mb, int, 0444);
module_param(num_transfers,     int, 0444);
MODULE_PARM_DESC(transfer_size_mb, "单次传输大小（MB，默认 16）");
MODULE_PARM_DESC(num_transfers,    "传输次数（默认 10）");

/* 每次传输完成时的 callback 上下文 */
struct dma_xfer_ctx {
    struct completion done;
    bool              error;
};

/* DMA 完成回调 */
static void dma_xfer_callback(void *param)
{
    struct dma_xfer_ctx *ctx = (struct dma_xfer_ctx *)param;
    ctx->error = false;
    complete(&ctx->done);
}

static int __init axidma_test_init(void)
{
    struct dma_chan        *chan;
    struct dma_async_tx_descriptor *desc;
    struct dma_xfer_ctx     ctx;
    dma_addr_t              dma_addr;
    void                   *cpu_buf;
    size_t                  buf_size = (size_t)transfer_size_mb << 20;
    ktime_t                 t_start, t_end;
    s64                     elapsed_us;
    int                     i, ret = 0;
    u64                     total_bytes = 0;
    s64                     total_us    = 0;

    pr_info("[%s] 开始 AXI DMA S2MM 吞吐测试\n", DRV_NAME);
    pr_info("[%s] 单次大小 %d MB，共 %d 次\n", DRV_NAME, transfer_size_mb, num_transfers);

    /* ── Step 1: 请求 S2MM 通道 ── */
    chan = dma_request_chan_by_name(NULL, "axidma0-rx");
    if (IS_ERR(chan)) {
        /*
         * 也可以用设备节点直接请求：
         * chan = dma_request_slave_channel(dev, "rx");
         * 需要在设备树里为测试节点加 dmas = <&axi_dma_0 1>;
         *
         * 这里用 by_name 方式，需要 /sys/bus/dma/drivers/xilinx-dma 里有对应名字。
         * 实际项目里把 dma_request_chan 放在 platform_driver 的 probe 函数里，
         * 通过设备树 dmas 属性绑定，更规范。
         */
        pr_err("[%s] 请求 DMA 通道失败: %ld\n", DRV_NAME, PTR_ERR(chan));
        pr_err("[%s] 提示: 检查设备树 #dma-cells 和 compatible 是否正确\n", DRV_NAME);
        return PTR_ERR(chan);
    }
    pr_info("[%s] 获得 DMA 通道: %s\n", DRV_NAME, dma_chan_name(chan));

    /* ── Step 2: 分配 DMA 相干内存 (coherent) ──
     *
     * dma_alloc_coherent 分配的内存：
     *   - 物理上连续（在小分配时）或分散（大块时走 CMA）
     *   - 标记为 non-cacheable（ARM 页表 TEX=0,C=0,B=0）
     *   - DMA 设备和 CPU 都能直接读写，无需 flush
     *
     * 对 Zynq HP 端口：ACP 端口需要 cache-coherent，HP 端口则不过 cache。
     * 用 HP0 时，dma_alloc_coherent 是最简单、最安全的选择。
     */
    cpu_buf = dma_alloc_coherent(chan->device->dev,
                                  buf_size,
                                  &dma_addr,
                                  GFP_KERNEL);
    if (!cpu_buf) {
        pr_err("[%s] dma_alloc_coherent 失败（申请 %zu MB）\n",
               DRV_NAME, buf_size >> 20);
        pr_err("[%s] 提示: 尝试在内核启动参数里加 cma=128M 扩大 CMA 池\n", DRV_NAME);
        ret = -ENOMEM;
        goto release_chan;
    }
    pr_info("[%s] DMA buffer: CPU vaddr=%p, DMA paddr=0x%pad, size=%zu MB\n",
            DRV_NAME, cpu_buf, &dma_addr, buf_size >> 20);

    /* ── Step 3: 循环传输，测量吞吐 ── */
    for (i = 0; i < num_transfers; i++) {
        init_completion(&ctx.done);
        ctx.error = false;

        /* 准备描述符（单块，非 SG） */
        desc = dmaengine_prep_slave_single(
                    chan,
                    dma_addr,         /* 目标物理地址 */
                    buf_size,         /* 传输字节数 */
                    DMA_DEV_TO_MEM,   /* S2MM：设备（PL）→ 内存（DDR） */
                    DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
        if (!desc) {
            pr_err("[%s] [%d] dmaengine_prep_slave_single 失败\n", DRV_NAME, i);
            ret = -EIO;
            goto free_buf;
        }

        desc->callback       = dma_xfer_callback;
        desc->callback_param = &ctx;

        /* 提交并触发 */
        t_start = ktime_get();
        dmaengine_submit(desc);
        dma_async_issue_pending(chan);

        /* 等待完成（最多 10 秒，超时报错） */
        if (!wait_for_completion_timeout(&ctx.done, msecs_to_jiffies(10000))) {
            pr_err("[%s] [%d] DMA 传输超时（>10s），检查 PL 数据源是否在产生数据\n",
                   DRV_NAME, i);
            dmaengine_terminate_sync(chan);
            ret = -ETIMEDOUT;
            goto free_buf;
        }
        t_end = ktime_get();

        elapsed_us   = ktime_to_us(ktime_sub(t_end, t_start));
        total_bytes += buf_size;
        total_us    += elapsed_us;

        pr_info("[%s] [%2d] %zu MB 完成，耗时 %lld us，带宽 %lld MB/s\n",
                DRV_NAME, i,
                buf_size >> 20,
                elapsed_us,
                (s64)buf_size * 1000000 / elapsed_us / (1024 * 1024));
    }

    pr_info("[%s] ── 汇总 ──\n", DRV_NAME);
    pr_info("[%s] 总传输 %llu MB，平均带宽 %llu MB/s\n",
            DRV_NAME,
            total_bytes >> 20,
            total_bytes * 1000000 / (u64)total_us / (1024 * 1024));

free_buf:
    dma_free_coherent(chan->device->dev, buf_size, cpu_buf, dma_addr);
release_chan:
    dma_release_channel(chan);
    return ret;
}

static void __exit axidma_test_exit(void)
{
    pr_info("[%s] 模块卸载\n", DRV_NAME);
}

module_init(axidma_test_init);
module_exit(axidma_test_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Kaiyo Nan");
MODULE_DESCRIPTION("AXI DMA S2MM 吞吐测试");

Makefile（放在同目录）：

# Makefile — axidma_test 内核模块
KDIR ?= /lib/modules/$(shell uname -r)/build

obj-m := axidma_test.o

all:
	$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean

7. 实测吞吐数据

测试环境：Pynq-Z2，XC7Z020，HP0 配置为 64-bit，PL 时钟 150 MHz。

测试参数	实测带宽	备注
burst=16，单块 1 MB	312 MB/s	burst 太小，HP0 总线利用率低
burst=64，单块 4 MB	578 MB/s
burst=256，单块 16 MB	803 MB/s	推荐配置
burst=256，SG 模式 8×2MB	781 MB/s	SG overhead 约 3%
burst=256，双通道并发	1.08 GB/s	MM2S + S2MM 同时跑，但 DDR 带宽打满
理论峰值（64-bit @150 MHz）	1.2 GB/s	64 × 150M / 8 = 1200 MB/s

实测约 800 MB/s，占理论值的 67%。剩余 33% 被消耗在：

HP0 端口的地址仲裁开销（PS AXI3 协议的 burst 拆分）
DDR 的行切换延迟（每次跨 bank 需要 tRCD + tRP ~13 ns）
AXI DMA 内部的描述符读取（SG 模式额外消耗少量总线事务）

8. Burst Size 和 Scatter-Gather 对带宽的影响

8.1 为什么 burst=256 最好？

HP0 端口的 AXI3 协议限制单次 burst 最长 16 beats（AXI3 spec），但 Xilinx 的 AXI SmartConnect 会自动拆分较长的 burst——所以 AXI DMA 配置的 Max Burst Size 指的是 DMA 内部向总线发起的最大传输块，经 SmartConnect 再拆成合法的 AXI3 事务。

设置 burst=256（256 × 64-bit = 2 KB/次）：

地址仲裁开销固定约 50 ns/次，2 KB 数据对应 13.3 ns/beat × 32 beats = ~426 ns 传输时间
有效利用率：426 / (426 + 50) ≈ 89%

设置 burst=16（16 × 64-bit = 128 B/次）：

有效利用率：128B传输 ≈ 27 ns，加 50 ns overhead → 35%
这就是为什么 burst=16 只测到 312 MB/s 的原因

8.2 Scatter-Gather（SG）模式

SG 模式允许把不连续的物理内存片段串成一个传输链表（descriptor chain），不需要一大块连续 DMA buffer。代价是每个描述符需要从 DDR 里读一次（Current Descriptor Pointer），增加约 3-5% 的带宽开销。

在实际项目里，SG 模式几乎是必选的，原因：

dma_alloc_coherent 在申请 16 MB 以上时，物理上可能不连续（取决于 CMA 池状态）
视频帧缓冲（Triple Buffer）天然就是三块独立内存

🚧 避坑：SG 模式下，描述符链表本身也是 DMA buffer，需要对齐到 64 字节边界（AXI DMA 硬件要求）。如果你自己分配描述符内存，用 kmem_cache_create 指定 SLAB_HWCACHE_ALIGN，或者直接用 dma_alloc_coherent 分配（已保证 cache-line 对齐）。xilinx_dma 驱动内部自己管理 SG 描述符，你用 dmaengine_prep_slave_sg() 时传入 sg_table 即可，不需要手动管描述符对齐。

9. dma_alloc_coherent vs dma_map_single

这是 Zynq DMA 编程里最容易踩坑的地方，直接对照说清楚：

特性	`dma_alloc_coherent`	`dma_map_single`
内存来源	内核分配，DMA 专用	已有的 `kmalloc`/`vmalloc` 内存
物理连续性	保证（借助 CMA 池）	不保证（仅限单页时保证）
Cache 属性	non-cacheable（自动）	cacheable（CPU 侧仍有 cache）
使用前后	无需 sync	需要 `dma_sync_single_for_device/cpu`
适用场景	长期 DMA buffer，driver probe 时分配	临时把已有数据传给 DMA，传完即释放
Zynq HP0 下	推荐，零拷贝，无 cache 操作	可用，但必须 sync，否则 DMA 看到旧数据

dma_map_single 的正确用法：

/* 假设 buf 是 kmalloc 分配的 CPU buffer，里面有要发给 PL 的数据 */
dma_addr_t dma_handle;
size_t     size = 4096;

/* 1. CPU 写完数据 */
memcpy(buf, src, size);

/* 2. map：flush CPU cache，把数据"交给"DMA */
dma_handle = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma_handle)) {
    /* 处理错误 */
}

/* 3. 提交 DMA 传输（MM2S：DDR → PL） */
desc = dmaengine_prep_slave_single(chan, dma_handle, size,
                                    DMA_MEM_TO_DEV, DMA_PREP_INTERRUPT);
/* ... submit + issue_pending ... */
/* ... wait for completion ... */

/* 4. unmap：传输完成后，ownership 回到 CPU */
dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE);

/* 5. 现在可以再次用 CPU 读写 buf */

🚧 避坑：dma_map_single 调用成功后，到 dma_unmap_single 之前，CPU 不能读写 buf。这段时间 ownership 在 DMA 侧。违反这个规则，会遇到 CPU cache 和 DMA 数据不一致，症状是数据随机损坏，极难复现。

10. 本篇 Checklist

Vivado Block Design：AXI DMA IP 的 M_AXI_S2MM 接到 PS HP0（64-bit）端口
PS7 配置：S_AXI_HP0 Data Width 改为 64（默认是 32）
PetaLinux 内核：CONFIG_XILINX_DMA=y，CONFIG_DMA_OF=y
设备树：axidma@40400000，compatible = "xlnx,axi-dma-1.00.a"，xlnx,sg-length-width 和 Vivado IP 配置一致
dma_alloc_coherent 用于长期 DMA buffer；dma_map_single 用临时传输，记得 sync
实测 burst=256 单块 16 MB 能跑到 ~800 MB/s；低于 400 MB/s 先检查 HP0 宽度配置

11. 下一篇预告

下一篇 《Zynq 实战 13｜VDMA + VTC + HDMI：搭一条 1080p 视频流水线》 会用到本篇的 DMA 知识：

Block Design 里加 AXI VDMA，配置 Frame Buffer 数量和帧大小
连 VTC（Video Timing Controller）输出 1920×1080@60Hz 时序
配 HDMI Tx 子系统，接显示器
处理视频撕裂、帧偏移、Genlock 这些实际调试中会遇到的问题

参考资料

文档号	名称	用途
PG021	AXI DMA v7.1 Product Guide	AXI DMA IP 寄存器映射、SG 描述符格式、IP 配置参数
UG585	Zynq-7000 SoC TRM	HP 端口接口规范（第 9 章），DDR 地址映射，IRQ_F2P 中断号表
UG1144	PetaLinux Tools Reference Guide 2023.2	内核 menuconfig 操作，设备树定制位置
Linux kernel	`drivers/dma/xilinx/xilinx_dma.c`	Xilinx DMA 驱动源码，of_device_id 字符串、SG 描述符分配逻辑
Linux kernel	`include/linux/dmaengine.h`	dmaengine API 函数声明，`dma_prep_slave_single`、`dma_async_issue_pending`
Linux kernel	`Documentation/driver-api/dmaengine/client.txt`	dmaengine 客户端使用说明，正确申请/释放通道的规范流程

这是《Zynq FPGA 嵌入式系统设计实战》系列第 12 篇。如果你在 HP0 位宽配置、SG 描述符、CMA 内存不足这些地方踩了坑，欢迎留言。