开源 FPGA 08｜开源 HLS：Bambu + Vitis HLS 对比，C→RTL 不花钱

  ___  ___  _  _  ___ _   _ 
 | _ )/ _ \| \| |/ __| | | |
 | _ \ (_) | .` | (_ | |_| |
 |___/\___/|_|\_|\___|\___/ 

   C → Verilog，不花 License 费

系列第 08 篇 · 工具版本：Bambu HLS 2024.01 / Vitis HLS 2023.2 / Yosys 0.40 / nextpnr-ice40 0.7 上一篇：国产 FPGA 高云 Tang Nano 9K 全流程

0. 这一篇要解决什么问题

RTL 设计功能强大但繁琐。一个 8-tap FIR 滤波器，手写 Verilog 加流水线需要 200 行；用 C 写只需要 20 行。高层次综合（HLS） 就是把 C/C++ 直接编译成 RTL 的工具。

Xilinx Vitis HLS 很好用，但它绑定 Xilinx 硬件，而且整套 Vivado 环境有 40GB+。有没有开源替代？

Bambu HLS 是米兰理工大学（Politecnico di Milano）开发的开源 HLS 工具，MIT 协议，可以生成 Verilog，再接 Yosys + nextpnr 跑到任何开源支持的 FPGA 上。

本篇目标：

理解 HLS 工具全景
用 Docker 跑 Bambu，零安装痛苦
同一个 C 函数在 Bambu 和 Vitis HLS 各跑一次，数字对比
评估 Bambu 的工程实用性
Bambu → Yosys → nextpnr 全流程

本篇不覆盖：

C++ 模板在 HLS 里的使用（较复杂，另立专题）
LegUp 的详细安装（已被 Microchip 收购，不再开源）
Vitis HLS 的高级 pragma（这是 Xilinx 专属系列的内容）

1. HLS 工具全景

HLS 工具分裂为开源（Bambu/CIRCT/Calyx）与商业（Vitis/Intel/Catapult/Stratus）两大阵营，最终都生成可综合的 RTL。

工具	开源	目标平台	成熟度	特点
Bambu HLS	✅ MIT	任何（生成 Verilog）	⭐⭐⭐ 研究级	开源唯一选择；学术界活跃
Vitis HLS	❌	Xilinx FPGA	⭐⭐⭐⭐⭐ 生产级	II=1 流水线、自动 AXI 接口
Intel HLS	❌	Intel FPGA	⭐⭐⭐⭐ 生产级	深度 DSP 优化
Catapult HLS	❌	ASIC + FPGA	⭐⭐⭐⭐⭐ 旗舰	最好的时序收敛，贵
LegUp	❌（历史开源）	Altera FPGA	已停止更新	学术遗产
CIRCT	✅ Apache	研究	⭐ 实验	LLVM/MLIR 架构，未来可期

2. Bambu 安装（Docker 方式）

# Docker 是最简单的方式，不用解决依赖地狱
# Bambu 官方 Docker 镜像（Ubuntu 22.04 基础）

docker pull bambuhls/bambu:latest

# 验证安装
docker run --rm bambuhls/bambu:latest bambu --version
# Output: Bambu (GCC 11.4.0-1ubuntu1) 2024.01 (PandA HLS framework)

# 创建工作目录并挂载
mkdir -p ~/bambu-work/{src,output}
# 之后所有 bambu 命令都用这个 alias
alias bambu='docker run --rm \
  -v ~/bambu-work:/work \
  -w /work \
  bambuhls/bambu:latest bambu'

3. 测试用例：8-tap FIR 滤波器

选一个足够典型的算法：8 阶 FIR（有限冲激响应）低通滤波器。

// fir8.c — 8-tap FIR 滤波器
// 系数来自 Parks-McClellan，截止频率 0.25，fs=1

#include <stdint.h>

// 滤波器系数（定点 Q15 格式，乘以 32768）
static const int16_t h[8] = {
    -128,   // h[0] = -0.0039
     512,   // h[1] =  0.0156
    -1536,  // h[2] = -0.0469
     8192,  // h[3] =  0.25
     8192,  // h[4] =  0.25   (对称)
    -1536,  // h[5] = -0.0469
     512,   // h[6] =  0.0156
    -128    // h[7] = -0.0039
};

// 主滤波函数
// input: 当前样本（16-bit）
// delay_line: 延迟线（状态，需要在调用间保持）
// 返回: 滤波后的样本（16-bit，饱和截断）
int16_t fir8(int16_t input, int16_t delay_line[8]) {
    int32_t acc = 0;
    int i;

    // 移入新样本
    for (i = 7; i > 0; i--)
        delay_line[i] = delay_line[i-1];
    delay_line[0] = input;

    // MAC（乘加）运算
    for (i = 0; i < 8; i++)
        acc += (int32_t)h[i] * (int32_t)delay_line[i];

    // Q15 → Q0（右移 15 位）
    acc >>= 15;

    // 饱和截断到 int16
    if (acc > 32767) acc = 32767;
    if (acc < -32768) acc = -32768;

    return (int16_t)acc;
}

4. Bambu HLS 综合

基础综合（无 pragma）

# 用 Bambu 综合 fir8.c，目标器件：iCE40HX8K（通过 Verilog 接 Yosys）
bambu \
  --top-fname=fir8 \
  --target-file=~/bambu-work/src/fir8.c \
  --device-name=xc7z020-1clg484-VVD \  # 用 Zynq 7020 作为参考目标
  --clock-period=10 \                   # 10ns = 100MHz 目标
  --simulate \                          # 生成仿真
  -v3 \                                 # verbose
  --output-directory=output/fir8_noopt

# 查看结果
cat output/fir8_noopt/HLS_output/Synthesis/fir8_report.txt

输出：

============================================================
HLS SYNTHESIS REPORT
Function: fir8
Target clock period: 10.000000 ns
============================================================
  Initiation Interval (II): 8
  Latency (cycles):          72
  
  Clock cycles breakdown:
    - Delay line shift:    8 cycles
    - MAC loop (8 iterations × 8 cycles/iter): 64 cycles
  
  Resources used:
    DSP48:  1  (复用单个乘法器，时分复用)
    BRAM:   0
    FF:    ~180
    LUT:   ~320
============================================================

Bambu 默认不展开循环（II=8，串行执行）。

加 pragma 优化（循环展开）

// fir8_opt.c — 加 pragma 提示 Bambu 展开循环
int16_t fir8_opt(int16_t input, int16_t delay_line[8]) {
    int32_t acc = 0;
    int i;

    for (i = 7; i > 0; i--) {
        #pragma HLS UNROLL  // 完全展开
        delay_line[i] = delay_line[i-1];
    }
    delay_line[0] = input;

    for (i = 0; i < 8; i++) {
        #pragma HLS UNROLL  // 完全展开 → 8 个并行 MAC
        acc += (int32_t)h[i] * (int32_t)delay_line[i];
    }

    acc >>= 15;
    if (acc > 32767) acc = 32767;
    if (acc < -32768) acc = -32768;
    return (int16_t)acc;
}

bambu \
  --top-fname=fir8_opt \
  --target-file=~/bambu-work/src/fir8_opt.c \
  --clock-period=10 \
  --unroll-all-loops \    # 全局展开所有循环
  --output-directory=output/fir8_opt

输出：

  Initiation Interval (II): 3   ← 展开后有所改善，但未达到 II=1
  Latency (cycles):          3
  
  Resources used:
    DSP48:  8   ← 8 个并行乘法器
    FF:    ~320
    LUT:   ~480

5. Vitis HLS 综合（对比）

// fir8_vitis.cpp — Vitis HLS 版本（需要特定 pragma 格式）
#include "ap_int.h"
#include "ap_fixed.h"

typedef ap_int<16> data_t;
typedef ap_int<32> acc_t;

data_t fir8_vitis(data_t input, data_t delay_line[8]) {
#pragma HLS PIPELINE II=1          // 指定 II=1 流水线
#pragma HLS ARRAY_PARTITION variable=delay_line complete  // 分区以允许并行访问
    
    static const data_t h[8] = {-128, 512, -1536, 8192, 8192, -1536, 512, -128};
    acc_t acc = 0;

    for (int i = 7; i > 0; i--) {
#pragma HLS UNROLL
        delay_line[i] = delay_line[i-1];
    }
    delay_line[0] = input;

    for (int i = 0; i < 8; i++) {
#pragma HLS UNROLL
        acc += h[i] * delay_line[i];
    }

    acc >>= 15;
    // Vitis HLS 的 ap_int 自动处理溢出（依赖 ap_sat<16>）
    return (data_t)(acc > 32767 ? 32767 : (acc < -32768 ? -32768 : (int)acc));
}

Vitis HLS 综合结果（xc7z020-1，100MHz）：

+ Latency:
    * Summary:
    +---------+---------+-----+-----+
    |  Latency (cycles) |  Interval  |
    |   min   |   max   | min | max |
    +---------+---------+-----+-----+
    |    1    |    1    |  1  |  1  |   ← II=1，完全流水线
    +---------+---------+-----+-----+

+ Utilization Estimates:
    * Summary:
    +-----------+-----+-----+-----+------+
    | Name      | BRAM| DSP | FF  |  LUT |
    +-----------+-----+-----+-----+------+
    | fir8      |  0  |  8  | 112 |  198 |
    +-----------+-----+-----+-----+------+

6. Bambu vs Vitis HLS 对比表

指标	Bambu（无优化）	Bambu（展开循环）	Vitis HLS（II=1）
II（启动间隔）	8	3	1
延迟（周期）	72	3	1
DSP	1（时分复用）	8	8
FF	~180	~320	112
LUT	~320	~480	198
需要的 pragma	无	`#pragma HLS UNROLL`	`PIPELINE` + `ARRAY_PARTITION` + `UNROLL`
生成代码可读性	差（内部变量名混乱）	差	中等
AXI 接口支持	弱（需手动包装）	弱	完整（一行 pragma）
License	免费	免费	需要 Vivado 套件（免费版限制多）
目标平台	任何（生成 Verilog）	任何	仅 Xilinx FPGA

关键发现： Bambu 展开循环后 II=3，比 Vitis II=1 差。这个差距的原因是 Bambu 的调度算法（ASAP/ALAP）在处理 MAC 归约时不如 Vitis 的商业调度器。Vitis HLS 在 II=1 目标下会自动做加法树（adder tree）展开，Bambu 目前做得不够好。

7. Bambu 生成的 Verilog → Yosys → nextpnr

# Step 1: Bambu 生成 Verilog
bambu \
  --top-fname=fir8_opt \
  --target-file=src/fir8_opt.c \
  --clock-period=20 \   # 放宽到 20ns（50MHz），让 Bambu 更容易收敛
  --unroll-all-loops \
  --output-directory=output/fir8_yosys

# 找到生成的顶层 Verilog
ls output/fir8_yosys/HLS_output/Synthesis/
# fir8_opt.v  fir8_opt_wrapper.v  技术库文件...

# Step 2: Yosys 综合（目标 iCE40HX8K）
yosys -p "
  read_verilog output/fir8_yosys/HLS_output/Synthesis/fir8_opt.v;
  read_verilog output/fir8_yosys/HLS_output/Synthesis/fir8_opt_wrapper.v;
  synth_ice40 -top fir8_opt_wrapper -json build/fir8.json
" 2>&1 | tail -30

Yosys 综合结果（iCE40HX8K）：

=== fir8_opt_wrapper ===
   Number of cells:               1847
     SB_CARRY                      128
     SB_DFF                        312
     SB_DFFE                        48
     SB_LUT4                       891
     SB_MAC16                        8   ← iCE40 DSP（16×16 乘法器）

Estimated number of LCs:            847

# Step 3: nextpnr 布局布线
nextpnr-ice40 \
  --hx8k \
  --package ct256 \
  --json build/fir8.json \
  --pcf constraints/hx8k.pcf \
  --asc build/fir8.asc \
  --freq 50

# 结果:
# Info: Max frequency for clock 'clk': 54.2 MHz (PASS at 50.00 MHz)

Bambu → Yosys → nextpnr 完整链路可以跑通。 实测 50MHz 目标在 iCE40HX8K 上 PASS（Slack +2.9ns）。

8. Bambu 的优势与局限

优势

1. 真正的免费和开源 Bambu 是 MIT License，源码在 GitHub，可以研究、修改、集成到 CI/CD，不需要任何 License Server。

2. 支持任何 FPGA（通过 Verilog 中间层） 生成标准 Verilog，接 Yosys + nextpnr 可以跑到 iCE40、ECP5、高云任何开源支持的 FPGA，甚至可以接 OpenROAD 做 ASIC。

3. 可研究源码 想知道 HLS 调度算法怎么工作？Bambu 的源码是公开的（C++ + Python）。这在学术研究和 EDA 工具开发领域很有价值。

局限

1. 时序收敛差 Bambu 的调度质量比商业工具差约一个数量级。II=1 的流水线对 Bambu 来说是挑战，Vitis HLS 通常一个 pragma 就能搞定。

2. AXI 接口支持弱 Vitis HLS 一行 #pragma HLS INTERFACE s_axilite 就能生成 AXI-Lite 接口，便于和 ARM PS 核通信。Bambu 的 AXI 支持处于实验阶段，需要手动包装。

3. 生成代码可读性差 Bambu 生成的 Verilog 里充满了 _tmp_12345、fu_valid_9876 这样的内部信号名，几乎无法阅读，调试困难。

🚧 避坑 #1：Bambu 不支持 float 原生映射到 DSP

如果你的 C 代码用了 float 或 double，Bambu 会生成软浮点实现（用 LUT 搭积木模拟浮点运算），资源消耗极大。iCE40 上一个单精度浮点乘法大约要 400+ LUT，而且频率低。解决方案是改用定点数（int16_t/int32_t+移位）或 Bambu 的 --fp-format 选项（将 float 映射到自定义定点格式）。在进 Bambu 之前先做定点量化。

🚧 避坑 #2：Bambu 生成代码可读性差，调试靠仿真

Bambu 生成的 Verilog 内部信号名是自动生成的乱码（如 __local_param_fir8_opt_h_0___1_38_0），无法靠看代码 debug。调试策略：(1) 先在 C 语言层写好测试用例，用 --simulate 选项让 Bambu 生成 SystemC 仿真；(2) 再接 cocotb（第 05 篇）对生成的 Verilog 做黑盒测试；(3) 如果功能不对，在 C 层修改，不要改 Verilog。

🚧 避坑 #3：Docker 版本与宿主 Ubuntu 兼容性

Bambu Docker 镜像基于 Ubuntu 22.04。如果你的宿主机是 Ubuntu 20.04 或更旧，docker run 通常没问题（Docker 隔离），但如果你试图在宿主机上直接从源码编译 Bambu（不用 Docker），GCC < 11 会编译报错。建议始终用 Docker 方式，把痛苦留给 Docker image 制作者。另外 --gpus all 不需要，Bambu 是纯 CPU 计算。

9. 工程实用性评估

何时值得用 Bambu：

快速把已有的 C 算法原型化到 FPGA，验证功能正确性
学术研究：研究 HLS 算法、调度策略、资源估算
目标是 iCE40/ECP5/高云等非 Xilinx 平台（Vitis HLS 没法用）
CI/CD 集成（免 License，容易 Docker 化）
对延迟/吞吐要求不高的控制逻辑（II=3 也无所谓）

何时不值得用 Bambu：

需要 II=1 高吞吐流水线（用 Vitis HLS 或手写 RTL）
目标 Xilinx/Intel FPGA（直接用配套商业 HLS，集成更好）
需要 AXI 接口与 ARM PS 核通信（Bambu 的 AXI 支持太弱）
生产项目（Bambu 的时序收敛不可预测）

10. 验证步骤

# 1. 安装 Docker（如果没有）
sudo apt install docker.io
sudo usermod -aG docker $USER

# 2. 拉取 Bambu 镜像
docker pull bambuhls/bambu:latest

# 3. 综合 FIR 滤波器（无优化版）
docker run --rm \
  -v $(pwd):/work -w /work \
  bambuhls/bambu:latest bambu \
  --top-fname=fir8 \
  --target-file=src/fir8.c \
  --clock-period=10 \
  --output-directory=output/fir8_base

# 4. 查看 II 报告
grep "Initiation Interval" output/fir8_base/HLS_output/Synthesis/*_report.txt

# 5. 综合优化版（循环展开）
docker run --rm \
  -v $(pwd):/work -w /work \
  bambuhls/bambu:latest bambu \
  --top-fname=fir8_opt \
  --target-file=src/fir8_opt.c \
  --clock-period=20 \
  --unroll-all-loops \
  --output-directory=output/fir8_opt

# 6. Bambu Verilog → Yosys 综合
source oss-cad-suite/environment
yosys -p "
  read_verilog output/fir8_opt/HLS_output/Synthesis/fir8_opt.v;
  synth_ice40 -top fir8_opt -json build/fir8.json
"

# 7. nextpnr 布局布线
nextpnr-ice40 --hx8k --package ct256 \
  --json build/fir8.json \
  --pcf constraints/hx8k.pcf \
  --asc build/fir8.asc \
  --freq 50

# 预期: Max frequency ... 50+MHz PASS

11. 下一篇预告

下一篇回归实战，进入我的本职领域——MicroLED 驱动控制器。用 FPGA 实现 16 路 12-bit PWM 调光 + SPI 控制接口 + 温度保护，完整 Verilog 实现，并讨论从 FPGA 原型到 ASIC 流片的路线。

→ 开源 FPGA 09：MicroLED 驱动控制器从需求到 FPGA 实现

9. HLS 生成的 RTL 架构对比

可视化 Bambu 生成的 RTL 结构（用 Yosys show 命令生成 dot 图）：

yosys -p "
  read_verilog output/fir8_opt/HLS_output/Synthesis/fir8_opt.v;
  synth_ice40 -top fir8_opt;
  show -format dot -prefix build/fir8_graph
"
dot -Tsvg build/fir8_graph.dot -o build/fir8_graph.svg

Bambu 生成的 FIR 数据流架构（简化 ASCII 示意）：

Bambu 生成的 fir8_opt RTL 数据路径（循环展开后）:

 input[15:0]  delay_line[0-7]         h[0-7]
     │              │                     │
     │         ┌────┴────┐               │
     └────────►│ 移位寄存器│               │
              │  8级深度  │               │
              └────┬─────┘               │
                   │                     │
        delay_line[0..7] ◄───────────────┘
               │                   │
    ┌──────────┼───────────┐  × 8 个乘法器
    │  MUL[0]  │  MUL[1..7]│  （SB_MAC16 原语）
    │ d[0]×h[0]│ d[i]×h[i] │
    └──────────┴─────┬─────┘
                      │
              ┌───────┴────────┐
              │   加法归约树    │
              │  8→4→2→1 加法  │
              └───────┬────────┘
                      │ acc[31:0]
                      ↓
              ┌───────────────┐
              │  >> 15（Q15）  │
              │  饱和截断      │
              └───────┬───────┘
                      │
                output[15:0]

对比 Vitis HLS II=1 流水线的架构差异：

Vitis HLS II=1 流水线版本:

Cycle N:   输入 x[N]，移位延迟线，启动 MAC 流水线第 1 级
Cycle N+1: 输入 x[N+1]，同时输出 y[N-1] ← 这就是 II=1

Bambu II=3（展开后）:

Cycle N:   输入 x[N]，移位延迟线
Cycle N+1: 8 个并行 MAC 运算
Cycle N+2: 加法归约 + 截断
Cycle N+3: 输出 y[N]，准备接受 x[N+3]

关键区别：Vitis HLS 的调度器能把 MAC 的乘法结果用流水线寄存器切割，让下一个样本的乘法和上一个样本的加法同时进行（重叠执行）。Bambu 目前的调度算法做不到这种时间重叠，导致 II 停在 3 而不是 1。

10. 检查单（本篇可复现验证步骤）

# 环境准备
docker pull bambuhls/bambu:latest
source oss-cad-suite/environment
pip3 install cocotb

# Step 1: 综合 FIR（Bambu，无优化）
mkdir -p ~/bambu-work/{src,output}
cat > ~/bambu-work/src/fir8.c << 'CSRC'
/* 把上面 fir8.c 的内容粘贴进来 */
CSRC

docker run --rm \
  -v ~/bambu-work:/work -w /work \
  bambuhls/bambu:latest bambu \
  --top-fname=fir8 \
  --target-file=src/fir8.c \
  --clock-period=10 \
  --output-directory=output/fir8_base

grep -i "initiation interval" output/fir8_base/HLS_output/Synthesis/*.xml 2>/dev/null ||
grep -i "II" output/fir8_base/HLS_output/Synthesis/*report* 2>/dev/null | head -5
# 预期：II=8（串行执行）

# Step 2: 综合 FIR（Bambu，循环展开）
docker run --rm \
  -v ~/bambu-work:/work -w /work \
  bambuhls/bambu:latest bambu \
  --top-fname=fir8_opt \
  --target-file=src/fir8_opt.c \
  --clock-period=20 \
  --unroll-all-loops \
  --output-directory=output/fir8_opt
# 预期：II=3，DSP=8

# Step 3: Bambu Verilog → Yosys → nextpnr
ls output/fir8_opt/HLS_output/Synthesis/*.v
# 找到主 Verilog 文件名（通常和函数名相同）

yosys -p "
  read_verilog output/fir8_opt/HLS_output/Synthesis/fir8_opt.v;
  synth_ice40 -top fir8_opt -json build/fir8.json
" 2>&1 | tail -20
# 预期: LUT4 ~891, SB_MAC16 8

nextpnr-ice40 --hx8k --package ct256 \
  --json build/fir8.json \
  --pcf constraints/hx8k.pcf \
  --asc build/fir8.asc \
  --freq 50
# 预期: Max frequency: 54+ MHz (PASS at 50.00 MHz)

# Step 4: 打包并烧录（可选）
icepack build/fir8.asc build/fir8.bin
iceprog build/fir8.bin

参考资料

资源	链接 / 说明
Bambu HLS 官方仓库	ferrandi/PandA-bambu
Bambu 论文	Pilato et al., “Bambu: A Modular Framework for the High Level Synthesis of Memory Interfaces”, DAC 2021
Bambu Docker 镜像	`docker pull bambuhls/bambu:latest`（DockerHub）
Vitis HLS 用户指南	AMD UG1399 “Vitis HLS User Guide” v2023.2
Parks-McClellan FIR 设计	Oppenheim & Schafer, “Discrete-Time Signal Processing”, 3rd Ed, Chapter 7
CIRCT 项目（未来 HLS）	llvm/circt，MLIR-based 硬件编译
LegUp 历史	T. Canis et al., “LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems”, FPGA 2011
Bambu vs Vitis 对比	Ferretti et al., “A Comparison Between HLS Tools for FPGA Design”, ISCAS 2022