▶ 原文链接

深入拆解NVIDIA Dynamo端到端架构设计

来源: NVIDIA | Eli(Dynamo首席架构师) | 日期未提供 分类: NVIDIA 原文发表: 未提供 纪要生成: 2026-03-09


全集重点


嘉宾/话题简介

本次分享嘉宾Eli是NVIDIA Dynamo项目的首席架构师。本次分享聚焦Dynamo端到端架构设计思路,回应大模型推理服务平衡吞吐量与交互延迟的核心痛点,覆盖从预部署配置、集群调度、请求路由到故障容错的全链路实现逻辑,为大模型推理大规模落地提供完整方案参考。


分节详述

[00:00] Dynamo架构设计的核心挑战

本节重点

详细精要

💬 精华片段(中文)

"So as we look at Dynamo, we need to be able to support not only disaggregated serving, but also aggregated serving. Right. And some some areas of the curve aggregated serving will be better than disaggregated. So it's not really a one size fits all."


[02:58] 预部署阶段:AI Configurator离线配置工具

本节重点

详细精要

💬 精华片段(中文)

你可以在任意环境运行这个工具,因为它不需要GPU,完全基于仿真实现,能够快速给出准确的配置结果。

"The idea here is that this allows you to basically do offline configuration for your performance, choosing your particular latency targets and offline be able to quickly determine what is a good starting point. So this will tell you exactly what TP settings to give, what parallelism strategies, also how to match pre-fill and decode workers. And the idea is that you can do it, you can run it anywhere, because it doesn't really require GPU, just because it's done through simulation."


[06:00] 集群调度与服务发现:K8s原生设计

本节重点

详细精要

💬 精华片段(中文)

我们采用标准的K8s技术实现服务发现,用户无需在集群中部署额外服务即可接入,大幅降低了落地门槛。

"So the information is shared between everything during discovery. And again, the main thing to mention is that we're using standard Kubernetes techniques for this. So it makes it easier for people to adopt without needing any additional services within their cluster."


[08:50] 运行时动态调优:Planner与Model Express

本节重点

详细精要

💬 精华片段(中文)

如果首token延迟开始上升,我们可以增加预填充worker的数量;如果token间延迟成为瓶颈,我们则可以增加解码worker的数量。

"So if your time to first token is starting to increase, we can increase the number of prefill workers. If the intertoken latency is really the challenge, then we can increase the number of decode workers."


[11:10] 请求链路:路由、worker与KV缓存传输

本节重点

详细精要

💬 精华片段(中文)

路由会维护一个全局索引,记录KV缓存在所有worker上的分布情况,这个索引的信息直接来自worker上报的KV事件,所以完全精确,不需要近似判断缓存是否存在。

"So when blocks are stored or evicted from particular workers, the router keeps track of that and contains a global index for the way the KV cache is distributed across the worker. We call this precise thing because it really gets events directly from the worker so it doesn't have to approximate whether something's in the cache or not."


[16:00] 容错与高可用设计

本节重点

详细精要

💬 精华片段(中文)

我们现在投入大量精力研发的方向之一是利用Model Express和其他技术实现快速重启,尽可能降低故障后的恢复时间。

"One of the other pieces that we're spending a lot of time on now is using model express and other technologies to do fast restart. So again, we talked about low latency GPU to GP way transfer. We're also looking at different ways to do checkpoint and restore of complete processes to really reduce that cold start and warm start time."


专业术语注释

术语 解释
Dynamo(NVIDIA) 本集指NVIDIA推出的端到端大模型推理服务系统,覆盖部署、调度、路由、容错全链路
AI Configurator Dynamo配套的离线配置仿真工具,可快速生成推理部署最优配置
TP(Tensor Parallelism) 张量并行,一种大模型分布式推理并行策略,将模型参数拆分到多个GPU上计算
K8s(Kubernetes) 开源容器编排系统,是当前大规模云服务部署的事实标准
CRD(Custom Resource Definition) 自定义资源定义,K8s提供的扩展能力,用户可自定义资源类型
Grove Dynamo自研的K8s调度器,支持拓扑感知与细粒度扩缩容
HPA(Horizontal Pod Autoscaler) K8s原生水平Pod扩缩容组件
KV Cache 键值缓存,大模型推理中存储已计算的token注意力键值对,避免重复计算,降低延迟
Prefill Worker 预填充worker,大模型推理中负责处理用户输入prompt预计算阶段的工作节点
Decode Worker 解码worker,大模型推理中负责逐token生成输出序列阶段的工作节点
Nixle Dynamo内置的高性能多介质传输库,支持CPU、GPU、存储间的低延迟数据传输
Rust 高性能、内存安全的系统级编程语言,本集中用于开发Dynamo的路由、worker内核等核心组件
OpenAI Compatible Interface 兼容OpenAI标准的API接口,用户无需修改适配OpenAI的代码即可切换到Dynamo服务
vLLM 主流开源大模型推理服务框架,支持PagedAttention等内存优化技术
TensorRT-LLM(TRT-LLM) NVIDIA推出的大模型推理加速引擎,针对NVIDIA GPU做了深度优化
SGLang 主流开源大模型推理服务框架,支持快速结构输出等特性
SLA(Service Level Agreement) 服务等级协议,本集中指推理服务的延迟、可用性等服务承诺

延伸思考

  1. Dynamo的多推理引擎抽象设计是否会引入额外性能开销,不同引擎的特性适配完整度是落地时需要重点验证的内容
  2. KV缓存全局精确索引在超大规模集群(上万GPU)下的同步开销、路由性能是否会成为瓶颈,有待进一步测试
  3. 解离部署模式下预填充与解码worker间的KV缓存传输延迟对推理性能的影响,需要结合实际业务流量评估收益
  4. AI Configurator的仿真准确度与实际硬件、模型的适配度,会直接影响初始配置的有效性,需要针对特定场景做调优
  5. 请求迁移功能需要传输KV缓存,对于长序列请求来说迁移开销是否可控,适合的故障触发阈值需要针对性设置

原文发表:未提供  ·  纪要生成:2026-03-09