Isaac GR00T N1.6 模型架构

1. 整体架构概览

Isaac GR00T N1.6 是 NVIDIA 提出的视觉-语言-动作 (VLA) 通用人形机器人基座模型,3B 参数级。其架构由两部分串联:上游 Eagle Vision-Language Backbone(基于 NVIDIA Cosmos-Reason-2B VLM 变体)将图像 + 语言编码成统一的多模态特征序列;下游 Action Head 使用 32 层 AlternateVLDiT(Diffusion Transformer)通过 Flow Matching 把高斯噪声逐步去噪成动作 chunk。为了支持跨形态(cross-embodiment)训练,状态编码器、动作编码器、动作解码器都使用 CategorySpecificMLP——同一组层但为每个 embodiment 维护独立权重,最多支持 32 种形态。N1.6 相比 N1.5 的核心改变包括:换用更大的 Cosmos VLM 并解冻顶部 4 层、DiT 层数从 16 翻倍到 32、移除 N1.5 的 post-VLM 4 层 transformer adapter、对大部分形态改为预测 state-relative action chunks

graph TB subgraph Input["输入"] IMG["多视角图像
(原生宽高比, 无 padding)"] TXT["语言指令
(input_ids)"] STATE["机器人状态
[B, state_dim]
(≤ max_state_dim=29)"] EID["embodiment_id
[B] ∈ [0, 32)"] end subgraph Backbone["Eagle Backbone (Cosmos-Reason-2B 变体)"] EAGLE["vision_model + language_model
(select_layer=16, 截断后 LLM)
tune_top_llm_layers=4"] VLLN["VLN LayerNorm
(可选)"] end subgraph Encoders["Embodiment 条件化编码"] SE["state_encoder
CategorySpecificMLP
29 → 1024 → 1536"] AE["action_encoder
MultiEmbodimentActionEncoder
含 W1/W2/W3 + 正弦时间编码"] end subgraph FlowMatch["Flow Matching"] NOISE["x_t = (1 - t) * noise + t * actions
velocity = actions - noise"] TBUCKET["t_discretized = t * 1000"] end subgraph DiT["32 层 AlternateVLDiT"] DITCORE["sa_embs = concat(state_features, action_features)
cross-attention 交替 attend
非图像 token / 图像 token
每 2 个 cross block 切换一次
奇数层走 self-attention"] end subgraph Decoder["动作解码"] AD["action_decoder
CategorySpecificMLP
1024 → 1024 → 29"] EULER["欧拉积分
x_{t+dt} = x_t + dt * v_t
(4 步, dt = 1/4)"] end subgraph Output["输出"] ACT["动作 chunk
[B, 16, action_dim]
(多数 embodiment 为 state-relative)"] end IMG --> EAGLE TXT --> EAGLE EAGLE --> VLLN VLLN -->|"backbone_features [B, S, 2048]
+ backbone_attention_mask
+ image_mask"| DITCORE STATE --> SE EID --> SE EID --> AE EID --> AD NOISE -->|"x_t"| AE TBUCKET -->|"t_discretized"| AE AE --> DITCORE SE --> DITCORE TBUCKET -->|"timestep"| DITCORE DITCORE --> AD --> EULER --> ACT style Input fill:#e8f4fd,stroke:#2196F3 style Backbone fill:#fff3e0,stroke:#FF9800 style Encoders fill:#f3e5f5,stroke:#9C27B0 style FlowMatch fill:#fff9c4,stroke:#FFC107 style DiT fill:#e8f5e9,stroke:#4CAF50 style Decoder fill:#e0f2f1,stroke:#009688 style Output fill:#fce4ec,stroke:#E91E63

2. 核心组件详解

2.1 Eagle Vision-Language Backbone

EagleBackbone (gr00t/model/modules/eagle_backbone.py) 是上游 VLM 包装。N1.6 默认 model_name="nvidia/Eagle-Block2A-2B-v2"——NVIDIA 内部的 Cosmos-Reason-2B VLM 变体,用 flash attention + bf16 加载。它由 vision_model + mlp1 视觉投影 + language_model 三段组成。

graph LR subgraph EagleIn["输入"] PIX["pixel_values
(原生宽高比, any-res)"] IDS["input_ids + attention_mask
含图像 token 占位"] end subgraph EagleCore["Eagle (Cosmos-Reason-2B 变体)"] VIT["vision_model
(any-res ViT)"] MLP1["mlp1
视觉特征投影"] LLM["language_model
(LLM, 顶部 4 层可训练)
layers 被截断到 select_layer=16"] end PIX --> VIT --> MLP1 --> LLM IDS --> LLM LLM --> H["hidden_states[-1]
backbone_features
[B, S, 2048]"] IDS --> IMASK["image_mask = (input_ids == image_token_index)"] IDS --> AMASK["backbone_attention_mask = (attention_mask == 1)"] style EagleCore fill:#fff3e0,stroke:#FF9800

关键细节:


2.2 Embodiment 条件化的状态/动作编码

GR00T 要在一个模型里同时处理多种机器人(GR1、G1、YAM、Galaxea R1 Pro、Bridge、DROID...),关键设计是 CategorySpecificLinear——每个 embodiment id 选一组独立的 (W, b),同一个网络结构服务 32 种形态。

graph TB subgraph CSL["CategorySpecificLinear"] IN["x: [B, T, input_dim]"] --> SELECT["W[cat_ids] / b[cat_ids]
bmm(x, selected_W) + b"] EID["cat_ids: [B]"] --> SELECT SELECT --> OUT["[B, T, hidden_dim]"] end subgraph StateEnc["state_encoder = CategorySpecificMLP"] ST["state [B, state_dim≤29]"] --> SL1["CSL 29 → 1024
+ ReLU"] SL1 --> SL2["CSL 1024 → 1536"] SL2 --> SF["state_features
[B, state_horizon, 1536]"] end subgraph ActionEnc["action_encoder = MultiEmbodimentActionEncoder"] NA["noisy actions
[B, 16, 29]"] --> W1["W1 (CSL): 29 → 1536"] TS["t_discretized: [B]
expand to [B, 16]"] --> SIN["SinusoidalPositionalEncoding
dim=1536"] W1 --> CAT2["concat → [B, 16, 3072]"] SIN --> CAT2 CAT2 --> W2["W2 (CSL): 3072 → 1536
+ Swish"] W2 --> W3["W3 (CSL): 1536 → 1536"] W3 --> AF["action_features
[B, 16, 1536]"] end style CSL fill:#e3f2fd,stroke:#2196F3 style StateEnc fill:#f3e5f5,stroke:#9C27B0 style ActionEnc fill:#fff3e0,stroke:#FF9800

关键细节:


2.3 AlternateVLDiT (Action Head 主干)

N1.6 的 Action Head 由 32 层 AlternateVLDiT 组成(gr00t/model/modules/dit.py:289),是 DiT 的子类。每个 block 由 AdaLayerNorm + Attention + (Cross-Attention) 构成,时间步 temb 通过 AdaLN 注入到每一层(不是 Pi0 那种把时间塞进序列)。

graph TB subgraph Token["输入序列拼接"] SF2["state_features
[B, state_horizon, 1536]"] --> SAE["sa_embs = concat
[B, state_horizon + 16, 1536]"] AF2["action_features
[B, 16, 1536]"] --> SAE POS["+ position_embedding
(nn.Embedding(max_seq_len=1024))"] --> SAE end subgraph DiTLoop["32 个 Transformer Block (interleave)"] direction TB IDX["block idx ∈ [0, 32)"] IDX --> COND{"idx % 2 == 1 ?"} COND -->|"奇数层"| SA["Self-Attention
仅 sa_embs 内部"] COND -->|"偶数层 (cross)"| ALT{"idx % (2·attend_text_every_n_blocks=4)
== 0 ?"} ALT -->|"是"| TXT_ATTN["attend to 非图像 token
(非图像 backbone tokens)"] ALT -->|"否"| IMG_ATTN["attend to 图像 token
(图像 backbone tokens)"] SA --> NEXT["next block"] TXT_ATTN --> NEXT IMG_ATTN --> NEXT end subgraph TimeInject["时间步注入 (AdaLN)"] TS2["t_discretized [B]"] --> TENC["TimestepEncoder
embedding_dim=inner_dim
(num_attention_heads · attention_head_dim
= 32 · 48 = 1536)"] TENC --> TEMB["temb"] TEMB --> ADA["AdaLayerNorm 注入
每个 block 的 norm1"] end subgraph Out["输出投影"] SHIFT["shift, scale = proj_out_1(SiLU(temb)).chunk(2)"] NORM["norm_out → (1 + scale) * x + shift"] PROJ["proj_out_2 → output_dim=1024"] end SAE --> IDX NEXT --> SHIFT SHIFT --> NORM --> PROJ --> MO["model_output
[B, T, 1024]"] style Token fill:#e3f2fd,stroke:#2196F3 style DiTLoop fill:#e8f5e9,stroke:#4CAF50 style TimeInject fill:#fff9c4,stroke:#FFC107 style Out fill:#fce4ec,stroke:#E91E63

关键细节:


2.4 Flow Matching 扩散机制

GR00T 用 Flow Matching 而非 DDPM 风格扩散,与 Pi0 / Pi0.5 思路一致,但有两点 GR00T 特色:

  1. 离散化时间步 ——num_timestep_buckets=1000,连续 t 乘以桶数取整后送入 TimestepEncoder,得到 temb 注入到 AdaLN
  2. 方向取反约定 ——Pi0 用 x_t = t·noise + (1-t)·actionsGR00T 反过来x_t = (1-t)·noise + t·actions,对应 velocity = actions - noise;推理时从 t=0(纯噪声)线性走到 t=1(干净动作),步进 dt = +1/N(参见 gr00t_n1d6.py:316
graph TB subgraph FMTheory["Flow Matching 约定 (GR00T)"] FM["x_t = (1 - t) * noise + t * actions
velocity = actions - noise

t = 0: x_0 = noise (起点)
t = 1: x_1 = actions (终点)

训练目标: 预测 velocity"] end subgraph TSample["训练时间步采样"] BETA["sample ~ Beta(α=1.5, β=1.0)"] FLIP["t = (1 - sample) · noise_s
(noise_s = 0.999)"] end subgraph NoiseProc["加噪"] ACT_GT["真实动作 actions
[B, 16, 29]"] NS["noise ~ N(0,1)
[B, 16, 29]"] XT_TRAIN["x_t = (1-t)·noise + t·actions"] UT["velocity = actions - noise"] end subgraph LossC["损失"] PRED["v_t = action_decoder(DiT(...))"] MASK["action_mask
(屏蔽 padding 的 action 维)"] L["loss = sum((v_t - velocity)² · mask) / sum(mask)"] end BETA --> FLIP FLIP -->|"t"| XT_TRAIN ACT_GT --> XT_TRAIN NS --> XT_TRAIN ACT_GT --> UT NS --> UT UT --> L PRED --> L MASK --> L style FMTheory fill:#e3f2fd,stroke:#2196F3 style TSample fill:#fff3e0,stroke:#FF9800 style NoiseProc fill:#f3e5f5,stroke:#9C27B0 style LossC fill:#fce4ec,stroke:#E91E63

与 Pi0 / 标准扩散的对比:

特性 Pi0 (Flow Matching) GR00T N1.6 (Flow Matching)
线性路径方向 x_t = t·noise + (1-t)·actions x_t = (1-t)·noise + t·actions
推理积分方向 t: 1.0 → 0.0(dt=-0.1,10 步) t: 0.0 → 1.0(dt=+0.25,4 步
时间步表示 正弦周期 [4e-3, 4.0] 注入序列 1000 个离散桶 + AdaLN 注入每层
Beta 采样 Beta(1.5, 1.0) 直接用 Beta(1.5, 1.0)(1-sample)·0.999
时间注入位置 拼成序列里的一个 token 通过 AdaLN 注入每个 DiT block

2.5 State-Relative Action Chunks(N1.6 新约定)

对于大多数 embodiment,N1.6 让模型预测 相对于当前状态的增量动作(state-relative),而不是绝对关节角 / 末端位姿。开关在 Gr00tN1d6Config.use_relative_action,相关变换由 gr00t/data/ 中的 StateActionProcessor 在数据侧完成:训练时把 GT 转成 relative,推理时把模型输出加回 base state 再下发。

这一改动的好处是:


3. 训练流水线

graph TB subgraph DataPrep["数据准备 (LeRobot v2 schema)"] RAW["原始 episode
(video, state, action) 三元组"] --> SAP["StateActionProcessor
归一化 + 可选 state-relative 变换"] SAP --> COLL["Gr00tN1d6DataCollator
(processing_gr00t_n1d6.py)
构建 vlm_content + state + action"] end subgraph Inputs["模型输入"] VLMC["vlm_content
(已编排好图像 + 语言)"] STATE_I["state [B, ≤29]"] ACT_I["action [B, 16, ≤29]
(可能为 relative)"] AM["action_mask [B, 16, ≤29]"] EID2["embodiment_id [B]"] end subgraph BackPass["Backbone 前向"] VLMC --> EAGLE_F["EagleBackbone(
input_ids, attention_mask,
pixel_values)"] EAGLE_F --> BFEAT["backbone_features [B, S, 2048]
backbone_attention_mask
image_mask"] end subgraph HeadPass["Action Head 前向"] BFEAT --> VLLN_OP["VLN LayerNorm"] STATE_I --> SE_OP["state_encoder + 可选 dropout/noise"] EID2 --> SE_OP EID2 --> AE_OP ACT_I --> NOISY["sample t ~ flip(Beta(1.5,1)) · 0.999
noise ~ N(0,1)
x_t = (1-t)·noise + t·actions
velocity = actions - noise"] NOISY --> AE_OP["action_encoder
(W1/concat-sin/W2-Swish/W3)"] SE_OP --> CAT3["sa_embs = concat(state, action)
+ position_embedding"] AE_OP --> CAT3 CAT3 --> DIT_F["AlternateVLDiT (32 层)
cross-attend 到 VLLN(backbone)"] VLLN_OP --> DIT_F DIT_F --> DEC["action_decoder → v_t"] end subgraph LossB["损失 + 反传"] DEC --> MSE["F.mse_loss(v_t, velocity, reduction='none')
· action_mask"] AM --> MSE NOISY --> MSE MSE --> SUM["loss = sum(action_loss) / (sum(mask)+1e-6)"] SUM --> BACK["反向传播
(HuggingFace Trainer + DeepSpeed)"] end style DataPrep fill:#e8f4fd,stroke:#2196F3 style Inputs fill:#fff3e0,stroke:#FF9800 style BackPass fill:#f3e5f5,stroke:#9C27B0 style HeadPass fill:#e8f5e9,stroke:#4CAF50 style LossB fill:#fce4ec,stroke:#E91E63

训练设置要点:


4. 推理流水线

graph TB subgraph PreProc["输入准备"] OBS["观测
{video.*: ndarray
state.*: ndarray
annotation.*: str}"] --> POL["Gr00tPolicy.get_action()
(gr00t/policy/gr00t_policy.py)"] POL --> VLMG["build vlm_content
(processing_gr00t_n1d6)"] POL --> NORM_S["normalize state"] end subgraph Backbone["Backbone (单次前向)"] VLMG --> EAGLE_I["EagleBackbone forward
output hidden_states[-1]"] EAGLE_I --> BF["backbone_features [B, S, 2048]
backbone_attention_mask
image_mask"] end subgraph EncOnce["编码状态 (单次)"] NORM_S --> SE_I["state_encoder"] EID3["embodiment_id"] --> SE_I SE_I --> SFI["state_features [B, state_horizon, 1536]"] end subgraph Denoise["4 步 Flow Matching 去噪"] INIT["x_0 = randn(B, 16, 29)
dt = 1 / 4 = 0.25"] STEP["for step in range(4):
t_cont = step / 4
t_discretized = int(t_cont · 1000)"] subgraph DSTEP["单步"] X_T["x_t (当前动作)"] --> AE_I["action_encoder(x_t, t, eid)"] AE_I --> SAE_I["sa_embs = cat(state_features, action_features)
+ pos_emb"] SAE_I --> DIT_I["AlternateVLDiT.forward
cross-attend 到 backbone"] BF --> DIT_I DIT_I --> DEC_I["action_decoder → pred"] DEC_I --> VT["v_t = pred[:, -16:]"] end VT --> UPD["x_{t+dt} = x_t + dt · v_t"] UPD --> X_T end subgraph PostProc["后处理"] UPD --> UN["反归一化"] UN --> REL{"use_relative_action ?"} REL -->|"是"| ADD["+ base state → 绝对动作"] REL -->|"否"| KEEP["直接输出"] ADD --> OUT_A["action chunk
[16, action_dim_real]"] KEEP --> OUT_A end INIT --> X_T STEP --> AE_I style PreProc fill:#e3f2fd,stroke:#2196F3 style Backbone fill:#fff3e0,stroke:#FF9800 style EncOnce fill:#f3e5f5,stroke:#9C27B0 style Denoise fill:#e8f5e9,stroke:#4CAF50 style PostProc fill:#fce4ec,stroke:#E91E63

推理优化要点:


5. 关键超参数表

模型结构参数

参数 说明
model_name nvidia/Eagle-Block2A-2B-v2 Eagle backbone (Cosmos-Reason-2B 变体)
backbone_embedding_dim 2048 Backbone 输出维度
select_layer 16 LLM 截断保留层数
tune_top_llm_layers 4 训练时解冻的 LLM 顶部层数
tune_llm / tune_visual False / False 默认全冻视觉与底部 LLM
hidden_size 1024 解码后 hidden 维度
input_embedding_dim 1536 state/action 编码后的维度,也是 DiT inner_dim
max_seq_len 1024 位置嵌入最大长度
max_num_embodiments 32 CategorySpecific 层的形态数

Diffusion Transformer (AlternateVLDiT)

参数 说明
num_layers 32 DiT 层数(N1.5 是 16)
num_attention_heads 32 注意力头数
attention_head_dim 48 每头维度
inner_dim 1536 = 32 × 48
norm_type ada_norm AdaLayerNorm
dropout 0.2 DiT dropout
interleave_self_attention True 奇数层自注意力,偶数层 cross
attend_text_every_n_blocks 2 文/图 cross 切换周期(每 4 个 cross block 中 2 文 2 图)
output_dim 1024 投影回 hidden_size
use_vlln True backbone 输出后过 LayerNorm
add_pos_embed True nn.Embedding(1024, 1536) 位置嵌入

动作生成

参数 说明
action_horizon 16 单次预测的动作步数
max_action_dim 29 动作维度上限(padding 用)
max_state_dim 29 状态维度上限
use_relative_action False (config 默认) 多数 embodiment 在数据侧设为 True,输出 state-relative

Flow Matching

参数 说明
num_inference_timesteps 4 推理去噪步数
noise_beta_alpha 1.5 Beta 采样 α
noise_beta_beta 1.0 Beta 采样 β
noise_s 0.999 时间步缩放(避免 t=1)
num_timestep_buckets 1000 timestep 离散化桶数

训练正则

参数 说明
state_dropout_prob 0.0 状态特征 dropout 概率(finetune 可能 >0)
state_additive_noise_scale 0.0 状态特征加性噪声尺度
attn_dropout 0.2 注意力 dropout
model_dtype bfloat16 训练精度
use_flash_attention True 强制 flash attention
backbone_trainable_params_fp32 True 可训练参数 cast 回 fp32

推理性能(来自 README)

设备 模式 Data Backbone Action Head E2E 频率
RTX 5090 torch.compile 2 ms 18 ms 16 ms 37 ms 27.3 Hz
H100 torch.compile 4 ms 23 ms 11 ms 38 ms 26.3 Hz
RTX 4090 torch.compile 2 ms 25 ms 17 ms 44 ms 22.8 Hz
Jetson Thor torch.compile 5 ms 39 ms 61 ms 105 ms 9.5 Hz

6. 关键源文件表

组件 类名 / 函数 文件路径
主模型 Gr00tN1d6 gr00t/model/gr00t_n1d6/gr00t_n1d6.py:411
Action Head Gr00tN1d6ActionHead gr00t/model/gr00t_n1d6/gr00t_n1d6.py:19
配置 Gr00tN1d6Config gr00t/configs/model/gr00t_n1d6.py:13
VL Backbone EagleBackbone gr00t/model/modules/eagle_backbone.py:8
Diffusion Transformer AlternateVLDiT / DiT gr00t/model/modules/dit.py:289 / :172
Embodiment 线性层 CategorySpecificLinear / MLP gr00t/model/modules/embodiment_conditioned_mlp.py:44 / :128
动作编码器 MultiEmbodimentActionEncoder gr00t/model/modules/embodiment_conditioned_mlp.py:162
时间步正弦编码 SinusoidalPositionalEncoding gr00t/model/modules/embodiment_conditioned_mlp.py:11
数据 collator Gr00tN1d6DataCollator gr00t/model/gr00t_n1d6/processing_gr00t_n1d6.py
推理策略 Gr00tPolicy gr00t/policy/gr00t_policy.py:46
Server run_gr00t_server.py gr00t/eval/run_gr00t_server.py
训练入口 launch_train.py / launch_finetune.py gr00t/experiment/
Trainer 包装 Gr00tTrainer gr00t/experiment/trainer.py