Rebel-Quad

High Utilization, Low Power, Proven at Rack Scale.

Scaling Frontier LLMs with UCIe-Advanced & HBM3E

Built to serve frontier LLMs with high utilization and low power
Powered by unified mixed precision cores, predictive DMA, and UCIe interconnect
Rack-scale performance. Modular flexibility. Ready for deployment

Architecture

4-homogeneous-chiplet SoC based on
UCIe-Advanced

Compute (Dense)

1,024 TFLOPS (FP16)
2,048 TFLOPS (FP8)

External Memory

HBM3E 144GB 4.8TB/s

Chiplet Interface (UCIe-A)

16Gbps
1TB/s per channel

Host Connection

2x (64GB/s + 64GB/s)
2x PCIe Gen5 x16

Power Consumption

Up to 600W

Software

Native-support of PyTorch 2.x, vLLM and Triton

REBEL-Quad vs. B200 SXM

REBEL-Quad
B200 SXM

Throughput

1.4

Efficiency

1.6

Power Consumption

0.9

One Engine.
Mixed Precision.

REBEL-Quad executes FP8 and FP16 in a single, mixed-precision pipeline—no need for separate blocks or recompiled kernels. This delivers 2.8x higher compute density and 16% higher utilization vs. ATOM™.

Prefetch smarter.
Decode faster.

REBEL-Quad uses a predictive, software-controlled DMA engine tightly coupled with an on-chip mesh to prefetch KV data proactively. This enbales 2.7TB/s effective bandwidth and reduces token-level latency in 32K+ context LLMs.

Modular Architecture.
Monolithic Efficiency.

REBEL-Quad extends a full-chip mesh over UCIe-Advanced interconnects, offering 1TB/s per channel bi-directional bandwidth with just 11ns latency. Chiplets operate as one virtual die—no software changes, no I/O bottlenecks.

No Stalls.
Full Throughput.

REBEL-Quad implements hardware-accelerated, full-mesh synchronization across 256 routers. This avoids stalls in sparse or imbalanced workloads, sustaining high utilization across all chiplets and model phases.