Accelerate AI with Over 300 Supported Models, Effortlessly.
Discover how to quickly deploy your AI models on Rebellions' NPU using RBLN SDK.
You can find detailed information on our compiler, runtime, model zoo, and serving frameworks.
Get Started with Frameworks
Hugging Face
RBLN SDK supports transformer and diffuser models on Hugging Face, downloadable from the Optimum RBLN library. Deploy newest models like Llama3-8b, SDXL from Hugging Face Hub.
💡 Run Hugging Face models on Rebellions hardware.
- Compilation and inference with Hugging Face models optimized for Rebellions’ hardware.
- Efficient, developer-friendly API using RBLN Runtime.
- Support of Llama and SDXL models with multi chips.
PyTorch
RBLN SDK supports PyTorch 2.0. Accelerate your PyTorch-trained NLP, speech, and vision models on Rebellions’ hardware.
💡 RBLN SDK integrates PyTorch models.
- Compilation of PyTorch models optimized for Rebellions’ hardware.
- Efficient, developer-friendly API using RBLN Runtime.
- Run Torch 2.0 models without pretuning and build a powerful serving pipeline.
TensorFlow
RBLN SDK supports TensorFlow. Optimize inference for models like LLMs, ImageNet and YOLO.
💡 RBLN SDK integrates TensorFlow models.
- Inference with a multitude of pre-trained Keras Applications.
- Efficient, developer-friendly API using RBLN Runtime.
- Run TensorFlow without pretuning and build a powerful serving pipeline.
Rebellions’ Software Stack
Rebellions Software Stack supports our hardware to deliver maximum performance.
Machine Learning Framework
Machine Learning (ML) frameworks are essential tools in the development and deployment of AI models, including NLP, Vision, Speech, and Generative models. Currently, the most popular frameworks are TensorFlow, PyTorch, and Hugging Face, each offering unique features and capabilities that cater to different aspects of machine learning development and deployment.
Compiler
The RBLN Compiler transforms models into executable instructions for ATOM™. It comprises two main components: the Frontend Compiler and the Backend Compiler. The Frontend Compiler abstracts deep learning models into Intermediate Representations (IRs), optimizing them before handing them off to the Backend Compiler. The Backend Compiler further optimizes these IRs and produces the Command Stream, the Program Binary for the hardware to execute the tasks, and serialized weights.
Compute Library
The Compute Library includes a comprehensive suite of highly optimized low-level operations, which are essential for model inference. These low-level operations form the programmable components of the arithmetic logic units within the Neural Engines. The Compute Library prepares the Program Binary at the Compiler’s command. The RBLN SDK supports low-level operations for both traditional Convolutional Neural Networks (CNNs) and state-of-the-art GenAI models. This includes hundreds of General Matrix Multiply (GEMM), normalization, and nonlinear activation functions. Thanks to the flexibility of the Neural Engines, the list of supported low-level operations continues to expand, enabling acceleration across a wide range of AI applications.
Runtime Module
The Runtime Module acts as the intermediary between the compiled model and the hardware, managing the actual execution of programs. It prepares executable instructions generated by the Compiler, manages data transfer between memory and the Neural Engines, and monitors performance to optimize the execution process.
Driver
The Driver, consisting of the Kernel-Mode Driver (KMD) and User-Mode Driver (UMD), provides efficient, safe, and flexible access to the hardware. The KMD allows the operating system to recognize the hardware and exposes APIs to the UMD. It also delivers the Command Stream from the Compiler stack to the device. The UMD, running in user space, intermediates between the application software and the hardware, managing their interactions.
Firmware
The Firmware is the lowest-level software component on ATOM™, serving as the final interface between software and hardware. It controls the tasks of the Command Processor, which orchestrates ATOM™’s operations. Located on the SoC, the Command Processor manages the Command Stream (the actual AI workloads) across multiple layers of the memory architecture and monitors the hardware’s health status.
RBLN Backend Rebellions Hardware
Rebellions’ ATOM™ is an AI accelerator engineered specifically for AI inference tasks with formidable capacity, manufactured on Samsung’s advanced 5nm process. It delivers 32 Tera Floating Point Operations per Second (TFLOPS) for FP16 and 128 Trillion Operations Per Second (TOPS) for INT8, enhanced by eight Neural Engines and 64 MB of on-chip SRAM. With an intricate memory architecture engineered with unparalleled technical mastery, ATOM™ is designed for high performance and peak efficiency.
Frequently Asked Questions
Can’t find what you’re looking for? Contact us here!
We are continuously improving compatibility with major AI frameworks through regular updates.
In most cases, you can use the RBLN SDK with minimal code changes.
- For officially supported Model Zoo models, you can use the provided example code right away.
- Other models can also be compiled by referring to the Model Zoo code.
Check the list of supported operations in advance:
To maximize the performance of transformer-based models, consider the following:
- Set the
rbln_tensor_parallel_size
value appropriately to utilize NPU parallelism - Tune the input sequence length and batch size
The RBLN SDK provides a C/C++-bound runtime for applications where Python runtime is unavailable or extremely low latency is required.
Please refer to the C/C++ guide for more information.
The RBLN SDK and Compiler are regularly updated to maintain API compatibility with the latest versions of major frameworks.
For details, please refer to the respective Release Notes.
RBLN SDK offers high compatibility with PyTorch-based models.
• torch.compile()
Support: Fully compatible with PyTorch 2.0’s torch.compile()
feature, and supports models compiled using TorchDynamo and TorchInductor backends.
• Extensive Operator Support: The RBLN Compiler supports most PyTorch operators. You can check the full list in Supported Ops. It also includes major operators for Vision, NLP, and Audio, making it suitable for a wide range of deep learning models.
• PyTorch Model Zoo Compatibility: Popular models such as ResNet, YOLO, LLaMA, and BERT are supported. See the PyTorch Model Zoo page for more details.
• JIT/Scripted Model Support: Models converted using TorchScript can also be processed by the RBLN Compiler.
The RBLN Driver can be installed using the provided deb
or rpm
installation files and requires root privileges. During installation, you must ensure that the kernel version is compatible with the driver.
In most cases, we provide an environment with the Driver pre-installed. If installation is required, please refer to the Installation Guide.
The RBLN SDK can be easily installed in a Python environment as follows:
pip3 install --extra-index-url https://pypi.rbln.ai/simple rebel-compiler==<latest-version> optimum-rbln==<latest-version> vllm-rbln==<latest-version>
To check the latest package versions, refer to the Release Notes. Depending on your environment, additional Python package dependencies may be required.
pip3 install --extra-index-url https://pypi.rbln.ai/simple rebel-compiler==
Python 3.9 or higher is recommended, and there are key package dependencies such as numpy, torch, and onnx.
Please refer to the Support Matrix page for the supported OS and Python versions.
Required packages may vary by model, so refer to the requirements.txt
file included in the Model Zoo code for details.
Currently, RBLN SDK only supports Linux. Windows support will be determined based on our technical roadmap.
More details on the supported OS and Python version can be found on the Support Matrix page.
The RBLN SDK supports distributed inference based on tensor parallelism, called RSD (Rebellions Scalable Design).
Please first check the Model List that support multi-device, and refer to the provided example for compilation instructions.
The optimal batch size may vary depending on the type of NPU used, server configuration, and service requirements.
We recommend using the Profiler tool and conducting various experiments for fine-tuning.
RBLN SDK includes the RBLN Profiler for performance bottleneck analysis, collecting key metrics such as execution time, memory usage, and operation dependencies
.pb
format trace files can be visualized with Perfetto. - You can analyze bottlenecks, inter-operation dependencies, and layer-by-layer latency to suggest optimization directions. For detailed usage, refer to the Profiler Guide.
To process video files, you can use libraries like OpenCV (cv2) to extract each frame from an .mp4
file as an image, and then feed those frames into the model for inference.
For example, when using an object detection model like YOLOX, the typical procedure is as follows:
1.Load the video file using cv2.VideoCapture
2.Extract frames one by one
3.Preprocess each frame to match the model’s input format
4.Perform object detection using the model
5.Visualize the results and either save them or display them in real time
Both are AI inference NPUs developed by Rebellions, but REBEL is a next-generation product designed with a chiplet-based architecture. A detailed comparison chart is available on the product page.
Yes. You can use Rebellions AI processor resources via the Kubernetes Plugin.
- Kubernetes Device Plugin: Supports RBLN NPUs on Kubernetes cluster environment.
- NPU NPU Feature Discovery: Labels Kubernetes nodes with RBLN NPUs for scheduling.
- RBLN Metrics Exporter: Exposes NPU metrics (temperature, power, DRAM, utilization) in Prometheus format for Grafana dashboards.
RBLN SDK is compatible with vLLM, Nvidia Triton Inference Server, and TorchServe. Container-based deployment also supports integration with Kubernetes.
GPUs were originally designed for graphics rendering but have been widely adopted for AI training and high-performance computing (HPC) due to their large-scale parallel processing capabilities. They typically use FP32/FP16 operations and support various types of computation through CUDA cores and Tensor Cores.
NPUs are processors specialized for AI and deep learning, designed to perform efficient computations at low power. They are optimized for low-bit operations such as INT8 and FP16 and include dedicated hardware architectures that accelerate neural network computations.
Rebellions devices are designed exclusively for inference, and fine-tuning is not currently supported.
To maximize inference performance, we recommend the following optimization strategies:
- Use Mixed Precision and Quantization: Improve memory efficiency and compute speed by using FP16 or INT8 quantized models.
- Adjust Batch Size: Find the optimal batch size based on model characteristics and input data to increase throughput.
- Refactor Model Architecture: Simplify the computation graph through layer fusion and removal of redundant operations to boost performance.
- Double Buffering: Utilize double buffering in
AsyncRuntime
to improve execution efficiency. - Apply Continuous Batching for LLM Serving: For large language model (LLM) serving, maximize hardware utilization by applying continuous batching techniques using
vllm-rbln
.
You can ask questions or discuss technical issues on Rebellions Dev Forum. You can directly reach out to us here.
The SDK is updated approximately every month, and the driver is updated every three months, although the schedule is subject to change.
For detailed information, please refer to the latest Release Notes.
Currently, for officially supported models listed in the RBLN Model Zoo, you can use the provided compilation and inference example code.
If you’re using a modified model or a model not included in the Model Zoo, technical support may be limited, and compilation may fail.
First, check the error code to identify the cause. If further assistance is required, please reach out via the Rebellions Dev Forum.
Please check the following items:
- Memory Usage: If the system runs out of memory during compilation, the process may fail.
- NPU Configuration: Ensure that the value of
rbln_tensor_parallel_size
is not greater than the actual number of devices installed in your system. You can verify the number of devices by running therbln-stat
command in your terminal. - Docker Environment: Refer to the Docker Guide for more details.
You can limit the number of CPU threads used during inference by setting the RBLN_NUM_THREADS
environment variable. Specifying an appropriate number of threads can reduce CPU load and help stabilize performance.
Pleae refer to this document for more details.
Issues may arise due to version mismatches between the driver and compiler.
- Refer to the Release Notes of the RBLN SDK to ensure that all components are installed with compatible versions.
- After aligning all libraries to their compatible versions, try recompiling the model.