Enhancing Serving and Application Performance of LLMs
Large language model (LLM) has become common parlance in the tech industry. For these powerful models to truly reach and benefit a broader audience, they must be accessible through intuitive, user-friendly applications. However, as LLMs grow in both parameter size and complexity, their computational demands have escalated, making efficient inference increasingly challenging. Achieving optimal LLM serving requires advanced, application-level optimizations on the part of the providers to enhance speed, memory usage, and overall efficiency. At Rebellions, we are committed not only to supporting the development of LLMs but also to supporting the application-level requirements needed to deliver seamless and scalable AI solutions.
Tensor Parallelism
Distributing the inference workload across multiple devices through parallelism can be highly advantageous in two ways: it allows for the acceleration of bigger LLMs that cannot fit within a single device and also increases computing capacity that helps improve throughput and latency.
Tensor parallelism, one form of parallelism particularly effective on LLMs, involves intelligently partitioning the model’s tensors. RBLN Compiler is designed to fully optimize this process by performing meticulous analyses of data dependencies, movement patterns, and the scheduling of tasks across multiple devices. By leveraging these data, RBLN Compiler supports tensor parallelism of up to 16 devices, ensuring high performance of demanding workloads in larger deployments, such as our Rebellions Scalable Design (RSD).
Below is an example of utilizing tensor parallelism across eight ATOM devices using RBLN SDK.
model = RBLNLlavaNextForConditionalGeneration.from_pretrained(
model_id=”llava-hf/llava-v1.6-mistral-7b-hf”,
export=True,
rbln_config={
"language_model": {
"tensor_parallel_size": 8,
"max_seq_len": 32768,
"use_inputs_embeds": True,
"batch_size": 8,
},
"vision_feature_select_strategy": "default"
},
)
Weight-Only Quantization
Weight-only quantization (WOQ) addresses the challenge of high memory consumption by reducing weight transfer overhead in LLMs. For example, utilizing FP16 for running an entire 70B model demands approximately 140GBs of memory, which can lead to latency degradation even with tensor parallelism in place. WOQ enables the efficient deployment of LLMs with little compromise of accuracy and performance.
This technique involves reducing the precision of model weights to a lower scale (from FP16 to INT4), resulting in decreased memory requirements when transferring parameters from memory to the Neural Engines. Subsequently, the weights are upcasted back before computations to maintain accuracy. Through a partnership with SqueezeBits, we successfully compressed the Llama3-70B model’s 16-bit weights to 4-bit weights with minimal performance degradation across eight zero-shot tasks through ATOM-aware quantization. This achievement lays a crucial foundation for large language model (LLM) deployment.
vLLM Integration
vLLM is a widely recognized open-source serving framework designed to optimize the serving requirements of LLMs by enhancing inference performance and efficiency. Its advanced memory management and execution techniques effectively optimize computational workflows, leading to faster processing and reduced latency. In line with our commitment to delivering cutting-edge AI solutions, Rebellions has integrated vLLM into our software stack, leveraging its efficient memory handling and execution capabilities. We have already implemented continuous batching and are actively developing support for PagedAttention and FlashAttention to further elevate LLM serving capabilities.
Rebellions Model Zoo
The RBLN Model Zoo (Pytorch/TensorFlow) provides a comprehensive selection of pre-trained machine learning and deep learning models, with ongoing additions to expand its offerings, all supported by detailed and user-friendly documentation. It includes leading LLMs built on the transformers and diffusers libraries, optimized for seamless integration with Rebellions hardware.
Optimum RBLN, customized from Optimum by HuggingFace, is designed to optimize the performance of machine learning models for deployment on Rebellions’ hardware. Optimum RBLN is what enables the optimization techniques introduced above, acting as the interface between Rebellions’ hardware backend and the models.
With an extensive list of supported models now including vision language models (VLMs) like LlaVa and crucial components in Retrieval-Augmented Generation (RAG) pipeline such as embedding and reranker models, Rebellions provide a seamless experience for developers to deploy LLMs to power their applications.
Stay Tuned!
The proof of pudding is in the eating, and the true test of LLMs lies in their ability to deliver effective inference in real-world applications. At Rebellions, we go beyond merely providing powerful LLMs—we equally prioritize developing and optimizing the application-level features that ensure seamless integration, scalability, and performance across diverse use cases. By focusing on both the models and their deployment environments, Rebellions is committed to enabling meaningful, high-impact AI solutions. Stay tuned and don’t miss the latest news on our technology!