ML Hyperpolyglot / ML Inference Engines

a side-by-side reference sheet

general | core features | supported hardware

Contributions welcome on GitHub.

General
vLLM SGLang llama.cpp TensorRT-LLM TGI DeepSpeed-MII MLC LLM
GitHub vllm-project/vllm sgl-project/sglang ggerganov/llama.cpp NVIDIA/TensorRT-LLM huggingface/text-generation-inference microsoft/DeepSpeed-MII mlc-ai/mlc-llm
Primary Language Python / C++ / CUDA Python / CUDA C / C++ C++ / Python / CUDA Rust / Python Python / CUDA C++ / Python / Metal / CUDA / Vulkan
License Apache 2.0 Apache 2.0 MIT Apache 2.0 Apache 2.0 (mostly) Apache 2.0 Apache 2.0
Core Features
vLLM SGLang llama.cpp TensorRT-LLM TGI DeepSpeed-MII MLC LLM
Continuous Batching true true true true true true true
PagedAttention Yes (Original) true true true true true true
Speculative Decoding true true true true true true true
Quantization Support AWQ, FP8, GPTQ, SqueezeLLM AWQ, FP8, GPTQ GGUF (K-Quants), IQ-Quants AWQ, FP8, INT8, INT4 AWQ, EETQ, GPTQ, bitsandbytes No (uses DeepSpeed-Inference) 4-bit, 8-bit, AWQ
Multi-GPU Support Ray, PyTorch Distributed Ray, PyTorch Distributed Native MPI PyTorch Distributed DeepSpeed true
Supported Hardware
vLLM SGLang llama.cpp TensorRT-LLM TGI DeepSpeed-MII MLC LLM
NVIDIA GPU true true Yes (CUDA) Yes (Optimized) true true true
AMD GPU Yes (ROCm) Yes (ROCm) Yes (HIP) false Yes (ROCm) Yes (ROCm) Yes (ROCm)
CPU true Limited Yes (Highly Optimized) false true false Yes (Vulkan)
Apple Silicon false false Yes (Metal) false false false Yes (Metal)
TPU true false false false false false false