ML Hyperpolyglot / ML Inference Engines
a side-by-side reference sheet
general | core features | supported hardware
Contributions welcome on GitHub.
| General | |||||||
|---|---|---|---|---|---|---|---|
| vLLM | SGLang | llama.cpp | TensorRT-LLM | TGI | DeepSpeed-MII | MLC LLM | |
| GitHub | vllm-project/vllm | sgl-project/sglang | ggerganov/llama.cpp | NVIDIA/TensorRT-LLM | huggingface/text-generation-inference | microsoft/DeepSpeed-MII | mlc-ai/mlc-llm |
| Primary Language | Python / C++ / CUDA | Python / CUDA | C / C++ | C++ / Python / CUDA | Rust / Python | Python / CUDA | C++ / Python / Metal / CUDA / Vulkan |
| License | Apache 2.0 | Apache 2.0 | MIT | Apache 2.0 | Apache 2.0 (mostly) | Apache 2.0 | Apache 2.0 |
| Core Features | |||||||
| vLLM | SGLang | llama.cpp | TensorRT-LLM | TGI | DeepSpeed-MII | MLC LLM | |
| Continuous Batching | true | true | true | true | true | true | true |
| PagedAttention | Yes (Original) | true | true | true | true | true | true |
| Speculative Decoding | true | true | true | true | true | true | true |
| Quantization Support | AWQ, FP8, GPTQ, SqueezeLLM | AWQ, FP8, GPTQ | GGUF (K-Quants), IQ-Quants | AWQ, FP8, INT8, INT4 | AWQ, EETQ, GPTQ, bitsandbytes | No (uses DeepSpeed-Inference) | 4-bit, 8-bit, AWQ |
| Multi-GPU Support | Ray, PyTorch Distributed | Ray, PyTorch Distributed | Native | MPI | PyTorch Distributed | DeepSpeed | true |
| Supported Hardware | |||||||
| vLLM | SGLang | llama.cpp | TensorRT-LLM | TGI | DeepSpeed-MII | MLC LLM | |
| NVIDIA GPU | true | true | Yes (CUDA) | Yes (Optimized) | true | true | true |
| AMD GPU | Yes (ROCm) | Yes (ROCm) | Yes (HIP) | false | Yes (ROCm) | Yes (ROCm) | Yes (ROCm) |
| CPU | true | Limited | Yes (Highly Optimized) | false | true | false | Yes (Vulkan) |
| Apple Silicon | false | false | Yes (Metal) | false | false | false | Yes (Metal) |
| TPU | true | false | false | false | false | false | false |