ML Hyperpolyglot / Accelerators
a side-by-side reference sheet
general | memory | compute (peak) | interconnect & scale
Contributions welcome on GitHub.
| General | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TPU v3 | TPU v4 | TPU v5e | TPU v5p | TPU v6e (Trillium) | TPU7x (Ironwood) | Nvidia V100 | Nvidia A100 | Nvidia H100 | Nvidia H200 | Nvidia L4 | Nvidia B200 | |
| Release Year | 2018 | 2021 | 2023 | 2023 | 2024 | 2025 | 2017 | 2020 | 2022 | 2023 | 2023 | 2024 |
| Architecture | MXU | MXU | TensorCore | TensorCore | Trillium | Ironwood | Volta | Ampere | Hopper | Hopper | Ada Lovelace | Blackwell |
| Memory | ||||||||||||
| TPU v3 | TPU v4 | TPU v5e | TPU v5p | TPU v6e (Trillium) | TPU7x (Ironwood) | Nvidia V100 | Nvidia A100 | Nvidia H100 | Nvidia H200 | Nvidia L4 | Nvidia B200 | |
| HBM Capacity | 32 GB | 32 GB | 16 GB | 95 GB | 32 GB | 192 GB | 32 GB | 80 GB | 80 GB | 141 GB | 24 GB (GDDR6) | 192 GB |
| HBM Bandwidth (GB/s) | 900 | 1200 | 819 | 2765 | 1638 | 7380 | 900 | 2039 | 3350 | 4800 | 300 | 8000 |
| SMEM/VMEM Capacity | 16 MiB (per core) | 16 MiB (per core) | 128 MiB | (unpublished) | (unpublished) | (unpublished) | 6 MiB | 40 MiB | 50 MiB | 60 MiB | ||
| Compute (Peak) | ||||||||||||
| TPU v3 | TPU v4 | TPU v5e | TPU v5p | TPU v6e (Trillium) | TPU7x (Ironwood) | Nvidia V100 | Nvidia A100 | Nvidia H100 | Nvidia H200 | Nvidia L4 | Nvidia B200 | |
| BF16 / FP16 TFLOPS | 123 | 275 | 197 | 459 | 918 | 2307 | 125 | 312 | 1979 | 1979 | 121 | 2250 |
| FP8 TFLOPS | 197 | 459 | 918 | 4614 | 3958 | 3958 | 242 | 4500 | ||||
| INT8 TOPS | 275 | 393 | 918 | 1836 | 62 | 624 | 3958 | 3958 | 242 | 4500 | ||
| FP4 TFLOPS | 9000 | |||||||||||
| Interconnect & Scale | ||||||||||||
| TPU v3 | TPU v4 | TPU v5e | TPU v5p | TPU v6e (Trillium) | TPU7x (Ironwood) | Nvidia V100 | Nvidia A100 | Nvidia H100 | Nvidia H200 | Nvidia L4 | Nvidia B200 | |
| ICI Bandwidth (Per Chip) (GB / s) | 656 | 300 | 200 | 1200 | 400 | 300 | 600 | 900 | 900 | 1800 | ||
| DCN Bandwidth (Per Chip) | ~10 GB/s (100 Gbps) | ~12.5 GB/s (100 Gbps) | ~12.5 GB/s (100 Gbps) | ~25 GB/s (200 Gbps) | ~25 GB/s (200 Gbps) | ~12.5 GB/s (100 Gbps) | ~25 GB/s (200 Gbps) | ~50 GB/s (400 Gbps) | ~50 GB/s (400 Gbps) | ~50 GB/s (400 Gbps) | ~50-100 GB/s (400-800 Gbps) | |
| Max Scale (ICI Domain) | 1024 chips | 4096 chips | 256 chips | 8960 chips | 256 chips | 9216 chips | 8 (Server) | 16 (NVSwitch) | 256 (SuperPOD) | 256 (SuperPOD) | 8 (Server) | 576 (NVLink) |
General
Memory
SMEM/VMEM Capacity
- TPU VMEM: On-chip Vector Memory (SRAM) local to each TensorCore. Values are per-core unless shared.
- GPU L2: For GPUs, the L2 cache is the closest architectural equivalent to the TPU’s on-chip scratchpad memory.
Compute (Peak)
BF16 / FP16 TFLOPS
- TPU7x/Ironwood: First dual-chiplet architecture, exposing 2 devices per chip.
- Nvidia Tensor Cores: Figures are for Tensor Core performance (with sparsity where applicable for newer gens).
Interconnect & Scale
DCN Bandwidth (Per Chip)
- DCN (Data Center Network): Network bandwidth for inter-rack communication, typically Ethernet or InfiniBand. Values are approximate and vary by deployment.
Max Scale (ICI Domain)
- ICI (Inter-Chip Interconnect): Dedicated fabric for chip-to-chip communication (TPU ICI or Nvidia NVLink).
- ICI Domain: Maximum number of chips connectable via the specialized low-latency fabric before requiring Ethernet/InfiniBand.
- TPU v4/v5p: Feature 3D Torus interconnects for massive scale.
- TPU 7x docs
- Nvidia NVLink: Historically server-scale, now scaling to hundreds with NVSwitch (e.g., NVL72/576).