ML Hyperpolyglot / Accelerators

a side-by-side reference sheet

general | memory | compute (peak) | interconnect & scale

Contributions welcome on GitHub.

General
TPU v3 TPU v4 TPU v5e TPU v5p TPU v6e (Trillium) TPU7x (Ironwood) Nvidia V100 Nvidia A100 Nvidia H100 Nvidia H200 Nvidia L4 Nvidia B200
Release Year 2018 2021 2023 2023 2024 2025 2017 2020 2022 2023 2023 2024
Architecture MXU MXU TensorCore TensorCore Trillium Ironwood Volta Ampere Hopper Hopper Ada Lovelace Blackwell
Memory
TPU v3 TPU v4 TPU v5e TPU v5p TPU v6e (Trillium) TPU7x (Ironwood) Nvidia V100 Nvidia A100 Nvidia H100 Nvidia H200 Nvidia L4 Nvidia B200
HBM Capacity 32 GB 32 GB 16 GB 95 GB 32 GB 192 GB 32 GB 80 GB 80 GB 141 GB 24 GB (GDDR6) 192 GB
HBM Bandwidth (GB/s) 900 1200 819 2765 1638 7380 900 2039 3350 4800 300 8000
SMEM/VMEM Capacity 16 MiB (per core) 16 MiB (per core) 128 MiB (unpublished) (unpublished) (unpublished) 6 MiB 40 MiB 50 MiB 60 MiB
Compute (Peak)
TPU v3 TPU v4 TPU v5e TPU v5p TPU v6e (Trillium) TPU7x (Ironwood) Nvidia V100 Nvidia A100 Nvidia H100 Nvidia H200 Nvidia L4 Nvidia B200
BF16 / FP16 TFLOPS 123 275 197 459 918 2307 125 312 1979 1979 121 2250
FP8 TFLOPS 197 459 918 4614 3958 3958 242 4500
INT8 TOPS 275 393 918 1836 62 624 3958 3958 242 4500
FP4 TFLOPS 9000
Interconnect & Scale
TPU v3 TPU v4 TPU v5e TPU v5p TPU v6e (Trillium) TPU7x (Ironwood) Nvidia V100 Nvidia A100 Nvidia H100 Nvidia H200 Nvidia L4 Nvidia B200
ICI Bandwidth (Per Chip) (GB / s) 656 300 200 1200 400 300 600 900 900 1800
DCN Bandwidth (Per Chip) ~10 GB/s (100 Gbps) ~12.5 GB/s (100 Gbps) ~12.5 GB/s (100 Gbps) ~25 GB/s (200 Gbps) ~25 GB/s (200 Gbps) ~12.5 GB/s (100 Gbps) ~25 GB/s (200 Gbps) ~50 GB/s (400 Gbps) ~50 GB/s (400 Gbps) ~50 GB/s (400 Gbps) ~50-100 GB/s (400-800 Gbps)
Max Scale (ICI Domain) 1024 chips 4096 chips 256 chips 8960 chips 256 chips 9216 chips 8 (Server) 16 (NVSwitch) 256 (SuperPOD) 256 (SuperPOD) 8 (Server) 576 (NVLink)

General

Memory

SMEM/VMEM Capacity

  • TPU VMEM: On-chip Vector Memory (SRAM) local to each TensorCore. Values are per-core unless shared.
  • GPU L2: For GPUs, the L2 cache is the closest architectural equivalent to the TPU’s on-chip scratchpad memory.

Compute (Peak)

BF16 / FP16 TFLOPS

  • TPU7x/Ironwood: First dual-chiplet architecture, exposing 2 devices per chip.
  • Nvidia Tensor Cores: Figures are for Tensor Core performance (with sparsity where applicable for newer gens).

Interconnect & Scale

DCN Bandwidth (Per Chip)

  • DCN (Data Center Network): Network bandwidth for inter-rack communication, typically Ethernet or InfiniBand. Values are approximate and vary by deployment.

Max Scale (ICI Domain)

  • ICI (Inter-Chip Interconnect): Dedicated fabric for chip-to-chip communication (TPU ICI or Nvidia NVLink).
  • ICI Domain: Maximum number of chips connectable via the specialized low-latency fabric before requiring Ethernet/InfiniBand.
  • TPU v4/v5p: Feature 3D Torus interconnects for massive scale.
  • TPU 7x docs
  • Nvidia NVLink: Historically server-scale, now scaling to hundreds with NVSwitch (e.g., NVL72/576).