This note summarises details of some of the new silicon chips for machine
intelligence. Its aim is to distil the most important implementation and
architectural details (at least that are currently available), to highlight the
main differences between them. I’m focusing on chips designed for training
since they represent the frontier in performance and capability. There are many
chips designed for inference, but these are typically intended for use in
embedded or edge deployments.
The Cerebras Wafer-Scale Engine (WSE) is undoubtedly the most bold and
innovative design to appear recently. Wafer-scale integration is not a new
idea, but integration issues to do with yield, power delivery and thermal
expansion have made it difficult to commercialise (see the 1989 Anamartic 160 MB solid state disk). Cerebras use this approach to integrate 84
chips with high-speed interconnect, uniformly scaling the 2D-mesh based
interconnect to huge proportions. This provides a machine with a large amount
of memory (18 GB) distributed among a large amount of compute (3.3 Peta FLOPs
peak). It is unclear how this architecture scales beyond single WSEs; the
current trend in neural nets is to larger networks with billions of weights,
which will necessitate such scaling.
Announced August 2019.
46,225 mm2 wafer-scale integrated system (215 mm x 215 mm) om TSMC 16 nm.
Many individual chips: a total of 84 (12 wide by 7 tall).
18 GB total of SRAM memory, distributed among cores.
426,384 simple compute cores.
Silicon defects can be repaired by using redundant cores and links to bypass a faulty area. It appears that each column includes one redundant core, leaving 410,592 functional cores.
Speculated clock speed of ~1 GHz and 15 kW power consumption.
Interconnect and IO:
Interconnections between chips, across scribe lines, with wiring added in post-processing steps after conventional wafer manufacturing.
IOs brought out on east and west edges of wafer, which is limited by the pad density on each edge. It is unlikely there are any high-speed SerDes since these would need to be integrated in every chip, making a sizeable part of the wafer area redundant apart from chips with edges on the periphery.
2D mesh-based interconnect, supports single-word messages. According to their whitepaper: “The Cerebras software configures all the cores on the WSE to support the precise communication required” indicating that the interconnect is statically configured to support a fixed communication pattern.
Interconnect requires static configuration to support specific patterns of communication.
Zeros not transmitted on the interconnect to optimise for sparsity.
Is ~0.1 mm2 of silicon.
Has 47 kB SRAM memory.
Zeros not loaded from memory and zeros not multiplied.
Assumed FP32 precision and scalar execution (can’t filter zeros from memory with SIMD).
FMAC datapath (peak 8 operations per cycle).
Tensor control unit to feed the FMAC datapath with strided accesses from memory or inbound data from links.
Has four 8 GB/s bidirectional links to its neighbours.
Is 17 mm x 30 mm = 510 mm2 of silicon.
Has 225 MBSRAM memory.
Has 54 x 94 = 5,076 cores (two cores per row/column possibly unused due to repair scheme leaving 4,888 usable cores).
Peak FP32 performance of 40 Tera FLOPs.
Google TPU 3
With few details available on the specifications of the TPU 3, it is likely an
incremental improvement to the TPU 2: doubling the performance, adding HBM2
memory to double the capacity and bandwidth.
Announced May 2018.
Likely to be 16nm or 12nm.
200W estimated TDP.
105 TFLOPs of BFloat16, likely from doubling the MXUs to four.
Each MXU has dedicated access to 8 GB of memory.
Integrated in four-chip modules.
32 GBHBM2 integrated memory with access bandwidth of 1200 GBps (assumed).
PCIe-3 x8 assumed at 8 GBps.
Google TPU 2
The TPU 2 is designed for training and inference. It improves over the TPU 1
with floating point arithmetic and enhanced memory capacity and bandwidth with HBM integrated memory.
Announced May 2017.
Likely to be 20nm.
200-250W estimated TDP.
45 TFLOPs of BFloat16.
Two cores with scalar and matrix units.
Also supports FP32.
Integrated in four-chip modules.
128x128x32b systolic matrix unit (MXU) with BFloat16 multiplication and FP32 accumulation.
8GB of dedicated HBM with access bandwidth of 300 GBps.
Peak throughput of 22.5 TFLOPs of BFloat16.
16 GBHBM integrated memory at 600 GBps bandwidth (assumed).
PCIe-3 x8 (8 GBps).
Google TPU 1
Google’s first generation TPU was designed for inference only and supports only
integer arithmetic. It provides acceleration to a host CPU by being sent
instructions across PCIe-3, to perform matrix multiplications and apply
activation functions. This is a significant simplification which would have
saved much time in design and verification.
Announced in 2016.
331 mm2 die on 28nm process.
Clocked at 700 MHz and 28-40W TDP.
28 MB on-chip SRAM memory: 24 MB for activations and 4 MB for accumulators.
Proportions of the die area: 35% memory, 24% matrix multiply unit, 41%
remaining area for logic.
256x256x8b systolic matrix multiply unit (64K MACs/cycle).
INT8 and INT16 arithmetic (peak 92 and 23 TOPs/s respectively).
8 GBDDR3-2133 DRAM accessible via two ports at 34 GB/s.
PCIe-3 x 16 (14 GBps).
DISCLAIMER: I work at Graphcore, and all of the information given here is
lifted directly from the linked references below.
The Graphcore IPU architecture is highly parallel with a large collection of
simple processors with small memories, connected by a high-bandwidth all-to-all
‘exchange’ interconnect. The architecture operates under a bulk-synchronous
parallel (BSP) model, whereby execution of a program proceeds as a sequence of
compute and exchange phases. Synchronisation is used to ensure all processes
are ready to start exchange. The BSP model is a powerful programming
abstraction because it precludes concurrency hazards, and BSP execution allows
the compute and exchange phases to make full use of the chip’s power resources.
Larger systems of IPU chips can be built by connecting the 10 inter-IPU links.
16 nm, 23.6 bn transistors, ~800mm2 die size.
1216 processor tiles.
125 TFLOPs peak FP16 arithmetic with FP32 accumulation.
300 MB total on-chip memory, distributed among processor cores, providing an
aggregate access bandwidth of 45 TBps.
All model state held on chip, there is no directly-attached DRAM.
Habana’s Gaudi AI training processor shares similarities with contemporary
GPUs, particularly wide SIMD parallelism and HBM2 memory. The chip integrates
ten 100G Ethernet links which support remote direct memory access (RDMA). This IO capability allows large systems to be built with commodity networking
equipment, compared with Nvidia’s NVLink or OpenCAPI.
Announced June 2019.
TSMC 16 nm with CoWoS, assumed die size ~500m2.
Heterogeneous architecture with:
a GEMM operations engine;
8 Tensor Processing Cores (TPCs);
a shared SRAM memory (software managed and accessible via RDMA).
200W TDP for PCIe card and 300W TDP for the mezzanine card.
Unknown total on-chip memory.
Explicit memory management between chips (no coherency).
Transcendental functions: Sigmoid, Tanh, Gaussian error linear unit (GeLU).
Tensor addressing and strided access.
Unknown local memory per TPC.
4x HBM2-2000 DRAM stacks providing 32 GB at 1 TBps.
10x 100GbE interfaces are integrated on-chip, supporting RDMA over Converged Ethernet (RoCE v2).
IOs are implemented with 20x 56 Gbps PAM4 Tx/Rx SerDes and can also be configured as 20x 50 GbE. This allows up to 64 chips to be connected with non-blocking throughput.
PCIe-4 x16 host interface.
Huawei Ascend 910
Huawei’s Ascend also bears similarities to the latest GPUs with wide SIMD
arithmetic and a 3D matrix unit, comparable to Nvidia’s Tensor Cores, a
(assumed) coherent 32 MB shared L2 on-chip cache. The chip includes
additional logic for 128 channel video decoding engines for H.264/265. In their
Hot Chips presentation, Huawei described overlapping the cube and vector
operations to obtain high efficiency and the challenge of the memory hierarchy
with ratio of bandwidth to throughput dropping by 10x for L1 cache (in the
core), 100x for L2 cache (shared between cores), and 2000x for external DRAM.
Announced August 2019.
456 mm2 logic die on a 7+ nm EUV process.
Copackaged with four 96 mm2HBM2 stacks and ‘Nimbus’ IO processor chip.
32 DaVinci cores.
Peak 256 TFLOPs (32 x 4096 x 2) FP16 performance, double that for INT8.
32 MB shared on-chip SRAM (L2 cache).
Interconnect and IO:
Cores interconnected in a 6 x 4 2D mesh packet-switched network, providing
128 GBps bidirectional bandwidth per core.
4 TBps access to L2 cache.
1.2 TBps HBM2 access bandwidth.
3x 30 GBps inter-chip IOs.
2x 25 GBps RoCE networking interfaces.
Each DaVinci core:
3D 16x16x16 matrix multiply unit providing 4,096 FP16 MACs and 8,192 INT8 MACs.
2,048 bit SIMD vector operations for FP32 (x64), FP16 (x128) and INT8 (x256).
Support for scalar operations.
This chip is Intel’s second attempt at an accelerator for machine learning,
following the Xeon Phi. Like the Habana Gaudi chip, it
integrates a small number of wide vector cores, with HBM2 integrated memory and
similar 100 Gbit IO links.
27 bn transistors.
688 mm2 die on TSMC16FF+ TSMC with CoWoS.
32 GBHBM2-2400 in four 8 GB stacks integrated on a 1200 mm2 passive silicon interposer.
60 MB on-chip SRAM memory distributed among cores and ECC protected.
Up to 1.1 GHz core clock.
24 Tensor Processing Cluster (TCP) cores.
TPCs connected in a 2D mesh network topology.
Separate networks for different types of data: control, memory and inter-chip communication.
Support for multicast.
119 TOPs peak performance.
1.22 TBps HBM2 bandwidth.
64 lanes of SerDes with peak 3.58 Tbps aggregate bandwidth (28 Gbps each direction in each lane) for inter-chip IOs.
x16 PCIe-4 host interface (also supports OAM, Open Compute).
128 KB L1 data cache/shared memory and four 16K 32-bit registers per SM.
32 GBHBM2DRAM, at 900 GBps bandwidth.
NVLink 2.0 at 300 GBps.
Turing is an architectural revision of Volta, manufactured on the same 16 nm
process, but with fewer CUDA and Tensor cores. It consequently has a smaller
die size and lower power envelope. Apart from ML tasks, it is designed to
perform real-time ray tracing, for which it also used the Tensor Cores.
Announced September 2018.
TSMC 12nm FFN, 754 mm2 die, 18.6 bn transistors.
260 W TDP.
72 SMs, each containing: 64 FP32 cores, and 64 INT32 cores, 8 Tensor cores
(4608 FP32 cores, 4608 INT32 cores and 576 TCs).