Filed under: Featured, Hardware, Hopper, Hopper GPU, News, nVidia, NVIDIA 5nm, NVIDIA 5nm GPU, Nvidia GPU, Nvidia Hopper GPU, NVIDIA MCM, NVIDIA MCM GPU, Report, Rumor, Sticky

Mysterious NVIDIA ‘GPU-N’ Could Be Next-Gen Hopper GH100 In Disguise…

NVIDIA Hopper GPUs Featuring MCM Know-how Rumored To Tape Out Shortly

A mysterious NVIDIA GPU regarded as GPU-N which could perhaps be the first glimpse at the upcoming-gen Hopper GH100 chip has been uncovered in a new analysis paper published by the green workforce (as found out by Twitter user, Redfire).

NVIDIA Research Paper Talks ‘GPU-N’ With MCM Style and design & 8576 Cores, Could This Be Subsequent-Gen Hopper GH100?

The study paper ‘GPU Domain Specialization by way of Composable On-Bundle Architecture’ talks about a up coming-technology GPU design and style as the most realistic resolution for maximizing reduced-precision matrix math throughput to raise Deep Learning overall performance. The ‘GPU-N’ and its respective COPA models have been mentioned alongside with their achievable requirements and simulated efficiency outcomes.

The ‘GPU-N’ is reported to feature 134 SM units (vs 104 SM models of A100). This helps make up a overall of 8576 cores or a 24% boost above the current Ampere A100 alternative. The chip has been calculated at one.four GHz, the identical theoretical clock speed of the Ampere A100 and Volta V100 (not to be puzzled as the closing clocks). Other requirements consist of a 60 MB L2 cache, a 50% improve over Ampere A100, and a DRAM bandwidth of two.sixty eight TB/s that can scale up to 6.3 TB/s. The HBM2e DRAM potential is 100 GB and can be expanded up to 233 GB with the COPA implementations. It is configured all-around a 6144-little bit bus interface at clock speeds of 3.5 Gbps.

Configuration NVIDIA V100 NVIDIA A100 GPU-N
SMs 80 108 134
GPU frequency (GHz) 1.four one.four one.4
FP32 (TFLOPS) 15.seven 19.five 24.two
FP16 (TFLOPS) one hundred twenty five 312 779
L2 cache (MB) six forty 60
DRAM BW (GB/s) 900 1,555 two,687
DRAM Potential (GB) 16 40 a hundred

Coming to the effectiveness numbers, the ‘GPU-N’ (presumably Hopper GH100) generates 24.two TFLOPs of FP32 (24% maximize over A100) and 779 TFLOPs FP16 (two.5x improve around A100) which sounds definitely shut to the 3x gains that had been rumored for GH100 around A100. In comparison to AMD’s CDNA 2 ‘Aldebaran’ GPU on the Instinct MI250X accelerator, the FP32 overall performance is considerably less than 50 percent (95.7 TFLOPs vs 24.two TFLOPs) but the FP16 general performance is 2.15x increased.

From past info, we know that NVIDIA’s H100 accelerator would be based on an MCM option and use TSMC’s 5nm method node. Hopper is intended to have two upcoming-gen GPU modules so we are wanting at 288 SM models in total. We are unable to give a rundown on the core rely however because we don’t know the variety of cores showcased in just about every SMs but if it is really going to stick to 64 cores for every SM, then we get eighteen,432 cores which are two.25x much more than the complete GA100 GPU configuration. NVIDIA could also leverage far more FP64, FP16 & Tensor cores in just its Hopper GPU which would drive up performance immensely. And which is going to be a requirement to rival Intel’s Ponte Vecchio which is predicted to attribute 1:one FP64.

It is possible that the remaining configuration will occur with 134 of the a hundred and forty four SM models enabled on just about every GPU module and as such, we are very likely wanting at a solitary GH100 die in action. But it is unlikely that NVIDIA would reach the exact FP32 or FP64 Flops as MI200’s without having employing GPU Sparsity.

But NVIDIA may possibly probably have a top secret weapon in their sleeves and that would be the COPA-centered GPU implementation of Hopper. NVIDIA talks about two Area-Specialised COPA-GPUs dependent on future-technology architecture, a person for HPC and one particular for DL segment. The HPC variant characteristics a quite regular tactic which is composed of an MCM GPU style and the respective HBM/MC+HBM (IO) chiplets but the DL variant is where by matters start off to get intriguing.  The DL variant residences a substantial cache on an fully separate die that is interconnected with the GPU modules.

Architecture LLC Capacity DRAM BW DRAM Potential
Configuration (MB) (TB/s) (GB)
GPU-N 60 2.seven 100
COPA-GPU-1 960 2.7 one hundred
COPA-GPU-2 960 4.five 167
COPA-GPU-three one,920 two.7 one hundred
COPA-GPU-4 one,920 4.5 167
COPA-GPU-five one,920 6.3 233
Ideal L2 infinite infinite infinite

Many variants have been outlined with up to 960 / 1920 GB of LLC (Previous-Stage-Cache), HBM2e DRAM capacities of up to 233 GB, and bandwidth of up to 6.three TB/s. These are all theoretical but specified that NVIDIA has discussed them now, we may well possible see a Hopper variant with these types of a design and style throughout the comprehensive unveil at GTC 2022.

NVIDIA Hopper GH100 ‘Preliminary Specs’:

NVIDIA Tesla Graphics Card Tesla K40
Tesla M40
Tesla P100
Tesla P100 (SXM2) Tesla V100 (SXM2) NVIDIA A100 (SXM4) NVIDIA H100 (SMX4?)
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GA100 (Ampere) GH100 (Hopper)
Method Node 28nm 28nm 16nm 16nm 12nm 7nm 5nm
Transistors seven.one Billion eight Billion 15.three Billion 15.3 Billion 21.1 Billion fifty four.2 Billion TBD
GPU Die Dimensions 551 mm2 601 mm2 610 mm2 610 mm2 815mm2 826mm2 TBD
SMs fifteen 24 fifty six 56 80 108 134 (Per Module)
TPCs 15 24 28 28 forty 54 TBD
FP32 CUDA Cores Per SM 192 128 64 64 sixty four 64 64?
FP64 CUDA Cores / SM sixty four four 32 32 32 32 32?
FP32 CUDA Cores 2880 3072 3584 3584 5120 6912 8576 (Per Module)
17152 (Full)
FP64 CUDA Cores 960 96 1792 1792 2560 3456 4288 (For each Module)?
8576 (Finish)?
Tensor Cores N/A N/A N/A N/A 640 432 TBD
Texture Models 240 192 224 224 320 432 TBD
Strengthen Clock 875 MHz 1114 MHz 1329MHz 1480 MHz 1530 MHz 1410 MHz ~1400 MHz
TOPs (DNN/AI) N/A N/A N/A N/A one hundred twenty five TOPs 1248 TOPs
2496 TOPs with Sparsity
FP16 Compute N/A N/A 18.7 TFLOPs 21.two TFLOPs thirty.4 TFLOPs 312 TFLOPs
624 TFLOPs with Sparsity
779 TFLOPs (For each Module)?
1558 TFLOPs with Sparsity (For every Module)?
FP32 Compute 5.04 TFLOPs 6.8 TFLOPs ten. TFLOPs 10.6 TFLOPs fifteen.7 TFLOPs 19.4 TFLOPs
156 TFLOPs With Sparsity
24.2 TFLOPs (For every Module)?
193.6 TFLOPs With Sparsity?
FP64 Compute one.68 TFLOPs .2 TFLOPs 4.seven TFLOPs five.30 TFLOPs 7.eighty TFLOPs 19.five TFLOPs
(nine.seven TFLOPs typical)
24.2 TFLOPs (For each Module)?
(12.one TFLOPs typical)?
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-little bit HBM2 4096-little bit HBM2 4096-little bit HBM2 6144-bit HBM2e 6144-bit HBM2e
Memory Sizing 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s Up To forty GB HBM2 @ one.six TB/s
Up To 80 GB HBM2 @ 1.6 TB/s
Up To a hundred GB HBM2e @ 3.five Gbps
L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 6144 KB 40960 KB 81920 KB
TDP 235W 250W 250W 300W 300W 400W ~450-500W

The publish Mysterious NVIDIA ‘GPU-N’ Could Be Subsequent-Gen Hopper GH100 In Disguise With 134 SMs, 8576 Cores & 2.sixty eight TB/s Bandwidth, Simulated Effectiveness Benchmarks Demonstrated by Hassan Mujtaba appeared 1st on Wccftech.