CHINA HAS ALREADY REACHED EXASCALE – ON TWO SEPARATE SYSTEMS

Nicole Hemsoth

Native CPU and accelerator architectures that have been in play on China’s previous large systems have been stepped up to make China first to exascale on two fronts. 

The National Supercomputing Center in Wuxi is set to unveil some striking news based on quantum simulation results on a forthcoming homegrown Sunway supercomputer.

The news is notable not just for the calculations, but the possible architecture and sheer scale of the new machine. And of course, all of this is notable because the United States and China are in a global semiconductor arms race and that changes the nature of how we traditionally compare global supercomputing might. We have been contemplating China’s long road to datacenter compute independence, of which HPC is but one workload, and these are some big steps.

The supercomputing community has long been used to public results on the Top 500 list of the world’s most powerful systems with countries actively vying for supremacy. However, with tensions at peak and the entity list haunting the spirit of international competition, we can expect China to remain mum about some dramatic system leaps. Including the fact that the country has already broken the (true/LINPACK) exascale barrier in 2021—on more than one machine.

We have it on outstanding authority (under condition of anonymity) that LINPACK was run in March 2021 on the Sunway “Oceanlite” system, which is the follow-on to the #4-ranked Sunway TaihuLight machine. The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot.

We’ve already published what little we knew about the Sunway Oceanlite architecture, and earlier this year (and now, in the absence of verified system information) our conjecture was that this new machine was a die shrink, allowing 2X the elements and 2X the performance per socket and with a doubling of sockets (and other engineering of course), Wuxi could create an exascale system. Clearly, Wuxi has.

Wuxi is using those 42 million cores for sustained exascale supercomputing in full-scale quantum simulation production, which we learned today via a preview ahead of the annual Supercomputing Conference (SC21). The TaihuLight follow-on is capable of running a quantum simulation that can be parallelized across the entire machine. This simulation also bodes well for an AI/ML training and inference workloads as it highlights extensive use of mixed-precision math, including 16-bit floating point performance of a reported 4.4 exaflops.

Without delving into all the quantum details, the Wuxi team, along with collaborators at Tsinghua University and the Shanghai Research Center for Quantum Sciences, have developed the tensor-based simulator for random quantum circuits that is optimized for compute density and can “reduce the simulation sampling time of Google Sycamore to 304 seconds from the previously claimed 10,000 years.” This is just a preview abstract and there aren’t a lot of details on this result but it’s worth mentioning to tee up what we find out in mid-November when a paper is released detailing the simulation.

But let’s get back to fully benchmarked (LINPACK) exascale systems in China. The same authority confirmed that a second exascale run in China, this time on the Tianhe-3 system, which we previewed back in May 2019, reached almost identical performance with 1.3 exaflops peak and enough sustained to be functional exascale. We do not have a power figure for this but we were able to confirm this machine is based on the FeiTeng line of processors from Phytium, which is Arm-based with a matrix accelerator. (For clarity, FeiTeng is kind of like “Xeon,” it’s a brand of CPUs from Phytium).

This is not a new architecture. Here’s the analysis from 2015 when we first got wind of Phytium’s HPC ambitions, and here is a follow-on deep dive into the “Mars” 64-core FT-2000/64 architecture. The “Mars” processor then was always intended for us in China’s supercomputers but of course, has had to evolve with the times. The matrix engine that adds the real “oomph” to these devices is still based on an updated variant of that Matrix 2000 DSP accelerator we saw in Tianhe-2A (another top supercomputer of its day), which is called the Matrix-2000+ accelerator. The whole software stack for Tianhe-2A took major footwork to tune to the DSP. It was never likely that National University of Defense Technology would swap all of that effort for an architecture that performed quite well, especially on LINPACK.

Recall that this Phytium emergence and the emergence of the Matrix 2000 DSP accelerators for the Tianhe-2A system came about because China couldn’t use an Intel Xeon Phi many core processors as planned due to trade restrictions at the time.

From what we can tell on these two exascal…...more here

Click here for reuse options!
Copyright 2021 Hiram's 1555 Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.