Sunway TaihuLight supercomputer makes its appearance

The Sunway TaihuLight System [1] is very impressive with over 10 million cores and a peak performance of just over 125 Pflop/s. The Top500 list has the Sunway TaihuLight as the fastest computer. The Sunway is almost three times (2.75 times) as fast and three times as efficient as the system it displaces in the number one spot. In fact, the sum of the performance for computers in position 2 through 7 is just barely greater than the performance of the Sunway system. The HPL Benchmark results at 93 Pflop/s or 74% of theoretical peak performance is impressive, with an efficiency of 6 Gflops per Watt; this is the highest efficiency of all the top computers. The HPCG benchmark performance at only 0.3%of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest interconnect performance. The ratio of floating point operations per byte


Jack Dongarra
The Sunway TaihuLight System [1] is very impressive with over 10 million cores and a peak performance of just over 125 Pflop/s.The Top500 list has the Sunway TaihuLight as the fastest computer.The Sunway is almost three times (2.75 times) as fast and three times as efficient as the system it displaces in the number one spot.In fact, the sum of the performance for computers in position 2 through 7 is just barely greater than the performance of the Sunway system.The HPL Benchmark results at 93 Pflop/s or 74% of theoretical peak performance is impressive, with an efficiency of 6 Gflops per Watt; this is the highest efficiency of all the top computers.The HPCG benchmark performance at only 0.3% of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest interconnect performance.The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory.By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer.So for many 'real' applications the performance on the TaihuLight will be nowhere near the peak performance rate.Also the primary memory for this system is on low side at 1.3 PB (Tianhe-2 has 1.4 PB and Cray Titan system at Oak Ridge National Laboratory has .71PB).The Sunway TaihuLight system, based on a homegrown processor, demonstrates the significant progress that China has made in the domain of designing and manufacturing large-scale computation systems.A computer node of this machine is based on one manycore processor chip called the SW26010 processor.Each processor is composed of four Management Processing Elements (MPEs), four Computing Processing Elements (CPEs) (a total of 260 cores), four memory controllers (MC), and a Network on Chip connected to the System Interface.Each of the four MPE, CPE, and MC have access to 8 GB of DDR3 memory, see Fig. 1.There are 40 960 nodes in the complete system.
The Sunway TaihuLight System is using Sunway Raise OS 2.0.5 based on Linux as the operating system.The basic software stack for the many-core processor includes basic compiler components, such as C/C++ and Fortran compilers, an automatic vectorization tool, and basic math libraries.There is also the Sunway OpenACC, a customized parallel compilation tool that supports OpenACC 2.0 syntax and targets the SW26010 manycore processor.
The Gordon Bell Prize [2] is awarded each year to recognize outstanding

RESEARCH HIGHLIGHTS
achievement in high-performance computing.The purpose of the award is to track the progress over time of parallel computing, with particular emphasis on rewarding innovation in applying highperformance computing to applications in science, engineering, and large-scale data analytics.Financial support of the $10 000 award is provided by Gordon Bell, a pioneer in high-performance and parallel computing.There are three submissions which are finalists for the Gordon Bell Award at SC16 that are based on the new Sunway TaihuLight system.These three applications are as follows: (1) a fully implicit non-hydrostatic dynamic solver for cloud-resolving atmospheric simulation; (2) a highly effective global surface wave numerical simulation with ultrahigh resolution; and (3) largescale phase-field simulation for coarsening dynamics based on Cahn-Hilliard equation with degenerated mobility.The fact that there are sizeable applications and Gordon Bell contender applications running on the system is impressive and shows that the system is capable of running real applications and not just a 'stunt machine'.
China has made a big push into high-performance computing.In 2001, there were no supercomputers listed on the Top500 list [3] in China.Today China has 167 systems on the June 2016 Top500 list compared to 165 systems in the USA, see Fig. 2.China has onethird of the systems, while the number of systems in the USA has fallen to the lowest point since the TOP500 list was created.No other nation has seen such rapid growth.According to the Chinese national plan for the next generation of high-performance computers, China will develop an exascale computer during the 13th Five-Year-Plan period (2016-2020).It is clear that they are on a path which will take them to an exascale computer by 2020, well ahead of the US plans for reaching exascale by 2023.

Figure 1 .
Figure 1.Basic layout of a node, the SW26010.