TH Express-2 reaches new heights for supercomputer interconnects

Modern supercomputers are highly dependent on parallel processing to boost their performance, and thus are all effectively parallel computers. The basic architecture of a parallel computer is composed of two parts: a single processor architecture and a communication architecture [1]. The latter is mainly implemented and supported by the interconnection network. During the design of an interconnection network, the following aspects must be considered carefully: (i) The communication requirements of the system, and the type of communication abstract that will be delivered to the programmer. (ii) The performance requirements of the network, in terms of both pointto-point communication and collective communication performances, ensure that the overall system is balanced with respect to communication speed and computation speed. (iii) The type of interconnection topology required to reduce the average distance between nodes (and thus reduce communication latency) while maintaining a moderate number of links (enabling increased communication bandwidth without making the implementation overly complex). (iv) The type of communication protocol to be used, including the routing algorithm (how to deliver messages from one node to another) and the flow control mechanism (how to handle congestion when it occurs in network traffic). (v) Implementation of the network with appropriate building blocks, including host interfaces, switches, routers, and electrical or optical cables.

(v) Implementation of the network with appropriate building blocks, including host interfaces, switches, routers, and electrical or optical cables.
The design of the interconnection network of the Tianhe-2 supercomputer is particularly challenging because of the enormous scale and the leading edge performance of the entire system [2]. Each compute node in Tianhe-2 has a peak performance of 3.4 teraflops, and a very high speed link is thus required to connect the node to the network. Otherwise, the compute node will be left hungry and waiting for data. The current Tianhe-2 system uses 16 000 compute nodes to achieve a peak performance of 54.9 petaflops, and the future system will be larger when the performance is upgraded to 100 petaflops. Therefore, the interconnection network required must be scalable to more than 10 000 nodes, and must support high bandwidth and low latency communication among these nodes.
A proprietarily designed interconnection network called TH Express-2 has been adopted in Tianhe-2 to address the considerations and challenges described above. TH Express-2 uses the well-known fat-tree topology to attain the highest possible bisection bandwidth, that is, the bandwidth that is consumed when half of the system nodes send messages to the other half.
A previous study of the message types used in interconnection network traffic indicated that most of the messages that are sent are short messages, while most of the data are sent in the form of large mes-sages [1]. To support both types of messages efficiently, TH Express-2 implements the mini-packet to transmit short messages with low latency, and uses remote direct memory access to transfer large amounts of block data with high bandwidth. TH Express-2 also supports collective optimization for synchronization and global communications.
At the application level, in addition to the standard Message Passing Interface, TH Express-2 provides Galaxy Express-2, which is a user-level communication infrastructure, to fully utilize the hardware's communication capability, and avoid operating system interference. At a lower level, TH Express-2 supports both deterministic and adaptive routing to balance the network traffic. Reliable linklevel transmission is implemented using credit-based flow control, link-level cyclic redundancy check, and packet retransmission.
Two very-large-scale integrated circuits, a network interface chip and a router chip, were designed by the National University of Defense Technology team to implement the TH Express-2 network. The network interface chip connects the Peripheral Component Interconnect Express (PCIE) 3.0 interface of each compute node to the network using eight serializer/deserializer (SERDES) lanes, each of which operates at 14 Gbps (giving a total of 112 Gbps). The router chips execute all message switching and forwarding tasks, and interconnect with each other to form the backbone of the interconnection network. Each router chip can switch among 24 network ports, with a total throughput of 5376 Gbps.

RESEARCH HIGHLIGHT
A successful high-end interconnect depends not only on technical innovation, but also on effective integration and implementation of sophisticated technologies and processes. TH Express-2 offers its own innovations, including improved logic design for user-level operation to reduce the point-to-point latency, reliable end-to-end communication to simplify scalable message-passing services, and network interface controller (NIC)-assisted collective operation to provide an overlap between communication and computation. The outstanding performance (as depicted in Table 1) of the system is mainly a result of excellent engineering implementation. One meaningful lesson that has been learned from the practical aspects of TH Express-2 is that it is highly important to find a good balance between innovative new technology and technologies that are proven to be effective. Tofu of K Computer [4] Cray Gemini [5] Cray Aries [5] InfiniBand EDR To date, the sustained performance of Tianhe-2 has been ranked No. 1 in the worldwide Top 500 supercomputer list (http://www.top500.org/) for almost 3 years, and this performance is inseparable from that of TH Express-2. Table 1 lists the main features of some typical interconnection networks that have been deployed in Top 500 supercomputers. Obviously, TH Express-2 has its own performance advantages. After reaching the target of 100 petaflops in 2016, the next target for supercomputers is to achieve performance in the exaflops (or 1000 petaflops) range. Interconnection networks will continue to play important roles in these exascale systems.