Extreme-Scale Computer Architecture

Increased transistor integration will soon allow us to build processor chips with over 1,000 cores. To construct such a chip, energy and power consumption are the most formidable obstacles. Hence, we need to design it from the ground up for energy efficiency. First of all, we want to operate at low voltage, since this is the point of maximum energy efficiency. In such an environment, however, we have to deal with process variation. Hence, it is important to design techniques to tolerate it. At the architecture level, we require simple cores organized in a hierarchy of clusters. Moreover, we also need techniques to reduce the leakage of on-chip memories and to lower the voltage guardbands of logic. Finally, data movement should be minimized, through both hardware and software techniques. With a systematic approach that cuts across multiple layers of the computing stack, we can deliver much higher energy efficiencies.


I. INTRODUCTION
As transistor sizes continue to scale down, we are about to witness extraordinary levels of chip integration.Sometime early in the next decade, as we reach 7 nm, we will be able to integrate, for example, 1,000 sizable cores and substantial memory on a single die.There are many unknowns as to how to build a general-purpose architecture in such an environment.However, we know that the main challenge will be to make it highly energy efficient.Energy and power consumption have emerged as the main obstacles to designing more capable architectures.
Given this energy efficiency challenge, researchers have coined the term Extreme Scale Computer Architecture to refer to computer organizations that, loosely speaking, are 100-1,000 times more capable than current systems for the same power consumption and physical footprint.For example, these organizations should deliver a datacenter that provides exascale performace (10 18 operations per second) for 20 MW; a departmental server that provides petascale performance (10 15  operations per second) for 20 KW; and a portable device that provides sustained terascale performance (10 12 operations per second) for 20 W. Extreme-scale computing is concerned with technologies that are applicable to all machine sizes -not just high-end systems.
Extreme scale computer architectures need to be designed for energy efficiency from the ground up.They need to have efficient support for concurrency, since only massive parallelism will deliver this performance.They should also minimize data transfers -since moving data around is a ‡ This work was supported in part by the National Science Foundation under grant CCF-1012759, DARPA under PERFECT Contract Number HR0011-12-2-0019, and DOE ASCR under Award Numbers DE-FC02-10ER2599 and DE-SC0008717.major source of energy consumption.Finally, they need to leverage new technologies that will be developed in the next few years.These technologies include low supply voltage (V dd ) operation, 3D die stacking, resistive memories, and photonic interconnects.
In this paper, we outline some of challenges that appear at different layers of the computing stack, and some of the techniques that can be used to address them.

II. BACKGROUND
For several decades, the processor industry has seen a steady growth in CPU performance, driven by Moore's Law [1] and Classical (or Dennard) scaling [2].Under classical scaling, the power density remains constant across semiconductor generations.Specifically, consider the dynamic power (P dyn ) consumed by a certain number of transistors that fit in a chip area A. The dynamic power is proportional to C × V 2 dd × f , where C is the capacitance of the devices and f is the frequency of operation.Hence, the power density is proportional to C × V 2 dd × f /A.As one moves to the next fabrication generation, the linear dimension of a device gets multiplied by a factor close to 0.7.The same is the case for V dd and C, while the f gets multiplied by 1/0.7.Moreover, the area of the transistors is now 0.7 2 × A. If we compute the new power density, we have 0.7C × (0.7V dd ) 2 × f /(0.7 3 × A).Consequently, the power density remains constant.
Unfortunately, as the feature size decreased below 130 nm over a decade ago, classical scaling ceased to apply for two reasons.First, V dd could not be decreased as fast as before.In fact, in recent years, it has stagnated around 1 V, mostly due to the fact that, as V dd gets smaller and closer to the threshold voltage (V th ) of the transistor, the transistor's switching speed decreases fast.The second reason is that static power became significant.The overall result is that, under real scaling, the power density of a set of transistors increases rapidly with each generation -making it progressively harder to feed the needed power and extract the resulting heat.
In addition, the large amount of power that needs to be provided causes concerns at both ends of the computing spectrum.At the high end, data centers are faced with large energy bills while, at the low end, handheld devices are limited by the capacity of the batteries.Overall, all of these trends motivate the emergence of research on extreme-scale computing.

III. ENERGY-EFFICIENT CHIP SUBSTRATE
To realize extreme-scale computing systems, devices and circuits need to be designed to operate at low V dd .This is because V dd reduction is the best lever available to increase the energy efficiency of computing.V dd reduction induces a quadratic reduction in dynamic energy, and a larger-than-linear reduction in static energy.As a result, an environment with a V dd ≈ 500 mV is much more energy efficient than one with the conventional V dd ≈ 0.9 V.It potentially consumes 40 times less power [3], [4].
This substantial power reduction implies that many more cores can now be placed on a given power-constrained chip.Unfortunately, there are well-known drawbacks of low V dd .They include a lower switching speed and a large increase in process variation -the result of V dd being close to V th .It is possible that researchers will find ways of delivering low-V dd devices of acceptable speed.However, the issue of dealing with high process variation is especially challenging.

A. The Effects of Process Variation
Process variation is the deviation of the values of device parameters (such as a transistor's V th , channel length, or channel width) from their nominal specification.Such variation causes variation in the switching speed and the static power consumption of nominally-similar devices in a chip.At the architectural level, this effect translates into cores and on-chip memories that are slower and consume more static power than they would otherwise do.
To see why, consider Figure 1.Chart (a) shows a hypothetical distribution of the latencies of dynamic logic paths in a pipeline stage.The X axis shows the latency, while the Y axis shows the number of paths with such latency.Without process variation (taller curve), the pipeline stage can cycle at a frequency 1/τ N OM .With variation (shorter curve), some paths become faster, while others slower.The pipeline stage's frequency is determined by the slower paths, and is now only 1/τ V AR .
Figure 1(b) shows the effect of process variation on the static power (P ST A ).The X axis of the figure shows the V th of different transistors, and the Y axis the transistors' P ST A .The P ST A of a transistor is related to its V th exponentially with P ST A ∝ e −V th .Due to this exponential relationship, the static power saved by high-V th transistors is less than the extra static power consumed by low-V th transistors.Hence, integrating over all of the transistors in the core or memory module, total P ST A goes up with variation.
Process variation has a systematic component that exhibits spatial correlation.This means that nearby transistors will typically have similar speed and power consumption properties.Hence, due to variation within a chip, some regions of the chip will be slower than others, and some will be more leaky than others.If we need to set a single V dd and frequency for the whole chip, we need to set them according to the slowest and leakiest neighborhoods of the chip.This conservative design is too wasteful for extreme-scale computing.

B. Multiple Voltage Domains
Low-V dd chips will be large and heavily affected by process variation.To tolerate process variation within a chip, the most appealing idea is to have multiple V dd and frequency domains.A domain encloses a region with similar values of variation parameters.In this environment, we want to set a domain with slow transistors to higher V dd , to make timing.On the other hand, we want to set a domain with fast, leaky transistors to lower V dd , to save energy.For this reason, extreme scale low-V dd chips are likely to have multiple, possibly many, V dd and frequency domains.
However, current designs for V dd domains are energy inefficient [5].First, on-chip Switching Voltage Regulators (SVR) that provide the V dd for a domain have a high power loss, often in the 10-15% range.Wasting so much power in an efficiencyfirst environment is hardly acceptable.In addition, small V dd domains are more susceptible to variations in the load offered to the power grid, due to lacking as much averaging effects as a whole-chip V dd domain.These variations in the load induce V dd droops that need to be protected against with larger V dd guardbands [6] -also hardly acceptable in an efficiency-first environment.Finally, conventional SVRs take a lot of area and, therefore, including several of them on chip is unappealing.

C. What is Needed
To address these limitations, several techniques are needed.First, an extreme-scale chip needs to be designed with devices whose parameters are optimized for low-V dd operation [7].Simply utilizing conventional device designs can result in slow devices.
Voltage regulators need to be designed for high energy efficiency and modest area.One possible approach is to organize them in a hierarchical manner [8].The first level of the hierarchy is composed of one or a handful of SVRs, potentially placed on a stacked die, with devices optimized for the SVR inductances.The second level is composed of many on-chip low-drop-out (LDO) voltage regulators.Each LDO is connected to one of the first-level SVRs and provides the V dd for a core or a small number of cores.LDOs have high energy efficiency if the ratio of their output voltage (V o ) to their input voltage (V i ) is close to 1. Thanks to systematic process variation, the LDOs in a region of the chip need to provide a similar V o to the different cores of the region.Since these LDOs take their V i from the same first-level SVR and their V o is similar, their efficiency can be over 90%.In addition, their area is negligible: their hardware reuses the hardware of a power-gating circuit.Such circuit is likely to be already present in the chip to power-gate the core.
To minimize energy waste, the chip should have extensive power gating support.This is important at low V dd because leakage accounts for the larger fraction of energy consumption.Ideally, power gating should be done at fine granularities, such as groups of cache lines, or groups of functional units.Fine granularities lead to high potential savings, but complicate circuit design.

A. Simple Organization
For highest energy efficiency, an exteme-scale architecture should be mostly composed of many simple, throughputoriented cores, and rely on highly-parallel execution.Low-V dd operation substantially reduces the power consumption, which can then be leveraged by increasing the number of cores that execute in parallel -as long as the application can exploit the parallelism.Such cores should avoid speculation and complex hardware structures as much as possible.
Cores should be organized in clusters.Such organization is energy-efficient because process variation has spatial correlation and, therefore, nearby cores and memories have similar variation parameter values -which can be exploited by the scheduler.
To further improve energy efficiency, a cluster typically contains a heterogeneous group of compute engines.For example, it can contain one wide superscalar core (also called latency core) to run sequential or critical sections fast.The power delivery system should be configured such that this core can run at high V dd in a turbo-boost manner.Moreover, some of the cores may have special instructions, such as special synchronization or transcendental operations.

B. Minimizing Energy in On-Chip Memories
A large low-V dd chip can easily contain hundreds of Mbytes of on-chip memory.To improve memory reliability and energy efficiency, it is likely that SRAM cells will be redesigned for low-V dd [9].In addition, to reduce leakage, such memory will likely operate at higher V dd than the logic.However, even accounting for this fact, the on-chip memories will incur substantial energy losses due to leakage.To reduce this waste, the chip may support power gating of sections of the memory hierarchy -e.g., individual on-chip memory modules, or individual ways of a memory module, or groups of lines in a memory module.In principle, this approach is appealing because a large fraction of such a large memory is likely to contain unneeded data at any given time.Unfortunately, this approach may be too coarse-grained to make a significant impact on the total power consumed: to power gate a memory module, we need to be sure that none of the data in the module will be used soon.This situation may be rare in the general case.Instead, we need a fine-grained approach where we power-on only the individual on-chip memory lines that contain data that will be accessed very soon.
To come close to this ideal scenario, we can use eDRAM rather than SRAM for the last levels of the cache hierarchy -either on-or off-chip.EDRAM has the advantage that it consumes much less leakage power than SRAMs.This saves substantial energy.However, eDRAM needs to be refreshed.Fortunately, refresh is done at the fine-grained level of a cache line, and we can design intelligent refresh schemes [10], [11].
One intelligent refresh technique is to try to identify the lines that contain data that is likely to be used in the near future by the processors, and only refresh such lines in the eDRAM cache.The other lines are not refreshed and marked as invalid -after being written back to the next level of the hierarchy if they were dirty.To identify such lines we can dynamically use the history of line accesses [10] or programmer hints.
Another intelligent refresh technique is to refresh different parts of the eDRAM modules at different frequencies, exploiting the different retention times of different cells.This approach relies on profiling the retention times of different on-chip eDRAM modules or regions.For example, one can exploit the spatial correlation of the retention times of the eDRAM cells [11].With this technique, we may refresh most of the eDRAM with long refresh periods, and only a few small sections with the conventional, short refresh periods.

C. Minimizing Energy in the On-Chip Network
The on-chip interconnection network in a large chip is another significant source of energy consumption.Given the importance of communication and the relative abundance of chip area, a good strategy is to have wide links and routers, and power gate the parts of the hardware that are not in use at a given time.Hence, good techniques to monitor and predict network utilization are important.
On-chip networks are especially vulnerable to process variation.This is because the network connects distant parts of the chip.As a result, it has to work in the areas of the chip that have the slowest transistors, and in those areas with the leakiest transistors.
To address this problem, we can divide the network into multiple V dd domains -each one including a few routers.Due to the systematic component of process variation, the routers in the same domain are likely to have similar values of process variation parameters.Then, a controller can gradually reduce the V dd of each domain dynamically, while monitoring for timing errors in the messages being transmitted.Such errors are being detected and handled with already-existing mechanisms in the network.When the controller observes an error rate in a domain that is higher than a certain threshold, the controller increases the V dd of that domain slightly.In addition, the controller periodically decreases the V dd of all the domains slightly, to account for changes in workloads and temperatures.Overall, with this approach, the V dd of each domain converges to the lowest value that is still safe (without changing the frequency).We call this scheme Tangle [12].

V. REDUCING DATA MOVEMENT
As technology scales, data movement contributes with an increasingly larger fraction of the energy consumption in the chip [13].Consequently, we need to devise approaches to minimize the amount of data transferred.In this section, we discuss a few ways to do it.
One approach is to organize the chip in a hierarchy of clusters of cores with memories.Then, the system software can co-locate communicating threads and their data in the same cluster.This reduces the total amount of data movement needed.
Another technique consists of using a single address space in the chip and directly managing in software the movement of data used by the application in the cache hierarchy.Many of the applications that will run on an extreme-scale 1,000core chip are likely to have relatively simple control and data structures -e.g., performing many of their computation in regular loops with analyzable array accesses.As a result, it is conceivable that a smart compiler performing extensive program analysis [14], possibly with help from the programmer, will be able to manage (and minimize) the movement of data in the on-chip memory hierarchy.
In this case, the architecture would support simple instructions to manage the caches, rather than providing programmertransparent hardware cache coherence.Such instructions can perform cache entry invalidation and cache entry writeback to the next level of the hierarchy.Plain writes do not invalidate other cached copies of the data, and plain reads return the closest valid copy of the data.While the machine is now certainly harder to program, it may eliminate some data movement inefficiencies associated with the hardware cache coherence protocol -such as false sharing, or moving whole lines when only a fraction of the data in the line is used.In addition, by providing a single address space, we eliminate the need to copy data on communication, as in message-passing models.
A third way of reducing the amount of data transfers is to use Processing in Memory (PIM) [15].The idea is to add simple processing engines close to or embedded in the main memory of the machine, and use them to perform some operations on the nearby data in memory -hence avoiding the round trip from the main processor to the memory.
While PIM has been studied for at least 20 years, we may now see it become a reality.Companies are building 3-D stacks that contain multiple memory dies on top of a logic die [16].Currently, the logic die only includes advanced memory controller functions plus self-test and error detection, correction, and repair.However, it is easy to imagine how to augment the capabilities of the logic die to support Intelligent Memory Operations [17].These can consist of preprocessing the data as it is read from the DRAM stack into the processor chip.They can also involve performing operations in place on the DRAM data.
Finally, another means of reducing communication is to support efficient communication and synchronization hardware primitives, such as those that avoid spinning over the network.These may include dynamic hierarchical hardware barriers, or efficient point-to-point synchronization between two cores using hardware full-empty bits [18].

VI. PROGRAMMING EXTREME-SCALE MACHINES
The system software in extreme-scale machines has to be aware of the process variation profile of the chip.This includes knowing, for each cluster, the V dd and frequency it can support and the leakage power it dissipates.With this information, the system software can make scheduling decisions that maximize energy efficiency.Similarly, the system software should monitor different aspects of hardware components, such as their usage, the energy consumed, and the temperature reached.With this information, it can make decisions on what components to power gate, or what V dd and frequency setting to use -possibly with help from application hints.
Application software is likely to be harder to write for extreme-scale architectures than for conventional machines.This is because, to save energy in data transfers, the programmer has to carefully manage locality and minimize communication.Moreover, the use of low V dd requires more concurrency to attain the same performance.
An important concern is how users will program these extreme-scale architectures.In practice, there are different types of programmers based on their expertise.Some are experts, in which case they will be able to map applications to the best clusters, set the V dd and frequency of the clusters, and manage the data in the cache hierarchy well.They will obtain good energy efficiency.However, many programmers will likely be relatively inexperienced.Hence, they need a high-level programming model that is simple to program and allows them to express locality.One such model is Hierarchical Tiled Arrays (HTA) [19], which allows the computation to be expressed in recursive blocks or tiles.Another possible model is Concurrent Collections [20], which expresses the program in a dataflow-like manner.These are high-level models, and the compiler has to translate them into efficient machine code.For this, the compiler may have to rely on program auto-tuning to find the best code mapping in these complicated machines.

VII. CONCLUSION
Attaining the 100-1,000x improvement in energy efficiency required for extreme-scale computing involves rethinking the whole computing stack from the ground up for energy efficiency.In this paper, we have outlined some of the techniques that can be used at different levels of the computing stack.Specifically, we have discussed the need to operate at low voltage, provide multiple voltage domains, and support simple cores organized in clusters.Memories and networks can be optimized by reducing leakage and minimizing the guardbands of logic.Finally, data movement can be minimized by managing the data in the cache hierarchy, processing in memory, and utilizing efficient synchronization.A major issue that remains in these machines is the challenge of programmability.

Fig. 1 .
Fig. 1.Effect of process variation on the speed (a) and static power consumption (b) of architecture structures.