Darwin3: A large-scale neuromorphic chip with a Novel ISA and On-Chip Learning

Spiking Neural Networks (SNNs) are gaining increasing attention for their biological plausibility and potential for improved computational efficiency. To match the high spatial-temporal dynamics in SNNs, neuromorphic chips are highly desired to execute SNNs in hardware-based neuron and synapse circuits directly. This paper presents a large-scale neuromorphic chip named Darwin3 with a novel instruction set architecture(ISA), which comprises 10 primary instructions and a few extended instructions. It supports flexible neuron model programming and local learning rule designs. The Darwin3 chip architecture is designed in a mesh of computing nodes with an innovative routing algorithm. We used a compression mechanism to represent synaptic connections, significantly reducing memory usage. The Darwin3 chip supports up to 2.35 million neurons, making it the largest of its kind in neuron scale. The experimental results showed that code density was improved up to 28.3x in Darwin3, and neuron core fan-in and fan-out were improved up to 4096x and 3072x by connection compression compared to the physical memory depth. Our Darwin3 chip also provided memory saving between 6.8X and 200.8X when mapping convolutional spiking neural networks (CSNN) onto the chip, demonstrating state-of-the-art performance in accuracy and latency compared to other neuromorphic chips.


INTRODUCTION
Spiking neural networks (SNNs) have garnered significant attention from researchers due to their ability to process spatial-temporal information in an efficient event-driven manner.To exploit the capabilities of SNNs, several spiking neural network simulation platforms have been introduced, such as Brian2 [1], NEST [2], and SPAIC [3].Nevertheless, the dependence of these platforms on using extensive GPU and CPU resources to mimic the spiking dynamics with a high count of timing steps potentially diminish the intrinsic advantages of SNNs.Neuromorphic chips are designed for efficient execution of spiking neural networks, which have demonstrated promising performance in brain simulation and specific ultra-low power scenarios.However, several limitations prevent them from fully leveraging the advantages of spiking neural networks.
To better leverage the benefits of the SNN models, we should emphasize the three aspects when designing neuromorphic chips: Flexibility of Neural Models: One of the key functions of neuromorphic chips is to simulate diverse biological neurons and synapses.However, many neuromorphic chips only support a single type of neuron model, as evidenced in platforms like Neurogrid [4], which is based on analog neuron circuits.Some works introduce a degree of configurability to accommodate various neuron models.Loihi [5] achieved enhanced learning capabilities through configurable sets of traces and delays.FlexLearn [6] has conceived a versatile data path that amalgamates key features from diverse models.Moreover, endeavors have been undertaken to develop fully configurable neuronal models using instructions.SpiNNaker's multi-core processors [7], based on conventional ARM cores, provide significant flexibility.However, it is associated with reduced performance and energy efficiency compared to other accelerators.While Loihi2 [8] presents an instruction set incorporating logical and mathematical operations similar to RISC instructions.However, instruction sets designed for conventional neural networks lack efficiency for SNNs despite their flexibility.
Synapse Density: To further unlock the potential of SNNs, neuromorphic chips need to support the representation of large-scale SNNs with more complex topologies [9].However, current neuromorphic chips pay less attention to this aspect, primarily concentrating on simulating the behavior of neurons and synapses.For instance, TrueNorth [10] employs a crossbar design for synaptic connections, but it suffers from limited and fixed fan-in/fan-out capacity.Loihi [5] takes an approach by using axon indexes to encode topology, thereby enhancing flexibility.Loihi2 [8] proposes optimizing for convolutional and factorized connections but gives less attention to other connection types.Unicorn [11] introduces a technique for merging synapses from multiple cores to extend the synaptic scale of a single core.Thus, improving synapse density for various topologies under limited storage conditions is crucial for optimizing the cost-effectiveness of the chip.
On-chip Learning Ability: Learning capability is a critical feature of biological neural networks.Currently, only a few neuromorphic chips support on-chip learning.Among those, the supported learning rules are pretty restricted.For instance, BrainscaleS2 [12] only accommodates fixed learning algorithms.Loihi [5] supports programmable rules for pre-, post-, and reward traces.Loihi2 [8] extends its capabilities of the programmable rules applied to pre-, post-, and generalized "third-factor" traces.However, even with the enhanced flexibility exhibited by Loihi2 [8], it cannot accommodate novel learning rules that might emerge.The latest research achievements in the field of electrochemical memory array [13] also provide new reference solutions.
In this paper, we design a large-scale neuromorphic chip with a domain-specific Instruction Set Architecture (ISA), named Darwin3, to support model flexibility, system scalability, and onchip learning capability of the chip.Darwin3 is the third generation of our Darwin [14] family of neuromorphic chips, which was successfully taped out and lit up in December 2022.Our main contributions are as follows.
1) We propose a domain-specific instruction set architecture (ISA) for neuromorphic systems, capable of efficiently describing diverse models and learning rules, including the integrateand-fire (LIF) family [15], Izhikevich [16], and STDP [17], among others.The proposed architecture excels in achieving high parallelism during computational operations, including loading parameters and updating state variables such as membrane potential and weights.
2) We design a novel mechanism to represent the topology of SNNs.This mechanism effectively compresses the information required to describe synaptic connections, thereby reducing overall memory usage.
The article is organized as follows: Firstly, we introduce the topic and briefly overview the article's contents.Second, we present the neuromorphic computing domain-specific ISA.Then, we offer the overall architecture of the neuromorphic chip and the implementation of each part, including the architecture of neuron nodes and the mechanism of topology representation.Lastly, we demonstrate the experiment results.

Model Abstraction of Neurons, Synapses, and Learning
Many neuron models have been proposed in the field of computational neuroscience.The leaky integrate-and-fire (LIF) family [18] [19] [20] [15] is a group of spiking neuron models that can be described by one or two-dimensional differential equations and were widely implemented on hardware accelerators.These models have been developed for use in many real-world applications various applications.The Hodgkin-Huxley model [21] [22] is considered biologically plausible and accurately captures the intricacies of neuron behavior with four-dimensional differential equations that represent the transfer of ions across the neuron membrane.However, this model can cause very high computational costs.The Izhikevich model [16], specifically designed to replicate bursting and spiking behaviors observed in the Hodgkin-Huxley model, is represented with two-dimensional differential equations.
All these neuron models are represented using systems of differential equations, with variations occurring only in the number of equations and the variables and parameters in each equation.The primary operators needed to solve them are the same.Therefore, it can be a practical approach to identify the common features shared by complex LIF models and utilize them to con-struct more complex models by introducing additional state variables and computation steps.We chose the Adaptive Leaky Integrate-and-Fire (AdLIF) model [20] as the baseline with relatively more variables and parameters.Mathematically, it can be expressed by Equation1, which captures the dynamics of the model and its adaptation properties.
(  >   ℎ ) : Where   is the membrane potential,   is the membrane time constant,   is the leak reversal potential,  is synapse conductance,    is adaptation current,    is time constant of the adaptation current,  is the sensitivity to the sub-threshold fluctuations of the membrane potential,  is the increment of    produced by a spike, 0 is the reset potential after spike and  is synaptic spike current.
Similar to the various designs of neuron models with different computational complexity, there are also multiple synapse models, such as the delta and alpha synapse models [23].One of the complex and commonly used models is the conductance-based(COBA) dual exponential model [24] [23], as shown in Equation 2, which has a similar computational complexity to that of Equation 1.We chose this model as our representative synapse model.
Where  is a spike at time  0 , ℎ is the gating variable of the ion channel,  is synapse conductance,   is the time constant of the synaptic decay phase,   is the time constant of the synaptic rise phase,  is synaptic spike current,   is the membrane potential and   is the leak reversal potential.
Synaptic plasticity [25], the ability of synapses to change their strength, was first proposed as a mechanism of learning and memory by Donald Hebb [26].After that, numerous learning rules have been proposed ever since.The STDP rule [17], and their variants are the most widely used.One relatively complex variant considers triplet [27] interactions and is reward-modulated [28].We select this model as the baseline, and through the selection of different state variables and parameters, the same equation can describe most STDP and its variant rules.The rule can be expressed mathematically as Equation 3.
Where  0 is the pre-synaptic spike trace,  0 is the first post-synaptic spike trace,  1 is the second post-synaptic spike trace,  is the reward to modulate the synaptic traces,  * are time constants of  0 ,  0 ,  1 and .
Equations 1 to 3 described three representative models.To implement the models using digital circuits, we need to convert the differential equations to a discrete form.By applying the Euler method, Equations 1 and 2, are converted to Equation 4 and 5.
Where  * 0 to  * 6 are fixed coefficient parameters,  * 0 to  * 3 are constants.Equations 4, 5, and 6 reveal that both complex LIF models and STDP variants can be expressed as polynomials involving multiple multiplication and addition operations.To implement these polynomial computations in digital circuits, we map them to corresponding data paths for further analysis.Figure 1 (a) (b) and (c) illustrates that the data paths of    , and   in Equations 4, 5 and  in Equation 6 are almost identical, except for different control signals from selectors and input sources, which allows us to efficiently implement these computations in circuits using a unified data path, where parameters can be preconfigured statically, and state variables are updated continuously over time steps.For more complex cases such as Izhikevich model [16], the parameter that multiplies the state variables in the computation process is also a state variable.Therefore, we obtain the unified data path shown in Figure 1 (d).

The Proposed Darwin3 ISA
To effectively manage the data path and maximize performance and concurrency, it is crucial to design an efficient controller.First, we map state variables and parameters into a set of registers, as indicated in Table 1(a).It covers the state variables and parameters related to neurons and synapses, in which constants are static parameters.To provide users with the flexibility to implement different models, we propose a specialized instruction set architecture (ISA), as shown in Table 1 The core principle of this ISA is to amalga-mate common operations into a single instruction, taking into account the computational characteristics of SNNs.By doing so, it not only reduces the memory usage required for instructions but also minimizes the time needed for instruction decoding during the computation process.We defined a set of instructions outlined in Table 1 (b).This instruction set comprises ten primary commands.The first group, which focuses on load and store operations, consists of LSIS, LDIP, LSLS, and LDLP.Specifically, LSIS and LSLS cater to loading or writing back state variables for both the inference and learning processes, executed in parallel.On the other hand, LDIP and LDLP are designated for loading parameters in parallel for inference and learning phases, respectively.The second group, tailored for updating state variables, includes UPTIS, UPTVM, UPTLS, UPTWT, and UPTTS.Among these, UPTIS updates state variables, excluding the membrane potential.UPTVM is exclusively for adjusting the membrane potential.UPTLS emphasizes state variables specific to the learning stage, while UPTWT manages the adjustments of synaptic weights.UPTTS oversees the updating of temporary state variables.Lastly, the GSPRS instruction is dedicated to generating spikes.With these instructions, we can effectively manage the computing units and support the process required for constructing flexible SNN models.
Models such as AdEx [20] and HH [21] necessitate intricate operations, including exponential and division functions.These are not directly supported by the instructions outlined in Table 1 (b).The design enables users to perform division and exponentiation operations using computational units like shift, multiplication, and lookup tables, thus conserving hardware resources.Consequently, we've augmented our instruction set with several instructions typically found in reduced instruction set architectures, as detailed in Table 1 In Table 2, we present a range of code examples beyond basic loading and storing operations to illustrate the efficacy of the proposed ISA.This selection demonstrates that simple LIF models and more complex Triplet STDP rules can be concisely represented using minimal instructions.Additionally, specialized rules such as SDSP [29] can be efficiently encoded through a strategic combination of instructions.This versatility underscores the high flexibility and effectiveness of our instruction set design, establishing it as a viable tool for researchers and devel- Reserve NHIS (6-bit) LS indicates whether an operation is a load or store operation, and NHIS is a 6-hot code corresponding to the state variables  0 - 5 , indicating whether each state variable needs to be loaded or stored during the operation.

LDIP
NHIP(8-bit) NHIC(3-bit) NHIP is an 8-hot code corresponding to the parameters  0 - 7 , and NHIC is a 3-hot code corresponding to  0 - 2 , indicating whether each needs to be loaded during the operation.

LSLS
LS(1-bit) NHLS(10-bit) A one hot code LS bit indicates whether an operation is a load or store operation, and a 10-hot code NHLS corresponding to the state variables  0 to  9 indicates whether each state variable needs to be loaded or stored during the operation.LDLP NHLP(7-bit) NHLC(4-bit) A 7-hot code NHLP corresponding to the parameters  0 - 6 and a 4-hot code NHLC corresponding to  0 - 3 , indicate whether each parameter needs to be loaded during the operation.

UPTIS
Reserve OHIS (3-bit) NHIP (6-bit) OHIS is a one-hot code indicating whether , , or    to be calculated and NHIP is a 6-hot code corresponding to  3 - 7 and  1 , indicating whether each needs to participate in the calculation according to Equation 4 and 5.

UPTVM
Reserve NHVM(4-bit) NHVM is a 4-hot code to determine whether   , ,    , or  0 , needs to participate in the calculation to update   according to Equation 4 and 5.
The variables k, l, m, and n determine the selected temporary state variable   that needs to update according to the equation:   (t+1)=  *  (t)+  .

GSPRS
Reserve NHSP(4-bit) A 4-hot code NHSP respectively determines whether to fire a spike, whether to perform threshold comparison, whether to involve adaptive operation, and whether membrane potential needs to reset to  0 .

THE DARWIN3 CHIP ARCHITECTURE Overview
The Darwin3 chip architecture is characterized by a two-dimensional mesh of computing nodes, forming a 24*24 grid, interconnected via a Net-work on Chip (NoC), shown in Figure 2(a).The node at position (0,0) features a RISC-V processing core for chip management, while the other nodes, functioning as neuron cores, handle the majority of computations, with each supporting up to 4096 spiking neurons.Inter-chip communication modules are placed at four edges of the chip connected with peripheral routers, acting as compression and decompression units.This design enables the NoC to extend connec- This work implements a low-latency NoC architecture, which employs XY routing as detailed in [31].The design is further improved by integrating the CXY [32] and OE-FAR [33] routing strategies to tackle congestion issues.Additionally, a new routing algorithm is introduced in this work that uses the relative offsets between the source and destination addresses as the basis for its routing scheme.This strategic decision simplifies the data packet transmission process to neighboring chips, eliminating the need for complex routing protocols and address translation.
The asynchronous communication interfaces, denoted by green circles in the figure, interconnect local synchronous modules, establishing Darwin3 as a global asynchronous local synchronous system.This enhances the capability of each node on the chip to operate independently at a high-performance level.

Architecture of Neuron Cores
The architecture of a neuron core, illustrated in Figure 2 (b), comprises five components: the controller unit, the model execution unit, the time management unit, the register and memory units, and the spike event processing unit.
The controller unit is responsible for fetch-ing, decoding, and executing flow control.The model execution unit can perform various arithmetic and logical operations.As defined in Table 1 (a), registers store state variables, parameters, constants, and temporary variables.The time management unit has two primary responsibilities.First, it generates an internal tick signal based on global time-step information, indicating the progression of time steps within the core.Second, it implements timedivision multiplexing for 1-4096 logical neurons based on configuration information.
Each neuron core has memories for different things like axon-in, axon-out, neuron state variables, synapse state variables, and instructions.Instructions are only used to describe how neurons and synapses work.The neuron's ID determines the address of instructions and related state variables.The memories for axon-in and axon-out store how neurons are connected, and their organization is shown in Figure 2 (d).
When the chip starts, we need to set up the memories for the working nodes.Configuration data will be transported from the external controller (e.g., a PC or an FPGA) to the corresponding nodes through the Inter-Chip Communication module.
Unlike conventional processors that fetch instructions in every cycle using a clock, Dar-win3's neuron cores are driven by spike events.When a neuron gets a spike, AER IN queries the corresponding axon-in entry to find the neuron ID and weight, calculating the state variable h.When a time step advances, the controller unit performs computations for each neuron's inference or learning stage based on the instructions.And if a neuron fires a spike, the AER OUT gets the address and ID of the post-synaptic neuron from the axon-out, packaging this into a spike data packet.
The dashed lines in Figure 2 (b) illustrate the process of a neuron core receiving, processing, generating, and transmitting spikes.Multiplication operations take two cycles, while addition operations take one cycle.For example, the commonly used LIF neuron model requires four cycles (two multiplication and two addition operations), while the CUBA Delta model requires three cycles (one multiplication and one addition operation).Transmission delay is expressed as 2*N + 2*(N+1), where 2*N is for delay through N routers, and 2*(N+1) is for delay through N+1 asynchronous interconnections between pre-synaptic and post-synaptic neurons.ISA.In the inference mode, the controller unit updates the state variables of each neuron described by the instructions within the current time step.In the learning mode, the controller unit extracts learning parameters and state variables to execute necessary calculations and updates, calculating new weights.The axon-in memory area has been reconfigured to accommodate learning-related parameters and state variables to optimize hardware resources.

Representation of Neuronal Connections
A flexible connection representation mechanism is essential in pursuing the development of neuromorphic computing chips capable of supporting complex networks.Several connection topologies find frequent application in SNNs: Upon a comprehensive examination of commonly employed connection expression mechanisms (as summarized in Table 3), we discovered that the approach used by Loihi [5] stands out for its exceptional flexibility, featuring substantial fan-in and fan-out capability.Combining these advantageous attributes, we have introduced a novel scheme that enables a highly compressed representation of connection topology, as depicted in Figure 2 (d).To efficiently represent the topology of connections, each neuron core has independent memories for axon-out and axon-in within this framework.Spikes are conveyed utilizing the addressevent representation (AER) method.Following the generation of a spike by a pre-synaptic  neuron, it accesses axon-out to retrieve the target node's address and axon-in index information of the target node.Subsequently, the AER OUT module encapsulates and transmits this information to the router through the local connection port.The router, in turn, directs the data packet towards the designated target node.Upon reception of the data packet, the target node queries axon-in to acquire pertinent information concerning the target neuron and connection weights.Within the framework of the axon-out structure, each operational neuron is associated with a Linker.The Linker's entries retain the address of the entry containing detailed connection information and the specific index of the neuron.This index distinguishes cases in which multiple neurons are connected to the same target node.This structural configuration optimizes the compression of information for connection types extending beyond point-to-point scenarios: 1.The last flag (LF) is set to 0 when a neuron connects to multiple nodes, indicating nonterminal nodes and facilitating efficient compression of redundant information.As shown in Figure 2 (d), the situation is represented by 1 # .2. When multiple neurons connect to the same node(s), a single entry suffices for their connectivity information.As shown in Figure 2 (d), the situation is represented by 2 # .
Each received axon-in index aligns with a corresponding Linker within the axon-in structure context.The entries in the Linker encapsulate the address of the entry, housing detailed connection information and the type of connection.This structure is strategically designed to optimize information compression, particularly tailored for connection types extending beyond point-to-point scenarios: 1.When all 4096 neurons within the node are connected to one pre-synaptic neuron, there is no necessity to store neuron indexes individually.In such cases, 4096 weights can be stored sequentially.As depicted in Figure 2 This structure also facilitates the incorporation of weights with different bit widths, allowing diverse weights to be accommodated within a shared entry, consequently improving storage density.

EXPERIMENTAL RESULTS
To evaluate the proposed ISA and architecture, we first implemented the entire architecture in Verilog at the RTL level.Using the GLOBAL FOUNDRIES 22nm FDSOI process, we generated a GDSII file that meets the sign-off requirements after completing physical design and ver- TrueNorth [10] Loihi [5] FlexLearn [6] ISSCC 2019 [36] ODIN [37] Loihi2 [8] Unicorn [11] ANP @0.75V --8.4pJ @0.55V --1.5pJ @0.56V 5.47pJ @0.8V #1 Darwin3 allows nodes to operate at different frequencies, with internal modules typically running at 300-400MHz.#2 A test chip contains 2 QPEs with 8 PEs, while a full SpiNNker2 has 38 QPEs with 152 PEs.#3 Loihi2 has been implemented in Intel 4, equivalent to the 7nm process.#4 This data is obtained through a digital synthesis flow, not from the final silicon tape-out data.#5 Loihi2's "Axon Routing", which refers to fan-out or fan-in, has a topology compression of 256x.#6 Its output layer consists of ten integrate-and-fire(IF) neurons.#7 A minimum SOP energy of 23.6 pJ at 0.75 V is extracted from pre-silicon simulations. ification.
After the initial Chip-on-Board(COB) testing in December 2022, the chip was repackaged using Flip-Chip BGA, and a dedicated test system board was assembled, featuring a Xilinx 7-Series FPGA. Figure 3 illustrates the system board, chip layout, and main blocks.The chip's structure is organized into a grid of 6*6 groups, each consisting of 4*4 tiles.Each tile comprises a node connected to a router.Except for the RISC-V node, all nodes on the chip are neuron cores, collectively driving their computational functions.Notably, two distinct tile types exist, primarily differing in the size of their axonin memory.We first compare some important metrics with the current state-of-the-art works, and then we run some application demonstrations to verify the chip's functionalities and performance.

Comparison with The State-of-the-Art Neuromorphic Chips
Table 4 summarizes the performance and specifications of state-of-the-art neuromorphic chips.Mixed-signal designs with analog neurons and synapse computation and high-speed digital peripherals are grouped on the left [4] [34] [12], and digital designs, including Darwin3, are grouped on the right [7] The critical metrics for efficient spiking neuromorphic hardware platforms are the scale of neurons and synapses, model construction capabilities, synaptic plasticity, and energy per synaptic operation.

Neuron Number
The quantity of neurons and synapses directly determines the size and complexity of the spiking neural network that a neuromorphic chip can support, which is extremely important.How-ever, a direct comparison with the SpiNNaker chips [7] [35] is not feasible due to its use of ARM processors, where the scale is tied to the size of the off-chip memory.Among other chips, NeuroGrid [4] has the largest number of neurons in a single neuron core, reaching 64K.Loihi [5], Unicorn [11], Loihi2 [8], and Darwin3 are at a similar level, boasting neuron counts exceeding 1K.At the chip level, Darwin3 can support up to 2.35 million neurons, surpassing the scale of TrueNorth and Loihi2 by more than two times.

Synapse Capacity
The capabilities of fan-in and fan-out within each neuron core profoundly impact the chip's overall capacity of synapses, as detailed in Table 3. Darwin3 distinguishes itself with its adaptive axon-out and axon-in memory configuration, coupled with efficient compression mechanisms, enabling remarkable fan-in and fan-out capacities of up to ( 1 -1) * M * N and ( 2 -N) * N 2 , respectively.In the case of Darwin3, the compression mechanism yields a maximum fan-in improvement of 1024x and a maximum fan-out improvement of 2048x when compared to the physical memory depth.
While the previous discussion delved into fan-in and fan-out capabilities, focusing on the synaptic connectivity potential, the challenge of efficiently storing synaptic weight parameters remains crucial.In Figure 4 (a), we present a comparative analysis of weight storage requirements, highlighting the stark contrast between Darwin3 and existing approaches when applied to typical networks converted from Convolutional Neural Networks (CNNs).Chips lacking specialized compression mechanisms exhibit dense weight matrices, making memory usage 6.8 to 200 times larger than the original approach.In crossbar designs, neurons consistently occupy their unique space, contributing to additional inefficiencies.
Darwin3 employs a versatile mechanism by classifying convolutional connections into weight-sharing multi-to-multi forms and obtains storage parity with the initial parameters, thereby achieving efficiency comparable to Loihi2 [8].Importantly, this advantage extends to non-convolutional connections featuring shared weight parameters.Darwin3 enables instructional access to the complete axonin, thus realizing the factorized attribute, which is also supported by Loihi2 [8], through multiplication operations.Furthermore, Darwin3 offers compatibility with diverse weight-bit widths, enhancing its adaptability and storage efficiency.

Code Density
Code density is a meaningful ISA metric, so we compare the code density of Darwin3 with the SpiNNaker chips [7] [35] and Loihi2 [8], the outstanding neuromorphic chips based on ISA.We use the C code to describe a model and the spinnaker tools integrated by SpyNNaker [47] to generate assembly code for the SpiNNaker chips.Then, we compare the length of the assembly code in Figure 4 (b).Loihi2's RISC instruction set is similar to ARM's Thumb, where spike instructions aid in curtailing spike-related instruction codes, offering a slight edge over SpiNNaker.Darwin3 shows an advantage in code density because of our proposed instructions.This instruction set concurrently loads parameters and expedites multiplication and addition with multiple parameters.Impressively, Darwin3 gets a remarkable 2.2x to 28.3x code density advantage across distinct models.

Inference and Learning Performance
For researchers working on SNNs, after finalizing the model, the primary focus lies on evaluating the chip's performance during application execution, with particular attention to latency and accuracy.To evaluate the capabilities of Darwin3, we conducted several experiments under two distinct scenarios: inference and learning.Table 5 (b) compares the performance of Darwin3 to state-of-the-art neuromorphic chips in typical applications.These applications were SNNs converted from trained and quantified ANNs.We implemented the same type of network models on Darwin3, and the performance metrics indicate that Darwin3 is in the leading position regarding accuracy and latency.The accuracy is up to 6.76% higher, and the latency is up to 4.5x better than others.Darwin3 exhibits advantages because it has a flexible and efficient connection construction ability, which is very friendly to the converted convolutional networks.Due to the high efficiency of connection storage, it does not increase redundant spike transmission latency.The asynchronous interconnection method employed by Darwin3 has significantly reduced the communication delay between neuron cores.Darwin3 utilizes click elements [48] to construct a cross-clock domain structure, enabling the completion of cross-clock domain data transfer in just two cycles.Furthermore, the related topological structures can be split and computed in parallel with more neuron cores.We attribute the observed discrepancies to the quantization operations while mapping these models to hardware, and the quantization methods employed may vary among different approaches.It's important to note that there is still room for improving latency performance by optimizing the mapping approach.
To further evaluate the on-chip learning capability of Darwin3, we constructed a network based on the architecture proposed by Diehl and Cook [49].We added a supervision layer, which provides positive or negative rewards based on comparing the network's output and the target during the training process, achieving the overall implementation of the RSTDP rule.The network

Energy Efficiency
The energy consumption of each synaptic operation (SOP) is the most critical energy consumption metric for neuromorphic chips.We measured the energy consumption of the Darwin3 chip when running a two-layer neural network, where the neurons in the first layer can fire spikes without inputs, and the neurons in the second layer receive spikes and perform calculations.We select the common approach [39] to evaluate energy consumption, as detailed in Equation 7.
Where   is the power dissipated by a Darwin3 chip after the power-up process with no applications configured,   is the baseline power, which consists of the power dissipated by all nodes enabled without running any neurons on it,   is the power required to simulate a LIF neuron with a 1 ms time-step, n is the total number of neurons,   is the energy consumed per synaptic event (activation of neural connections) and s are the total synaptic events.The chip operates at a frequency 333MHz with a core voltage supply of 0.8v and an IO voltage supply of 1.8v, as shown in Table 4.The measured average SOP power consumption is 5.47pj/SOP.This metric is directly influenced by factors such as the manufacturing process, power supply voltage, and op-erating frequency, making fair comparison challenging.However, based on the data released by prior works under typical scenarios, Darwin3 boasts a leading achievement.Darwin3's advantage lies in its internal asynchronous interconnection circuit, which enables the chip to consume very low power when there is no spike transmission or calculation.Additionally, all memories of Darwin3 will shut down during the idle phase, reducing power consumption.

Applications with A Million of Neurons
To further illustrate the chip's efficacy, we developed two extensive applications implemented on Darwin3, spiking VGG-16 ensembling and directly-trained [50] SNN-based maze solving, shown in Figure 5.We ensembled outputs of five VGG-16 models obtained through ANN2SNN [51] using a voting mechanism [52], culminating in a composite model comprising approximately 1.05 million neurons and employing 8-bit weight precision.We applied random transformations to the input and used five independent VGGs in the hidden layers for the classification tasks.The voting layer produces the final classification outcome based on the collective votes from the individual outputs of the hidden layers.Compared to the original single VGG-16, accuracy testing on the CIFAR-10 dataset witnessed an increase from 92.98% to 93.48%.We also developed an application for maze solving.We mapped the maze onto a set of neurons, where excitatory neurons represent the free-walking grid points, and inhibitory neurons represent obstacles.The interconnected excitatory neurons can transmit spikes in sequence, and under the action of STDP rules, the synaptic weights are continuously increased to form a stable synaptic strength.However, synapses connected to inhibitory neurons cannot be strengthened, and the transmission of spikes will be  terminated when encountering inhibitory neurons.After learning, the model can quickly find the path by observing the path along which the spikes propagate.We conducted experiments using mazes of different sizes, comparing the time it takes to search for a path on our chip versus a CPU server.A map of size 15434*1534 requires over 2.35 million neurons, approaching the upper limit of neurons that Darwin3 can simulate.The result is shown in Table 6.With the STDP-based SNN method, the time consumed increases linearly with the maze size.In contrast, the traditional search method on the CPU server consumes a lot of time because it relies on many recursive operations.

CONCLUSION
The article proposes a new instruction set and a connectivity compression mechanism to create a chip that can support large-scale neural net- The mazes are randomly generated, and the running time is an average of five measurements.
works.This chip has been designed to be more efficient in terms of the number of neurons it can accommodate and its synaptic computing performance, compared to existing works.The experimental results show that the chip has reached the same leading level as the state-of-the-art works in terms of accuracy and latency performance metrics, both for inference and learning modes.
The practical effectiveness of the chip has also been demonstrated by running a maze-searching application on it.Due to the chip's versatile chip communication mechanism, different Darwin3 chips can be integrated onto a single board and interconnect several boards to configure a big chassis.These chassis can be interconnected through a network infrastructure to support the construction of extensive SNNs.This configuration can support the construction of extensive SNNs when coupled with suitable software frameworks.

Figure 1 :
Figure 1: Typical Data Path.(a) The Data Path of    .(b) The Data Path of   .(c) The Data Path of .(d) The Common Data Path for State Variables.

UPTLS k( 3
-bit) l(3-bit) m(3-bit) n(2-bit) The variables k, l, m, and n determine the selected state variable   that needs to update according to the equation:   (t+1)=  *  (t)+  .UPTWT m(2-bit) n(9-bit) A binary code m and n-hot code n determine the synaptic weight to update according to the equation: WT

Figure 2 (
c) shows the architecture tailored for inference and learning based on the proposed Time needed for processing spike events Time needed for computation of synapse states Time needed for computation of neuron states Time needed for communication of spike events

Figure 2 :
Figure 2: The Architecture of The Chip Top and Main Blocks.(a) The Top Architecture of The Proposed Chip.(b) The Architecture of a Neuron Core.(c) The Architecture for Inference and Learning Process.(d) The Architecture of The Synapses.

𝐷1
represents fan-in memory depth (commonly associated with axon-in structures), 2 represents fan-out memory depth (commonly linked to axon-out structures), M represents the number of neuron cores, and N represents the number of neurons within a neuron core.R and C represent the dimensions of the crossbar, with R being equivalent to 1 (rows) and C being equivalent to 2 (columns).

Figure 3 :
Figure 3: The Test Chip and System Board

Figure 4 :
Figure 4: Comparison of Code Density and Memory Usage.(a) Comparison of Required Weight Memory Across Typical Networks.(b) Comparison of Code Density.

Table 1 :
Registers and Instruction Set (a) Registers for State Variables and Parameters The membrane potential  0 - 1 Two presynaptic spike traces  0 - 1 g,  1 The synaptic conductance  2 Flag for spike from pre-synapse  2 I,  2 The synaptic current  0 - 1 Two postsynaptic spike traces  3 - 4 h,  3 The gating variable  2 Flag for spike from post-synapse  5    ,  4 The adaptive voltage  0 - 1 traces to do reward and punishment  6 - 7   ℎ ,  5   0 -  7 Eight parameters for inference stage, corresponding to  0 - 7 in Equations 4 and 5   0 -  7 Temporary registers for parameters and states

Table 2 :
Code Examples for Widely Used Models tions to other chips in all four cardinal directions, enhancing system scalability.

Table 3 :
Different Connectivity Mechanisms

Table 4 :
Performance and Specifications of State-of-The-Art Neuromorphic Chips

Table 5 :
Performance Comparison with Other Chips (a) Inference Mode The weight precision here refers to the precision of the network run in the experiment rather than the maximum weight precision of the chip.#2 The synaptic weights of its ten neurons in the output layer are 10-bit.

Table 6 :
Time Cost for Maze Solving Application