Elastic circuits Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona EMicro 2013 Goals • Convince ourselves that: – designing an asynchronous circuit is easy – synchronous and asynchronous circuits are similar – asynchronous circuits bring new advantages • Not to cover exotic asynchronous schemes • Elasticity can also be synchronous EMicro 2013 Elastic circuits 2 Clocking • How to distribute the clock? • How to determine the clock frequency? • How to implement robust communications? • How to reduce and manage energy? Nvidia KeplerTM GK110 28nm, 7.1B transistors, 550mm2, 2688 CUDA cores, Base clock: 836MHz, Memory clock: 6GHz EMicro 2013 Elastic circuits 3 EMicro 2013 Elastic circuits 4 Outline • • • • • • • • Synchronous and Source-synchronous circuits Completion detection Handshaking Performance analysis Why asynchronous? Design automation Synchronous elasticity Globally-asynchronous Locally-synchronous EMicro 2013 Elastic circuits 5 Synchronous and Source-Synchronous Synchronous circuit PLL EMicro 2013 Elastic circuits 7 Synchronous circuit CL Two competing paths: • Launching path • Capturing path Launching path < Capturing path + Period 1 2 CLKtree + CL < PLL EMicro 2013 CL Elastic circuits < CLKtree Period + Period (no clock skew) 8 Source-synchronous Launching path CLK gen Capturing path matched delay matched delay matched delay • No global clock required • More tolerance to PVT variations • Period > longest combinational path • Good for acyclic pipelines EMicro 2013 Elastic circuits 9 Source-synchronous with forks and joins CLK gen ? How to synchronize incoming events? EMicro 2013 Elastic circuits 10 C element (Muller 1959) A B C C A 0 0 1 1 B 0 1 0 1 C 0 C C 1 A B C EMicro 2013 Elastic circuits 11 C element (Muller 1959) A B MAJ C (many implementations exist) A 0 0 1 1 B 0 1 0 1 C 0 C C 1 A B C EMicro 2013 Elastic circuits 12 Completion detection Completion detection CLK gen fixed delay The fixed delay must be longer than the worst-case logic delay (plus variability) Q: could we detect when a computation has completed ASAP ? EMicro 2013 Elastic circuits 14 Delay-insensitive codes: Dual Rail • Dual rail: every bit encoded with two signals A.t 0 0 1 1 A.f 0 1 0 1 A Spacer 0 1 Not used SP 1 A.t A.f A EMicro 2013 1 SP 0 Elastic circuits SP 1 SP 15 Dual Rail AND gate A B C SP SP SP 0 - 0 - 0 0 SP 1 SP 1 SP SP 1 1 1 A.t A.f B.t B.f C.t C.f A C B EMicro 2013 Elastic circuits 16 Dual Rail Inverter EMicro 2013 A Z SP SP 0 1 1 0 A.t Z.t A.f Z.f Elastic circuits 17 Dual Rail AND/OR gate A.t A.f C.t A C B B.t B.f A A.f A.t C B C.f C.f A C B.f B.t C.t B EMicro 2013 Elastic circuits 18 Dual rail: completion detection Dual-rail logic • • • C done • • • Completion detection tree EMicro 2013 Elastic circuits 19 Multi-input C element a1 a2 a3 a4 C C C C a5 a6 a7 EMicro 2013 c C C Elastic circuits 20 Dual rail: completion detection INV AND OR AND CLK gen EMicro 2013 Elastic circuits 21 Dual rail: completion detection INV AND OR AND CLK gen EMicro 2013 C Elastic circuits 22 Dual rail: operation INV AND Compute Reset OR AND CLK gen C For a correct operation, all internal signals should be reset before the compute phase: • Use a more complex implementation of dual-rail (e.g., DIMS), or • Have internal completion detection, or • Use timing assumptions EMicro 2013 Elastic circuits 23 Other DI codes • There are many DI codes: – k-out-of n, Berger, Knuth, … • Example: 1-out-of-4 – 2 bits with 4 wires – Same wire efficiency as DR – Less power consuming – Good for communication – Bad for logic EMicro 2013 Elastic circuits Wires 0000 Value Spacer 0001 0010 0100 0 1 2 1000 others 3 not used 24 Single rail data vs. dual rail Some back-of-the-envelope estimations: Area Delay Static power Dynamic power Single rail 1 1 1 < 0.2 Dual Rail 2 << 1 2 2 Dual rail: • Good for speed • Large area • High power comsumption EMicro 2013 Elastic circuits 25 Handshaking Handshaking CLK gen unknown delay Assume that the source module can provide data at any rate: • When should the CLK generator send an event if the internal delays of the circuit are unknown? Solution: handshaking EMicro 2013 Elastic circuits 27 Handshaking Data I have data Request Acknowledge I want data EMicro 2013 Elastic circuits 28 Asynchronous elastic pipeline ReqIn ReqOut C C C C AckOut AckIn • David Muller’s pipeline (late 50’s) • Sutherland’s Micropipelines (Turing award, 1989) EMicro 2013 Elastic circuits 29 Multiple inputs and outputs EMicro 2013 Elastic circuits 30 Multiple inputs and outputs EMicro 2013 Elastic circuits 31 Mulitple inputs and outputs Ack Req C Req EMicro 2013 Ack Elastic circuits 32 Channel-based communication • A channel contains data and handshake wires Single-Rail Data Req Ack Dual-Rail Data Ack EMicro 2013 Elastic circuits 33 Push/pull channels Single-Rail Data Req (push) Ack Receiver Sender Single-Rail Data Ack Req (pull) • Push: the sender initiates the communication • Pull: the receiver initiates the communication EMicro 2013 Elastic circuits 34 Four-phase protocol Data transfer Data transfer Req Ack Data Data 1 Data 2 Data 3 • Valid data on the active edge of Req • Req/Ack must return to zero before the next transfer • Different variations of the 4-phase protocol exist EMicro 2013 Elastic circuits 35 Two-phase protocol Data transfer Data transfer Req Ack Data Data 1 Data 2 Data 3 • Every edge is active • It may require double-edge triggered flip-flops or pulse generators EMicro 2013 Elastic circuits 36 How to memorize? L Combinational Logic ? L ? delay C EMicro 2013 2-phase or 4-phase ? Elastic circuits C 37 How to memorize? L Combinational Logic L Pulse generator delay C EMicro 2013 2-phase Elastic circuits C 38 How to memorize? L Combinational Logic L delay C EMicro 2013 4-phase Elastic circuits C 39 Performance analysis Ring oscillators C 6 7 5 1 C C 2 C 3 C 4 • Every ring requires an odd number of inverters • The cycle period is determined by the slowest ring • The cycle period is adapted to the operating conditions (temperature, voltage) EMicro 2013 Elastic circuits 41 Global Rings C C EMicro 2013 Elastic circuits 43 Global Rings Th = 1 / 6 • Ramamoorthy and Ho, 1980 Performance evaluation of asynchronous concurrent systems with Petri nets • T. Williams et al., A self-timed chip for division, 1987 • Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990 • Manohar and Martin, Slack elasticity in concurrent computing, 1998. EMicro 2013 Elastic circuits 44 Global Rings Th = 2 / 6 • Ramamoorthy and Ho, 1980 Performance evaluation of asynchronous concurrent systems with Petri nets • T. Williams et al., A self-timed chip for division, 1987 • Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990 • Manohar and Martin, Slack elasticity in concurrent computing, 1998. EMicro 2013 Elastic circuits 45 Global Rings Th = 3 / 6 • Ramamoorthy and Ho, 1980 Performance evaluation of asynchronous concurrent systems with Petri nets • T. Williams et al., A self-timed chip for division, 1987 • Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990 • Manohar and Martin, Slack elasticity in concurrent computing, 1998. EMicro 2013 Elastic circuits 46 Global Rings Th = 1 / 6 • Ramamoorthy and Ho, 1980 Performance evaluation of asynchronous concurrent systems with Petri nets • T. Williams et al., A self-timed chip for division, 1987 • Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990 • Manohar and Martin, Slack elasticity in concurrent computing, 1998. EMicro 2013 Elastic circuits 47 Global Rings Th 1/2 Bubble limited Token limited 0 N N/2 tokens • Ramamoorthy and Ho, 1980 Performance evaluation of asynchronous concurrent systems with Petri nets • T. Williams et al., A self-timed chip for division, 1987 • Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990 • Manohar and Martin, Slack elasticity in concurrent computing, 1998. EMicro 2013 Elastic circuits 48 A latch-based view of synchronous circuits Filp-flop = Master + Slave EMicro 2013 Elastic circuits 49 Multiple Rings 2/4 2/5 5/7? EMicro 2013 2/7 Elastic circuits 2/4 It’s bubble limited !!! 50 Slack matching 2/4 2/5 2/4 24/ /79? • We can add as many bubbles as we want (but not tokens!) • Slack matching can be solved optimally in polynomial time • Slack matching is conceptually equivalent to buffer (FIFO) sizing or recycling EMicro 2013 Elastic circuits 51 Performance analysis C C (Mean Cycle Ratio) EMicro 2013 Elastic circuits 52 Latch-based design L1 L2 Launching path L3 L4 Capturing path L1 L2 L3 L4 EMicro 2013 Elastic circuits 53 Matched delays can be adjustable L1 L2 L3 L4 Delays can be adjusted: • At testing/boot time (to adjust to static variability) • At runtime (to compensate dynamic variability) EMicro 2013 Elastic circuits delay selection 54 Why asynchronous? Exploiting elasticity CLK Rigid clock High performance Low energy EMicro 2013 Elastic circuits 56 Exploiting elasticity Rigid Voltage 1V High performance Voltage scaling 0.9 V 0.8 V Low energy 0.7 V 500 MHz 1 GHz Performance 2 GHz Rigid clock High performance Low energy EMicro 2013 Elastic circuits 57 Voltage scaling and power savings 3 ARM926 cores on the same die -14% EMicro 2013 Elastic circuits -24% 58 Tracking variability matched delay EMicro 2013 Elastic circuits 59 Tracking variability delay Good correlation for: • Process variability (systematic) • Global voltage fluctuations • Temperature •best Aging (partially) typ EMicro 2013 Elastic circuits worst 60 Margins Rigid Clocks: Gate and wire delays (typ) P V T PLL Aging Skew Jitter Cycle period Gate and wire delays (typ) P VT Aging Elastic Clocks: Margin reduction Skew Speed-up / Power savings Cycle period EMicro 2013 Elastic circuits 61 Clock elasticity Rigid clock wasted time computation time Cycle period Elastic clock computation time Cycle period EMicro 2013 Elastic circuits 62 Design Automation Design automation paradigms • Synthesis of asynchronous controllers – Logic synthesis from Petri nets or asynchronous FSMs • Syntax-directed translation – Correct-by-construction composition of handshake components • De-synchronization – Automatic transformation from synchronous to asynchronous EMicro 2013 Elastic circuits 64 Synthesis of asynchronous controllers Bus DSr Data Transceiver LDS LDTACK Device D DSr DSw LDS VME Bus Controller LDTACK D DTACK DTACK Read Cycle EMicro 2013 Elastic circuits 65 Synthesis of asynchronous controllers DSr+ LDS+ LDTACK+ DTACK- D+ DTACK+ LDTACK- DSr- D- LDS- Signal Transition Graph D DSr LDS VME Bus Controller LDTACK DTACK EMicro 2013 Elastic circuits 66 Synthesis of asynchronous controllers DSr+ LDS+ LDTACK+ D+ DTACK- DTACK+ LDTACK- DSr- D- LDSD DTACK LDS DSr LDTACK Cortadella et al., Petrify EMicro 2013 Elastic circuits 67 Syntax-directed translation int = type [0..255] & gcd: main proc (in? chan <<int,int>> & out! chan int) begin x, y: var int | forever do in?<<x,y>> * SEQ ; do x <> y then if x < y then y:=y-x else x:=x-y fi od → out R MUX W x R → R ; out!x od end - DMX DMX <> do - DMX DMX < → áá ññ → Sources: P.A.Beerel, R.O. Ozdag and M. Ferretti. A Designer’s Guide to Asynchronous VLSI, Cambridge University Press, 2010. EMicro 2013 @ → J. Kessels and A. Peeters. DESCALE: A Design Experiment for a Smart Card Application Consuming Low Energy, in Principles of Asynchronous Circuit Design, A Systems Perspective, Eds., J. Sparso and S. Furber, Kluwer Academic Publishers, 2001. R MUX W y R R Elastic circuits 68 De-synchronization • Strategy: substitute the clock tree by local clocks and handshakes • Combinational logic and latches are not modified • More tolerance to variability – Similar area, less power and/or more speed • Cortadella, Kondratyev, Lavagno and Sotiriou. Desynchronization: Synthesis of asynchronous circuits from synchronous specifications. IEEE TCAD, Oct 2006. EMicro 2013 Elastic circuits 69 Synchronous operation CLK gen Transforming a synchronous circuit into asynchronous (automatically) EMicro 2013 Elastic circuits 70 De-synchronization Transforming a synchronous circuit into asynchronous (automatically) EMicro 2013 Elastic circuits 72 System-level de-synchronization CLK EMicro 2013 Elastic circuits 74 System-level de-synchronization EMicro 2013 Elastic circuits 75 System-level de-synchronization EMicro 2013 Elastic circuits 76 Synchronous elasticity Different flavors of elasticity … … … …1 … 7 4 1 1 0 2 4 7 0 1 2 4 1 7 1 0 2 … 8 + 4 3 Rigid … + e 8 4 3 Elastic … 8 4 3 + Synchronous Elastic s Carloni et al., Latency-insensitive systems. EMicro 2013 Elastic circuits 79 Asynchronous elasticity req ack EMicro 2013 Elastic circuits 80 Synchronous elasticity valid stop CLK RingPLL oscillator EMicro 2013 Elastic circuits 81 Latch-based elasticity sender receiver Data Data En En V En V V Valid Stop EMicro 2013 En V Valid Stop Elastic circuits 82 Elastic netlists Enable signal to data latches EB Fork Join EB Join / Fork EB EB EMicro 2013 Elastic circuits 83 Variable Latency Units [0 - k] cycles go done clear V/S EMicro 2013 V/S Elastic circuits 84 Globally-asynchronous Locally-synchronous GALS SoC design with GALS • Most IPs are synchronous DSP • Different components may have different operating frequencies CLK3 P Bridge CDC • Some components have variable latencies (e.g., cache hit/miss latency) Fast Bus CLK1 Bridge CDC Mem Slow Bus • Multiple clock domains are essential EMicro 2013 Elastic circuits CLK2 86 Multiple clock domains f3/f0 CLK0 CLK1 f2/f0 CLK2 CLK (f0) CLK3 f1/f0 CLK Independent clocks Rational clock frequencies Single clock (mesochronous) (controllable skew) EMicro 2013 Elastic circuits 87 Synchronous handshakes Data Sender Valid Receiver Ack CLK1 CLK2 • The arrival of data is unpredictable • Handshakes solve the problem EMicro 2013 Elastic circuits 88 The problem: metastability D Q ФT D setup ФR hold Q ФR D Q EMicro 2013 ? Elastic circuits 89 How long does it take to resolve metastability? Metastability MTBF: Mean Time Between Failures EMicro 2013 Elastic circuits 90 Classical synchronous solution D Q D Q D Q D Q ФT ФR Mean Time Between Failures fФ: frequency of the clock fD: frequency of the data tr: resolve time available W: metastability window : resolve time constant MTBF EMicro 2013 e tr 2 f f D W Elastic circuits Example # FFs MTBF 1 FF 15 min 2 FF 9 days 3 FF 23 years 91 Handshake with synchronizers Data Sender Valid Receiver Ack CLK1 CLK2 • Simple solution • Throughput can be highly degraded: a long round trip for every transaction EMicro 2013 Elastic circuits 92 Asynchronous FIFOs Data Circular buffer Data Valid Ack Valid Ack FIFO control Clk Out Clk In • Ack is issued as soon as data has been delivered • No impact on throughput (1 token/cycle) • Min latency determined by the internal synchronizers • Some tricky structures for the FIFO pointers (e.g. Grey encoding) EMicro 2013 Elastic circuits 93 SoC design with GALS DSP CLK3 P • Bridges for Clock Domain Crossing usually contain asynchronous FIFOs Bridge CDC • Latency cost only when interfacing with synchronous domains Fast Bus CLK1 Bridge CDC Mem Slow Bus EMicro 2013 CLK2 • No latency penalty between asynchronous domains Elastic circuits 94 Conclusions • Elasticity offers flexibility in time – Modularity – Dynamic adaptability – Tolerance to variability • Better optimization of power/performance • Why isn’t it an important trend in circuit design? – Lack of commercial EDA support (timing sign-off) – Designers do not feel comfortable with “unpredictable” timing – Other aspects: testing, verification, … • De-synchronization might be a viable solution EMicro 2013 Elastic circuits 95 Bibliography • Carmona, Cortadella, Kishinevsky and Taubin, Elastic Circuits, IEEE Trans. On CAD, Oct. 2009. • Beerel, Ozdag and Ferreti, A Designer’s Guide to Asynchronous VLSI, Cambridge 2001. • Sparso and Furber, Principles of Asynchronous Circuit Design: A Systems Perspective, Kluwer 2001. • Myers, Asynchronous Circuit Design, John Wiley&Sons, 2001 EMicro 2013 Elastic circuits 96 EMicro 2013 Elastic circuits 97