First Session (9/4/02) ------------- 1: Interlock in Synchronous Design Modern non-clock-gated microprocessors use 80% of total power 90% in local clock buf's + latches clock gating saves 60-80% work at pipeline stage for best savings ---> valid Forward interlock <--- stall Backward interlock Back interlock has problem; data overwrite on stall so need a bubble to hold incoming data but bubble moves backwards at speed of clock; not efficient! Solution 1: "optional" buffers in parallel which can be bypassed when pipeline not stalled; not at all efficient! Solution 2: use features of master-slave flip-flops, which already has a bubble inside anyway. Switch latches between being opaque and being transparent. Arrange for holes to combine with stall signals so valid values can catch up; more efficient. Generalisable Fork/Join/Branch/Select Savings: Unit level: 64% Local valid: 77% Local valid/stall: 82% Way to transfer between synch and bundled data async Can precalculate the valid bit 2: QDI Asynchronous Pipelines for High Speed [This was a rather low-level talk; didn't manage to take notes so see proceedings for details.] 3: Energy Efficient Pipelines E\tau^\alpha \alpha=2 V-insensitive measure \alpha->\infinity optimise for throughput \alpha->0 optimise for energy-efficiency Work at pipeline level for generalisation and ractability \tau = pipeline cycle time See paper for details [Notes of overheard talk at lunch] P.Riocrieux talked w. E.Grass about poss. collaboration on mixing async w. high speed wireless comms ("Hiperlink"?) IHP have ots of Si-Ge process experience, and AMU have lots of async experience Second Session -------------- 4: Negative-Overhead Self-Timed Pipeline [BEST PAPER] Surfing Pipelines zero- or less-delay overhead Use analogue features to boost performance Hold pulses on a wave, speeding slow and slowing fast (well, not really; set wave freq. to slightly accelerate fastest) Variable delay gates; "crummy buffer" 5: Event-Spacing Experiment [MY FAVOURITE] Why bursting in self-timed pipeline rings? Standard timing models unrealistic and/or sloppy Charlie diagram better, but can't exhibit bursting! Drafting allows for slip-streaming of signals; soon signals tend to catch up (so generate 3D charlie surface; new axis = time between pulses) Do analysis to tell where attractors are on charlie surface Constant forward delay (maintains bursts) produces 2D graph; also convert so that first signal always arrives at time 0 (wlog) Diagram in high-drafting case \ \ ___ ...o...........o...o... <-- burst mode \ / \______ \ / ......o.....o.......... <-- evenly-spaced mode \___/ ^ ^ | \- unstable inter-pulse timing (+ve gradient) \- stable inter-pulse timing (-ve gradient) Can have metastable timing w. medium drafting There is hysteresis about Testing theory: control drafting with crummy buffers | | burst | either | even | ^ | 107uA | 132uA <- ctrl current | Also unstable bursts when approaching from burst mode KEYNOTE ------- 6: ARM Arch: Startup -> Global Standard Started: "Want to be Global RISC Standard" Had to escape Apple+Acorn Get profit habit early Long hours Keep innovating but iterations/derivatives of old products: new products drive profit R+D: 60% Sales+Marketing: 27% Other (General+Admin): 13% ARM chip was small (low power) die size + lowest cost S(trengths)W(eaknesses)O(pportunities)T(hreats) Higher Perf / Lower S??? cost / Shorten design time / ? Communication + Teamwork w. Global Partners ARM all over: wireless, imaging, set-top boxes, digital audio, new storage apps, 32-bit Secure Processor leader (smart card cores), Networking, Automotive Platform strategy: not fabless chip PrimeXsys integrated-platform solution, partners offer ??? Tools can be sold to customers' customers Licensing Royalties Dev.Sys 2000 45% 25% 14% 2001 52% 19% 16% China v. important market (due to size) Third Session ------------- 7: Clock Synch using Handshakes GALS = Globally Asynchronous, Locally Synchronous 2 types: Data sync metastable safety vs low latency consistent multibit words Clock sync pausible ring oscillator generates clocks v. Arb ??? HS passive active pausible low phase either ? Bus + Xbar configurations possible (see proceedings) 8 - 10: missed KEYNOTE (10/4/02) ------- 11: Nanomagnetic Logic Devices (One possible future technology) Microelectronics without transistors Magnetism: All-metallic - smaller physical length scales (in limit) Non-volatile Radiation hard Low power "Spintronics" Memory => near to production: MRAM Logic => ? Quantum Cellular Automata +------------+ | ,-. ,-. | +----+ +----+ Majority gate | | |-||-| | | |* o| |o *| +--+--+--+ | `-' `-' | | | | | |*o|*o|*o| | | | | |o *| |* o| |o*|o*|o*| | = = | +----+ +----+ +--+--+--+ | | | | 0 1 |*o| | ,-. ,-. | |o*| | | |-||-| | | Cells couple +--+--+--+--+--+ | `-' `-' | to neighbours |*o|*o|*o|*o|*o| +------------+ |o*|o*|o*|o*|o*| Single electrons +--+--+--+--+--+ |*o| |o*| Can do chaining +--+--+--+ |o*|o*|o*| |*o|*o|*o| +--+--+--+ Problems: needs to be very cold! logic is not directed; results can flow back! Magnetic QCA: 60nm nanomagnets can act as QCA Very string dipoles; can work at room temp. Soliton transmission; info moves virtually effortlessly Operating margin too small; too hard to make Global rotating magnetic clock Domain Wall Pipeline ### ,---------- ,----------- ###------' -------------' ### Signals can only propagate round corners with same chirality as field rotation NOT gate: o 60 /\ | | 200nm track width .......| | ^ . | | Adds 0.5 cycle delay |1um . / \ | . / /\ \ | --------' / \ `---------- v __________/ \____________ Chain to get shift register Can do crossover (w. slight offset) Can do fanout (w. correct tapering) Good: All metal => good scaling Single plane => cheap to make Any substrate => flexible Have gain Full logic (soon) Energy efficiency better than CMOS Reasonable heg(??) density Bad: not very fast difficult to apply mag field efficiently no topological invariance of circuits edge roughness needs controlling Fourth Session -------------- 12: Probablistic Timing Analysis etc. Mean+variance -> moments of delays (abstraction of real distribution, but is two vals instead of a full distribution!) Timing seperations -> interface verification performance analysis synthesis optimisations (Bounded delays: min+max sep) Probablistic delays: prob. that sep exceeds value Usually done with Monte Carlo simulation Determining gate distributions is *hard* Can do better with just limited info? Focus on acyclic sequence charts (unroll cyclic graphs a few times!) Want to keep arguments to max() independent Eye "diagram" says when impossible (non-terminal reachable by two routes) -,*. /|^ \| Can convert using forward splitting or backward splitting / | _\ * | _* Fwd: dup upper node+extra edge to end \ | /| Back: dup lower node+extra edge from start \|| / -`*' Forward tends to overexpand. Better to use intermediate result (like backward split) Forward split is like backward split on reversed graph. MC can get closer to reality if you know distributions, but needs more info. 13: Fine-Grained Timing Constraints DI / QDI / SDI | V fast paths + slow paths order relations Fine grain pipelining reduces overhead for idle Constraint 1: data read at least once Constraint 2: data read at most once Deriving rules for automatically deriving timing rules 14: Relative Timing Based Verifications of Timed Circuits + Systems ^^^^^^^^^^^^^^ anything with timing constraints RTV Reduce complex (this speaker talked too fast to follow!) Fifth Session ------------- 15: Asynch. Circuit Synthesis by Direct Mapping: Interfacing to Environment David Cells: HW models of petri-net places] Do STGs, not petri-nets Read arcs simpler than consuming arcs use read arcs ripe. signals [what does this mean!!?] Basedon STG decomp: output exposition (elementary cycles) environment tracking (explicit context) tracking system optimisation 16: Architectural Design of FIFO The baroque buffer! High-level talk Math heavy: refer to paper LBuf = linear buffer, RBuf = rectangular buffer Conclusions: Simple calculus to verify functional correctness Accurate performance analysis based on seq function Rectangular buffers are (\kappa,\delta)-optimal 17: Checking Delay-Insensitivity: 10**4 Gates and Beyond Using NULL-convention logic Local Fundamental Mode simplifies No dynamic analysis SAT-solvers can handle large combinatorial logic blocks KEYNOTE (11/4/02) ------- 18: Mobile Applications (given by Director of Strategy for Orange) (Structured as A-Z; lame and didn't make for particularly good note taking; but an enjoyable person to listen to, particularly for his asides...) Avatar (Ananova) Speech is intuitive interface Business Apps (orange focus on SMEs + SOHO) more latitude w business cases; help direct what consumers are offered Comms+Communities (core business is voice) Design/Interface (poor interface limits SMS usage) Environment Control (wireless capability in white/brown goods) ubiquity important Financial Services; whireless operator bridges cap to customer people terrified of change Games; good, but tricky because of large games companies phones not good general data device Health; continual monitoring (T-shirt with bluetooth) Infotainment; not profitable except through transport charges Junk filtering Killer environment - part of fabric of society Learning Music; distribution to any device New media; multimodal is crucial (gave financial example) Open ecosystem Peer-to-Peer Questions (search engines/question answering engine) - good interface Remote control for life SMS (instant messaging too) Time management User-friendly Voice driven - HARD Wireless internet X marks the spot - location *sensitive* services are good, but location *based* are not so good (too intrusive); push is obviously not popular, opt-in only Why? Make money! [He was running out of ideas here...] Zzzzz.... [and worse here!] Sixth Session ------------- 19: Adding Synchronous and LSSD Modes to Asynchronous Circuits Test Requirements IP Integration At best 99% (single) stuck at coverage Bridging fault patterns post-layout Q100 98% stuck-at 95% SA + random iddq 90% SA + dedicated iddq targetted at rest only synch clocked feedback only FF as storage elements avoid redundant gate (?) avoid gated FFs Never U /\ D Comb. if U = ŹD, seq. otherwise make enabled by U --> U /\ \phi D --> D /\ \phi (See diagrams in paper) C-element X = S + (R * Z) 20: Testing of Asynchronous Designs by "Inappropriate" Means: Synchronous Approach Simple timing } HDL frontend } NCL still feasable Simple tools } BAD: Not high speed, significant area overhead GOOD: Low EMI, High security, Low power (?) Acyclic NCL pipelines are 100% testable Free obervability, 0 overhead Cyclic all SF(?) stall at acks but has overhead 21: Functional Test Methodology for GALS Systems Local clock generators } Bus support } Areas to improve Testability } Test has problems: glitches / redundancy / difficult to halt in a particular state WANT Enough fault test for wrappers Verify synch island / wrapper comms Verify intermodule comms (GALS) Testing method No big changes to synchronous areas Synchronous fallback model Stuck at fault model (adapted) Probs: Difficult pat. gen. Performance penalty Delay fault test Hazard free Functional test - watch what's happening Stuck at fault based approach impractical Functional test methodology Design+test automation script v. modular design, easily integrated 22: Timing Measurement + Test Testers very expensive for deriving setup+hold times Ver+test of timing constraints OK on chip? W. temp? If not, can we change them? Need adjustable delay - inverter chains - inverter matrix better accurate time measurement test strategies | MUTEX 90% within 5ps | add resistors, slowing but balancing | combine w. delays to get better time measurement Delays set to 10% inverter delay Timing accuracy better Seventh Session --------------- 23: Improving Smart Card Security Using Self-Timed Circuits [BEST TALK] Attack Technologies Invasive - leaves tamper evidence -> design, algorithmic content + keys Focussed Ion Beam Optical Reverse Engineering Litigation Attack Hence, only hide the key Non-invasive - side-channel attacks power emissions timing EMI fault induction (power/clock glitches) clock glitches can convert branches into NOPs power glitches can stop EEPROM writes no evidence left Defense Technologies Use dual-rail encoding of signals, with 11 (both high) to indicate and propagate tampering. This also balances gates. Insert random delays to stop averaging detection Memory: encrypt with session key for data (can be done with cost similar to memory access time) and use 4-rail for address bus Balanced dual-rail less data dependent power Some data dependent timing of care not taken 24: SPA Written in Balsa CHAIN not bus but network-on-chip Security *costs* power, performance and area Balsa extremely productive KEYNOTE ------- 25: From Research to Mainstream Tech vision: full custom perf in ASIC design time arbitrary large designs w. hierarchy Business vision: build sustainable business in async Business model: fabless chip company Liberal licensing can lead to more tech transfer University resources - networking - source of employees Establish core values Brainstorm+filter engage with customers GET NOTICED Industry traction Customer traction Investor traction Finance - Capital - raise more than you need! - seek repeat investors - right VCs worth it - Control burn - reduce project scope - invent min. (use commercial tools) - don't ramp up fast - Takes 5 years to build chip co. at least Shippable silicon - QA - Functionality - Performance - Product engineering post tape 3mo characterise 12-18mo validate - Manufact(???) Test not critical, but must control Design: CAST lasy[?] spans multiple levels of abstraction CSP decomposition 2D pipelining CSP -> PRS (production rules?) PRS -> SPICE netlist Semi-custom layout CAST cells have multiple descriptions (at several levels) include tests for the cells Globally DI, Locally QDI Interested in formal verification, CSP vs. CSP vs. Production Rules www.fulcrummicro.com