

Presentation to 2004 Workshop on Extreme Supercomputing Panel:

## **Roadmap and Change**

#### How Much and How Fast

#### **Thomas Sterling**

California Institute of Technology and NASA Jet Propulsion Laboratory October 12, 2004

October 12, 2004



#### 2<sup>9</sup> Years Ago Today



October 12, 2004



## Linpack Zettaflops in 2032





#### **Architectures / Systems**



![](_page_4_Picture_0.jpeg)

## The Way We Were: 1974

- IBM 370 market mainstream
  - Approx. 1 Mflops
- DEC PDP-11 geeks delight
- Seymour Cray started working on Cray-1
  - Approx. 100 Mflops
- 2<sup>nd</sup> generation microprocessor
  - e.g. Intel 8008
- Core memory
- 1103 1Kx1 DRAM chips
- Punch cards, paper tapes, teletypes, selectrics

![](_page_4_Picture_12.jpeg)

![](_page_4_Picture_13.jpeg)

![](_page_5_Picture_0.jpeg)

## **What Will Be Different**

- Moore's Law will have flatlined
- Nano-scale atomic level devices
  - Assuming we solve lithography problem
- Local clock rates ~100 GHz
  - Fastest today is > 700 GHz
- Local actions strongly preferential to global actions
- Non-conventional technologies may be employed
  - Optical
  - Quantum dots
  - Rapid Single Flux Quantum (RSFQ) gates

![](_page_5_Picture_12.jpeg)

![](_page_5_Picture_13.jpeg)

![](_page_6_Picture_0.jpeg)

### What we will need

- 1 nano-watt per Megaflops
  - Energy received from Tau Ceti (per m<sup>2</sup>)
- Approximately 1 square meter for 1 Zetaflops ALUs
  - 10 billion execution sites
- > 10 billion-way parallelism
- Including memory and communications: 2000 m<sup>2</sup>
- 3-D packaging (4m)<sup>3</sup>
- Global latency of ~ 10,000 cycles
- Including average latency, => 1 trillion-way parallelism

![](_page_6_Picture_11.jpeg)

![](_page_7_Picture_0.jpeg)

#### Parcel Simulation Latency Hiding Experiment

![](_page_7_Figure_2.jpeg)

![](_page_8_Picture_0.jpeg)

#### Latency Hiding with Parcels with respect to System Diameter in cycles

Sensitivity to Remote Latency and Remote Access Fraction 16 Nodes deg\_parallelism in RED (pending parcels @ t=0 per node)

![](_page_8_Figure_3.jpeg)

October 12, 2004

![](_page_9_Picture_0.jpeg)

#### Latency Hiding with Parcels Idle Time with respect to Degree of Parallelism

Idle Time/Node (number of nodes in black)

![](_page_9_Figure_3.jpeg)

![](_page_10_Picture_0.jpeg)

### **Architecture Innovation**

- Extreme memory bandwidth
- Active latency hiding
- Extreme parallelism
- Message-driven split-transaction computations (parcels)
- PIM
  - e.g. Kogge, Draper, Sterling, ...
  - Very high memory bandwidth
  - Lower memory latency (on chip)
  - Higher execution parallelism (banks and row-wide)
- Streaming
  - Dally, Keckler, ...
  - Very high functional parallelism
  - Low latency (between functional units)
  - Higher execution parallelism (high ALU density)

![](_page_11_Picture_0.jpeg)

# **Continuum Computer Architecture**

- Merges state, logic, and communication in single building block
- Parcel driven computation
  - Fine grain split transaction computing
  - Move data through vectors of instructions in store
  - Move instruction stream through vector of data
  - Gather-scatter an intrinsic
  - Very efficient *Futures* for produces-multi-consumer computing
- Combines strengths of PIM and Streaming
  - All register architecture (fully associative)
  - Functional units within a cycle of neighbors
  - Extreme parallelism
  - Intrinsic latency hiding

October 12, 2004

of the AGM March 2001-Volume 44, Number 3 Phillip Armour Norman Badler Gordon Bell Steven Bellovin James Bennett Hal Berghel Grady Booch Anita Borg Michael Bove Eric Brewer Dan Bricklin Kilnam Chon Ellen Christiansen lacques Cohen Rita Colwell Larry Constantine Martin Cooper Robert Cringely Jon Crowcroft THE Peter Denning Whitfield Diffie Edsger Dijkstra Susan Dray Usama Fayyad Christopher Fry Ravi Ganesan John Glenn NEXT Mark Gorenberg Jim Gray Andrew Grosso Karen Holtzblatt Thomas Horan Joseph Jacobson Ramesh Jain Christopher Johnson 1,000 Leon Kappelman Ray Kurzweil Jennifer Lai Henry Lieberman Carlos López González Brock Meeks Cameron Miner YEARS Michael Muller Bonnie Nardi Donald Norman Peter Neumann Cherri Pancake David Parnas Jean-François Podevin Mitchel Resnick Doug Riecken Pamela Samuelson Roger Schank Bruce Schneier Ari Schwartz Steven Schwartz Ted Selker Richard Stallman Thomas Sterling U.S.57.95 CANADA \$10.95 Anthony Townsend Dennis Tsichritzis Andres van Dam Hal Varian Ron Vetter Jim Waldo Ann Winblad

And Martin Martin

![](_page_12_Figure_1.jpeg)

![](_page_12_Figure_2.jpeg)

#### Continuum Computer Architecture for Exaflops Computation

Computing Research is the con-

tinuum computer architecture

(CCA), an ultra-fine-grain uni-

medium enabled through next-

whether major exaflops engines

everything from controlled fusion

reactors to rapid-response medi-

robot brains for autonomous control of spacecraft, airplanes, auto-

clothes and bodies-may look less

like today's microprocessors and

Several concurrent trends in semiconductor and other tech-

nologies will force a rethinking of

cines, to compact low-power

mobiles, and homes, to

much more like CCAs.

embedded smartware in our

a continuous 3D execution

generation submicron logic

devices. Future computers-

used to design and simulate

form structure that approximates

THOMAS STERLING

THE ultimate computers in our long-term future will deliver exaflops-scale performance (or greater) and will look very different from today's microprocessors and massively parallel computers. Ironically, however, their alien structures and operational behavior can be inferred from the same technology trends driving development of today's conventional computing systems.

A vision of future computer architectures that are direct extrapolations of current trends is easily inspired by the explosive growth of today's computer performance, price-performance, and applications (driven by Moore's Law for device technology), as well as the more dramatic paradigm shifts brought on by the Internet, the Web, and grids. Yet an examination of these trends also reveals the possibility of something quite different in how we'll organize, design, and fabricate our largest computers in the future. They even set the stage for a revolution in computer architecture that may displace the venerable and highly successful "von Neumann model" and its predominance over the past 50 years. One class of innovative computing system being explored

puting system being explored today by computer scientists at the California Institute of Technology's Center for Advanced

ing explored the physical structure and logical er scientists at operation of parallel computer stitute of Techsystems. Lithographic feature size or Advanced will be driven below 0.05 microns

78 March 2001/Vol. 44, No. 3 COMMUNICATIONS OF THE ACM

![](_page_12_Figure_10.jpeg)

![](_page_12_Figure_11.jpeg)

![](_page_12_Figure_12.jpeg)

October 12, 2004

![](_page_13_Picture_0.jpeg)

### Conclusions

- Zettaflops at nano-scale technology is possible
  - Size requirements tolerable
    - But packaging is a challenge;
  - Latency challenge does not sink the idea
- Major obstacles
  - Power
  - Latency
  - Parallelism
  - Reliability
  - Programming
- Architecture can address many of these
- Continuum Computing Architecture
  - Combines advantages of PIM and streaming
  - Strong candidate for future Zetaflops computer

![](_page_14_Picture_0.jpeg)