# CS 356 Unit 0

Class Introduction Basic Hardware Organization

## What is This Course About?

- Introduction to Computer Systems
  - a.k.a. Computer Organization or Architecture
- Filling in the "systems" details
  - How is software generated (compilers, libraries) and executed (OS, etc.)
  - How does computer hardware work and how does it execute the software I write?
- Lays a foundation for future CS courses
  - CS 350 (Operating Systems), ITP/CS 439 (Compilers), CS 353/EE 450 (Networks), EE 457 (Computer Architecture)

### **USC**Viterbi **Today's Digital Environment** Applications Networks C++ / Java / C++ / Java / Algorithms Algorithms Our Focus in CS 356 GPU / FPGAs GPU / FPGAs **Digital Logic Digital Logic** Transistors / Circuits Transistors / Circuits Voltage / Currents Voltage / Currents

## Why is System Knowledge Important?

- Increase productivity
  - Debugging
  - Build/compilation
- High-level language abstractions break down at certain points
- Improve performance
  - Take advantage of hardware features
  - Avoid pitfalls presented by the hardware
- Basis of understanding security and exploits



#### **USC** Viterbi **Reality 1** Abstraction vs. Reality Abstraction is good until reality intervenes • ints are not integers and floats aren't reals • $|s x^2 >= 0$ ? - Bugs can result - Floats: Yes - It is important to underlying HW implementations Ints: Not always - Sometimes abstractions don't provide the control 40,000\*40,000 = 1,600,000,000 or performance you need 50,000\*50,000 = -1,794,967,296 • Is (x+y)+z = x+(y+z)?- Ints: Yes Floats: Not always • (1e20 + -1e20) + 3.14 = 3.14 1e20 + (-1e20 + 3.14) = around 0 **Reality 2 Reality 3** Knowing some assembly is critical Memory matters! ٠ • You'll probably never write much (any?) code in - Memory is not infinite assembly as compilers are often better than even - Memory can impact performance more than humans at optimizing code computation for many applications • But knowing assembly is critical when - Source of many bugs both for single-threaded and - Tracking down some bugs especially parallel programs - Taking advantage of certain HW features that a compiler may - Source of many security vulnerabilities not be able to use Implementing system software (OS/compilers/libraries) - Understanding security and vulnerabilities



- There's more to performance than asymptotic complexity
  - Constant factors matter!
  - Even operation counts do not predict performance
    - How long an instruction takes to execute is not deterministic...it depends on what other instructions have been execute before it
  - Understanding how to optimize for the processor organization and memory can lead to up to an order of magnitude performance increase

Drivers and Trends

(0.13)

**USC** Viterbi

School of Engineeri

# COMPUTER ORGANIZATION AND ARCHITECTURE

USC Viterbi

School of Engineerin

# **Computer Components**

**Disk Drive** 

- Processor
  - Executes the program and performs all the operations
- Main Memory
  - Stores data and program (instructions)
  - Different forms:
    - RAM = read and write but volatile (lose values when power off)
    - ROM = read-only but non-volatile (maintains values when power off)
  - Significantly slower than the processor speeds
- Input / Output Devices
  - Generate and consume data from the system
  - MUCH, MUCH slower than the processor



USC Viterbi

## Architecture Issues

• Fundamentally, computer architecture is all about the different ways of answering the question:

"What do we do with the ever-increasing number of transistors available to us"

 Goal of a computer architect is to take increasing transistor budgets of a chip (i.e. Moore's Law) and produce an equivalent increase in computational ability

## USC Viterbi (0.17)

0.19)

Count

## Moore's Law, Computer Architecture & Real-**Estate Planning**

- Moore's Law = Number of • transistors able to be fabricated on a chip grows exponentially with time
- Computer architects decide, "What should we do with all of this capability?"
- Similarly real-estate ٠ developers ask, "How do we make best use of the land area given to us?"



USC University Park Development Master Plan http://re.usc.edu/docs/University%20Park%20Development%20Project.pdf

## **Transistor Physics**

- Cross-section of transistors on an IC
- Moore's Law is founded on our ability to keep shrinking transistor sizes
  - Gate/channel width shrinks
  - Gate oxide shrinks
- Transistor feature size is referred to as the implementation "technology node"



**USC**Viterbi



#### **USC**Viterbi **Technology Nodes Process Technology Node Progression** 1000 1000 350 250 130 90 65 45 32 600 Feature Size (nm) 100 22 10 1985 1990 1995 2000 2005 2010 2015 Year

#### **USC**Viterbi Growth of Transistors on Chip 1,000,000 100.000 Pentium (28M) 10,000 (Thousands) Pentiun (3.1M) (5.5M) 1,000 Intel '486 (1.2M) Tranistor ntel '38 (275K) 100 10 1975 1985 2000 2005 2010 1980 1990 1995

Year

#### USC Viterbi (0.21) **USC**Viterbi School of Engine Implications of Moore's Law **Memory Wall Problem** Processor performance is increasing much faster than memory • What should we do with all these transistors performance 100,000 - Put additional simple cores on a chip - Use transistors to make cores execute instructions 10,000 faster 1000 Performance Gau Performance - Use transistors for more on-chip cache memory CPI 100 • Cache is an on-chip memory used to store data the processor is likely to need • Faster than main-memory (RAM) which is on a separate chip and much larger (thus slower) Computer Architecture -A Quantitative Approach (2003) © 2003 Elsevier Science (USA), All rights reserved **USC**Viterbi **USC**Viterb Cache Example Pentium 4 Processor L2 Cache • Small, fast, on-chip memory to store copies of recently-used Cache does not have data desired data • When processor attempts to System Bus access data it will check the BAM L1 Data cache first If the cache has the desired Processor data, it can supply it quickly If the cache does not have the Cache has desired data, it must go to the main data memory (RAM) to access it System Bus L1 Instruc.





**USC**Viterbi

(0.27)

Progression to Parallel Systems

- If power begins to limit clock frequency, how can we continue to achieve more and more operations per second?
  - By running several processor cores in parallel at lower frequencies
  - Two cores @ 2 GHz vs. 1 core @ 4 GHz yield the same theoretical maximum ops./sec.
- For various applications like graphics and computationally intensive workloads this is taken to an extreme by GPUs

## **GPU Chip Layout**

- 2560 Small Cores
- Upwards of 7.2 billion transistors
- 8.2 TFLOPS
- 320 Gbytes/sec



**USC**Viterb

Source: NVIDIA

Photo: http://www.theregister.co.uk/2010/01/19/nvidia\_gf100/

