## EE457: Computer Systems Organization Summer 2020 - Midterm Exam 06/23/20, 1:30PM - 3:30PM (Submit by 4:30 p.m)

| Name:            |          |  |
|------------------|----------|--|
| Student ID:      |          |  |
| Email:           | @usc.edu |  |
| Lecture section: |          |  |
| Redekopp         |          |  |
| 1:30 p.m.        |          |  |

[5 points: Complete all the information in the box above.]

| Page | Ques. | Your score | Max score | Recommended<br>Time |
|------|-------|------------|-----------|---------------------|
| 1    | -     |            |           | 0 min.              |
| 2    | 1     |            | 14        | 15 min.             |
| 3-5  | 2     |            | 36        | 45 min.             |
| 6    | 3     |            | 15        | 15 min.             |
| 7    | 4     |            | 8         | 15 min.             |
| 8    | 5     |            | 6         | 15 min.             |
| 9    | 6     |            | 9         | 15 min.             |
|      | Total |            | 88        |                     |

Note: The last page is blank and can be used for scratch paper.

Please turn it in with your exam (for in person exams only)

| 1. | the | question or the selection that makes the state                                                                                                                                                                | pipelined processor, a stall may be required if the                              |  |  |  |  |
|----|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|--|--|--|--|
|    | b.  | A state in a state diagram has 2 outgoing transitions and 3 incoming transition. If using a 1-hot implementation, the flip-flop for that state in the NSL will be produced by an C gate with how many inputs? |                                                                                  |  |  |  |  |
|    | c.  | ( <b>Dynamic / Static</b> ) instruction program occupies when executing.                                                                                                                                      | count will determine how much memory the                                         |  |  |  |  |
|    | d.  | ( <b>True / False</b> ) Suppose half of the instructions executed by a program are ADD instructions which take 2 cycles. Reducing the ADD instructions by half will cut the overall execution time in half.   |                                                                                  |  |  |  |  |
|    | e.  | ( <b>True / False</b> ) In the single-cycrequired to execute an <b>JUMP</b> instruction is like                                                                                                               | cle CPU, reducing the latency of the datapath kely to decrease the clock period. |  |  |  |  |
|    | f.  | ( <b>True / False</b> ) In the single-cycrequired to execute an <b>LW</b> instruction is likely                                                                                                               | cle CPU, reducing the latency of the datapath y to decrease the clock period.    |  |  |  |  |
|    | g.  | What is the minimum number function using only 2-input OR gates?                                                                                                                                              | of gate delays required to implement a 6-input OR                                |  |  |  |  |
|    | h.  | Signals that must be valid <u>duri</u> <b>Moore</b> ) style.                                                                                                                                                  | ng the clock are best produced using (Mealy /                                    |  |  |  |  |
|    | i.  | (True / False) For signed num subtracting and then simply checking if the M                                                                                                                                   | bers, the comparison of A < B can be performed by ISB of the result is a 1.      |  |  |  |  |
|    | j.  | system. Answers are limited to 8-bits (not 9-                                                                                                                                                                 | of subtraction.) Finally, state whether overflow has                             |  |  |  |  |
|    |     | a.) 2's comp. system                                                                                                                                                                                          | b.) Unsigned system                                                              |  |  |  |  |
|    |     | 11010101 <sub>2</sub><br>- 01011110 <sub>2</sub>                                                                                                                                                              | 01100101 <sub>2</sub><br>+ 01010100 <sub>2</sub>                                 |  |  |  |  |
|    |     |                                                                                                                                                                                                               |                                                                                  |  |  |  |  |

Overflow: Y/N

Overflow: Y/N

2. **(36 pts.) Arithmetic, Datapath, and Control Unit Design**. Recall that sign-truncation refers to removing copies of the sign bit without altering the number's value. For example, **11111010** binary can be truncated by **4 bits** and reduced to **1010** without affecting the value.

Given an array of 32 (8-bit) **signed (2's complement) binary numbers** stored in a memory, scan through the array, and record the MAX number of bits that can be truncated from any number in the array (i.e. find the number with the most copies of the sign-bit and record how many copies that number had.) In addition, by the end of the scan, set a FLAG bit if ANY number in the array had **4 or more** leading sign bits that could be truncated.

Below is a sample of 4 (though you will examine 32) numbers and how the MAX and FLAG variables would update based on the computation of each number. The final result would be MAX = 6 and FLAG = 1.

| Maman, Cantanta               | Addr. (i) | 0                                                | 1                | 2                 | 3        |
|-------------------------------|-----------|--------------------------------------------------|------------------|-------------------|----------|
| Memory Contents               | M[i]      | 1 <b>1</b> 011101                                | 0 <b>0000010</b> | 1 <b>111111</b> 0 | 10110111 |
| <b>S</b> = (Sign-bits that ca | an be     | 1                                                | 5                | 6                 | 0        |
| Truncated from M              | [i]       | (remember you must retain the original sign bit) |                  |                   |          |
| MAX, FLAG                     |           | 1,0                                              | 5 , 1            | 6 , 1             | 6 , 1    |

While there are many ways to accomplish this computation, we want you to scan through each number and examine the bits of the current number 1 clock cycle at a time (see the datapath below). While we could go through all 8-bits, since we only care about how many leading sign bits the number has we should not require 8 cycles per number if there are fewer leading sign bits. For example, given the number 11100101, we should be able to stop counting the sign bits after 2 clock cycles since there are only 2 leading copies of the sign bit. This is a performance requirement. So if a number has n leading copies of the sign bit, you may only use n cycles to count the sign bits.

To produce the answers, we will keep two counters: the **i-counter** to track which array element we are examining and the **B-counter** to select (via a mux) one of the bits of the number per cycle. For this B-counter you will need to complete its connections with the ADDER/SUBTRACTOR and mux to ensure the appropriate select bit sequence is generated. We will use a third counter (**S-counter**) to count how many leading copies of the sign bit there are. Finally, we will use a register to store the **MAX** number of leading sign bits from ANY number and a separate 1-bit register for the **FLAG** to indicate if ANY number had **4 or more copies** of the sign bit.

We will use a Mealy-style state machine. After an initial state (**INIT**) we will move to the BCNT (bit counting) state where we will count the number of leading sign bits in the current array element. Once we determine that value, we will enter a SCMP (S compare) state to update the MAX and FLAG outputs appropriately. We can then return to the BCNT state to process the next number and repeat this process until we are done.

Examine the incomplete datapath on the next page and the skeleton of the state machine on the following page. Complete the state machine and then the data path by filling in the shaded regions, taking care to read the instructions for what is expected. Finally, complete the NSL and OFL.

a.) [1 pt] How many bits should the i-counter be?

b.) [7 pts] Complete the datapath connections below by using constants, drawing wires, or simply using signal name labels to the inputs in the shaded boxes. The signals in the dotted circles will be generated by the OFL of the state machine or other signals from the datapath below. Those will be dealt with on the next page.



c.) [12 pts] Complete both transitions & operations that should happen in the BCNT and SCMP states.



e.) Now suppose we implemented the state machine from the previous part and gave you the one-hot state outputs: QI, QB, QS, and QD. Use these signals plus <u>others that are defined in the datapath</u> to define the control signals indicated below: Just write logic equations using logic operations

| MEN =  |  |
|--------|--|
| SEN =  |  |
| BSEL = |  |
| BEN =  |  |
| EN =   |  |
| SCLR = |  |

In the datapath, the 1-bit comparator producing MATCH can be implemented with a single logic gate. What type of gate would be correct: \_\_\_\_\_ (AND / XOR / XNOR / NOR) (6 pts.)

f.) Below is a <u>subset</u> of the state flip-flops. Assuming a <u>1-hot state assignment</u>, complete the NEXT STATE LOGIC equations and SET/CLR inputs of the state memory. You need not draw the gates but just write the equations for the specified D inputs. Then in the shaded boxes show what to connect to the **SET/CLR inputs** (Vdd and GND are always available)

DI = \_\_\_\_\_\_\_

DB = \_\_\_\_\_\_

(3 pts.)

g.) Using /RST, constants and other signals, show how to connect the /SET and /CLR inputs of the following state flip-flops.



3. (15 pts.) ISA and Single-Cycle CPU Datapath: We want to add support for a new instruction 'BMNE' (Branch Memory Address if Not Equal), while not affecting any other instructions. This new instruction has the format shown below and will branch to the value loaded from memory given by the address in \$rt if the value in \$rs is not equal to a sign-extended immediate (i.e. \$rs != imm), otherwise it continues executing sequentially (the next instruction). Implement any changes to the datapath and control signals to support this new BMNE instruction on the single-cycle CPU. Assume when this instruction executes a new BMNE control signal will be generated and set to '1'.

This instruction will use the machine code I-format:

| BMNE: | opcode | rs | rt | 16-bit signed immediate            |
|-------|--------|----|----|------------------------------------|
|       | 6      | 5  | 5  | (immediate to be compared to \$rs) |



- a. What operation should the ALU perform? \_\_\_\_\_
- b. Sketch the additions/changes to the **datapath** above that would be needed to support this new instruction and its operations (provide a brief description in the space below if the sketch is unclear). <u>Use as little additional logic as possible.</u>

c. Show the values of the following control signals to implement this new instruction.

|                      | BMNE | MemToReg | Branch | ALUSrc | RegWrite | MemRead | MemWrite |
|----------------------|------|----------|--------|--------|----------|---------|----------|
| <b>Value</b> (0,1,X) | 1    |          |        |        |          |         |          |
|                      |      |          |        |        |          |         |          |

## 4. (8 pts.) Performance

Assume a program requires the execution of the following number of instructions with the given CPI. The processor uses a 2 GHz clock period.

| Instruction Type | Instruction Count                | CPI |
|------------------|----------------------------------|-----|
| ADD              | 100*10 <sup>6</sup> instructions | 1   |
| Branch           | 50*10 <sup>6</sup> instructions  | 2   |
| Load             | 50*10 <sup>6</sup> instructions  | 3   |
| Store            | 25*10 <sup>6</sup> instructions  | 3   |
| MUL              | 25*10 <sup>6</sup> instructions  | 5   |

| MUL                                | 25*10 <sup>6</sup> instructions                                           | 5                                         |
|------------------------------------|---------------------------------------------------------------------------|-------------------------------------------|
| a.) What is the execution time of  | of this program (in milliseconds)?                                        |                                           |
| Time: ms (show                     | work)                                                                     |                                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
| b.) Is it possible to achieve a 1. | 1x speedup by improving the CPI                                           | of only the BRANCH instructions?          |
| Yes / No: (show work)              |                                                                           |                                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
| c ). If we want the program to ru  | in 1 2v faster by simply reducing t                                       | he <b>number of Load</b> instructions, by |
|                                    | to reduce the instruction count of                                        |                                           |
| Factor of reduction (i.e. ol       | d # of LOADS / new # of LOADS                                             | S): (show work)                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
|                                    |                                                                           |                                           |
| d ) Assuma asah MIII instructio    | on is converted to 4 ADD instructi                                        | one Assuming we execute that              |
| updated program on the pip         | elined processor discussed in cla                                         | ss with full-forwarding, 1 cycle stall    |
|                                    | ndent instructions and 1 cycle brai<br>the runtime of the program be in t |                                           |
| Time: ms (show                     | . •                                                                       |                                           |
| 1 11116 1113 (3/10W                | WOIN                                                                      |                                           |

## 5. (5 pts) Pipelining Performance

Consider the <u>final pipelined CPU implementation we arrived at in class</u> with **full forwarding** where possible, **stalling logic** for a LW followed by a dependent instruction, and branch determination in the DECODE stage (<u>i.e. early branch determination</u>). Look at the code segment below which implements a loop that starts at L1 and will exit when the 'bne' instruction is true but continue looping when 'bne' is false and thus the 'beq' will jump back to the top of the loop.

a.) Suppose the bne is false twice and then is true on the 3<sup>rd</sup> iteration of the loop. How many total bubbles (stall cycles + flushed instructions) would occur during execution of those 3 iterations?

| Orig | inal Co | ode S | egme  | ent 1 |  |  |  |  |
|------|---------|-------|-------|-------|--|--|--|--|
| L1:  | add     | \$4,  | \$8,  | \$9   |  |  |  |  |
|      | lw      | \$3,  | 20(   | \$4)  |  |  |  |  |
|      | sub     | \$6,  | \$3,  | \$5   |  |  |  |  |
|      | addi    | \$4,  | \$4,  | 8     |  |  |  |  |
|      | bne     | \$4,  | \$9,  | L2    |  |  |  |  |
|      | SW      | \$8,  | 0(\$4 | 4)    |  |  |  |  |
|      | beq     | \$0,  | \$0,  | L1    |  |  |  |  |
| L2:  | or      | \$7,  | \$6,  | \$9   |  |  |  |  |
|      |         | -     | -     |       |  |  |  |  |
|      |         |       |       |       |  |  |  |  |

Now assume the hardware designer REMOVES the hazard detection unit and flushing logic (forwarding logic is still present) and instead declares <u>1 delay slot after a LW</u> followed by a dependent instruction and <u>1 delay slot after a branch</u>. Answer the questions below about how to reorder the code as necessary to <u>ensure correctness</u> while also achieving the <u>BEST POSSIBLE performance</u> (i.e. CPI). Recall that 'nop' instructions are always available for use to ensure correctness.

| b.) | ) What instruction could be moved into the <u>delay slot</u> for the <b>bne</b> (answer 'nop' if | no other |
|-----|--------------------------------------------------------------------------------------------------|----------|
|     | instruction can be moved)?                                                                       |          |

c.) What instruction could be moved into the <u>delay slot</u> for the **beq** (answer 'nop' if no other instruction can be moved)?

\_\_\_\_\_

## 6. (9 pts.) Pipelining and Arithmetic

Consider the problem of calculating an approximation of **e**<sup>x</sup> using the following formula:

$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!}$$

Given the datapath below assume the following delays:

| Addition unit       | 200 ps |
|---------------------|--------|
| Multiplication unit | 500 ps |
| Division unit       | 800 ps |

a. What is the total delay of the circuit as shown below? \_\_\_\_\_\_.



b. Now suppose we have an array of many values for X (e.g. X[1000]) and we want to compute e<sup>X[i]</sup> for each value of X[i]. Go back to the circuit above and consider how we can pipeline the design. We want to achieve a **clock period of at most 800 ps** (i.e. the divider delay) with a latency of 4 clock cycles and a throughput of 1 result every clock cycle in the steady state. Show where to add pipeline stage registers by drawing a THICK HORIZONTAL LINE (See the example diagram below for how to draw your pipeline registers) through the wires were a pipeline register should be inserted.

Sample for how to draw a pipeline register on the design above:



| Intentionally blank for scratch work. | Please turn it in with your exam: |
|---------------------------------------|-----------------------------------|
| Name:                                 | Section time:                     |