COMP 425 MidTerm Review Notes 11 – 23

Lecture 11 -12 More Complex Pipelines

A software could resolve data conflict by inserting a NOP. (DELAY) We could still have a structural hazards when two instructions try to write to the data register at the same time. It can be solved by adding a write port to the register file.

A more complex pipeline can be constructed with 5 stages.

  • Divide the original execution stage into Execute, Memory, Write.
  • Move register read to into the decode stage

Lecture 14 More Data Dependencies

Bypass implementation in hardware is complicated. Forwarding is insufficient to completely resolve the RAW data hazard. It does help to reduce the total penalty.

Why we can have a CPI > 1? full bypassing might be expensive. Loads take two cycles. Branches cause bubbles in the pipeline. CPI depends on the instruction set and the architecture implementation.

Approaches to improve complex systems

  • separate instruction and data memory ports
  • Multilevel cache systems
  • Interleaved memory
  • memory systems with out-of-order responses

Lecture 15-16 Data Dependencies and a Prelude to Scoreboarding

General characteristics of Pipelined Function Units

  • Operands are latched as they come in
  • Internal registers hold intermediate results and control signals
  • They may, or may not, have variable latency
    • Problems:
      • This could lead to out-of-order write hazards
      • structural hazards (write at the same time)
    • Solution
      • Equalizing Latencies as a strategy for dealing with structural hazards

Data Hazards

  • Read-After-Write
  • Write-After-Read
  • Write-After-Write

Execution order

  • In order: Begin execution of the instruction in the order listed
  • Out-of-order: Complete the instructions as soon as they can without violating dependencies or latencies

Instruction Scheduling: Out of order execution can be faster

Lecture 17 Score-boarding

A scoreboard is a hardware data structure to detect and mange instruction-issue dynamically. It is a data-dependence analysis algorithm implemented in hardware.

  • Is the required function unit available
  • Is the input data (register) available
  • Is it safe to write the destination
  • Is there a structural conflict at the WB stage

Lecture 18 Overcoming Branches

Branch instruction: [cond][offset]

Conditional instructions such as “ADDNE” can be used to avoid the pipeline bubble of some branches.

Lecture 19 Memory Hierarchy and Caches

The most important data memory: registers

The area of a register file is proportional to

  • The number of bits in a register
  • The number of registers in a register file
  • The square of the number of ports

Since different memory components have different latencies. A memory hierarchy is deployed to minimize the latency of memory operations.

Cache’s efficiency depends on the principle of locality

  • Temporal Locality
    • Recently accessed items are likely to be accessed again in the near future
  • Spatial locality
    • Items whose addresses are near one another tend to be referenced close together in time

Lecture 20 – 23 Caches

Structure of Cache

Cache size = Number of Sets (S) * Number of Lines/ Set * Block size

Address of word [t bits](tag)  [s bits](set index)  [b bits](block offset)

Direct-mapped cache: one line per set


  1. Locate set
  2. Check if the line in set has matching tag
  3. Yes + line valid; hit
  4. Locate data starting at offset

2-way set-associative cache: two lines per set

  1. Locate set with set index
  2. Check if any line in set has matching tag(check the tag of multiple lines in parallel)
  3. Yes + line valid: hit
  4. Locate data starting at offset

Write operations

  1. Write-hit:
    1. Write-through: write to memory immediately
    2. Write-back: defer write to memory until replacement of line
  2. Write-miss
    1. No-write-allocate: write immediately to memory
    2. Write-allocate: load into cache, update line in cache

Cache Performance Metrics

  • Miss Rate
    • Fraction of memory references not found in cache
  • Hit Time
    • Time to deliver a line in the cache to the processor
  • Miss Penalty 
    • Additional time required because of a miss

Note: there is a huge difference between miss and hit, increasing the miss rate by a tiny bit could significantly boost the performance

Causes for Cache Misses:

  1. Compulsory: first reference to a block
  2. Capacity: cache is too small
  3. Conflict: miss that occur because of collisions due to block-placement poicy

Miss rate analysis for Matrix Multiplication

Row major order allows spatial locality to help reduce the miss rate when it is being accessed sequentially. In the row-major order, each row would be stored in a “Way” (column) in cache.

As a result, if we choose the right dimension, we could minimize the cache miss rate.

This entry was posted in Architecture, Class. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s