Lecture 11 -12 More Complex Pipelines
A software could resolve data conflict by inserting a NOP. (DELAY) We could still have a structural hazards when two instructions try to write to the data register at the same time. It can be solved by adding a write port to the register file.
A more complex pipeline can be constructed with 5 stages.
- Divide the original execution stage into Execute, Memory, Write.
- Move register read to into the decode stage
Lecture 14 More Data Dependencies
Bypass implementation in hardware is complicated. Forwarding is insufficient to completely resolve the RAW data hazard. It does help to reduce the total penalty.
Why we can have a CPI > 1? full bypassing might be expensive. Loads take two cycles. Branches cause bubbles in the pipeline. CPI depends on the instruction set and the architecture implementation.
Approaches to improve complex systems
- separate instruction and data memory ports
- Multilevel cache systems
- Interleaved memory
- memory systems with out-of-order responses
Lecture 15-16 Data Dependencies and a Prelude to Scoreboarding
General characteristics of Pipelined Function Units
- Operands are latched as they come in
- Internal registers hold intermediate results and control signals
- They may, or may not, have variable latency
- This could lead to out-of-order write hazards
- structural hazards (write at the same time)
- Equalizing Latencies as a strategy for dealing with structural hazards
- In order: Begin execution of the instruction in the order listed
- Out-of-order: Complete the instructions as soon as they can without violating dependencies or latencies
Instruction Scheduling: Out of order execution can be faster
Lecture 17 Score-boarding
A scoreboard is a hardware data structure to detect and mange instruction-issue dynamically. It is a data-dependence analysis algorithm implemented in hardware.
- Is the required function unit available
- Is the input data (register) available
- Is it safe to write the destination
- Is there a structural conflict at the WB stage
Lecture 18 Overcoming Branches
Branch instruction: [cond][offset]
Conditional instructions such as “ADDNE” can be used to avoid the pipeline bubble of some branches.
Lecture 19 Memory Hierarchy and Caches
The most important data memory: registers
The area of a register file is proportional to
- The number of bits in a register
- The number of registers in a register file
- The square of the number of ports
Since different memory components have different latencies. A memory hierarchy is deployed to minimize the latency of memory operations.
Cache’s efficiency depends on the principle of locality
- Temporal Locality
- Recently accessed items are likely to be accessed again in the near future
- Spatial locality
- Items whose addresses are near one another tend to be referenced close together in time
Lecture 20 – 23 Caches
Structure of Cache
Cache size = Number of Sets (S) * Number of Lines/ Set * Block size
Address of word [t bits](tag) [s bits](set index) [b bits](block offset)
Direct-mapped cache: one line per set
- Locate set
- Check if the line in set has matching tag
- Yes + line valid; hit
- Locate data starting at offset
2-way set-associative cache: two lines per set
- Locate set with set index
- Check if any line in set has matching tag(check the tag of multiple lines in parallel)
- Yes + line valid: hit
- Locate data starting at offset
- Write-through: write to memory immediately
- Write-back: defer write to memory until replacement of line
- No-write-allocate: write immediately to memory
- Write-allocate: load into cache, update line in cache
Cache Performance Metrics
- Miss Rate
- Fraction of memory references not found in cache
- Hit Time
- Time to deliver a line in the cache to the processor
- Miss Penalty
- Additional time required because of a miss
Note: there is a huge difference between miss and hit, increasing the miss rate by a tiny bit could significantly boost the performance
Causes for Cache Misses:
- Compulsory: first reference to a block
- Capacity: cache is too small
- Conflict: miss that occur because of collisions due to block-placement poicy
Miss rate analysis for Matrix Multiplication
Row major order allows spatial locality to help reduce the miss rate when it is being accessed sequentially. In the row-major order, each row would be stored in a “Way” (column) in cache.
As a result, if we choose the right dimension, we could minimize the cache miss rate.