Chapter 4

Overview

In this chapter we will take a look at processor hardware and the instructions that the hardware can support, called instruction set architecture.

One important concept is that ISA ‘s logic can differ from the actual way the processor works. The ISA will appear to be sequential but sometimes the different instructions are executed simultaneously.

We want to have clever tricks that improve the performance while maintaining the functionality of a simpler and more abstract model

The processor we are working with :Y86-64

It is functionally correct and sequential. It executes a complete instruction on every clock cycle and the clock is slow to allow an entire series of actions to complete within one cycle.

Then we will apply transformations to create a pipelines processor. It will break each execution into five stages, each will be handled by a separate section of hardware. Instructions progress through the stages of the pipeline with one instruction entering the pipeline in one clock cycle. As a result, the processor can process up to 5 instructions at the same time.

The Y86-64 instruction set

Defining a instruction set means defining the different components of its state, set of instructions and their encodings, a set of programming convention and exception handling.

programmer-visible state.

Each instruction in Y86-64 can read and modify some part of the processor. This is the visible state. The 15 registers, condition codes(ZF,SF,OF) about the most recent arithmetic or logical ops, PC, memory and stat, which indicates the state of the program execution.

Y86-64 instruction

It is largely a subset of X86 and has fewer addressing mode and only 8-byte integer operations. Here are a few “quirks” of our Y86:

  1. movq is split into four instructions: irmovq, rrmovq, mrmovq and rmmovq, where i stands for immediate, r stands for register and m stands for memory. It will become useful when we implement them.
  2. As with x86, we cant move from memory to memory, and can’t move immediate to memory.
  3. We have four operations and they can only operate on register data. But on x86, they can operate on memory as well. These instruction will set the three conditional code.

Instruction encoding

The byte encoding of the instructions are between 1 to 10 bytes, depending on what fields are required. All instructions have a initial byte that indicate the instruction type. The high-order part is the icode, the low-order part of the byte is the ifunc that will differentiate between different functions under the same code. For example, 0x6 for icode is operations, and 0x0 for ifunc is addq, 0x1 for ifunc is subq and so on.

One-byte long instruction

Some instructions are only 1 byte long because they don’t need any other infos. They are ret, nop and halt.

For other instructions, we will need more bytes because they require operands. The next bytes inline is the register specifer byte. We have 15 registers in Y86 and the high-order and low-order specify the two registers. The value from 0x0 to 0xE are associated with %rax to %r14. The value 0xF is reserved for no register. We use 0xF for operations that only need one register such as pushq.

We might also need a 8-byte constant word. It can serve as the data for irmovq and displacement for rmmovq, and destination of calls. BEWARE the address are absolute (because it is simplier) and is in byte-reverse order, the little-endian encoding.

Exception

The status code in stat has four possible value. AOK(normal), HLT(halt), ADR(address invalid) and INS(invalid instruction. Any read or write attempt to an address out of the addrses limit will trigger the ADR.

In y86, when exception happens, we will just stop.

Two unusual instructions

pushq will decrement the stack pointer by 8 and writes a register value to the memory. So what will pushq %rsp do? Two possibility:

  1. push the original value
  2. Push the decremented value

Let’s first try putting a code together

movq %rsp, $rax pushq $rsp popq $rdx subq %rdx, %rax ret

The return value is always 0. So it means we pushed the undecremented value onto the stack. The same question can be raised for popq $rsp. A similar code can be used to determine the effect. We found popq $rsp will put the value of ($rsp) to $rsp. It is equivalent to mrmovq ($rsp), $rsp.

Logic design and Hardware control language HCL

We need three components to implement a digital system.

  1. combinational logic to computer functions on bits
  2. Memory element to store bits
  3. clock signal to regulate the update of memory

nowaday don’t have to draw diagrams anymore and use a textual notation. The most common languages are Verilog and VHDL. In 1980s, logic synthesis is developed to generate efficient circuit from HCL.

Logic gates

They are basic electronic component that generates output based on some boolean functions of inputs. Logic gates are always active and if some input changes, within a short amount of time, the output will change accordingly. Putting logic gates together, we got combinational circuits. Several restrictions apply:

  1. inputs must be either system input, output of memory our output of logic gates
  2. outputs of two gates can’t be connected together
  3. The network must be acyclic. No path can form a loop. Else there will be ambiguity.

Multiplexer

It will select input based on the bit pattern of the selection bit. We will use case expressions in HCL.

[ select1 : expr1; select2 : expr2; …

]

where select is a Boolean expression, indicating when to select the expression.

Example: If we have two selection bits, we can select 2^2 = 4 inputs.

[ !s1 && !s2 : A; !s1 : B; !s0 : C; !1 : D;

]

We can also simplify the select statement.

ALU

arithmetic logic unit can perform ops on data.It has two data input and one control input.

Memory and clocking

Combinational circuit, by definition, doesn’t store information and will simply react to signal inputs. To create sequential circuits that can have states and perform computations on the state, we must have devices that store information. Our storage devices will all be controlled by clock, a periodical signal that determins when new values will be loaded into the devices. The registers will only change according to inputs when the clock signal rises.

Sequential Y86 implementations

In this version, our sequential processor,SEQ, will perform all the steps required to process a complete instruction. It will be slow.

Stages

  1. Fetch: reads the bytes of an instruction from memory, using the PC as the memory address. It will grab the icode, ifunc and possibly the register specifier rA and rB and an 8-byte constant word valC. In the end, it will compute valP to be the address of the next instruction.
  2. Decode: reads up to two operands from the register file, and give the values stored in the rA and rB, giving valA and valB.
  3. Execute: the ALU will perform the operations specified by icode and ifunc, or computers the effective address of memory reference. The conditional code is possibly set and conditional jump/move will be determined to move/jump or not.
  4. Memory: write data to memory or read data from memory
  5. Writeback: write back to up to 2 result to register file.
  6. PC update: PC is set to the address of the next instruction.

OPq

An example, we will work through OP rA, rB.

Fetch: icode: ifunc <- M1[PC] // read 1 byte and get instruction code and function.

rA:rB <- M1[PC + 1] // read the target register

valP <- PC + 2 // find the next instruction after OP

Decode: ValA <- R[rA], ValB <- R[rB]

Execute: ValE <- ValB OP ValA, set CC //conditional code

We put ValB in front of ValA because we want division to have B divided by A.

Memory: No memory instruction

Write back R[rB] <- ValE

PC update: PC <- ValP

rmmovq

We can also use ALU to get the effective address of memory, as can be seen in move instructions

Example: rmmovq rA, D(rB)

Fetch: icode: ifunc <- M1[PC], rA:rB <-M1[PC + 1], ValC <- M8[PC + 2], ValP <- PC + 10 // we need access to displacement, it is an 8 bytes const after register specifying byte.

Decode: ValA <- R[rA], ValB <- R[rB].

Execute <- ValE <- ValB + ValC //displacement D

Memory M[ValE] <- ValA

Writeback: no writeback

PC update <- PC <- ValP

If we have rmmovq, where there is no displacement, in execute stage we will have : ValE <- 0 + ValA.

I guess this is the case because we need to pass ValA to writeback and it has to go through ALU. Kind of like zero-padding.

Push and Pop

It is among the most difficult instructions in Y86 because we both need to access memory and also update the stack pointer.

push rA

Fetch: icode: ifunc <- M1[PC], rA <-M1[PC + 1], ValP <- PC + 2

Decode: valA <- R[rA], ValB <- R[$rsp]

Execute: ValE <- ValB – 8

Memory: M[ValE] <- ValA

Writeback: R[%rsp] <- ValE

PC update : PC <- ValP

we can see ValB has the value of the old $rsp. We observed the rule that we should update the value of $rsp before writing to it is followed. (even though the actual update happened after the memory operation.)

pop rA

Fetch:R[$rsp] , ValP <- PC + 2

Decode: valA <- R[$rsp], ValB <- R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <- M[ValA]

Writeback: R[$rsp] <- ValE, R[rA] <- ValM

PC update: PC <- ValP

We can read two copies of the stack pointer. Because doing so can make the flow of the instruction much similar to instructions.

We used the incremented value to update the stack pointer and the un-incremented value as the memory access.

Call and ret

call dest

Fetch: icode: ifunc <- M1[PC], valC<- M8[PC + 1], ValP <- PC + 9

Decode: valB <-R[$rsp]

Execute: ValE <- ValB – 8

Memory:M[ValE] <- ValP

Writeback: R[$rsp] <-ValE

PC <- ValC

For call, the address of the function is the 8 byte right after instruction byte. We also need to use ValB to store the stack pointer and decrement it to store the return address. Then the PC is set to the destination.

ret

Fetch: icode, ifunc <- M[PC]

Decode:ValA <-R[$rsp], ValB < R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <-M8[ValA]

Writeback: R[$rsp] <- ValE

PC update: PC<- ValM

Ret also keeps two copy of unincremented $rsp. One is used for instruction update, another one is used for memory access

These two instructions are very similar to push and pop but we push and pop the program counters.

The hardware structure

A single clock signal transition trigger a flow through combinational logic to execute an entire instruction. We can treat the RAM units as another combinational circuit; it output word based on the address input, just like the circuit. So we are left with four hardware to control: The PC, the conditional code register, the data memory and the register file.  The PC is loaded with the new instruction every clock cycle. The condition code is loaded when an integer opration is executed. The data memory is written only when rmmovq, pushq or call is executed. The two write ports of the register file allow two program registers to be updated on every cycle. All of these state update happened simultaneously but But they achieve the same effect as the sequential execution in the previous section. This is achieved because we organized the computation in a way that follows some principle:

PRINCIPLE 1: No reading back. The processor never needs to read back updated state in order to complete other instruction. Let’s first start with a counter-example: suppose in pushq, we will first decrement the $rsp then use the updated $rsp to store the pushed value. This would violate the principle. We will need to go back to register file to retrieve the updated $rsp. On the other hand, our implementation generate the decremented value ValE and use it both as the memory write address and also the update value for $rsp, so we can do memory write and register update **simultaneously **. Along the line of this principle, we shouldn’t both set and then read the condition codes in any instruction. Even the condition codes are not set until the next cycle, they will be updated at the very beginning of the next cycle before any instruction can read it.

  As we can see, at the beginning of the cycle, the instruction hasn’t been executed and at the end of cycle 3, all combinational circuits have reacted to the given inputs and the update values have been generated.

Stage implementation

We will devise HCL descriptions for the control logic blocks required for SEQ.

Fetch stage

The fetch will read 10 bytes from memory and the first byte is split into 2 4 bits section as icode and ifun. Then compute the instruction codes or, in the event the instruction is invalid, the value is corresponded to a nop instruction. Based on the icode, we can compute three 1 bit signals(in dashed line)

  1. instr_valid: Is the instruction valid?
  2. need_regids: Does the instruction need register id?
  3. need_valC: Does this instruction need constant word?

The signals inst_valid * and *imem_error ( when the instruction address is out of bound). are used to generate the status code in the memory stage.

The HCL for need_regid is :

<br />bool need_regids = {
    icode in {RRMOVQ, RMMOVQ, MRMOVQ, PUSH, POP, IRMOV}
}

and the HCL code for need_valC is

bool need_valC = {
    icode in {IRMOVQ, RMMOVQ, MRMOVQ, JXX, CALL}
}

The remaining 9 bytes are processed by the “align” hardware into the register fields and the constant word. Byte 1 is split into 2 register specifying fields and if need_regids is 0, both register specifying fields will be 0xF(RNONE). Depends on if we have register specifying fields, the constant field will either be 1-8 bytes or 2-9 bytes.

The PC incrementer hardware unit will generate valP. Depends on need_regids and need_valC value, valP will vary. The formula is : PC + r + 8p, where r and p will be either zero or one, indicating the status of *need_regids, need_valC.

Principles of pipelining

Pipeling is like the Ford model-T production line; each unit of the processor will do their jobs on different instructions and they don’t have to wait till the instruction finished.

For the sequential, the clock can’t be too quick otherwise, we the clock will erase the registers before an instruction finished. We have five stages(fetch, decode…) and we have to let the clock’s ticking time to be longer than the time it takes for the signal to propagate.

But for pipelined processors with pipeling registers, we can run each stage of an instruction as soon as the previous instruction left that stage. For example, Instruction just finished fetch and stored all the info it needs in a pipeling register, instruction B can immediately uses the fetch part of the processor after the clock ticks.(Else it has to wait instruction A for the next few stages.)

The throughput(the number of instruction in given time) will increase. But the latency(the time required to finish an individual instruction, will increase as well.

Formula:

  • Through put = instruction / time for cycling the clock.

The first image is a pipelined processor. We can see we have divided the computation in three stages.And each instruction will require three clock cycles to finish.

The second image shows how three instructions will be finished. As soon as I1 is done with combinational logic A, I2 can start using A. Ignoring the time to transfer in and out of the pipelining registers and assume each combinational logic takes unit time, the sequential will take 9 clock cycles to finish the job but the pipelined will take 5 clock cycles.

 

When the inputs for B is ready for I1, the clock’s rising will load the input of the first pipeline register.

Slowing the clock will not change the behavior but making the clock too fast will be bad; The needed information hasn’t been calculated out by the circuit and the clock tick will let whatever is in the pipeling register output to the next combinational logic. BAD.

Limit of pipeling

Nonuniform partitioning

The rate of the clock is determined by the delay of the slowest stage.

We can see that the delay in each stages are different and stage A will have to be idle when for I2 since it has to wait for I1’s B to finish first. If we clock A too fast, we will let I2 enter stage B prematurely and erase I1’s B stage information.

Pipelining a system with feedback.

We might run into dependency between successive instructions.

  1. data dependency, where the next instruction might need the result of the current one.
  2. control dependency,the outcome of a conditional test will determine the next instruction. A good example is conditional jump. We might need to change the system’s behavior in order to do pipeling for system with feedback.

Pipelined Y86-64 implementations

We first shift the computation of the PC into the fetch stage. Then we add pipeline registers between the stages.

Rearranging the computation stages.

We move the PC update stages to the beginning of an instruction, so it computes the **current ** instruction. (For our previous model, the PC is at the end and will compute the instruction for the next one.

In our new PC update stage, we have state registers to hold the signals from the previous computation and when the clock rises, we will select the appropriate PC through the exact same logic we use to compute PC in the previous SEQ model. There is no hardware register storing the program counter. The PC is dynamically computed. (We can implement the processor different from the conceptual model.

The shift of stat element is called circuit retiming, which change the implementation/representation of the system without changing its logical behavior. It is often used to balance the delay of different stages.

The figure shows the shifting of the PC computation.

Pipeline register

To create a pipelined processor, we first need insert pipeline registers. The pipeline registers are in blue boxes, each containing fields that represent results from the stages. They are actual hardware components.

  • F holds a predicted value of the program counter
  • D holds info about the instruction for the decode stage.
  • E holds info for the execute stages.
  • M holds info for the memory stage and will also hold information about branch condition and branch target(info from execution stage).
  • W holds info for writing back to the register file and the return address to the PC selection for ret.

The graph above shows how a pipelined processor would hold information. At stage 5, I1 is about to write, I2 is about to do memory operations. This graph give us another justification for drawing the processor in a way that the instructions flow from bottom to top. The expanded listing of cycle 5 shows Fetch at the bottom and write at the top, just as the hardware.

In the sequential processor SEQ, we have values such as ValC, ValE that are unique. But for pipelined processor, we will have multiple version of these values for different instructions in the processor.

As we can see in a pipelined processor graph above, there are 4 stat fields in 4 different pipelined registers. Confusing one stat with another will cause error. So we use capital prefixes. D_stat for stat of the decode pipeline register and so on.

We also need to name some signals that have just been computed within a stage. We will prefix it with lowercase pipeline register’s first letter. The output of D_stat will be d_stat.

The actual stat is computed by a block in the writeback stage based on the stat in W.

Saving Dest till the write

For both sequential and pipelined processor, we will get dstE and dstM in the fetch stage. For sequential, we can directly output these signals to the register file write ports. But for sequential, we will need to hold it in each pipeline register till that instruction reaches write back stage. We want to make sure the write port address and data inputs hold values for the same instruction. Otherwise we will write to the register file with the valM and valE from the writeback stage register, but the address from the decode pipeline register.

Select A block

It is not present in the sequential processor. It will generate valA for pipeline register E and M. It selects from either valP or valA from the read port. It is to reduce the amount of state that must be carried forward to pipeline E and M Only call requires valP in the memory stage(writing the returned address onto the stack) and only *jump * requires valP in the execute stage. (In the event that the jump is not taken) None of these event requires value read from the register file When we need valP, valA won’t be needed so valP and valA can share the field in pipeline register E and M.

Next PC prediction

Our goal is to issue a new instruction on every clock cycle so in each clock cycle we will get a new instruction and a new instruction will proceed into the execute stage. If we can do this for all instruction, we will have a throughput of 1 instruction per cycle. But we can’t. Conditional jump and ret will need information that we can only get by completing previous instruction. (Not exactly, for ret, we can get back once the previous instruction writes the address to the memory in memory stage)

But other than conditional jump and ret, we can determine the address of the next instruction based on the info in fetch stage. If the instruction is unconditional jump or call, then the next instruction will be valC, the constant word for address and all others will be valP, the add for the next instruction. So we can issue new instruction almost every cycle.

But for branch prediction, we have to make a guess. If we take it, we will use valC, else we will use valP. The subject is called branch prediction.

To predict the ret is more difficult since we have an “unbounded” number of address that could be the ret address. So we will just simply hold off till the ret instruction passes through the writeback stage.

Aside: For most programs, prediction the ret address is easy since procedures call and return often comes in pairs and the software will just return to the instruction after the call. Some processors will include a hardware stack within the fetch stage to hold the return address generated by call.

Pipeline hazard

  1. Data dependency: the results of one instruction are used as the data for a following instruction
  2. Control dependencies: one instruction will determine the location of the following instruction

As we can see in the image, the third, forth and fifth instructions are all nop(no operations) so we will have enough clock cycles to allow 3 to pass to $rax.

If we don’t have enough ops to allow I1 and I2 to pass through to registers, we will run into error; We can see the addl instruction will execute at the same time when irmovel $3, %rax is writing to the register. So this will cause the addl use an outdated value of the register $rax.

To avoid it, we will use stalling, where the processor holds back one or more instruction until the hazard dependency is no longer valid. When the addq is in the decode stage, the pipeline control logic detects that at least one of the instructions in the execute, memory and write (all stages that are ahead of decode) will update the register. For our example, we will hold the addq instruction at decode stage through injecting bubble, a form of dynamically generated nop. We will also hold the halt at fetch stage.

A could potentially arise when the program will alter PC, register file, memory and the conditional code.

  1. register file: already discuss
  2. PC: for a conditional jump, if we predicted the wrong PC, we will have to roll back.
  3. Memory: reading and writing both occur in memory stage. By the time we want to read from memory for one instruction, previous instructions have finished writing. But we can have interference when one instruction tries to write in the memory stage while another instruction tries to fetch. This will occur when we have self-modifying code that writes to a portion of memory from which instructions are later fetched.
  4. Condition code registers: Cond registers are altered by integer operations in execution stage and will be used for conditional move in the execute stage and will be used by conditional jump in the memory stage. By the time the next instruction reaches to execute or memory, the previous instruction will already finish the update for condition code register. NO HAZARD.

Avoid data hazards by forwarding

When there are a pending write a register file and at the same time a pending load from decode stage on the same register, a hazard might occur. We could stall but we could also forward the value that’s about the be written directly to the input for pipeline register E as the source operand.

In cycle 6, the decode stage logic detects there is a pending write to one of the source operand $rax in the writeback stage. So it will use this value for the source operand valB rather than read from the register file.

In Prog3, the decode at cycle 5 detect there is a pending write to $rdx and a pending write to $rax in memory stage. It will use W_valE for valA and M_valE for valB.

We can even remove the only nop since we can fetch E_valE, the newly computed value, and forward it to the decode stage

Every value that will be written back to the register file has their register ID. The logic will compare the ID of write and the ID for sources (srcA, srcB) to detect cases for forwarding. If we have multiple destination registers ID matching sources, we need to establish a priority.

We can see from the final pipelined implementation, we have two extra units. Sel+Fwd A and Fwd B that will handle the forwarding. Sel+Fwd A combines the roll of selA with forward logic. So valA for pipeline register E can either be the incremented PC valP, the value read from register file, or the forwarded values. Fwd B implements the forwarding logic for source operand valB.

Load/Use data hazards

Memory reads occur late in the pipeline. So if we want to read a value into the register and then change it, we will have problems.

In cycle 8, mrmovq will load $rax, but in cycle 7, the addq instruction that follows right after mrmovq has already decoded the $rax value. mrmovq is too late.

We must stall the instruction. The decode stage detects that addq requires a result from a memory read and will inject a bubble into the execution stage. The injection will stop once the previous instruction has reached memory stage.

Avoiding control hazard

It arises when the processor can’t reliable find the next instruction. It could only occurs for ret and conditional jump.

For ret, we will inject bubbles until the ret has reached write-back stage and the PC selection logic will set the PC to to the return address.

For conditional jump, we will cancel the instructions after the mispredicted jump through injecting bubble at execution and stages after execution. Because from fetching conditional jump till finding out the jump is incorrect only takes 2 cycles, any instructions after conditional jump won’t proceed till execution so they won’t alter any programmer-visiable state. So simply make the instruction “disappear” through bubble.

Exception handling

Events that will cause exception to break the current program execution are

  • Internal(such as overflow, division by zero). Our processor has 3 ways:(a halt, (b invalid instruction, (c attempt to read, write to invalid address, both in memory stage or instruction fetch.
  • External. such as mouse move and internet package arrival.

After an excepting instruction, non of the following instruction should have any effect on the programmer-visible stage.

Three subtleties.

(1 It is possible to have multiple exceptions happening at the same time. EG: Fetching a halt instruction and an out-of-bound read occurs in the same cycle. We must establish priority. The basic rule is to give the instruction that is furthest along the pipeline as the priority. Because the memory access will be before the halt statement in machine learning, so only this exception should report to the OS.

(2 Instruction is fetched, executed, causes an exception and later it is canceled due to a mispredicted branch. EG: In a conditional jump, we have a jump to invalid instruction and the decode stage will find out the it is invalid and throw an exception. But the pipeline later detects that it shouldn’t take the branch and discard the previous instruction. We want to avoid raising an exception in this case.

(3 An instruction following one excepting instruction to alter some part of the state.

xorq will zero the $rsp and pushq will decrement the stack pointer. This exception is detected in the memory stage but on the same cycle, the execution stage has the addq and will set the conditional code to new values.
This violate the requirement that after an excepting event, we shouldn’t alter any programmer-visible state.

If an instruction has caused a exception, the status code will indicate the reason for exception and the status will propagate till writeback stage and the control logic will detect the exception and stop.

To avoid updating programmer-visible states, the pipeline logic will disable any update to conditional code, memory when an instruction in the memory or write-back stages has caused an exception.

We can see that the exception doesn’t alter the flow of the execution, except when we have to disable update for programmer-visible state for the following instructions. The structure of the processor makes sure the first exception will be the first to report to the write-back and to the system. If some instruction is fetched but later canceled, the exception is canceled as well.

PIPE stage implementations.

Random finding

CISC and RISC

CISC stands for complex instruction set and RISC means reduced Instruction set computer. CISC has very complex instruction set and x86 uses it. RISC believes efficient code could be generated by simpler form of instruction and it uses less hardware and can use efficient pipeline. PowerPC and ARM uses RISC.

But now everything blends together and the line between CISC and RISC blurs.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax