# Author: admin

## Chapter 5, Randomized algorithm

# Randomized?

A randomized algorithm is an algorithm whose behaviors are not only determined by inputs but also by a random number generator.

# Hiring problem

You are the HR of a company and you need to recruit assistants. You will fire the current one if you find a better one. To do so, you need to get help from recruiting company who will charge for each interview and if that person is hired.

The cost of this problem highly depends on the order of the interview. The best case is to interview the best at the very front so the total cost would be where n is the number of interviewee. The worst case is when the ability of the candidates are in strictly ascending order. So you have to hire and fire every one except the last guy, yielding a cost of .

We have to find the **Average running time**, basically averaging the running time over all possible input. But to do so, we need knowledge, or **ASSMUPTION** about the distribution of the input.

## Prereq for analyze the problem

For hiring problem, we can assume the interviewee came in random order. There are n candidates so there are ways to order them(permutation).

if A occurs otherwise = 0.

EG: We define to be the indicator variable of a coin toss. and if head, else .

(1)

We can guess the expected value of a indicator variable of A is the probability of A.

Lemma 1:Given a sample space S and an event A in the sample space S, let . Then Pf

(2)

## Analyze the hiring problem

Define to be the number of hiring made in the process, we want to find . is interviewing n people and it is consists of , which are indicator variable about whether the th person is hired or not. and

, the expected value of the i-th person being hired means that the i th person must be better than all people ahead of him. For a random distribution, the chance of the i-th person being the best is . So .

(3)

So the average hiring cost will be , which is significantly better than .

# Randomized algorithms

For the hiring problem, we assume that our input follows uniform distribution. But sometimes we don’t have that knowledge. Another fact about our previous algorithm is that for a particular input, the time cost will always be the same. In other word, the algorithm is deterministic and the randomness rests in the input’s distribution.

But we want to know the cost even if we don’t know the distribution. We can **impose ** a distribution on the input. In the hiring problem, we permute the input. *No particular input elicits its worst-case behavior*.

The only thing to change

```
randomize_hire(n)
randomly permute the order of the candidate
best = 0
for i = 1 to n
interview candidate i
if i.value > best
best = i.value
hire i
```

To differentiate this from our previous notation, we call the cost of this algorithm expected cost, instead of average cost.

## Randomly permuting arrays

One way of permuting the array is to give each entry a random priority and sort it base on the priority.

```
Permute_by_sorting(A)
n = A.length
P = [1..n] //randomly create an array
for i = 1 to n
P[i] = Random(1, n**3)
sort A, using P as a sort key
```

We know sorting is O(nlgn).

We need to prove now that **the array is uniform random permutation**, it is likely to produce every permutation of 1 through n.

Lemma 4: Our procedure produce a uniform random permutation of inputs, assuming all priorities are distinct.

pf: we want to show that the probability of any permutation showing up is . For any array (1,2,…n), we generate permutation (perm(1), perm(2)… perm(n)), let be the event that 1 receives the perm(i)th smallest priority then we want to find the probability that all i, event occurs:

(4)

We know this probability joint probability will give us

# Quick Math fact

## Sum of expectation

Linearity of expectation is the property that the expected value of the sum of random variables is equal to the sum of their individual expected values, regardless of whether they are independent.

A better way to do this is to generate the array in place.

```
generate_in_place(A)
n = A.length
for i = 1 to n
A[i] = randomly pick from A[i...n]
```

We will use a loop invariant to show that at the end, the algorithm will produce a uniform random permutation.

Before we start, it is good to know what is a k-permutation. On a set n, a k-perm is a sequence containing k of the n elements. There are such possible k-permutations. And we know n-permutation will be n!

Lemma 5: randomize_in_place will give a uniform random permutation pf:

- Initialization: at the beginning, for i = 1, we know A[1] will be picked out of A[1]. So we have 1-permutation and the n possibility.
- Maintenance: assume at just before ith iteration and our assumption that for th, A[i – 1] follows a uniform distribution and each possible permutation will appear in A[0..i -1] with possibility . Now adding one more element to the subarray, we got A[0..i] and there are possible element to pick for A[i], so the probability for any permutation for subarray A[1..i] is , which is the formula for i-permutation.

Let’s make it more formalize. Let’s call the event in which the first i – 1 iterations have created the *particular* (i – 1) permutation in A[1..i – 1]. By the loop invariant, . Let’s denote be the event that the ith iteration put a particular in position i. . So if we want to know the the prob of permutation , we want to know .

.

# Advanced random stuff

Talked about brithday paradox There is the classic way then there is a way using indicator variable. It is simpler but an approximation. We define for , = I{person I and J have the same birthyday} = 1 if they do else 0.

Let X be the variable that counts the number of pairs of individuals having the same birthday,

Taking the expected value of it;

= k choose 2 .

We can think of it like this: The outer loop pick one person who are on the ith position, then the inner loop has ith to kth position to be chosen, effectively making it a k choose 2.

So the expected number of pairs that will have the same birthday will be . We want at least 1 pair so . We got k = 28.

I should talk about some distribution, Geometric, Bernoulli, Poisson and so on in another post.

#

## Chapter 3 Growth of function

# Asymptotic notation

This chapter talks about the rigorous mathematics formula behind the Big-O, Big-Omega and other notation. Since this chapter is skipped in the MIT open course, I will be brief.

Basically means is tightly bounded by n within a constant factors. There exist positive constants such that to the right of , the value of will always lies between .

Big-O gives us an upper bound in the same fashion. Big-Omega gives us the asymptotic lower bound. means passing certain , the value of is always above for some constant

When we have a asymptotic notation in the function, we will interpret it as some anonymous function that we don’t care to name.

We also have little-o and little-omega to non-asymptotically tight bounds. If we have it means the bound holds for all constants c > 0. Intuitively, for large enough n, becomes insignificant.

# Math magic

Then we talk about some basic mathematics functions such as polynomial and exponential.

## Exponential

so So the exponential with a base strictly bigger than 1, will grow faster than any polynomial function.

Then we Taylor Expand to talk about some property.

## Factorial

There isn’t a clear formula for factorial, but the Stirling’s approximation is a good approximate for factorial.

## Fibonacci number

We have proven a formula for related to the golden ration

Where is the conjugate of the golden ration.

We can manipulate a bit to find that . Which means the ith Fib is equal to rounded to the nearest integer.

## Chapter 4

# Overview

In this chapter we will take a look at processor hardware and the instructions that the hardware can support, called *instruction set architecture*.

One important concept is that *ISA* ‘s logic can differ from the actual way the processor works. The *ISA* will appear to be sequential but sometimes the different instructions are executed simultaneously.

**We want to have clever tricks that improve the performance while maintaining the functionality of a simpler and more abstract model**

## The processor we are working with :Y86-64

It is functionally correct and sequential. It executes a complete instruction on every clock cycle and the clock is **slow** to allow an entire series of actions to complete within one cycle.

Then we will apply transformations to create a *pipelines processor*. It will break each execution into five stages, each will be handled by a separate section of hardware. Instructions progress through the stages of the pipeline with one instruction entering the pipeline in one clock cycle. As a result, the processor can process up to **5 instructions** at the same time.

# The Y86-64 instruction set

Defining a instruction set means defining the different components of its state, set of instructions and their encodings, a set of programming convention and exception handling.

## programmer-visible state.

Each instruction in Y86-64 can read and modify some part of the processor. This is the visible state. The 15 registers, condition codes(ZF,SF,OF) about the most recent arithmetic or logical ops, PC, memory and stat, which indicates the state of the program execution.

## Y86-64 instruction

It is largely a subset of X86 and has fewer addressing mode and only 8-byte integer operations. Here are a few “quirks” of our Y86:

- movq is split into four instructions: irmovq, rrmovq, mrmovq and rmmovq, where i stands for immediate, r stands for register and m stands for memory. It will become useful when we implement them.
- As with x86, we cant move from memory to memory, and can’t move immediate to memory.
- We have four operations and they can only operate on
**register data**. But on x86, they can operate on memory as well. These instruction will set the three conditional code.

## Instruction encoding

The byte encoding of the instructions are between 1 to 10 bytes, depending on what fields are required. All instructions have a initial byte that indicate the instruction type. The high-order part is the *icode*, the low-order part of the byte is the *ifunc* that will differentiate between different functions under the same code. For example, 0x6 for icode is operations, and 0x0 for ifunc is addq, 0x1 for ifunc is subq and so on.

### One-byte long instruction

Some instructions are only 1 byte long because they don’t need any other infos. They are *ret, nop* and *halt*.

For other instructions, we will need more bytes because they require operands. The next bytes inline is the *register specifer byte*. We have 15 registers in Y86 and the high-order and low-order specify the two registers. The value from 0x0 to 0xE are associated with %rax to %r14. The value 0xF is reserved for *no register*. We use 0xF for operations that only need one register such as *pushq*.

We might also need a 8-byte constant word. It can serve as the data for irmovq and displacement for rmmovq, and destination of calls. BEWARE the address are absolute (because it is simplier) and is in byte-reverse order, the little-endian encoding.

## Exception

The status code in stat has four possible value. AOK(normal), HLT(halt), ADR(address invalid) and INS(invalid instruction. Any read or write attempt to an address out of the addrses limit will trigger the ADR.

In y86, when exception happens, we will just stop.

## Two unusual instructions

pushq will decrement the stack pointer by 8 and writes a register value to the memory. So what will pushq %rsp do? Two possibility:

- push the original value
- Push the decremented value

Let’s first try putting a code together

movq %rsp, $rax pushq $rsp popq $rdx subq %rdx, %rax ret

The return value is always 0. So it means we pushed the undecremented value onto the stack. The same question can be raised for popq $rsp. A similar code can be used to determine the effect. We found popq $rsp will put the value of ($rsp) to $rsp. It is equivalent to mrmovq ($rsp), $rsp.

# Logic design and Hardware control language HCL

We need three components to implement a digital system.

- combinational logic to computer functions on bits
- Memory element to store bits
- clock signal to regulate the update of memory

nowaday don’t have to draw diagrams anymore and use a textual notation. The most common languages are Verilog and VHDL. In 1980s, logic synthesis is developed to generate efficient circuit from HCL.

## Logic gates

They are basic electronic component that generates output based on some boolean functions of inputs. Logic gates are always active and if some input changes, within a short amount of time, the output will change accordingly. Putting logic gates together, we got combinational circuits. Several restrictions apply:

- inputs must be either system input, output of memory our output of logic gates
- outputs of two gates can’t be connected together
- The network must be acyclic. No path can form a loop. Else there will be ambiguity.

### Multiplexer

It will select input based on the bit pattern of the selection bit. We will use case expressions in HCL.

[ select1 : expr1; select2 : expr2; …

]

where select is a Boolean expression, indicating when to select the expression.

Example: If we have two selection bits, we can select 2^2 = 4 inputs.

[ !s1 && !s2 : A; !s1 : B; !s0 : C; !1 : D;

]

We can also simplify the select statement.

### ALU

arithmetic logic unit can perform ops on data.It has two data input and one control input.

## Memory and clocking

Combinational circuit, by definition, doesn’t store information and will simply react to signal inputs. To create sequential circuits that can have **states and perform computations on the state**, we must have devices that store information. Our storage devices will all be controlled by clock, a *periodical* signal that determins when new values will be loaded into the devices. The registers will only change according to inputs when the clock signal rises.

# Sequential Y86 implementations

In this version, our sequential processor,SEQ, will perform all the steps required to process a complete instruction. It will be slow.

## Stages

- Fetch: reads the bytes of an instruction from memory, using the PC as the memory address. It will grab the icode, ifunc and possibly the register specifier rA and rB and an 8-byte constant word valC. In the end, it will compute valP to be the address of the next instruction.
- Decode: reads up to two operands from the register file, and give the values stored in the rA and rB, giving valA and valB.
- Execute: the ALU will perform the operations specified by icode and ifunc, or computers the effective address of memory reference. The conditional code is possibly set and conditional jump/move will be determined to move/jump or not.
- Memory: write data to memory or read data from memory
- Writeback: write back to up to 2 result to register file.
- PC update: PC is set to the address of the next instruction.

### OPq

An example, we will work through OP rA, rB.

Fetch: icode: ifunc <- M1[PC] // read 1 byte and get instruction code and function.

rA:rB <- M1[PC + 1] // read the target register

valP <- PC + 2 // find the next instruction after OP

Decode: ValA <- R[rA], ValB <- R[rB]

Execute: ValE <- ValB OP ValA, set CC //conditional code

We put ValB in front of ValA because we want division to have B divided by A.

Memory: No memory instruction

Write back R[rB] <- ValE

PC update: PC <- ValP

### rmmovq

We can also use ALU to get the effective address of memory, as can be seen in move instructions

Example: rmmovq rA, D(rB)

Fetch: icode: ifunc <- M1[PC], rA:rB <-M1[PC + 1], ValC <- M8[PC + 2], ValP <- PC + 10 // we need access to displacement, it is an 8 bytes const after register specifying byte.

Decode: ValA <- R[rA], ValB <- R[rB].

Execute <- ValE <- ValB + ValC //displacement D

Memory M[ValE] <- ValA

Writeback: no writeback

PC update <- PC <- ValP

If we have rmmovq, where there is no displacement, in execute stage we will have : ValE <- 0 + ValA.

I guess this is the case because we need to pass ValA to writeback and it has to go through ALU. Kind of like zero-padding.

### Push and Pop

It is among the most difficult instructions in Y86 because we both need to access memory and also update the stack pointer.

push rA

Fetch: icode: ifunc <- M1[PC], rA <-M1[PC + 1], ValP <- PC + 2

Decode: valA <- R[rA], ValB <- R[$rsp]

Execute: ValE <- ValB – 8

Memory: M[ValE] <- ValA

Writeback: R[%rsp] <- ValE

PC update : PC <- ValP

we can see ValB has the value of the old $rsp. We observed the rule that we should update the value of $rsp before writing to it is followed. (even though the actual update happened after the memory operation.)

pop rA

Fetch:R[$rsp] , ValP <- PC + 2

Decode: valA <- R[$rsp], ValB <- R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <- M[ValA]

Writeback: R[$rsp] <- ValE, R[rA] <- ValM

PC update: PC <- ValP

We can read two copies of the stack pointer. Because doing so can make the flow of the instruction much similar to instructions.

We used the incremented value to update the stack pointer and the un-incremented value as the memory access.

### Call and ret

call dest

Fetch: icode: ifunc <- M1[PC], valC<- M8[PC + 1], ValP <- PC + 9

Decode: valB <-R[$rsp]

Execute: ValE <- ValB – 8

Memory:M[ValE] <- ValP

Writeback: R[$rsp] <-ValE

PC <- ValC

For call, the address of the function is the 8 byte right after instruction byte. We also need to use ValB to store the stack pointer and decrement it to store the return address. Then the PC is set to the destination.

ret

Fetch: icode, ifunc <- M[PC]

Decode:ValA <-R[$rsp], ValB < R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <-M8[ValA]

Writeback: R[$rsp] <- ValE

PC update: PC<- ValM

Ret also keeps two copy of unincremented $rsp. One is used for instruction update, another one is used for memory access

These two instructions are very similar to push and pop but we push and pop the program counters.

## The hardware structure

A single clock signal transition trigger a flow through combinational logic to execute an entire instruction. We can treat the RAM units as another combinational circuit; it output word based on the address input, just like the circuit. So we are left with four hardware to control: The PC, the conditional code register, the data memory and the register file. The PC is loaded with the new instruction every clock cycle. The condition code is loaded when an integer opration is executed. The data memory is written only when rmmovq, pushq or call is executed. The two write ports of the register file allow two program registers to be updated on every cycle. All of these state update happened simultaneously but **But they achieve the same effect as the sequential execution in the previous section**. This is achieved because we organized the computation in a way that follows some principle:

PRINCIPLE 1: No reading back. The processor never needs to read back updated state in order to complete other instruction. Let’s first start with a counter-example: suppose in pushq, we will first decrement the $rsp then use the updated $rsp to store the pushed value. This would violate the principle. We will need to go back to register file to retrieve the updated $rsp. On the other hand, our implementation generate the decremented value ValE and use it both as the memory write address and also the update value for $rsp, so we can do memory write and register update **simultaneously **. Along the line of this principle, we shouldn’t both set and then read the condition codes in any instruction. Even the condition codes are not set until the next cycle, they will be updated at the very beginning of the next cycle before any instruction can read it.

As we can see, at the beginning of the cycle, the instruction hasn’t been executed and at the end of cycle 3, all combinational circuits have reacted to the given inputs and the update values have been generated.

## Stage implementation

We will devise HCL descriptions for the control logic blocks required for SEQ.

### Fetch stage

The fetch will read 10 bytes from memory and the first byte is split into 2 4 bits section as icode and ifun. Then compute the instruction codes or, in the event the instruction is invalid, the value is corresponded to a nop instruction. Based on the icode, we can compute three 1 bit signals(in dashed line)

- instr_valid: Is the instruction valid?
- need_regids: Does the instruction need register id?
- need_valC: Does this instruction need constant word?

The signals *inst_valid * and *imem_error* ( when the instruction address is out of bound). are used to generate the status code in the memory stage.

The HCL for need_regid is :

```
<br />bool need_regids = {
icode in {RRMOVQ, RMMOVQ, MRMOVQ, PUSH, POP, IRMOV}
}
```

and the HCL code for *need_valC* is

```
bool need_valC = {
icode in {IRMOVQ, RMMOVQ, MRMOVQ, JXX, CALL}
}
```

The remaining 9 bytes are processed by the “align” hardware into the register fields and the constant word. Byte 1 is split into 2 register specifying fields and if *need_regids* is 0, both register specifying fields will be 0xF(RNONE). Depends on if we have register specifying fields, the constant field will either be 1-8 bytes or 2-9 bytes.

The PC incrementer hardware unit will generate valP. Depends on *need_regids* and *need_valC* value, valP will vary. The formula is : PC + r + 8*p, where r and p will be either zero or one, indicating the status of *need_regids, need_valC*.

# Principles of pipelining

Pipeling is like the Ford model-T production line; each unit of the processor will do their jobs on different instructions and they don’t have to wait till the instruction finished.

For the sequential, the clock can’t be too quick otherwise, we the clock will erase the registers before an instruction finished. We have five stages(fetch, decode…) and we have to let the clock’s ticking time to be longer than the time it takes for the signal to propagate.

But for pipelined processors with pipeling registers, we can run each stage of an instruction as soon as the previous instruction left that stage. For example, Instruction just finished fetch and stored all the info it needs in a pipeling register, instruction B can immediately uses the fetch part of the processor after the clock ticks.(Else it has to wait instruction A for the next few stages.)

The throughput(the number of instruction in given time) will increase. But the latency(the time required to finish an individual instruction, will increase as well.

Formula:

- Through put = instruction / time for cycling the clock.

The first image is a pipelined processor. We can see we have divided the computation in three stages.And each instruction will require three clock cycles to finish.

The second image shows how three instructions will be finished. As soon as I1 is done with combinational logic A, I2 can start using A. Ignoring the time to transfer in and out of the pipelining registers and assume each combinational logic takes unit time, the sequential will take 9 clock cycles to finish the job but the pipelined will take 5 clock cycles.

When the inputs for B is ready for I1, the clock’s rising will load the input of the first pipeline register.

Slowing the clock will not change the behavior but making the clock too fast will be bad; The needed information hasn’t been calculated out by the circuit and the clock tick will let whatever is in the pipeling register output to the next combinational logic. BAD.

## Limit of pipeling

### Nonuniform partitioning

The rate of the clock is determined by the delay of the slowest stage.

We can see that the delay in each stages are different and stage A will have to be idle when for I2 since it has to wait for I1’s B to finish first. If we clock A too fast, we will let I2 enter stage B prematurely and erase I1’s B stage information.

## Pipelining a system with feedback.

We might run into dependency between successive instructions.

- data dependency, where the next instruction might need the result of the current one.
- control dependency,the outcome of a conditional test will determine the next instruction. A good example is conditional jump. We might need to change the system’s behavior in order to do pipeling for system with feedback.

# Pipelined Y86-64 implementations

We first shift the computation of the PC into the fetch stage. Then we add pipeline registers between the stages.

## Rearranging the computation stages.

We move the PC update stages to the beginning of an instruction, so it computes the **current ** instruction. (For our previous model, the PC is at the end and will compute the instruction for the next one.

In our new PC update stage, we have state registers to hold the signals from the previous computation and when the clock rises, we will **select** the appropriate PC through the exact same logic we use to compute PC in the previous SEQ model. There is no hardware register storing the program counter. The PC is dynamically computed. (We can implement the processor different from the conceptual model.

The shift of stat element is called *circuit retiming*, which change the implementation/representation of the system without changing its logical behavior. It is often used to balance the delay of different stages.

The figure shows the shifting of the PC computation.

## Pipeline register

To create a pipelined processor, we first need insert pipeline registers. The pipeline registers are in blue boxes, each containing fields that represent results from the stages. They are actual hardware components.

- F holds a
*predicted value of the program counter* - D holds info about the instruction for the decode stage.
- E holds info for the execute stages.
- M holds info for the memory stage and will also hold information about branch condition and branch target(info from execution stage).
- W holds info for writing back to the register file and the return address to the PC selection for
*ret*.

The graph above shows how a pipelined processor would hold information. At stage 5, I1 is about to write, I2 is about to do memory operations. This graph give us another justification for drawing the processor in a way that the instructions flow from bottom to top. The expanded listing of cycle 5 shows Fetch at the bottom and write at the top, just as the hardware.

## In the sequential processor SEQ, we have values such as ValC, ValE that are unique. But for pipelined processor, we will have multiple version of these values for different instructions in the processor.

As we can see in a pipelined processor graph above, there are 4 stat fields in 4 different pipelined registers. Confusing one stat with another will cause error. So we use capital prefixes. D_stat for stat of the decode pipeline register and so on.

We also need to name some signals that have just been computed within a stage. We will prefix it with lowercase pipeline register’s first letter. The output of D_stat will be d_stat.

The actual stat is computed by a block in the writeback stage based on the stat in W.

### Saving Dest till the write

For both sequential and pipelined processor, we will get dstE and dstM in the fetch stage. For sequential, we can directly output these signals to the register file write ports. But for sequential, we will need to hold it in each pipeline register till that instruction reaches write back stage. We want to make sure the write port address and data inputs hold values for the same instruction. Otherwise we will write to the register file with the valM and valE from the writeback stage register, but the address from the decode pipeline register.

### Select A block

It is not present in the sequential processor. It will generate valA for pipeline register E and M. It selects from either valP or valA from the read port. It is to **reduce the amount of state that must be carried forward to pipeline E and M** Only *call* requires valP in the memory stage(writing the returned address onto the stack) and only *jump * requires valP in the execute stage. (In the event that the jump is not taken) **None of these event requires value read from the register file** When we need valP, valA won’t be needed so valP and valA can share the field in pipeline register E and M.

### Next PC prediction

Our goal is to *issue* a new instruction on every clock cycle so in each clock cycle we will get a new instruction and a new instruction will proceed into the execute stage. If we can do this for all instruction, we will have a throughput of 1 instruction per cycle. But we can’t. Conditional jump and ret will need information that we can only get by completing previous instruction. (Not exactly, for ret, we can get back once the previous instruction writes the address to the memory in memory stage)

But other than conditional jump and ret, we can determine the address of the next instruction based on the info in fetch stage. If the instruction is unconditional jump or call, then the next instruction will be valC, the constant word for address and all others will be valP, the add for the next instruction. So we can issue new instruction almost every cycle.

But for branch prediction, we have to make a guess. If we take it, we will use valC, else we will use valP. The subject is called branch prediction.

To predict the ret is more difficult since we have an “unbounded” number of address that could be the ret address. So we will just simply hold off till the ret instruction passes through the writeback stage.

Aside: For most programs, prediction the ret address is easy since procedures call and return often comes in pairs and the software will just return to the instruction after the call. Some processors will include a hardware stack within the fetch stage to hold the return address generated by call.

## Pipeline hazard

- Data dependency: the results of one instruction are used as the data for a following instruction
- Control dependencies: one instruction will determine the location of the following instruction

As we can see in the image, the third, forth and fifth instructions are all nop(no operations) so we will have enough clock cycles to allow 3 to pass to $rax.

If we don’t have enough ops to allow I1 and I2 to pass through to registers, we will run into error; We can see the addl instruction will execute at the same time when irmovel $3, %rax is writing to the register. So this will cause the addl use an outdated value of the register $rax.

To avoid it, we will use **stalling**, where the processor holds back one or more instruction until the hazard dependency is no longer valid. When the addq is in the decode stage, the pipeline control logic detects that at least one of the instructions in the execute, memory and write (all stages that are ahead of decode) will update the register. For our example, we will hold the addq instruction at decode stage through injecting bubble, a form of dynamically generated nop. We will also hold the halt at fetch stage.

A could potentially arise when the program will alter PC, register file, memory and the conditional code.

- register file: already discuss
- PC: for a conditional jump, if we predicted the wrong PC, we will have to roll back.
- Memory: reading and writing both occur in memory stage. By the time we want to read from memory for one instruction, previous instructions have finished writing. But we can have interference when one instruction tries to write in the memory stage while another instruction tries to fetch. This will occur when we have self-modifying code that writes to a portion of memory from which instructions are later fetched.
- Condition code registers: Cond registers are altered by integer operations in execution stage and will be used for conditional move in the execute stage and will be used by conditional jump in the memory stage. By the time the next instruction reaches to execute or memory, the previous instruction will already finish the update for condition code register. NO HAZARD.

### Avoid data hazards by forwarding

When there are a pending write a register file and at the same time a pending load from decode stage on the same register, a hazard might occur. We could stall but we could also forward the value that’s about the be written directly to the input for pipeline register E as the source operand.

In cycle 6, the decode stage logic detects there is a pending write to one of the source operand $rax in the writeback stage. So it will use this value for the source operand valB rather than read from the register file.

In Prog3, the decode at cycle 5 detect there is a pending write to $rdx and a pending write to $rax in memory stage. It will use W_valE for valA and M_valE for valB.

We can even remove the only nop since we can fetch E_valE, the newly computed value, and forward it to the decode stage

Every value that will be written back to the register file has their register ID. The logic will compare the ID of write and the ID for sources (srcA, srcB) to detect cases for forwarding. If we have multiple destination registers ID matching sources, we need to establish a **priority**.

We can see from the final pipelined implementation, we have two extra units. Sel+Fwd A and Fwd B that will handle the forwarding. Sel+Fwd A combines the roll of selA with forward logic. So valA for pipeline register E can either be the incremented PC valP, the value read from register file, or the forwarded values. Fwd B implements the forwarding logic for source operand valB.

### Load/Use data hazards

Memory reads occur late in the pipeline. So if we want to read a value into the register and then change it, we will have problems.

In cycle 8, mrmovq will load $rax, but in cycle 7, the addq instruction that follows right after mrmovq has already decoded the $rax value. mrmovq is too late.

We must stall the instruction. The decode stage detects that addq requires a result from a memory read and will inject a bubble into the execution stage. The injection will stop once the previous instruction has reached memory stage.

### Avoiding control hazard

It arises when the processor can’t reliable find the next instruction. It could only occurs for ret and conditional jump.

For ret, we will inject bubbles until the ret has reached write-back stage and the PC selection logic will set the PC to to the return address.

For conditional jump, we will cancel the instructions after the mispredicted jump through injecting bubble at execution and stages after execution. Because from fetching conditional jump till finding out the jump is incorrect only takes 2 cycles, any instructions after conditional jump won’t proceed till execution so they won’t alter any programmer-visiable state. So simply make the instruction “disappear” through bubble.

## Exception handling

Events that will cause exception to break the current program execution are

- Internal(such as overflow, division by zero). Our processor has 3 ways:(a halt, (b invalid instruction, (c attempt to read, write to invalid address, both in memory stage or instruction fetch.
- External. such as mouse move and internet package arrival.

After an excepting instruction, non of the following instruction should have any effect on the programmer-visible stage.

#### Three subtleties.

(1 It is possible to have multiple exceptions happening at the same time. EG: Fetching a halt instruction and an out-of-bound read occurs in the same cycle. We must establish priority. The basic rule is to give the instruction that is furthest along the pipeline as the priority. Because the memory access will be before the halt statement in machine learning, so only this exception should report to the OS.

(2 Instruction is fetched, executed, causes an exception and later it is canceled due to a mispredicted branch. EG: In a conditional jump, we have a jump to invalid instruction and the decode stage will find out the it is invalid and throw an exception. But the pipeline later detects that it shouldn’t take the branch and discard the previous instruction. We want to avoid raising an exception in this case.

(3 An instruction following one excepting instruction to alter some part of the state.

xorq will zero the $rsp and pushq will decrement the stack pointer. This exception is detected in the memory stage but on the same cycle, the execution stage has the addq and will set the conditional code to new values.

This violate the requirement that after an excepting event, we shouldn’t alter any programmer-visible state.

If an instruction has caused a exception, the status code will indicate the reason for exception and the status will propagate till writeback stage and the control logic will detect the exception and stop.

To avoid updating programmer-visible states, the pipeline logic will disable any update to conditional code, memory when an instruction in the **memory or write-back** stages has caused an exception.

We can see that the exception doesn’t alter the flow of the execution, except when we have to disable update for programmer-visible state for the following instructions. The structure of the processor makes sure the first exception will be the first to report to the write-back and to the system. If some instruction is fetched but later canceled, the exception is canceled as well.

## PIPE stage implementations.

# Random finding

## CISC and RISC

CISC stands for complex instruction set and RISC means reduced Instruction set computer. CISC has very complex instruction set and x86 uses it. RISC believes efficient code could be generated by simpler form of instruction and it uses less hardware and can use efficient pipeline. PowerPC and ARM uses RISC.

But now everything blends together and the line between CISC and RISC blurs.

## So many bit tricks

While reading the CMU intro to system course, I realized how many bit tricks are out there that could potentially boast the speed of the program. I will use this post as a place to record all the bit tricks I have seen either from this course or from other places suck as hacker’s delight.

## Chapter 2(representing and manipulating information) notes

# Info storage

## Hex

We want to see bit pattern. Binary is too cumbersome and decimal needs to be converted so we have Hax. The val of a single byte can range from

Convert from decimal to hex: (1)Divide the decimal number by 16 repeatedly: (2)Then r will be the least significant digit of the hex. (3) repeat this process on q

Convert from hex to decimal: Multiply each digit of the hex by the appropriate power of 16.

## Word

Word size: the normal size of int or pointer. Very important because it dictates the size of the virtual address space. Since pointer is as long as word length, so the address range from .

Pointers need to use the full size of the word so for a 64-bit machine the pointer will be 8 bytes long. But for 32-bit machine it will only need 4 bytes. Potentials for bugs since some 32-bit programs would store pointer in int type but that won’t work for 64bits.

## Address for multi bytes objects

Multi-bytes object is stored on contiguous sequences of bytes and the address is the smallest address of the bytes used.

## Byte ordering(Important, came up in Network)

0x87654321 Little Endian: The beginning(smallest) address will store the least significant bit(on the right most of the number) 12, 34,56,78

Big endian: start with storing most significant to least(Left to right, natural for us)

It matters in

(a)Network: If little communicate with big through a string of bytes, the bytes will be received in reverse order.

(b)Reading bytes generated by little endian: Need to reverse it. Least on the left and most sig on the right.

## Strings

A string in C is an array of chars terminated by the null. Each char is represented by some encoding (ASCII).

Text data is more independent than binary data.

## Boolean algebra

Multiplication distributes over addition, so does AND distributes over OR. We can’t distribute addition over multiplication, But we can distribute or over AND. a | (b & c) = (a | b) & (a | c)

Each element is its own inverse in bool algebra. a ^ a = 0 where 0 is an all-zero vector. This holds even we rearrange and combine them in different orders. (a ^ b) ^ a = b. An very interesting application is in place swap

```
void inplace_swap(int *x, int *y){
*y = *x ^ *y;
*x = *x ^ *y;
*y = *x ^ *y;
}
```

If the x stores a and y stores b, then

step & *x & *y

initial & a & b

1 & a & a ^ b

2 & a ^ a ^ b = b & a ^ b

3 & b & a ^ b ^ b = b

(The Latex tabular is broken lol)

## Bit masking

using bit patter to indicate a selected set of bits within a word.

x & 0xFF will get the least significant byte of x.

~0 will get a mask of all ones, regardless of word size(portable)

## Logic operatior && || and ~

OR, AND, NOT in logic and any non-zero arg is TRUE and the operation will return True or False. It has a short circuit, if the result of the expression can be determined by evaluating the first arg, then the operation won’t eval the second args. So a && 5 / a won’t ever cause a division by 0 and p && *p++ won’t dereferencing a null pointer

A way to do x == y is to

```
!( x ^ y)
```

x ^ y will give us all zero if all the bits match. Then the NOT will return 1 only if all bits are zeros.

## shift operation in C.

x = , we apply x << k, shifting k bits to the left and drop off the k most significant bits, filling the right with k zeros. We got . Shift associate from left to right. x<<j<<y = (x<<j)<<y

Right shift. Two kinds. 1:arithmatic, 2: logic. The logic one is similar to left shift, shifting k elements to the right and leave k zeros on the left. Arithmetic will shift to the right by k, but instead of filling k zeros on the left, it will have k , the most significant bit.

# Integer representation

## Two’s complement

We can see the most significant bit(sign bit) is a negative weight. When the sign bit is 1, the whole number has to be negative.

The range of two’s complement is asymmetric; because half of the bit patterns are for negative and half are for non-negative, which includes zero so the will be 1 smaller than .

## Other representations:

One’s complement: Similar to two’s complement, except the negative weight of the most sig dig is .

Sign magnitude. The sig dig determine the sign.

Both method has two ways to representation zero. NOT cool. Note the different position of apostrophes: Two’s complement versus Ones’ complement. The term “two’s complement” arises from the fact that for nonnegative x we compute a w-bit representation of −x as 2w − x (a single two). The term “ones’ complement” comes from the property that we can compute −x in this notation as [111 . . . 1]− x (multiple ones).

## Conversion between signed and unsigned.

When convert between sign and unsigned whose word sizes are the same, the numerical value might change but the bit values stay identical. (

```
short int v = -12345;
unsigned short uv = (unsigned short) v;
```

v = -12345,where uv = 53191 . Because now we interpret the two’s complement as the unsigned. The numerical value difference between unsigned and two’s complement with the same bit pattern is 0 if is if (negative)

When an operation is performed on an int and an unsigned int, the int will be dragged to to unsigned, assuming the number is nonnegative.

Problem for relational operators < and >. eg: -1 < 0U. -1 will be thought as nonnegative so -1 is actually 4294….. and 4293…. < 0 is false.

## Expanding bit representation

Type promotion, expand space. For unsign we use zero expansion, adding zeros to the left. For two’s complement, we use sign extension, adding copies of the most significant bit to the left. Case in point: 101 = -3, 1101 = -8 + 4 + 1 = -3. The newly added 4th bit will cancel out half with the previous 3rd bit, therefore we will subtract the same number. (To prove formally, use induction)

If we are doing

```
short sx = -12345;
unsigned uy = sx;
```

uy = 4294954951

Which means we first convert short to (unsigned)(int) sx.

## Truncating

w-bit number to k-bit number , we will drop the leading w – k bits. For unsigned number x, the numerical value is equivalent to . Because . (4 mod 2 = 0, 8 mod 2 = 0).

For two’s complement, it is very similar. We first convert it to unsign, then unsign to two’s complement

# Integer Arithmetic

Word size inflation: To to addition for two k bits number might require k + 1 bts. can not place any bound on the word size required to fully represent the result. If the leading bit of k + 1 bit is 0, so the numerical value of this int addition is the same as the exact math addition.

If , then overflows happened, and we discard the leading k + 1 bit,(truncate) and this is equivalant to .

To check if overflow happens or not for , we can check . if this is true, then overflow didn’t happened. Because so . On the other hand if overflow happens, and so . So if overflows happened,

## Floating point

### format

For IEEE floating point, we have . Where s stands a single sign bit. Significand Mis a fractional binary number that ranges either between 1 and 2 – or between 0 and 1 – . The exponent E weights the value by a ( possibly negative) power of 2.

For single precision, s, exp, and frac are 1, k = 8, n = 23.

For double precision, s, exp, and frac are 1, k = 11, n = 52.

### Three cases of floating point

#### Normalized value.

When exp is neither all 0 nor all 1(255). It is interpreted as a sign integer in biased form. where is the unsigned number with bit representation and is a bias value equal to .(127 for single precision). So the exponent ranges from -126 to 127.

##### Implied leading bit: the leading bit is always one so we can ignore it and use the leftover bit for one bit more precision. To see that,we use an example:

. Same with binary, we can always shift the exponent so the leading bit is one.

#### Denormalized values

When the exponent field is all ZEROS. Exponent value is E = 1 – Bias and the significand value is M = f. The value of the fraction field is without an implied leading 1.

##### It can reps 0 since a normalized value’s M is always M > 1.

## Floating points rounding.

Find the closest matching number in floating point number format. The design decision for rounding is when the result is between two possible result. For example, If we want to round to the closest integer and our number is 2.5, we have 2 and 3 to choose from.

The default is round-to-even, which rounds the number either up or down so the least significant digit is 0(even). To do so will avoid statistical bias.

Only bit pattern XXX.YYYY100…, where X and Y are arbitrary numbers, will need round-to-even because it is half way between two results.

## Floating ops.

Floating additions are communicative but not associative.

Floating point multiplication is communicative but not associative due to possible overflow or losee of precision.

# Random findings

(a) Floating point arithmetic is not associative! A case in point is 3.14 + (1e20 – 1e20) = 3.14 but for most computer

(b) The C++ use the same exact numerical representations and ops as C. But Java uses a new set of standards! Java is more specific on the formats.

(c)

```
zhenk14 > gcc -std=c99 prog.c
```

is using ISO C99 standard, which took over ANSI (c89)

(d) using a cast to allow an object to be referenced according to a different data type from which it is created. Discouraged but used in system programming.

(e) long in 32-bit machine is as long as an int?

IA32 floating point arithmetic use a special 80 bit extended precision floating point format. All single and double precision floating point number are converted to 80 bits to perform arith ops. After the operations, the results are converted back to store in the memory.

# About C

## We can dereference with the array notation

```
byte_pointer start;
start[3] //This will access the 4th elem stored in the array pointed by the byte_pointer.
```

In general, start[i] will pointed i position beyond the loc pointed by start.

## type

Typedef can name type and is very similar to var declaration.

## pointer creation and deref

The C “address of” operator & creates a pointer.

```
int *byte_pointer = &x //x is an int
```

Will create a pointer that points to the location that holds x. The type of the pointer depends on the type of x.

## Typecast

```
(byte_pointer ) &x //x is an int
```

will cast the pointer to x, which is an int, to type byte_pointer and the program will reference it that way. Nothing change in the byte level, just the compiler to refer the data according to byte_pointer type.

## Strlen() doesn’t count the null terminator

## shift operations

If we are trying to shift k bits when the data is w and then in C it is not specified but generally it should perform . JUST DONT DO IT Also shift has lower precedence than addition.

## Floating point ops

C doesn’t require IEEE. But it is is math.h.

## Paper worth reading

## It has been four years!

It has been almost four years since I left high school for college. While I was cleaning my room and hard drives for move-out, I found the graduation speech I gave four years ago. It’s fun to read it again four years later and it serves as a good reminder of what and where I should be striving.

I will post the speech below, unedited. Enjoy!

Graduation speech

05/16/2014

Kedi Zheng

Before I get started, I want to thank and recognize my guardians and parents, who have supported me through out the years. Special thanks to Ms Patience nave, who introduced me to ORMA, cultivated me during my early year at ORMA, and Ms Vicki Miller, my second generation guardian, who has been encouraging and caring for me ever since.

I remain forever grateful for their nourishing.

The board, President Nobles, distinguished guests, faculty, staff, families, and fellow cadets, good morning. My name is Kedi Zheng and I represent one of the many voices of this graduating Class of 2014.

The only thing better than completing high school is the chance to convey my thought and experience to a captive audience through a lengthy speech, a chance I am now ready to take full advantage of.

I have been thinking of what I should say to interest many, amuse some, amaze a few, and, most importantly, to make this process as energetic as possible to keep us all awake and away from our cellphones. So I did what every student does; I plagiarized on Google and YouTube. But the only thing I found after cruising seas of data is that cut and paste could only produce second-class speech for the audience and verbal ordeal for the speaker.

So I decided to speak my mind and the first thing came up to me is appreciation.

Of all the commencement speeches this podium has ever held, nine out ten are about brotherhood, courage, civil responsibility and grand future. I certainly have enjoyed and valued these things our past cadets cherished. In fact, one of my best memories at Oak Ridge is doing the obstacle course with Sergeant H and my fellow cadets. But this is not only a time for fruition, a time for excessive self-promotion, but also a time for reflection and appreciation. I would be gravely remiss if I addressed the class of 2014 without mentioning our guides, mentors, and people who have facilitated and improved our life on and off campus.

There is a group of people who have been faithful to the development and evolution of our school. They have been indispensable to our cadets daily life, from morning inspection by TAC officers to classes taught by teachers with professionalism. They impart knowledge, wisdom and leadership, they build the platform for student leadership programs and guide us to stride forward. They are on the stage of our academy, but have often not been recognized in this finale. They have stood the vicissitude of time to build up the foundation on which we take off for the future. They are our beloved teachers, faculty and TAC officers. He is Maj B with a broad knowledge base and a distinctive pedagogy, teaching from bell to bell. She is Ms Wray, charming and erudite, interweaving Maths with fun facts and anecdotes to boost our interest and prowess in Maths. She is Ms Freeman, working hours on our college applications before she goes to coach a soccer game. He is Mr Morris with benevalent smile and good-will. She is First sergeant, the mum of Oak Ridge. They are chief duff’s national anthem, Master Sergeant’s eight count body builder, Sergeant Major and gunnies’ cadre training and Colonel Nobels’ and Colonel Lucas’s squared away uniform and invigorating speeches.

The list goes on and on and I would like to recognize them one by one, but time limit does not allow me to do so.

Without their understanding and support, I, as a kid who was new to this country and this language two years ago, could not even imagine the feats I have achieved. From a boy who couldn’t even construct a proper English sentence two years ago, to a man with a brawny body and strong character who is about to matriculate to wake forest, I have ventured a long way.

This scholastic triumph is trivial compared with soft skills I have learned here, because education should not, and is not confined only in the academic arena. Remembering one plus one and the 16^{th} president is not the essence of education ( By the way , It is 2 and Lincoln) Albert Einstein said that education is what remains after one has forgotten everything learned at school. This school is a microcosm of the real world, where we could experience and learn

As class representative, I also want to express our deepest gratitude to parents and classmates on behalf of the graduating class of 2014.

Classmates, thanks for the vital role you have played in our life and for the precious gift of friendship

Parents, you have devoted big chunks of your life into what seems to be the most onerous job in the world, at least in my parents case. Thank you for investing time, money and care on us. We are forever grateful to you. And we promise you your investment will pay back, with a boat load of interest and love.

Most of you know at least something about me more than my three syllables, exotic Chinese name. I spent my two years here, being a smarty pants, know-it-all and a person full of special ideas, so dwelling on formality is not my style. But I do have some words for our graduating class and underclassman, Not only for the sake of formality.

For the graduating class of 2014 , congratulations to all of you again, for graduating successfully and for the new adventure we are going to face together. Now we are at our first watershed of life, so we are entitled to be cocky, pompous and a little bit crazy. But don’t lose sight of what Oak Ridge has been cultivating since the first day we arrived. Because going to college is not a justifiable excuse to get loose. At Oak Ridge, everything is semi-collegiate, from class structure to dorm life, so the college freshman year is just “senior year plus one”. And don’t forget the purpose of going to college and why we are here: to master a body of knowledge and to prepare for the bigger world.

For underclassman, from Mr Whimet, the youngest potential leader, to Battalion commander of the 163th corps of cadet, captain Freeman, I hope you excel in this place where generations of leaders have started their journeys with skills instilled by this institute. Some of you may have struggled here, It may be discipline or academic. Seek help from you peers, your teachers, TAC officers, like what I have been doing since my first day.. Help is out there, not far, not close. Assistance and help are tangible, but they won’t just fall in your lap. Take Initiative to pursue them.

For the whole corps. This has been a meaningful year, we have changed a lot, and in general, the school is steadily improving. Sir Issac Newton once said : If I have seen further it is by standing on the shoulder of the giant. Past Alumni and the school has provided an excellent foundation and harbor to the future. I will be pleased, but not surprised, to hear that my high school friends become generals, successful business people or great community leaders, for our school deserves such excellent alumni.

There is a time for everything, and a season for every activity under the heavens:a time to be born and a time to die, a time to plant and a time to uproot, a time to kill and a time to heal, a time to tear down and a time to build, a time to weep and a time to laugh. Now, It is time to bid farewell. Parting is such a sweet sorrow that I would like to continue this commencement until today becomes tomorrow.

I wish you well, and in the unlikely event that my plane crashes and I couldn’t see you again, have a good day, a good year and a good life.

## picked up a system book

It is final time and my old habit of picking up random books during final weeks started sneaking in again. This time I picked the CMU system course. I have heard a friend talked very highly about it so I want to try it out myself. Good get away from dry final reviews. I will keep editing this page to reflect the progress I have made. Let’s hope it all go well.

## Hello, World!

Hey there!

After a long hiatus, the website is back! I took it down earlier this year because I realize how insecure it is to use DDNS to expose my own home network to the whole world. There are better places to use the Raspberry Pi.

I won’t migrate the old post from Pi since I didn’t use any framework such as world press, migrating is not as convenient. On top of that, who would want to look at websites that are hand-coded and look like they are from the 90′.

This blog will serve as a notebook for me to jot down ideas, fun projects and reveries. I have had a fews notebooks where I put my ideas. But they are scattered all over the places and it is hard to retrieve past information. And frankly, my handwriting is just not that aesthetic lol. On top of these reasons, putting my projects and ideas in the public will motivate me to put more thoughts in my words and be more precise and meticulous. After all, no one wants to get confused!

So that is! Hello World again!

Kedi right before his last final exam period at Wake Forest.