Network programming

Client-server model

In this model, an application consists of one server and multiple client. Server manages resources and provide services to client. 1. Client request services 2. Server manipulate resources to satisfy the request 3. Server sends a response to the client

Networks

A network is a hierarchal system of wires and boxes that’s organized by geography proximity. EG: LAN, WAN.

LAN

Lowest Level is LAN. It is consisted of a bunch of hosts connected by wires to a hub. Each host has a 48 bits address. (MAC address) EG:00:16:ea:e3:54:e6.

Host sends each other via frame. Broadcast, so everybody sees every bits.

Next level: internet(lowercase)

internet stands for inter-connected network that connects incompatible LANs together.

internets is usuallyy ad hoc so there is no set topology and we send data via hops across routers.

protocols

Protocols are a set of rules that govern how hosts and routers should communicate. It smoothes out differences between network

Internet Protocols(IP) does: 1. Provide naming scheme such that each host and router can be identified by a unique address. The uniform format is called host address.

  1. Provide deliver mechanisms: Provide the standard unit of transmission called packet that’s consisted of header and payload. Header consists of the packet size, source host address and so on.

The Internet

The Internet can be thought of as a world collection of hosts with the following properities:

  1. Hosts are mapped to 32 bits address.
  2. IP address is mapped to a set of identifier called domain name.
  3. A process on the Internet can communicate with another process via connection.

IP address

IP addresses are stored in IP address struct. The struct contains 1 unsigned int

 /* Internet address structure */
struct in_addr {
    uint32_t  s_addr; /* network byte order (big-endian) */
};

EG:0x8002C2F2

We can convert the hex to human readable dotted decimal form:

0x8002C2F2 = 128.2.194,242

Domain name

The Internet maintains a mapping between IP addresses to domain names called DNS domain name server.

The mapping is not one-to-one. A domain name can map to 1 address, multiple addresses or no address.

The node of the tree represent domain names formed by the path to the root.

Internet connection

Client and server communicates by sending streams of byte over a connection. Each connection is full duplex and point-to-point

A socket is an endpoint of a connection.

A port is a 16 bits that identifies a process: 1. ephemeral port: ports automatically assigned by kernel when client makes a request. 2. Well-known port : Associated with some service provided by the server. EG port 80 for http

Each connection is defined by a socket pair

(client ip: port, server ip: port)

# Socket 

For Linux Kernel, socket is an endpoint of communication. For program, it is just a file with a file descriptor.

A socket address is a struct that contain info about the ip address and port. We have to use a generic socket struct then cast it to different socket address struct since when socket was designed, we can't use generic pointer `void*`.

<img src="http://kedizheng.com/wp-content/uploads/2019/08/Screen-Shot-2019-08-19-at-4.20.59-PM.png" alt="" width="1424" height="490" class="alignnone size-full wp-image-957" />

<img src="http://kedizheng.com/wp-content/uploads/2019/08/Screen-Shot-2019-08-19-at-4.21.51-PM.png" alt="" width="1416" height="616" class="alignnone size-full wp-image-959" />

Above is an IPV4 socket address. For functions to use socket address correctly, we need to first cast `struct sockaddr_in*` to `struct sockaddr *`

## Convert from string representation of sockaddr to struct
we use `getaddrinfo` to do so.


```c
int getaddrinfo(const char *host,            /* Hostname or address */
                const char *service,         /* Port or service name */
                const struct addrinfo *hints,/* Input parameters */
                struct addrinfo **result);   /* Output linked list */

Given host and port number, returns result that points to a linked list of addrinfo structs. In each addrinfo contains a pointer to sockaddr that can be used to bind, accept and listen.

In hints, we specify more control about the connection. Such as IP type and socket types.

The reason getddrinfo returns linked list of addresses structure is because the default behavior is to return 3 structs for 3 different types of connection. 1 for connection, 1 for raw socket and 1 for data gram.

Also a host might have multiple IP address.

It is a linked list because an address for a host might have multiple sockets.

EG: Twitter might have multiple sockets listening?

We can get the host info from struct with

int getnameinfo(const SA *sa, socklen_t salen, /* In: socket addr */
char *host, size_t hostlen,/* Out: host */
char *serv, size_t servlen,/* Out: service */
int flags); /* optional flags */


Socket interface

Creating socket

int socket(int domain, int type, int protocol);

Both client and server uses this function to create socket. EG:

int sockfd = socket(AF_INET, SOCK_STREAM, 0);

AF_INET means we use IPV4 and SOCK_STREAM means this is an endpoint of a connection.

Better to use getaddrinfo() to get info for socket since it will be protocol independent.

 struct addrinfo *p = ...;
int clientfd = socket(p->ai_family, p->ai_socktype, p->ai_protocol);

Connect (by client)

A client establishes connection via

int connect(int clientfd, SA *addr, socklen_t addrlen);

It will block until a connection is successfully established. If successful, we can now read/write via clientfd. The connection is characterized by a pair (x:y,addr.sin_addr:addr.sin_port).

Bind

int bind(int sockfd, SA *addr, socklen_t addrlen);

Where typdef struct sockaddr SA

It will ask the kernel to associate the socket address struct with the socket file descriptor.

After bind, process can read and write from sockfd.

Listen

By default, program will assume a socket opened by socket(...) is an active socket from the client end. Use listen will turn the socket from active socket * to *listening socket.

 int listen(int sockfd, int backlog);

backlog is an hint for the server about how many requests can be queued before rejecting further requests.

Accept

int accept(int listenfd, SA *addr, int *addrlen);

accept(...) will wait for connection request bounded to socket at listenfd, then fill in the client address in addr and the size of socket address length at addrlen.

If successful, it will return a connected descriptor to communicate with the client via UNIX IO.

Connected fd is different from listening socket. A listen socket is created once and will listen for Connection requests. A connected socket is created for the connection and exists only as long as it takes for the server to service the client.

We have such distinct so we can create concurrent server to connect with many client. EG: when a new request comes, we fork a child to handle.

Web server

System level I/O with Unix/Linux

UNIX I/O

A linux file is a series of m bytes /[b_0, b_1…b_{m – 1} ]

All I/O devices are such as terminal, Network, printers are modeled as file.

Openning a file

An application opens a file by request a file from Kernel. The Kernel returns a small non-negative integer called File descriptor. The kernel keeps info about the file, and the app only keeps track of the descriptor.

Change file position

Kernel keeps a current position k for each file and it starts at 0. App can explicitly change it via sleek.

Read/write file

Read a file copies n bytes from the file to memory starting at k. It will then increment the file position k by n.

If the file size is m and m < n then Kernel will trigger a condition called EOF. There is no explicit EOF character associated with the file though…

Writing is similar. It will write n bytes to the file starting at k and increment the current position k by n.

File type

Regular file

contains arbitrary data

Directory file

A file contains multiple links to other file, mapping file names to files, including directory files. Each directory files contain 2 files at least. The first is “.”, which points to itself. The other is “..”, which points to the parent.

Socket

File used to communicate with other process across networks.

Open and close a file

Openning a file

Open will takes a filename and flags and user mode and return the smallest int that’s not used in by the current processor as the file descriptor. For example, the first file a process open will has fd == 3. (0 is stdin, 1 is stdout and 2 is stderr).

int open(char *filename, int flags, mode_t mode);

flags indicate how to access the file. The flag can also be ored to create bit mask that gives additional info. We can do

open('temp.txt, O_RDONLY | O_TRUNC, 0);

Close file

int close(int fd);

Will close the file associated with the given fd. Closing a closed file is error.

Always check return codes, even for seemingly benign functions such as close()

Read/Write file

ssize_t read(int fd, void* buf, size_t n);

will read at most n byte from fd starting at the current position and store it into buf. A return value of -1 indicates Error. 0 indicates EOF. Else the return value indicates the bytes read.

ssize_t write(int fd, const void* buf, size_t n);

copy at most n bytes from buf to fd at the current position.

Short Count

For both read and write, the return value might be smaller than n. This is called Short count. It can happen due to three things: 1. Encounter EOF 2. Reading from text line in terminal. If we are reading from a terminal, no matter how big the n is, read will just read a text line and return the length of the text. 3. Reading from socket.

Standard I/O

C standard library libc.h has high level file functions such as fopen, fclose, and so on. Open and close file: fopen, fclose Read and write bytes; fread, fwrite Read and write text line fget, fput Format read write fprintf, fscanf

standard IO models file as stream, abstraction for file descriptor and a buffer.

Applications often read/write one byte at a time and doing read, write is expensive because it requires kernel call. So we use buffered IO that put contents into a buffer first.

EG: stdout uses buffer. It will flush when we call fflush(stdout) or when the function exits.

Most of the time, we would prefer standard IO but it has problem with: 1. Can not access meta data 2. Not async-data safe (not safe for signal handling) 3. Not suitable for network socket 4. Restrictions

Standard IO uses full duplex so we can use the stream both to read and write but the buffer causes problem. More on standard IO restriction: 1. Can’t use input right after output because of buffer. If we want to use it, we must flush it with fflush 2. An output can’t follow input function without calling fseek or rewind. It causes problem for network since we can’t fseek on a socket.

We can open two streams one for read and one for write but we need to close them

FILE* fpin, *fpout;
fpin = fopen(socked, "r");
fpout = fopen(socked, "w");

We need to close both streams to release the resource. But since both streams have the same underlying socket, the second close will cause problem.

When to use Unix IO

  1. Inside signal handler
  2. When performance matter a lot

What not to use for binary file

Text-oriented IO

such as fget

string function

Such as strlen because ‘\0’ will be interpreted as null terminator

How kernel represent open files

To represent files in Unix kernel, we need three tables:

Descriptor table: Each process keeps a descriptor table where each entry is an open file. Each entry points to an entry in the file table.

File table: File table keeps track of the current position offset of each file, and a reference count of the file and a pointer to entry to v-node table. Closing a file will decrement the reference count of the associated file table. File table is shared across all processes

v-node table: Keeps stat info about that table.

Two file descriptor represents two different files.

Two file descriptors that point to the same file. EG, do two open on the same file name.

Forking will let children inherit files opened by parents
Before forking:

After forking:

We also increase the reference count by 1 for each file.

IO redirection

Linux IO allows user to redirect standard output to file on disk

$ ls > foo.txt

will redirect the output of ls to foo.txt.

It achieves this by using dup2(int oldfd, int newfd);
For each process, dup2 will copy oldfd’s table entry to that of the new file descriptor.

EG: dup2(4,1) will copy the pointer pointed to by file descriptor 4 to the entry of file descriptor 1 in the file descriptor table

if newfd is open, then we will close it first(and decrement the refcount).

Issues with Concurrency

Thread (un)safety

4 types of thread-unsafe functions

Access unprotected shared data

Easy to fix, add lock

Function that keeps state across multiple invocation.

Such as rand. solution: stop using state via static variable. Only solution is to change the function such that it doesn’t use static and provide that value via variable.

function that return pointer to static

function that use thread-unsafe function

Thread reentrant

Thread-reentrant is thread function that doesn’t rely on any shared data.
Not the same as thread safe. A function can be thread-safe but not reentrant.

Race

Race condition occurs when the correctness of the program depends on the order of execution.

Deadlock

A pair (or multiple) threads that are blocked because they need resources from other threads.
For mutex, we can prevent it by locking it in one order and unlocking it in reverse order.

chapter3 Machine-level representation of programs

History prospective

I386

The first chips to have 32 bits architecture in Intel. It has flat address mode, which Linux still uses today. When we compile with GCC, it will assume it is compiling for i386.

Pentium 4E

Added hyperthreading, a method to run two programs in one processor.

Core2

The first multi-core processor. Didn’t support multi-threading

Corei7

Up to 4 cores.

x86 comes from I386. Makes you wonder how “severe” the back-compatability issues are if they can’t even shake off the old name lol

Program encoding

gcc -01 -o p p1.c p2.c

gcc is the default c compiler.

-01 is the level of optimization

-o will specify the name of the compiled program, in this case p.

First the C preprocessor will expand the source code by including the in #include and macros defined by #define.

Then the compiler will generate assembly code of the expanded c code p1.s and p2.s. Next.

Data Format

ATT format

Also known as the GAS format. Instructions with multiple operands will be in reverse order.

Controls

Condition codes

CPU has a set of single bit condition registers. The famous ones are

CF

Carry flag. Will be one if the most recent ops generated a carry out.

ZF

zero flag. Recent ops generated 0

SF

Signed flag. If most recent ops has negative result.

OF

If most recent ops generates overflow in two’s complement.

Fun commands

CMP and TEST

Almost all ops (except leal, since it only copies address) will change condition registers.

We will concern ourselves with CMP(compare) and TEST. These two are ops that will set conditions code without altering other registers.

CMP behaves the same way as SUB. TEST behaves the same way as AND.

SET

SET is a class of ops that will set a bit to 0 or 1, usually one of the flag register or the lower byte of general register, based on combinations of the condition codes.

For example, setl(setwhen signed <) will effectively do

D <- SF ^ OF

Based on the calculation t = a – b.

When a < b and a - b < 0, SF = 1 and OF = 0.

When a < b but a – b > 0, an overflow has occured and OF = 1, SF = 0.

So SF ^ OF = 1 if a < b and D will be set by setl.

For a comprehensive list of set ops, please check out the book.

TEST

Test is the almost the same as AND but it doesn’t change the destination register. It will change ZF when the result is 0, and sign flag when the result’s sign bit is set . We usually do

<br />TEST %eax %eax

Then check the sign flag.

Control structure

The code that C translate to Assembly will use a lot of GOTO style code.

WE use different jump statements.

##JMP

JMP will just jump to a labeled location. It jumps unconditionally.

1: Direct jump. The jump target is encoded as part of the instruction.

2: Indirect jump. Jump target is read from a register or a memory location.

Indirect jump use ‘*’, followed by an operand specifier.

<br />jmp *%eax

use the value in reigister %eax as the jump target.

<br />jmp *(%eax)

will read the jump target from memory.

Conditional jump

They can only be direct. The way to jump to target are directed by PC relative. It directs the difference between the address of the target instruction and the address of the instruction immediately following the jump.  This PC relative applies when we disassembled the assembly code.

(I would love to post some assembly code here but there is no code tag for it)

If we want to jump to byte 0x17 through a jump, and our next byte is 0xa,

then the jump value is 0x17 – 0xa = 0xd.

THE PROGRAM COUNTER WOULD HAVE THE ADDRESS OF THE INSTRUCTION FOLLOWING THE JUMP WHEN PERFORMING PC RELATIVE ADDRESSING. Because in the old day, updating the PC with the next time is the first step in executing an instruction.

Loops

Do while, while all use conditional jump.

Most of compiler generate loop based on do-while form of a loop.

Do-while

do

body-statement

while (test-expr);

We translate it using goto, which resembles the assembly code:

loop:

body-statement;

if (test-expr)

goto loop;

while loop

while(test-expr)

body-statement

It might terminate the loop before the first execution. To translate it to do-while we need to add a conditional jump to skip the first execution of the body.

if(!test-expr)

goto done;

loop:

body-statement;

if(test-expr)

goto loop;

done:

For loop

for (init-expr; test-expr; update-expr)

body-expr;

First translate it to while

i = init-expr;

while(test-expr)

body-expr;

update-expr;

Then translate it to go-to…

i = init-expr;

if (!text-expr) goto done;

loop:

body-statement;

update-statement;

done:

 

Continue

Continue will jump directly to the end of the current loop. Need to jump to the update-section.

 

Conditional move

It will test a condition and move the data according to that condition. Because it is really hard to predict where if statement will jump to, so pipelineing’s ability to do calculation beforehand is limited. A misprediction of where the next code will be will let CPU discard much of the work done before.

int func(int x, int y){
if (x - y > 0)
    return x - y;
else
    return x + y;

If the machine predicts the sign of x – y wrong, (eg, predicting it to be positive but it is negative), then it has to discard the result of x – y and calculate x + y and return it. But for conditional moves, it will check conditions and then based on the result, move the appropriate data into appropriate register. No jumps so no works will be discarded. But we have to calculate both x – y and x + y.

int func_cond_move(int x , int y){
int tval = x - y;
int rval = x + y;
int test = x > y;
if(test) rval = tval; //cond move
return rval;

But if one of the conditional expression might lead to a error or side effect, we CAN’T use conditional moves.

int cread(int *xp){
return( xp? *xp: 0);

This code will return *xp if xp is anything but zero. So even given an invalid pointer that is not 0, we will deference that address since in conditional move, we evaluate all statements

Switch

Switch has a multi-way branching ability and the time it takes to perform the switch is INDEPENDENT OF THE CASES.

In the assembly code of the switch code, we will first “bring down” the cases number to start at zero. For example, if you have cases ranging from 100 to 106, then GCC will bring it back to 0-6. Then GCC use a switch table, an array that stores code for different cases.

jmp *.L7(, %eax, 4)

We use an indirect jump to jump to the location pointed by L7 + 4*%eax, where %eax stores the case number. L7 is the start of the switch table. We multiple by 4 because the word size is 4 bytes.

In the C code, the switch table is

<br />static void *jt[7]= {

&&loc_A, &&loc_def, &&loc_b...

}

And it stores address of codes through the && that creates a pointer for code location.

Procedure

A procedure call involves both data and control transfer. And it must allocates for local vars and deallocated them when the procedure ends. This is implemented through the program stack.

Stack frame

The protion of the stack allocated for a single procedure is called stack frame. The stack grow from higher address to lower address. The stack top has the lowest address and is pointed to by %rsp, the stack pointer.The stack bottom is pointed to by %rbp, the frame poitner.( you can think of the b as “base”) Since the top can be always moving, we always access information relative to the frame pointer %rbp.

The stack grows down so the top of stack will always have lower or equal address than the bottom. In other word, RSP <= RBP. And to allocate space on the stack for the current procedure, we will subtract RSP. To deallocate, we add to RSP.

When P( the caller) calls Q( the calee) the argument to Q are contained in P. On top of that, the return address within P where to program will resume execution, is pushed onto the stack, forming the end of P. (Before Q pushes anything onto the stack) The stack frame of Q will start with the saved value of the frame pointer and copies of any other saved register values.

Q would also uses the stack for any local vars that can’t be stored in registered. 1. Not enough register 2. Local vars are arrays so must be accessed by array references 3. &, the address operator is applied and we must be able to generate an address for it.

Stack frame is also used for storing arguments to any procedure it calls. We can allocate space for stack by decrementing the stack pointer( the top of the stack) then deallocate it later by incrementing it.

Leaf procedure

x86-64 will only allocate stack space if needed so if a procedure has fewer than 6 arguments(since x86 will store the first 6 args on the registers) and doesn’t call other function(calling other functions require the caller push the return address onto the stack), that procedure won’t need stack frame.

Control transfer

When procedure P calls Q, x86-64 calls instruction Call Q and it will push an address A onto the stack and sets the PC to the beginning of Q. The address A is called return address * and is computed as the instruction right next to the *call instruction.

The counterpart is called ret and will pop A off the stack and sets the PC to A.

When P calls Q, at least 8 bytes of storage has to be allocated by subtracting 8 from RSP.

Data Transfer

Usually data transfer is implemented through registers. Args are passed through RDI, RSI and so on. Return value is passed back through RAX. x86-64 use up to six register to pass integral(pointer and integer) arguments in per-defined orders. For 64 bits, the order is : rdi, rsi, rdx, rcx, r8 and r9. If the function has more then 6 arguments, x86 will push the rest of the arguments onto the stack. The 7th argument will be on the top of the stack(last to push on) because we will pop it first.

If we call a function with 8 arguments, we will see the 7th argument at RSP + 16 and the 6th at RSP + 8. (RSP is the beginning of the return address).

Local storage on the stack.

Sometimes we might need to store things onto the stack. Three situations are likely to cause this: 1. Not enough register. 2. The address operator is applied to a local variable and we need to generate an address for that variable.(Registers don’t have address). 3.We encountered an array or structure.

For example, if we want to swap two values through pointers, The calling code of this function has to allocate a stack frame because of the presence of the address operator.

<br />TEST %eax %eax
void swap(int *xp, int*up){ 
    long x = *xp; 
    long y = *yp; 
    *xp = y; 
    *yp = x; 
}

Local storage in the registers

Program registers are resources shared by all procedures and only ONE procedure can be active at a given time. So we must prevent overwrite when one function calls another function. There are two kinds of register in terms of storage. The first one is callee-saved. As the name indicates, the called function will save or won’t touch the value stored in callee-saved registers. The second kind of register is Caller-saved register.

By convention %rbx % rbp and %r12-%r15 are callee-saved registers. If the called function will alter the register, the function will push the original values onto the stack and before it returns, pop them off to the respective registers. They will be stored in a portion of the stack frame called “Saved register”. The data in callee-saved registers won’t be currupted by the callee

All other registers (except %rsp) are Caller-saved register. It is on the caller to save the values before calling callee.

We often use callee-saved register to hold results from previous calculation. After we push the callee-saved registers value onto the stack and finish the work in the function, we will need to pop them off into the register in reverse order to follow the stack convention.

Recursive calls

The stack convention allows each procedure to have their own private space and local variables in multiple outstanding procedures won’t interfere with each other. With this, we can do recursive calls. The book uses Factorial as an example. You should try to write the assembly for Fib to try out.

Array

Array in C aggregrigate scalar data into larger data type. But C generages pointer to element within the array and can perform pointer arithmatics. If we declear an array that stores n data of size L, the total space allocated will be n * L. If the starting address of the array is X_a, the address of the ith element will be X_a + i*L.

Random Finding

LEA

LEA doesn’t change any flag register so it is used a lot by compiler to do simple arithmatics to help the schedulor.

Let’s assume a at %ecx, b at % edx.

LEAL (%ecx, %edx), %ebx

will store a + b in %ebx

We can also use LEA to do left shift

LEAL (%edx, %edx), %edx

Will shift( %eax) to the left by 1 bit and store at %eax.

Test non negative

We will use

TEST %eax, %eax

To see if eax is non-negative since TEST will flip the zero and sign bit based on the result of %ead & % eax.

Calculating Parity

We can use XOR to calculate parity.

<br />int parity(unsigned x){

int val = 0;

while (x != 0){

val ^= x;

x >> 1;

}

return 0x1 & val;

}

This code will calculate the parity of x. It will shift x’s bit pattern to the right one by one and XOR it against the previous x bit pattern, effectively go through all digits to check parity at the least significant bit.

The least significant bit in value will start with 0 and flip to 1 through XOR if x has 1 in the least significant bit. The XOR result will be the next val.

The next val will be check against the right-shifted x. If the second to last bit of x is 1, which will become the least significant bit of the right-shifted x, will be XOR against val. If x = 11, we will have 11 XOR 01 = 10. A 0 at the least significant bit means the parity is even.

Flipping bits

It is just amazing how the book does it!!!

Chapter 4

Overview

In this chapter we will take a look at processor hardware and the instructions that the hardware can support, called instruction set architecture.

One important concept is that ISA ‘s logic can differ from the actual way the processor works. The ISA will appear to be sequential but sometimes the different instructions are executed simultaneously.

We want to have clever tricks that improve the performance while maintaining the functionality of a simpler and more abstract model

The processor we are working with :Y86-64

It is functionally correct and sequential. It executes a complete instruction on every clock cycle and the clock is slow to allow an entire series of actions to complete within one cycle.

Then we will apply transformations to create a pipelines processor. It will break each execution into five stages, each will be handled by a separate section of hardware. Instructions progress through the stages of the pipeline with one instruction entering the pipeline in one clock cycle. As a result, the processor can process up to 5 instructions at the same time.

The Y86-64 instruction set

Defining a instruction set means defining the different components of its state, set of instructions and their encodings, a set of programming convention and exception handling.

programmer-visible state.

Each instruction in Y86-64 can read and modify some part of the processor. This is the visible state. The 15 registers, condition codes(ZF,SF,OF) about the most recent arithmetic or logical ops, PC, memory and stat, which indicates the state of the program execution.

Y86-64 instruction

It is largely a subset of X86 and has fewer addressing mode and only 8-byte integer operations. Here are a few “quirks” of our Y86:

  1. movq is split into four instructions: irmovq, rrmovq, mrmovq and rmmovq, where i stands for immediate, r stands for register and m stands for memory. It will become useful when we implement them.
  2. As with x86, we cant move from memory to memory, and can’t move immediate to memory.
  3. We have four operations and they can only operate on register data. But on x86, they can operate on memory as well. These instruction will set the three conditional code.

Instruction encoding

The byte encoding of the instructions are between 1 to 10 bytes, depending on what fields are required. All instructions have a initial byte that indicate the instruction type. The high-order part is the icode, the low-order part of the byte is the ifunc that will differentiate between different functions under the same code. For example, 0x6 for icode is operations, and 0x0 for ifunc is addq, 0x1 for ifunc is subq and so on.

One-byte long instruction

Some instructions are only 1 byte long because they don’t need any other infos. They are ret, nop and halt.

For other instructions, we will need more bytes because they require operands. The next bytes inline is the register specifer byte. We have 15 registers in Y86 and the high-order and low-order specify the two registers. The value from 0x0 to 0xE are associated with %rax to %r14. The value 0xF is reserved for no register. We use 0xF for operations that only need one register such as pushq.

We might also need a 8-byte constant word. It can serve as the data for irmovq and displacement for rmmovq, and destination of calls. BEWARE the address are absolute (because it is simplier) and is in byte-reverse order, the little-endian encoding.

Exception

The status code in stat has four possible value. AOK(normal), HLT(halt), ADR(address invalid) and INS(invalid instruction. Any read or write attempt to an address out of the addrses limit will trigger the ADR.

In y86, when exception happens, we will just stop.

Two unusual instructions

pushq will decrement the stack pointer by 8 and writes a register value to the memory. So what will pushq %rsp do? Two possibility:

  1. push the original value
  2. Push the decremented value

Let’s first try putting a code together

movq %rsp, $rax pushq $rsp popq $rdx subq %rdx, %rax ret

The return value is always 0. So it means we pushed the undecremented value onto the stack. The same question can be raised for popq $rsp. A similar code can be used to determine the effect. We found popq $rsp will put the value of ($rsp) to $rsp. It is equivalent to mrmovq ($rsp), $rsp.

Logic design and Hardware control language HCL

We need three components to implement a digital system.

  1. combinational logic to computer functions on bits
  2. Memory element to store bits
  3. clock signal to regulate the update of memory

nowaday don’t have to draw diagrams anymore and use a textual notation. The most common languages are Verilog and VHDL. In 1980s, logic synthesis is developed to generate efficient circuit from HCL.

Logic gates

They are basic electronic component that generates output based on some boolean functions of inputs. Logic gates are always active and if some input changes, within a short amount of time, the output will change accordingly. Putting logic gates together, we got combinational circuits. Several restrictions apply:

  1. inputs must be either system input, output of memory our output of logic gates
  2. outputs of two gates can’t be connected together
  3. The network must be acyclic. No path can form a loop. Else there will be ambiguity.

Multiplexer

It will select input based on the bit pattern of the selection bit. We will use case expressions in HCL.

[ select1 : expr1; select2 : expr2; …

]

where select is a Boolean expression, indicating when to select the expression.

Example: If we have two selection bits, we can select 2^2 = 4 inputs.

[ !s1 && !s2 : A; !s1 : B; !s0 : C; !1 : D;

]

We can also simplify the select statement.

ALU

arithmetic logic unit can perform ops on data.It has two data input and one control input.

Memory and clocking

Combinational circuit, by definition, doesn’t store information and will simply react to signal inputs. To create sequential circuits that can have states and perform computations on the state, we must have devices that store information. Our storage devices will all be controlled by clock, a periodical signal that determins when new values will be loaded into the devices. The registers will only change according to inputs when the clock signal rises.

Sequential Y86 implementations

In this version, our sequential processor,SEQ, will perform all the steps required to process a complete instruction. It will be slow.

Stages

  1. Fetch: reads the bytes of an instruction from memory, using the PC as the memory address. It will grab the icode, ifunc and possibly the register specifier rA and rB and an 8-byte constant word valC. In the end, it will compute valP to be the address of the next instruction.
  2. Decode: reads up to two operands from the register file, and give the values stored in the rA and rB, giving valA and valB.
  3. Execute: the ALU will perform the operations specified by icode and ifunc, or computers the effective address of memory reference. The conditional code is possibly set and conditional jump/move will be determined to move/jump or not.
  4. Memory: write data to memory or read data from memory
  5. Writeback: write back to up to 2 result to register file.
  6. PC update: PC is set to the address of the next instruction.

OPq

An example, we will work through OP rA, rB.

Fetch: icode: ifunc <- M1[PC] // read 1 byte and get instruction code and function.

rA:rB <- M1[PC + 1] // read the target register

valP <- PC + 2 // find the next instruction after OP

Decode: ValA <- R[rA], ValB <- R[rB]

Execute: ValE <- ValB OP ValA, set CC //conditional code

We put ValB in front of ValA because we want division to have B divided by A.

Memory: No memory instruction

Write back R[rB] <- ValE

PC update: PC <- ValP

rmmovq

We can also use ALU to get the effective address of memory, as can be seen in move instructions

Example: rmmovq rA, D(rB)

Fetch: icode: ifunc <- M1[PC], rA:rB <-M1[PC + 1], ValC <- M8[PC + 2], ValP <- PC + 10 // we need access to displacement, it is an 8 bytes const after register specifying byte.

Decode: ValA <- R[rA], ValB <- R[rB].

Execute <- ValE <- ValB + ValC //displacement D

Memory M[ValE] <- ValA

Writeback: no writeback

PC update <- PC <- ValP

If we have rmmovq, where there is no displacement, in execute stage we will have : ValE <- 0 + ValA.

I guess this is the case because we need to pass ValA to writeback and it has to go through ALU. Kind of like zero-padding.

Push and Pop

It is among the most difficult instructions in Y86 because we both need to access memory and also update the stack pointer.

push rA

Fetch: icode: ifunc <- M1[PC], rA <-M1[PC + 1], ValP <- PC + 2

Decode: valA <- R[rA], ValB <- R[$rsp]

Execute: ValE <- ValB – 8

Memory: M[ValE] <- ValA

Writeback: R[%rsp] <- ValE

PC update : PC <- ValP

we can see ValB has the value of the old $rsp. We observed the rule that we should update the value of $rsp before writing to it is followed. (even though the actual update happened after the memory operation.)

pop rA

Fetch:R[$rsp] , ValP <- PC + 2

Decode: valA <- R[$rsp], ValB <- R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <- M[ValA]

Writeback: R[$rsp] <- ValE, R[rA] <- ValM

PC update: PC <- ValP

We can read two copies of the stack pointer. Because doing so can make the flow of the instruction much similar to instructions.

We used the incremented value to update the stack pointer and the un-incremented value as the memory access.

Call and ret

call dest

Fetch: icode: ifunc <- M1[PC], valC<- M8[PC + 1], ValP <- PC + 9

Decode: valB <-R[$rsp]

Execute: ValE <- ValB – 8

Memory:M[ValE] <- ValP

Writeback: R[$rsp] <-ValE

PC <- ValC

For call, the address of the function is the 8 byte right after instruction byte. We also need to use ValB to store the stack pointer and decrement it to store the return address. Then the PC is set to the destination.

ret

Fetch: icode, ifunc <- M[PC]

Decode:ValA <-R[$rsp], ValB < R[$rsp]

Execute: ValE <- ValB + 8

Memory: ValM <-M8[ValA]

Writeback: R[$rsp] <- ValE

PC update: PC<- ValM

Ret also keeps two copy of unincremented $rsp. One is used for instruction update, another one is used for memory access

These two instructions are very similar to push and pop but we push and pop the program counters.

The hardware structure

A single clock signal transition trigger a flow through combinational logic to execute an entire instruction. We can treat the RAM units as another combinational circuit; it output word based on the address input, just like the circuit. So we are left with four hardware to control: The PC, the conditional code register, the data memory and the register file.  The PC is loaded with the new instruction every clock cycle. The condition code is loaded when an integer opration is executed. The data memory is written only when rmmovq, pushq or call is executed. The two write ports of the register file allow two program registers to be updated on every cycle. All of these state update happened simultaneously but But they achieve the same effect as the sequential execution in the previous section. This is achieved because we organized the computation in a way that follows some principle:

PRINCIPLE 1: No reading back. The processor never needs to read back updated state in order to complete other instruction. Let’s first start with a counter-example: suppose in pushq, we will first decrement the $rsp then use the updated $rsp to store the pushed value. This would violate the principle. We will need to go back to register file to retrieve the updated $rsp. On the other hand, our implementation generate the decremented value ValE and use it both as the memory write address and also the update value for $rsp, so we can do memory write and register update **simultaneously **. Along the line of this principle, we shouldn’t both set and then read the condition codes in any instruction. Even the condition codes are not set until the next cycle, they will be updated at the very beginning of the next cycle before any instruction can read it.

  As we can see, at the beginning of the cycle, the instruction hasn’t been executed and at the end of cycle 3, all combinational circuits have reacted to the given inputs and the update values have been generated.

Stage implementation

We will devise HCL descriptions for the control logic blocks required for SEQ.

Fetch stage

The fetch will read 10 bytes from memory and the first byte is split into 2 4 bits section as icode and ifun. Then compute the instruction codes or, in the event the instruction is invalid, the value is corresponded to a nop instruction. Based on the icode, we can compute three 1 bit signals(in dashed line)

  1. instr_valid: Is the instruction valid?
  2. need_regids: Does the instruction need register id?
  3. need_valC: Does this instruction need constant word?

The signals inst_valid * and *imem_error ( when the instruction address is out of bound). are used to generate the status code in the memory stage.

The HCL for need_regid is :

<br />bool need_regids = {
    icode in {RRMOVQ, RMMOVQ, MRMOVQ, PUSH, POP, IRMOV}
}

and the HCL code for need_valC is

bool need_valC = {
    icode in {IRMOVQ, RMMOVQ, MRMOVQ, JXX, CALL}
}

The remaining 9 bytes are processed by the “align” hardware into the register fields and the constant word. Byte 1 is split into 2 register specifying fields and if need_regids is 0, both register specifying fields will be 0xF(RNONE). Depends on if we have register specifying fields, the constant field will either be 1-8 bytes or 2-9 bytes.

The PC incrementer hardware unit will generate valP. Depends on need_regids and need_valC value, valP will vary. The formula is : PC + r + 8p, where r and p will be either zero or one, indicating the status of *need_regids, need_valC.

Principles of pipelining

Pipeling is like the Ford model-T production line; each unit of the processor will do their jobs on different instructions and they don’t have to wait till the instruction finished.

For the sequential, the clock can’t be too quick otherwise, we the clock will erase the registers before an instruction finished. We have five stages(fetch, decode…) and we have to let the clock’s ticking time to be longer than the time it takes for the signal to propagate.

But for pipelined processors with pipeling registers, we can run each stage of an instruction as soon as the previous instruction left that stage. For example, Instruction just finished fetch and stored all the info it needs in a pipeling register, instruction B can immediately uses the fetch part of the processor after the clock ticks.(Else it has to wait instruction A for the next few stages.)

The throughput(the number of instruction in given time) will increase. But the latency(the time required to finish an individual instruction, will increase as well.

Formula:

  • Through put = instruction / time for cycling the clock.

The first image is a pipelined processor. We can see we have divided the computation in three stages.And each instruction will require three clock cycles to finish.

The second image shows how three instructions will be finished. As soon as I1 is done with combinational logic A, I2 can start using A. Ignoring the time to transfer in and out of the pipelining registers and assume each combinational logic takes unit time, the sequential will take 9 clock cycles to finish the job but the pipelined will take 5 clock cycles.

 

When the inputs for B is ready for I1, the clock’s rising will load the input of the first pipeline register.

Slowing the clock will not change the behavior but making the clock too fast will be bad; The needed information hasn’t been calculated out by the circuit and the clock tick will let whatever is in the pipeling register output to the next combinational logic. BAD.

Limit of pipeling

Nonuniform partitioning

The rate of the clock is determined by the delay of the slowest stage.

We can see that the delay in each stages are different and stage A will have to be idle when for I2 since it has to wait for I1’s B to finish first. If we clock A too fast, we will let I2 enter stage B prematurely and erase I1’s B stage information.

Pipelining a system with feedback.

We might run into dependency between successive instructions.

  1. data dependency, where the next instruction might need the result of the current one.
  2. control dependency,the outcome of a conditional test will determine the next instruction. A good example is conditional jump. We might need to change the system’s behavior in order to do pipeling for system with feedback.

Pipelined Y86-64 implementations

We first shift the computation of the PC into the fetch stage. Then we add pipeline registers between the stages.

Rearranging the computation stages.

We move the PC update stages to the beginning of an instruction, so it computes the **current ** instruction. (For our previous model, the PC is at the end and will compute the instruction for the next one.

In our new PC update stage, we have state registers to hold the signals from the previous computation and when the clock rises, we will select the appropriate PC through the exact same logic we use to compute PC in the previous SEQ model. There is no hardware register storing the program counter. The PC is dynamically computed. (We can implement the processor different from the conceptual model.

The shift of stat element is called circuit retiming, which change the implementation/representation of the system without changing its logical behavior. It is often used to balance the delay of different stages.

The figure shows the shifting of the PC computation.

Pipeline register

To create a pipelined processor, we first need insert pipeline registers. The pipeline registers are in blue boxes, each containing fields that represent results from the stages. They are actual hardware components.

  • F holds a predicted value of the program counter
  • D holds info about the instruction for the decode stage.
  • E holds info for the execute stages.
  • M holds info for the memory stage and will also hold information about branch condition and branch target(info from execution stage).
  • W holds info for writing back to the register file and the return address to the PC selection for ret.

The graph above shows how a pipelined processor would hold information. At stage 5, I1 is about to write, I2 is about to do memory operations. This graph give us another justification for drawing the processor in a way that the instructions flow from bottom to top. The expanded listing of cycle 5 shows Fetch at the bottom and write at the top, just as the hardware.

In the sequential processor SEQ, we have values such as ValC, ValE that are unique. But for pipelined processor, we will have multiple version of these values for different instructions in the processor.

As we can see in a pipelined processor graph above, there are 4 stat fields in 4 different pipelined registers. Confusing one stat with another will cause error. So we use capital prefixes. D_stat for stat of the decode pipeline register and so on.

We also need to name some signals that have just been computed within a stage. We will prefix it with lowercase pipeline register’s first letter. The output of D_stat will be d_stat.

The actual stat is computed by a block in the writeback stage based on the stat in W.

Saving Dest till the write

For both sequential and pipelined processor, we will get dstE and dstM in the fetch stage. For sequential, we can directly output these signals to the register file write ports. But for sequential, we will need to hold it in each pipeline register till that instruction reaches write back stage. We want to make sure the write port address and data inputs hold values for the same instruction. Otherwise we will write to the register file with the valM and valE from the writeback stage register, but the address from the decode pipeline register.

Select A block

It is not present in the sequential processor. It will generate valA for pipeline register E and M. It selects from either valP or valA from the read port. It is to reduce the amount of state that must be carried forward to pipeline E and M Only call requires valP in the memory stage(writing the returned address onto the stack) and only *jump * requires valP in the execute stage. (In the event that the jump is not taken) None of these event requires value read from the register file When we need valP, valA won’t be needed so valP and valA can share the field in pipeline register E and M.

Next PC prediction

Our goal is to issue a new instruction on every clock cycle so in each clock cycle we will get a new instruction and a new instruction will proceed into the execute stage. If we can do this for all instruction, we will have a throughput of 1 instruction per cycle. But we can’t. Conditional jump and ret will need information that we can only get by completing previous instruction. (Not exactly, for ret, we can get back once the previous instruction writes the address to the memory in memory stage)

But other than conditional jump and ret, we can determine the address of the next instruction based on the info in fetch stage. If the instruction is unconditional jump or call, then the next instruction will be valC, the constant word for address and all others will be valP, the add for the next instruction. So we can issue new instruction almost every cycle.

But for branch prediction, we have to make a guess. If we take it, we will use valC, else we will use valP. The subject is called branch prediction.

To predict the ret is more difficult since we have an “unbounded” number of address that could be the ret address. So we will just simply hold off till the ret instruction passes through the writeback stage.

Aside: For most programs, prediction the ret address is easy since procedures call and return often comes in pairs and the software will just return to the instruction after the call. Some processors will include a hardware stack within the fetch stage to hold the return address generated by call.

Pipeline hazard

  1. Data dependency: the results of one instruction are used as the data for a following instruction
  2. Control dependencies: one instruction will determine the location of the following instruction

As we can see in the image, the third, forth and fifth instructions are all nop(no operations) so we will have enough clock cycles to allow 3 to pass to $rax.

If we don’t have enough ops to allow I1 and I2 to pass through to registers, we will run into error; We can see the addl instruction will execute at the same time when irmovel $3, %rax is writing to the register. So this will cause the addl use an outdated value of the register $rax.

To avoid it, we will use stalling, where the processor holds back one or more instruction until the hazard dependency is no longer valid. When the addq is in the decode stage, the pipeline control logic detects that at least one of the instructions in the execute, memory and write (all stages that are ahead of decode) will update the register. For our example, we will hold the addq instruction at decode stage through injecting bubble, a form of dynamically generated nop. We will also hold the halt at fetch stage.

A could potentially arise when the program will alter PC, register file, memory and the conditional code.

  1. register file: already discuss
  2. PC: for a conditional jump, if we predicted the wrong PC, we will have to roll back.
  3. Memory: reading and writing both occur in memory stage. By the time we want to read from memory for one instruction, previous instructions have finished writing. But we can have interference when one instruction tries to write in the memory stage while another instruction tries to fetch. This will occur when we have self-modifying code that writes to a portion of memory from which instructions are later fetched.
  4. Condition code registers: Cond registers are altered by integer operations in execution stage and will be used for conditional move in the execute stage and will be used by conditional jump in the memory stage. By the time the next instruction reaches to execute or memory, the previous instruction will already finish the update for condition code register. NO HAZARD.

Avoid data hazards by forwarding

When there are a pending write a register file and at the same time a pending load from decode stage on the same register, a hazard might occur. We could stall but we could also forward the value that’s about the be written directly to the input for pipeline register E as the source operand.

In cycle 6, the decode stage logic detects there is a pending write to one of the source operand $rax in the writeback stage. So it will use this value for the source operand valB rather than read from the register file.

In Prog3, the decode at cycle 5 detect there is a pending write to $rdx and a pending write to $rax in memory stage. It will use W_valE for valA and M_valE for valB.

We can even remove the only nop since we can fetch E_valE, the newly computed value, and forward it to the decode stage

Every value that will be written back to the register file has their register ID. The logic will compare the ID of write and the ID for sources (srcA, srcB) to detect cases for forwarding. If we have multiple destination registers ID matching sources, we need to establish a priority.

We can see from the final pipelined implementation, we have two extra units. Sel+Fwd A and Fwd B that will handle the forwarding. Sel+Fwd A combines the roll of selA with forward logic. So valA for pipeline register E can either be the incremented PC valP, the value read from register file, or the forwarded values. Fwd B implements the forwarding logic for source operand valB.

Load/Use data hazards

Memory reads occur late in the pipeline. So if we want to read a value into the register and then change it, we will have problems.

In cycle 8, mrmovq will load $rax, but in cycle 7, the addq instruction that follows right after mrmovq has already decoded the $rax value. mrmovq is too late.

We must stall the instruction. The decode stage detects that addq requires a result from a memory read and will inject a bubble into the execution stage. The injection will stop once the previous instruction has reached memory stage.

Avoiding control hazard

It arises when the processor can’t reliable find the next instruction. It could only occurs for ret and conditional jump.

For ret, we will inject bubbles until the ret has reached write-back stage and the PC selection logic will set the PC to to the return address.

For conditional jump, we will cancel the instructions after the mispredicted jump through injecting bubble at execution and stages after execution. Because from fetching conditional jump till finding out the jump is incorrect only takes 2 cycles, any instructions after conditional jump won’t proceed till execution so they won’t alter any programmer-visiable state. So simply make the instruction “disappear” through bubble.

Exception handling

Events that will cause exception to break the current program execution are

  • Internal(such as overflow, division by zero). Our processor has 3 ways:(a halt, (b invalid instruction, (c attempt to read, write to invalid address, both in memory stage or instruction fetch.
  • External. such as mouse move and internet package arrival.

After an excepting instruction, non of the following instruction should have any effect on the programmer-visible stage.

Three subtleties.

(1 It is possible to have multiple exceptions happening at the same time. EG: Fetching a halt instruction and an out-of-bound read occurs in the same cycle. We must establish priority. The basic rule is to give the instruction that is furthest along the pipeline as the priority. Because the memory access will be before the halt statement in machine learning, so only this exception should report to the OS.

(2 Instruction is fetched, executed, causes an exception and later it is canceled due to a mispredicted branch. EG: In a conditional jump, we have a jump to invalid instruction and the decode stage will find out the it is invalid and throw an exception. But the pipeline later detects that it shouldn’t take the branch and discard the previous instruction. We want to avoid raising an exception in this case.

(3 An instruction following one excepting instruction to alter some part of the state.

xorq will zero the $rsp and pushq will decrement the stack pointer. This exception is detected in the memory stage but on the same cycle, the execution stage has the addq and will set the conditional code to new values.
This violate the requirement that after an excepting event, we shouldn’t alter any programmer-visible state.

If an instruction has caused a exception, the status code will indicate the reason for exception and the status will propagate till writeback stage and the control logic will detect the exception and stop.

To avoid updating programmer-visible states, the pipeline logic will disable any update to conditional code, memory when an instruction in the memory or write-back stages has caused an exception.

We can see that the exception doesn’t alter the flow of the execution, except when we have to disable update for programmer-visible state for the following instructions. The structure of the processor makes sure the first exception will be the first to report to the write-back and to the system. If some instruction is fetched but later canceled, the exception is canceled as well.

PIPE stage implementations.

Random finding

CISC and RISC

CISC stands for complex instruction set and RISC means reduced Instruction set computer. CISC has very complex instruction set and x86 uses it. RISC believes efficient code could be generated by simpler form of instruction and it uses less hardware and can use efficient pipeline. PowerPC and ARM uses RISC.

But now everything blends together and the line between CISC and RISC blurs.

Chapter 2(representing and manipulating information) notes

Info storage

Hex

We want to see bit pattern. Binary is too cumbersome and decimal needs to be converted so we have Hax. The val of a single byte can range from 00_{16} - FF_{16}

Convert from decimal to hex: (1)Divide the decimal number by 16 repeatedly: x = q \cdot 16 + r (2)Then r will be the least significant digit of the hex. (3) repeat this process on q

Convert from hex to decimal: Multiply each digit of the hex by the appropriate power of 16. 0x 7AF = 7 * 16^2 + 10 * 16^1 + 15*16^0

Word

Word size: the normal size of int or pointer. Very important because it dictates the size of the virtual address space. Since pointer is as long as word length, so the address range from 0 - 2^w = 1.

Pointers need to use the full size of the word so for a 64-bit machine the pointer will be 8 bytes long. But for 32-bit machine it will only need 4 bytes. Potentials for bugs since some 32-bit programs would store pointer in int type but that won’t work for 64bits.

Address for multi bytes objects

Multi-bytes object is stored on contiguous sequences of bytes and the address is the smallest address of the bytes used.

Byte ordering(Important, came up in Network)

0x87654321 Little Endian: The beginning(smallest) address will store the least significant bit(on the right most of the number) 12, 34,56,78

low \rightarrow high

Big endian: start with storing most significant to least(Left to right, natural for us)

It matters in

(a)Network: If little communicate with big through a string of bytes, the bytes will be received in reverse order.

(b)Reading bytes generated by little endian: Need to reverse it. Least on the left and most sig on the right.

Strings

A string in C is an array of chars terminated by the null. Each char is represented by some encoding (ASCII).

Text data is more independent than binary data.

Boolean algebra

Multiplication distributes over addition, so does AND distributes over OR. We can’t distribute addition over multiplication, But we can distribute or over AND. a | (b & c) = (a | b) & (a | c)

Each element is its own inverse in bool algebra. a ^ a = 0 where 0 is an all-zero vector. This holds even we rearrange and combine them in different orders. (a ^ b) ^ a = b. An very interesting application is in place swap


void inplace_swap(int *x, int *y){
    *y = *x ^ *y;
    *x = *x ^ *y;
    *y = *x ^ *y;
}

If the x stores a and y stores b, then

step & *x & *y

initial & a & b

1 & a & a ^ b

2 & a ^ a ^ b = b & a ^ b

3 & b & a ^ b ^ b = b

(The Latex tabular is broken lol)

Bit masking

using bit patter to indicate a selected set of bits within a word.

x & 0xFF will get the least significant byte of x.

~0 will get a mask of all ones, regardless of word size(portable)

Logic operatior && || and ~

OR, AND, NOT in logic and any non-zero arg is TRUE and the operation will return True or False. It has a short circuit, if the result of the expression can be determined by evaluating the first arg, then the operation won’t eval the second args. So a && 5 / a won’t ever cause a division by 0 and p && *p++ won’t dereferencing a null pointer

A way to do x == y is to

!( x ^ y)

x ^ y will give us all zero if all the bits match. Then the NOT will return 1 only if all bits are zeros.

shift operation in C.

x = [x_{n -1},x_{n -2},...,x_{0}] , we apply x << k, shifting k bits to the left and drop off the k most significant bits, filling the right with k zeros. We got [x_{n k -1},x_{n k -2},...,x_{0}, 0 ... k zeros ...]. Shift associate from left to right. x<<j<<y = (x<<j)<<y

Right shift. Two kinds. 1:arithmatic, 2: logic. The logic one is similar to left shift, shifting k elements to the right and leave k zeros on the left. Arithmetic will shift to the right by k, but instead of filling k zeros on the left, it will have k x_{n - 1}, the most significant bit.

Integer representation

Two’s complement

B2T_w (x) = -x_{w + 1}2^{w - 1}+ \sum_{i = 0}^{w - 2}x_i 2^i We can see the most significant bit(sign bit) is a negative weight. When the sign bit is 1, the whole number has to be negative.

The range of two’s complement is asymmetric; |Tmin| = |Tmax| + 1 because half of the bit patterns are for negative and half are for non-negative, which includes zero so the |Tmax| will be 1 smaller than Tmin.

Other representations:

One’s complement: Similar to two’s complement, except the negative weight of the most sig dig is 2^{w - 1} - 1.

Sign magnitude. The sig dig determine the sign.

Both method has two ways to representation zero. NOT cool. Note the different position of apostrophes: Two’s complement versus Ones’ complement. The term “two’s complement” arises from the fact that for nonnegative x we compute a w-bit representation of −x as 2w − x (a single two). The term “ones’ complement” comes from the property that we can compute −x in this notation as [111 . . . 1]− x (multiple ones).

Conversion between signed and unsigned.

When convert between sign and unsigned whose word sizes are the same, the numerical value might change but the bit values stay identical. (

short int v = -12345;
unsigned short uv = (unsigned short) v;

v = -12345,where -12345 = - 2^16 + 53191 uv = 53191 . Because now we interpret the two’s complement as the unsigned. The numerical value difference between unsigned and two’s complement with the same bit pattern is 0 if x >=0 is x = 2^w if x_{w - 1} = 1(negative)

When an operation is performed on an int and an unsigned int, the int will be dragged to to unsigned, assuming the number is nonnegative.

Problem for relational operators < and >. eg: -1 < 0U. -1 will be thought as nonnegative so -1 is actually 4294….. and 4293…. < 0 is false.

Expanding bit representation

Type promotion, expand space. For unsign we use zero expansion, adding zeros to the left. For two’s complement, we use sign extension, adding copies of the most significant bit to the left. [x_{w - 1}, x_{w - 2}....x_1] ] [x_{w - 1},x_{w - 1},x_{w - 1},x_{w - 1},x_{w - 2}, ...x_1] Case in point: 101 = -3, 1101 = -8 + 4 + 1 = -3. The newly added 4th bit will cancel out half with the previous 3rd bit, therefore we will subtract the same number. (To prove formally, use induction)

If we are doing

short sx = -12345;
unsigned uy = sx;

uy = 4294954951

Which means we first convert short to (unsigned)(int) sx.

Truncating

w-bit number to k-bit number w > k, we will drop the leading w – k bits. For unsigned number x, the numerical value is equivalent to x mod 2^k. Because 2^k mod 2^k = 0 , i > k. (4 mod 2 = 0, 8 mod 2 = 0).

For two’s complement, it is very similar. We first convert it to unsign, then unsign to two’s complement

Integer Arithmetic

Word size inflation: To to addition for two k bits number might require k + 1 bts. can not place any bound on the word size required to fully represent the result. If x + y < 2^k the leading bit of k + 1 bit is 0, so the numerical value of this int addition is the same as the exact math addition.

If 2^{k } < x + y < 2^{k + 1} , then overflows happened, and we discard the leading k + 1 bit,(truncate) and this is equivalant to x + y - 2^k.

To check if overflow happens or not for s = a + b, we can check s >= b. if this is true, then overflow didn’t happened. Because a + b \geq a so s \geq a. On the other hand if overflow happens, s = a + b - 2^k and a < 2^k so a - 2^k < 0 ,s = (a -2^k) + b < b. So if overflows happened, s < b, s < a

Floating point

format

For IEEE floating point, we have V = (-1)^s * M * 2^E. Where s stands a single sign bit. Significand Mis a fractional binary number that ranges either between 1 and 2 – \episilon or between 0 and 1 – \episilon. The exponent E weights the value by a ( possibly negative) power of 2.

For single precision, s, exp, and frac are 1, k = 8, n = 23.

For double precision, s, exp, and frac are 1, k = 11, n = 52.

Three cases of floating point

Normalized value.

When exp is neither all 0 nor all 1(255). It is interpreted as a sign integer in biased form. E = e - Bias where e is the unsigned number with bit representation e_{k-1}....e_{1}e_{0} and Bias is a bias value equal to 2^{k - 1} - 1.(127 for single precision). So the exponent ranges from -126 to 127.

Implied leading bit: the leading bit is always one so we can ignore it and use the leftover bit for one bit more precision. To see that,we use an example:

0.0123 = 1.23 * 10^{-2}. Same with binary, we can always shift the exponent so the leading bit is one.

Denormalized values

When the exponent field is all ZEROS. Exponent value is E = 1 – Bias and the significand value is M = f. The value of the fraction field is without an implied leading 1.

It can reps 0 since a normalized value’s M is always M > 1.

Floating points rounding.

Find the closest matching number in floating point number format. The design decision for rounding is when the result is between two possible result. For example, If we want to round to the closest integer and our number is 2.5, we have 2 and 3 to choose from.

The default is round-to-even, which rounds the number either up or down so the least significant digit is 0(even). To do so will avoid statistical bias.
Only bit pattern XXX.YYYY100…, where X and Y are arbitrary numbers, will need round-to-even because it is half way between two results.

Floating ops.

Floating additions are communicative but not associative.
Floating point multiplication is communicative but not associative due to possible overflow or losee of precision.

Random findings

(a) Floating point arithmetic is not associative! A case in point is 3.14 + (1e20 – 1e20) = 3.14 but for most computer (3.14 + 1e20) - 1e20 = 0

(b) The C++ use the same exact numerical representations and ops as C. But Java uses a new set of standards! Java is more specific on the formats.

(c)

zhenk14 > gcc -std=c99 prog.c

is using ISO C99 standard, which took over ANSI (c89)

(d) using a cast to allow an object to be referenced according to a different data type from which it is created. Discouraged but used in system programming.

(e) long in 32-bit machine is as long as an int?

IA32 floating point arithmetic use a special 80 bit extended precision floating point format. All single and double precision floating point number are converted to 80 bits to perform arith ops. After the operations, the results are converted back to store in the memory.

About C

We can dereference with the array notation

byte_pointer start;
start[3] //This will access the 4th elem stored in the array pointed by the byte_pointer.

In general, start[i] will pointed i position beyond the loc pointed by start.

type

Typedef can name type and is very similar to var declaration.

pointer creation and deref

The C “address of” operator & creates a pointer.

int *byte_pointer = &x //x is an int

Will create a pointer that points to the location that holds x. The type of the pointer depends on the type of x.

Typecast

(byte_pointer ) &x //x is an int

will cast the pointer to x, which is an int, to type byte_pointer and the program will reference it that way. Nothing change in the byte level, just the compiler to refer the data according to byte_pointer type.

Strlen() doesn’t count the null terminator

shift operations

If we are trying to shift k bits when the data is w and w < K then in C it is not specified but generally it should perform k mod w. JUST DONT DO IT Also shift has lower precedence than addition.

Floating point ops

C doesn’t require IEEE. But it is is math.h.

picked up a system book

It is final time and my old habit of picking up random books during final weeks started sneaking in again. This time I picked the CMU system course. I have heard a friend talked very highly about it so I want to try it out myself. Good get away from dry final reviews. I will keep editing this page to reflect the progress I have made. Let’s hope it all go well.