Logging and recovery

Big pictures

Why we want logs

The goal here is either all or no modification made by a transaction.
We need to first output to stable storage without modifying the database.

Why we need logs

It helps to make all modifications reflected on the database

Might be in the course of recovery

No change made to db in an aborted transaction


Log is a series of log records that contain the update information of db.

Update log

It usually has these 4 items:

  1. Transaction id
  2. Data item id. It usually has a disk location and an offset
  3. Old value
  4. new value

Old value allows us to undo

In case of an aborted transaction, we can undo the change.

new value allows us to redo

When a transaction is committed but didn’t make it to the disk, we can redo the transaction.
Just replace the data item with the old value

Start log

Commit log

Transaction Ti has committed


Transaction has aborted

How a transaction happens


  1. The transaction performs some computation
  2. The transaction modifies the data block in the disk buffer
  3. DB executes page write to write the data block back to disk

When to modify

Deferred modification

A transaction doesn’t update till it has completed.

Immediate modification

modify the database while the transaction is still active

Crashing scenario

Not all changes made it to the disk

Transaction committed, but modifications not finished


We need to set the data item to new value in log record

Transaction modified db but then abort

We need to erase the change made to db


Set the data specified to old value in the log record

Orders matter during recovery

The order to updates by redo is important.
Else we might end up with wrong value.
Recovery usually scan the log, and perform for each log record.
To preserve the order of update


Commit means we have output <Ti, commit> log record to stable storage.

It ensures transaction atomicity

If a system crash before <ti, commit>, we will need to roll back Ti.

Another look at REDO and UNDO


UNDO is needed during recover if a transaction Ti has but not or <Ti, abort> WE need to roll back all changes and also writes to indicate the undo has completed
This makes sure every transaction is complete, either a commit or abort.


Recall that we need to REDO during a recovery if we found out a transaction T1 has <T1, start> and have <T1, commit> or <T1, abort>

Why do we need REDO when we see abort

Usually when a transaction abort, we don’t need to do it again. But we need to REDO them anyway
Because the might come from UNDO, in which case, we still need to finish that transaction


Recall during recovery, we need to scan from the beginning of the log file. It will take a long time.

Strict checkpoint

  • doesn’t permit any update
  • outputs all modified buffer to disk

It follows steps

  1. Output all log records that are in main memory to log
  2. output all modified block in buffer to disk
  3. output a where L is a list of active transaction

Any database modification happened before the checkpoint doesn’t need redo or undo.
A transaction Ti that has its or before must have written its change to disk because checkpoint output all modified block to disk. So no need to perform REDO.

The REDO and UNDO only need to be applied to transaction in L(active trasaction during checkpointing.

Fuzzy checkpoint

We could have done a fuzzy checkpoint.
during fuzzy checkpoint, transactions are allowed to perform updates.

SQL cheetsheet

Relational Algebra

A procedural language that tells the machine “How” to get what we want


Selection(select) Projection(get required column) Cartesian product(outer join) and others…

Relational Calculus

A non-procedural language that tells the machine “What” to get.

A basic SQL clause


It usually looks like SELECT A1, A2…An FROM r1,…rm where P; SELECT is used to list desired attributes in the result A1…An are attributes FROM is a list of relations we need to access WHERE clause is a predicate(conditions) involving attributes of the relaion in the from clause.

SELECT doesn’t eliminate duplicates!

Does FROM internally calculates a Cartesian product?

for each tuple t1 in relation r1
       for each tuple tm in relation rm
            comcatanate t1,....tm into a single tuple t, 
            add t into the result relation

But we rarely used it. Instead we use where to restrict the combinations. If we have

SELECT a1, a2
WHERE R1.a3 == R2.a3

We would only combine tuple in R1 with only those tuples in R2 that has the same a3. We only match tuples with the same a3.

Natural Join

Natural join will join two relations together. But will only concatenate tuples that have the same values for all attributes that appear in the schemas of both relations.

If R1 has a1, a2, a3 and R2 has a2, a3, a4. R1 NATURAL JOIN R2 will only concatenate tuple with identical a2 and a3. It will return relation with R3 = a1, a2, a3, a4.

Specifying what column should be equated

The above example shows that natural join could be error prone. To change it, use


This will only equate a2 as a condition for concatnating tuples in R1 and R2.


Reasons for rename

Identical attribute names from two relations

Arithmetic expression will result in attribute with no name

Just wanna change it

As clause

The syntax for as is old name as new name As can appear in both select ** and **from In select, we can rename attributes

SELECT name as person_name

We can also rename relations to simplify

SELECT A.name, B.id
FROM banana_whatever_is_good as A, no_apple_is_the_best as B
where A.id = B.id;
Very helpful to compare tuple in the same relation.

EG: Find all student whose credit hours completed is at least greater than at least one senior at our college.
FROM student as S, student as C
WHERE S.hour_complete > C.hour_complete and C.year = 'senior'

But C is not a copy of S. It is just an alisas.

String ops

Strings are in single quote.

Pattern matching

% for matching substring

hello%' will match any string that starts with hello%hello$’ will match anystring that has hello

_(underscore) for matching any character

‘ will match any string with exact 3 chars ‘%%’ will match any string at least three chars long.

Order by

the result tuples will be in sorted order. (may it be numerical, alphabetical…) Can also be performed on multiple attributes

`\* FROM student ORDER BY birthday desc, name asc; This is list all students info in descending birthday order. In case people have the same birthday, they will be ordered ascendingly based on name.

Set ops


Combined tuples from two relations that have the same set of attributes

No duplicates

If want duplicate, you use UNION ALL


Find all tuples that exist in both relations.

NO duplicates as well.


Find all tuples that exist in first relation but not the second


Arithmetic with null

For + – * /, if one of the operand is null, the value is null.


If any comparison has one operand as NULL. Then the comparison is unknown. This creates a third logical value besides true and false.

In a where clause

true AND unknown = unknown. false AND unknown = false. true OR unknown = true, false OR unknown = unknown. NOT unknown = unknown

If in where a tuple is eval to false or unknown, it won’t be included.

Aggregate functions

They take a collection of values and return a single value. Such as AVG, MIN, MAX, SUM, COUNT

count unique student id. we need to use distinct


Group by

WE want to aggregate by group of sets of tuple. Such as calculating average GPA for different class man in college. We use group by. The attribute given in group by clause are used to form groups. Tuples with the same value on all attributes in the GROUP BY clause are placed in one group.

SELECT class_name, avg(GPA) as avg_GPA FROM student GROUP BY class

The attributes in SELECT must appear in GROUP BY


Having will state a condition on the group formed by GROUP BY. We used it after GROUP BY, when the groups have been formed. The predicate for HAVING is often an aggregate.


select name, avg (GPA)
from student
group by department
having avg (GPA) > 3;

Nested query

in WHERE clause

A common use of subqueries is to perform tests for set membership, make set comparisons, and determine set cardinality, by nesting subqueries in the where clause.

IN operator

We can use IN to specify multiple value in WHERE clause. It is also a shorthand for multiple OR.

SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1, value2, ...);
FROM student
WHERE name IN ('TOM', 'JOHN')

Find all the TOM and JOHN in the student body.

We can also put in subquery in IN.

SELECT column_name(s)
FROM table_name

If we want to find all the courses that are taught in 2009 and 2010. We can find it using intersect.

(SELECT course id
FROM section
WHERE semester = ’Fall’ AND year= 2009)
(SELECT course id
FROM section
WHERE semester = ’Spring’ AND year= 2010);

(This example is from DB systems concept 6e)

We can also use IN

SELECT course id
FROM section 
WHERE year = 2009 and course_id in (SELECT course id FROM section
                                     WHERE year = 2019)

Set comparison


For a query like “Find the names of all instructors whose salary is greater than at least one instructor in the Biology department.” We can use alias to refer to the same relation.

select distinct T.name
from instructor as T, instructor as S
where T.salary > S.salary and S.dept name = ’Biology’;
select name
from instructor
where salary > some (select salary
                     from instructor
                     where dept name = ’Biology’);

The subquery will generate all the salary of biology professor. Then the >SOME comparison is TRUE if the salary of that professor is greater than at least one member of all the salary values.


select name
from instructor
where salary > all (select salary
               from instructor
               where dept name = ’Biology’);

This query will find instructor’s name whose salary is greater than all salary in the bio department.

consider the query “Find the departments that have the highest average salary.”

select dept_name 
from prof_list
group by dept_name
having avg(salary) >=all (select avg(salary) from prof_list group by dept_name)

We first write the sub-query to find the avg salary of all the department. Then we will find select all the group that has their average salary higher than all department’s salary average.

Subqueries in FROM

A select-from-where expression in SQL returns a relation as a result. So we can insert select-from-where(a subquery) anywhere a relation can appear.

In an previous example, we tried to find the departments whose student has average gpa higher than 3.6. We used having on the group generated by GROUP BY.

select dept_name, avg(GPA) as avg_gpa
from student_list
group by dept_name
having avg_gpa > 3.5

We can also express this with a sub-query in the from clause.

select dept_name, avg_gpa
from (select dept_name, avg(GPA) as avg_gpa
      from dept_name
      group by dept_name)
where avg_gpa > 3.6

In this query, the subquery first generate all dept with their average gpa.
Then we use a where clause to filter out needed tuples.
The having predicate is now in the where clause.

The attribute of the sub-query can be used in the outer query. As in the case for dept_name and avg_gpa.

This sub-query in from can be replaced by HAVING. But in some situations, HAVING can’t help. Notably aggregate functions like SUM that is not DESCRIPTIVE of the population.( I don’t even know how to describe it. Can someone help?)

EG: We want to find the max salary spending department.

select dept_name,max( total_salary)
from (select dept_name, sum(salary)
      from dept
      group by dept_name)

##With Clause
The with clause defines a temporary relation whose definition is available only to the query in which the WITH clause occurs. It gives a sub-query block a name.

An example from geeksforgeeks:
WITH temporaryTable (averageValue) as
    (SELECT avg(Attr1)
    FROM Table),
    SELECT Attr1
    FROM Table
    WHERE Table.Attr1 > temporaryTable.averageValue;

Modification of DB


We can only delete tuple. We can not delete particular attributes.

delete from relation
where P

we will delete all tuples that satisfy the predicate.

We can make P the predicate very complex. Some example borrowed from the book:
delete all instructors who’s department is located in X building. instructor is stored in the instructor table and department is stored in the department table.

delete from instructor
where dept_name in (select dept_name
                    from department
                    where dept_loc = 'X')

delete can only delete one relation at a time.
But we can refer to the relation under deletion in the where clause’s. EG: delete all students with below-average gpa.

delete from student
where gpa < (select avg(gpa) 
             from student);

We will first calculate the avg gpa. Then do the comparisons and delete all tuples that fail the test.


Either specify the tuple to be inserted or write a query that return a set of tuples to be inserted.

simple insert

insert into course
values (’CS-437’, ’Database Systems’, ’Comp. Sci.’, 4);

query insert

EG: add all students who have higher than 3.5 GPA to honor group

insert into honor
       select ID, name, GPA
       from student
       where GPA > 3.5

The select statement gives a set of tuple. Then it is insert into the relation.


Update some value in a tuple without changing all.

update student
set grade = grade * 1.5
where grade < 60;

Give students who need help a curve.

update student
set grade = grade * 1.5
where grade < (select avg(grade) from student);

We can also add cases for and if-elseif-else style update

update student
set grade = case
            when grade < 60 then grade*1.5
            when  ....      then ...
            else grade * 1.3

Database index and hashing

We want to locate where our records are. We might have to resort to index, just like when we are trying to find a key word, we go to the indice in the back of a book.

Basic concepts:


Indices are used to locate record; Attributes that are used to loop up are called search key. They don’t have to be Key or super key.

Ordered indices: it is based on sorted order of the records

Hash indices

operations and performance affected by choice of indices

Access type:

We can have specific attribute look up or range look up

Space overhead



Ordered indices

Ordered indices are indices that store indices in sorted order.

Clustering indices

Record in the files might themselves be sorted in some order. A clustering index is an index whose search key also defines the sequential order of the file.
They are also called primary index.

Nonclustering indices

Indices order is different from the sequential order of the records. There is a second list that has pointers to the physical rows.

Index entry

It is a search key value and pointer to one or **more than one ** records that has the specific search key value.

Dense index

Index entry appears for every search key value in the file.

File organization in DB

A DB is mapped into files maintained by OS.

Layers of a file


File is organized logically as a sequence of records.


Each file is logically partitioned into fixed_length storage units called blocks. Usually of size 4 to 8 kb.


A block might contain several records.

Assumptions about records

  1. No record is larger than a block. If the record has variable length objects, we use pointers.
  2. each record is contained in only one block. It is good for speed up accessing


Fixed length record

type person = record
              name varchar(10);
              id = numeric(8,2);

Assume the above record takes 20 bytes.

How many records in a block

We use block size // record size(int division) as the number of records per block. Some space might left unused.


When deleting a record, we could move all records after it forward but too much movement.

Move the last record to the place of the deleted record.

Leave open and wait for subsequent insertion using free list

We need to record where each empty record is. We do so by adding a header in front of the block that records the first deleted record(at the beginning, point it to NULL). After deleting the first record, we point the pointer in the header to the beginning of the deleted block. If we delete again, we will store the location of the second deleted record in the first deleted record.

Upon insertion, we will insert the new record into the place pointed to by the header pointer, and update the header pointer to the next available free location (that was stored in the location of the first deleted record).

The deleted records are like a linked list. It is called free list.

Variable length records

Where do they come from

  1. Storing multiple record type in a file
  2. record has variable length attributes

Representation of variable length record

Usually has two parts. a fixed length attributes, then followed by variable-length attributes. var len representation

The last three fields are variable field. Before the variable length field, The beginning of the record will have information about each attribute’s length and beginning through (offset, length) pairs, no matter variable of fixed length.

In the middle, we have null bitmap, indicating which attr have a null value. In some DB, null bitmap is stored at the beginning and the null attributes will have no data at all, saving space.

Storing varlen record in a block using slotted page structure

At the beginning of each block there is a header that has

  1. Num of record
  2. end of free space
  3. an array that has location and size of each record

The records are allocated continuously in the block starting from the end of the block. The free space in the block is contiguous, between the header and the first record.

Insert record in slotted page structure

Insert at the end of the free space, so the records are still contiguous.


We set the entry in the array to deleted(for example, set the size to -1). Then we need to occupy the deleted space with other record to maintain the contiguous free space between header and the records. slotted page structure

How to organize records in a file

Heap file organization

Any record can be placed anywhere in the file. No ordering. Typically one file only contains one relation.

Sequential file organization

Order the records based on search key. We also store records physically in search-key order(or as close to it). Insertion creates problem: If we can’t find a free block, we insert the record in a overflow block and adjust the pointer of the previous block to point to this record.

Might need to reorder the records.


Use hash function computed on the record to locate the block it resides

What large database does with file

Large db usually don’t use OS file managements. It just ask the OS to give it a big file and put all records in it.

Multi table clustering file system

If we store multiple relations in a single block, it might be advantageous for some join query.

select dept_name, student_name, dept_location
from dept natural join student

For each tuple of department, db must locate the student tuple with the same department name and transfer it to the main memory. Worst case, each record will be in a different block we have to do one block read for each record. But if we store related records in each block, we can use one block read to satisfy join.

In the previous example, Assume department looks like

dept_name, dept_loc, dept_chair
CS         Klaus     Buzz       
MUSIC      NYC       Shostakovitch 

and the student looks like

id, name, dept_name, year
1    kd    CS        2020

Then we would store these two relation in a multitable like:

cs   |  klaus   | Buzz
1       kd      | 2020
2       ted     | 2021
music|  nyc     | Shostakovitch
33   |  tom     | 2019

Thus a join will require fewer block seek.

Buffer manager

Since we can only alter data in memory, we need to drag pages to from disk to main memory. Each page being dragged in will be stored in the buffer pool’s frame. When we need more space for more pages, we might evict some pages in the buffer pool based some predefined criteria. If the page is “dirty”, we also need to write the dirty page back to the disk.

Database hardware

Memory Hierarchy(from primary, faster to secondary and slower memory)


Main memory

Flash memory

Non volatile

Flash memory will store data even it is powered off.


Cheaper for a given storage capacity. Used a lot in cameras, phones.

USB(Universal serial bus)

A lot of USB sticks use flash drive.


Solid state drives.


Optical storage(CD, DVD)

Use laser to read.


Can only be read sequentially.

Magnetic disk

magnetic disks

Each magnetic disk will contain several platter. They are flat, circular. Their surfaces are covered with magnetic material. Plates are usually made of glass or metal.

units on disk

Each disk platter is divided into tracks, and then divided into sectors.


Usually a platter contains 50,000 – 100,000 tracks. Inner tracks usually have smaller length(smaller number of sector).


Sectors are usually 512 bytes. Each track usually have 500 to 1000 sectors in the inner track. sectors are the smallest “addressable” unit on the disk

Read-write head:

The motor spins the disk at 60, 90, 120 rps(revolution per second).

How read-write head stores information through reversing the direction of the magnetic field.

The head floats above the disk

The head doesn’t touch the plate. The spinning creates breeze that lift the head.

Head crash

Head crash would remove the storage medium. Further more, the removed medium will fly around, causing more crash.

Disk controller

Interface between the disk and the computer system. It is usually with in the disk drive. It accepts read-write command over a sector. controller also moves heads around.

Remapping of bad sectors.

If controller sees a sector become bad, an attempt to write to this sector will let controller logically map the sector to a different physical sector.

Interfaces to OS

SATA(serial ATA)

ATA is a connection standard from 80’s

SCSI(smaller-computer-system interconnect)

SAS(serial attached SCSI

Fibre channel

SAN(Storage area network

Large numbers of disks are connected by a high speed network. The disks are usually organized using RAID (redundant arrays of independent disks). to give a logical view of a very large disk

NAS(network attached storage

It provides a file system interface using networked file system protocols such as NFS

Optimizing disk-block access

Requests for Disk I/O are from vm manager and file system


A block is a logical unit consisted of fixed number of contiguous sectors. Sometimes page is used to refer to blocks.

Access disks


For successive block numbers. The disk will seek the first block. Then successive request will not require seek. It is a lot faster.

Random access

Just as the name imply


Blocks read are stored in an in-memory buffer for future requests.

Read ahead

When reading a block, read in consecutive blocks from the same track.(great for sequential)


When requesting different block, we might benefit from ordering the block requests.

Elevator algorithm

File organization

We can organize blocks such that it corresponds closely to the way the data will be accessed. OS will often allocate multiple consecutive blocks at a time to a file to increase the chance of sequential write.


A file that have multiple small appends might become fragmented. The system can make a backup copy of the data and restore the entire disk. This restoration will write back the file sequentially to decrease the fragmentation.

Log disk

Use a disk devoted to writing sequential log. Several consecutive blocks can be written at once. Since log disk can do writes later, the DB don’t have to wait for the write to complete.

SQL cheat sheets 2


Joins types

Inner join

Inner join will combined tuples from two relations that satisfy the join predicate and return a relation where each tuple is a concatenation of the two tuples that satisfy the join predicate

ON condition

We can use ON for a general predicate over the relations being joined. EG:

select * 
from student join classes_taken on student_id = taken_id

This on keyword specifies that a student matches with class if their id match. It is very similar to using natural join. But using on will list the id twice. Using natural join will only have id attribute once/

Difference between WHERE and ON

Even though using WHERE and JOIN are effectively the same for inner join. But where clause will let the inner join execute first then filter. But ON will first filter out.

Recall, outer join adds null-padded tuples only for tuples that don’t match.(or don’t contribute to the inner join). The ON condition is part of the outer join specification.

select *
from student left outer join takes on student.ID = takes.ID

Assume we have a student name kd with ID 1001. But there is no class with ID 1001.
Then using the query above, since there is no match, we will a tuple with attribute
(kd, 1001,..,...,..., NULL,NULL,NULL). Where all the NULL fill the parts for takes's attribute.
But if we move the predicates to where:
select *
from student left outer join takes on true
where student.id = takes.id

the on true will be evaluated as true for every pair of tuples from student and takes. Therefore we will have a Cartesian product of the two relations because the join predicate is all true. Since there is no class with ID 1001, every time a tuple with name “kd” appear in the result of the outer join, the ID of student.id and takes.id must be different. So the where clause will eliminate all tuples with name “kd”.

outer join

Outer join is a join the preserve those tuples that don’t get matched by creating tuples in the result containing null values. EG: If we want to create a table of students with their courses. But some students might not have taken any classes. We need to use outer join.

left outer join

Preserve tuples on in the relation named before the OUTER JOIN clause

right outer join

Preserve tuples in the relation named after the OUTER JOIN clause

Full outer join

Preserve tuples in both direction.

Primary key of outer join.

Join conditions





Why view

Not desiriable for all users to see the entire logical model.


Virtual relation

In SQL, we can define a virtual relation by a query and the relation contains the result of the query. It is usually not stored. Any such relation that is not part of the logical model is called a view.



create view *v* as <query expression>;

We can also specify the name of the attributes explicitly

create dept_avg_gpa(dept_name, avg_gpa) as 
select dept_name, avg(gpa)
from student
group by dept_name;

storing the views(most of the time) or store it


Because views are result of a query, so if the underling relations change, the views stored might be out of views.

Materialized views

We can store it and update it when the underlying relations change.


If we need fast response. Or If we need to repeatedly do large aggregates.

Update views

Rule of Thumbs: Don’t do it since you might have to alter the underlying relations.
“`create view instructor_info as
select ID, name, building
from instrctor, department
where instructor.dept_name = department.dept_name

insert into instructor_info
values('1001', 'kd', 'Taylor building')

But what if there is no tuples in instructor with id 1001 and name ‘kd’. and there is no building called Taylor. If we just naively insert (1001, ‘kd’, NULL, NULL…) into instructor relation and (Taylor building, NULL…) into department relation, this is no good. Becuase the instroctor_info view still won’t have the desired tuple since dept_name for both newly created tuples are NULL.

When can you do updatable views.

  1. from clauses only have one relation.
  2. select contains only attribute names. No aggregate or distinct.


A transaction consists of a sequence of query and or update statements.

Two outcomes

Commit work

updates performed become permanent in the DB

Rollback work

all the updates performed SQL are undone. DB is restored to what it was before the transaction.


Transactions are Atomic because of rollback. It is indivisible. Transaction is either done or not done. No half done.

How big is the transaction?

By default, each query is a transaction and gets committed each time it is finished.
We can use begin atomic…end to group multiple SQL statements into one trasaction.

Integrity constraints


We want to locate where our records are. We might have to resort to index, just like when we are trying to find a key word, we go to the indice in the back of a book.

Basic concepts:


Indices are used to locate record; Attributes that are used to loop up are called search key. They don’t have to be Key or super key.

Ordered indices: it is based on sorted order of the records

Hash indices

operations and performance affected by choice of indices

Access type:

We can have specific attribute look up or range look up

Space overhead



Ordered indices

Ordered indices are indices that store indices in sorted order.

Clustering indices

Record in the files might themselves be sorted in some order. A clustering index is an index whose search key also defines the sequential order of the file. They are also called primary index.

Nonclustering indices

Indices order is different from the sequential order of the records. There is a second list that has pointers to the physical rows.

Index entry

It is a search key value and pointer to one or **more than one ** records that has the specific search key value.

Dense index

Index entry appears for every search key value in the file. If multiple records have the same search key, then the index’s pointer will point to the first record. But for non-clustering index, each index entry will need store pointers to all records that has the specific search key. (sequential scan from pointer won’t work since the record is not in order).

Sparse index

Sparse index doesn’t store index entry for every search key in the records. It can only be used by clustering index. Because to locate a record, we will find the largest index entry that’s smaller than the search key we are looking for. Then do a sequential scan.

Comparisons and compromise

Dense index is fast but takes up more space Sparse index is slower but take less space. Compromise: Use sparse index to point to the beginning of each block. Since the cost of processsing in DB is to drag blocks from disk to memory.

Multilevel index

Store a index record of the original index. For comparison purpose, let’s compared this two level index with binary search.

For a clustering index, we can use binary search and the time is O(log_2 N). So for a 10,000 block of index, it will take at most 14 block reads to find the desired index. Also, if overflow blocks are used, binary search is not even available.

We can construct a multilevel index, treating the index just a regular sequential file and construct a sparse outer index on the original index. (Because the original index is sorted, we can use sparse outer index).

To find a record, we first find the largest outer index that’s smaller than the search key we want. The pointer in that index will point to a block of index. Then we can do sequential scan.

let’s assume 10,000 block of orignal index would need 100 block of outer sparse index. Since 100 block is small enough, it could be stored in memory so there will be no block read from disk. This is a huge improvement from 14 block reads required by binary search.

borrowed from tutorialspoint

Index update

We need to update index when a record is inserted or delete. We also need to update index when the search key attribute is altered.


First we need to look up.

Dense index insertion

If the search key is not in the index, Db inserts an index entry at the appropriate place.


  1. If the index entry stores pointers of all records that have the same search key value, append this record’s pointer to the index entry.
  2. Else if the index entry only stores the first record with the same search key value, DB will place the record being inserted after the other records that have the same search key value.

Sparse index insertion

We first need to assume each index stores an entry for each block.

  1. If our record is stored in a new block. we store the search key value in index entry.
  2. ELIF the new record has the least search key value in its block, making that record the first of its block, we update the index entry of that block pointing to our record.
  3. Else we do not change the index

Delete for dense index

First, we need to look up the search key

If the delete record is the only record with the search key, just remove the record and index entry.

If the dense index stores all the pointers to records with the same search key, then remove pointer to to-be-deleted record

Elif the dense index stores the pointer of the first record that has the same search key value and if the to-be-deleted record is the first record in the search-key pointer, we need to update the pointer to point the next next record that has the same search key value.

Delete for sparse

If the index doesn’t have the search key value of the to-be-deleted record, then do nothing


  1. If the deleted record is the only record with that search key and the next record in the search key order doesn’t have index pointing at it, update the index entry to point to the next record by search key order. But if that record already has an index entry pointing at it, we will delete the index record.
  2. ELIF if multiple records have the same record, we update the index to point to the next record that has the same search key value.

B tree and B+ tree

B tree is invented to store data on disk. It has a high branching factor to lower than depth of the tree and make sure fewer disk read are needed. In each node, we have data pointer, the pointer to the thing we want on disk, and strong text

Definition of B tree

A B tree of order m is a search tree in which each nonleaf node has up to m children. The actual element are stored in the leaves of the tree. The internal nodes only contain keys(pointer). A internal leaf will usually have m children pointer, plus the m – 1 elements used as a separation. All children smaller than the 1st element is stored in the first pointer, all elements larger than 1st element and smaller than 2nd element is stored in the second pointer and so on

B tree invariant

Every path from the root to leaf has the same length

The tree is balanced

If a node has n children, it contains n – 1 keys

We will see why this is the case in the next invariant

The elements stored in a given subtree all have keys that are between keys in the parent node on either side of the subtree pointer.

And it is left inclusive.

Cornell slides

We can see that the second leaf child [11, 12] is between the keys in the parent node 11 and 15.

Every node (except the root) is at least half full

The root has at least two children if it is not a leaf

Look up in B tree

It is like a binary search. We start from the root node and do binary search and linear search to traverse down. If the value we are looking for in not in the current node, we know where to traverse. In B tree, data and pointers are stored in internal nodes as well as leaves For example, in the picture above, if I want to find 16, I start from the root. The root is 20 so 16 must be in the left child of root. In the next level, 16 is at the right of 15, so we go to the pointer stored at right of 15. Then we are at leaf value and we can do sequential scan to find 15.

Insertion in B tree

We first find the appropriate node that the inserted elements will fall. If there are space in that node, we can just insert. If the current node is full, we then must split the current node into two leaves. One of the node will have the new value ( the pointer to the file).

Then we update the parent to contain a new key and a child pointer. If the parent is full, the we do the same thing. We might reach to the root. If root is split into two, then we just create a new root and increase the height by one. The two children of the root will be the original split root.

Deletion in B tree

It works in the opposite way. We removed the element from the leaf. If after the removal, the leav brcomes empty, we need to remove the key to that node from the parent leaf. We don’t want break the invariant that every node is at least half full, so we might have to re-partitioned the parent node’s values and keys. Else we need to combine the parent nodes, removing a key another level up. This might go up till root.

If the root has two children and they are combined, the combined node will become the new root, reducing the level by one.

B + tree

Data stores in leaves, search key in internal nodes

B + tree is a modification of B tree. In B +tree, it only stores search key in the internal node and stores data only in leaf nodes. This helps deletion and insertions.

Leaf node storage

B+ tree’s leaf nodes are ordered as a sequential linked list. But B tree is not. This make B+ tree able to traverse all data.

Since the leaf node and the internal nodes store different things, they will have two orders. The internal nodes will have have order ‘a’. (all internal nodes will not have more than a. The leaf node will have order ‘b’.

structure of leaf node

It contains n pointers and n – 1 search key value. The search key values are sorted: K1 < K2 … < Kn – 1. <P1, K1, P2, K2…Pn-1, Kn-1 P`n> If the B+ tree index is dense, then every search-key value must appear in some leaf node. We also need to maintain ceil( (n – 1) / 2). Just like B tree. We use Pn to point to the next leaf node. Since all the Ki are ordered, chaining leaf nodes allow fast iteration. Each pointer points to data on the disk.

Structure of internal node

It is very similar to leaf node, except all pointers are pointer to tree nodes. pi will point to tree node that contain search key value less than Ki and greater than ki – 1.

Search in B+ tree

We want to find the pointer of a search key X. Intuitively, we traverse down the tree to a leaf that would contain X if it exists. Then we scan all nodes.

<P1, K1, P2, K2…Pn-1, Kn-1 Pn>

We repeat the process below till we traverse down to a leaf: Examine the current node and find the smallest i such that Ki is greater or equal to X. If Ki = X, then we point the current node to the node pointed by Pi + 1.

Else if Ki > X, we point the current node to the node pointed by Pi.

If no Ki > X, then we point the current node to Pn, the last internal node.(this meansX > kn – 1)

In a nutshell

  1. If Ki – 1 < X < Ki, next_node = Pi
  2. If X == Ki, next_node = Pi+1
  3. If X > all K, next_node = Pn

    def find(value V):

    cur_node = root while(cur_node != leaf node) i = smallest num such that V <= cur_node.K_i if no such an i: set P_m be the last non null pointer in the node cur_node = P_m elif(V = cur_node.K_i) cur_node = cur_node.P_i + 1 else cur_node = cur_node.p_i end while

    Now cur_node is a leaf node

    i = smallest value such that V = K_i if (i exists) return (C, i) else return null

At leaf value, if there is a search key == X, then go fetch the record from disk. Else return null.

But if there are multiple records that have the same search key, we might need to traverse in the leaf node until we hit the next record with search key larger than V.

done = False
set(L, i ) = find(V)
if (L, i) == Null return
while (not done or L not null)
  while (i > number of keys in L or L.Ki > V) 
    print record pointed by L.Pi
    i = i + 1
  if(i > number of keys in L) #possible that more records with same key in next node
    then L = L.Pn
    else done = True

We can also do range queries, find(V, U), we only need to traverse leaf nodes till while(L.Ki < U)

The path is no longer than floor(log_{n/2}&{N}.

Search cost in real life

Typically a node is of the same size of a disk block, which is usually 4 kb. If each search key is 12 bytes and a disk-ptr is 32 bytes, we can store 100 key ptr pairs. If we have 1 million search value in a file, look up takes: log_50^1,000,000 = 4 nodes.


You can find the B tree reading here