B tree and B+ tree

B tree is invented to store data on disk. It has a high branching factor to lower than depth of the tree and make sure fewer disk read are needed. In each node, we have data pointer, the pointer to the thing we want on disk, and strong text

Definition of B tree

A B tree of order m is a search tree in which each nonleaf node has up to m children. The actual element are stored in the leaves of the tree. The internal nodes only contain keys(pointer). A internal leaf will usually have m children pointer, plus the m – 1 elements used as a separation. All children smaller than the 1st element is stored in the first pointer, all elements larger than 1st element and smaller than 2nd element is stored in the second pointer and so on

B tree invariant

Every path from the root to leaf has the same length

The tree is balanced

If a node has n children, it contains n – 1 keys

We will see why this is the case in the next invariant

The elements stored in a given subtree all have keys that are between keys in the parent node on either side of the subtree pointer.

And it is left inclusive.

Cornell slides

We can see that the second leaf child [11, 12] is between the keys in the parent node 11 and 15.

Every node (except the root) is at least half full

The root has at least two children if it is not a leaf

Look up in B tree

It is like a binary search. We start from the root node and do binary search and linear search to traverse down. If the value we are looking for in not in the current node, we know where to traverse. In B tree, data and pointers are stored in internal nodes as well as leaves For example, in the picture above, if I want to find 16, I start from the root. The root is 20 so 16 must be in the left child of root. In the next level, 16 is at the right of 15, so we go to the pointer stored at right of 15. Then we are at leaf value and we can do sequential scan to find 15.

Insertion in B tree

We first find the appropriate node that the inserted elements will fall. If there are space in that node, we can just insert. If the current node is full, we then must split the current node into two leaves. One of the node will have the new value ( the pointer to the file).

Then we update the parent to contain a new key and a child pointer. If the parent is full, the we do the same thing. We might reach to the root. If root is split into two, then we just create a new root and increase the height by one. The two children of the root will be the original split root.

Deletion in B tree

It works in the opposite way. We removed the element from the leaf. If after the removal, the leav brcomes empty, we need to remove the key to that node from the parent leaf. We don’t want break the invariant that every node is at least half full, so we might have to re-partitioned the parent node’s values and keys. Else we need to combine the parent nodes, removing a key another level up. This might go up till root.

If the root has two children and they are combined, the combined node will become the new root, reducing the level by one.

B + tree

Data stores in leaves, search key in internal nodes

B + tree is a modification of B tree. In B +tree, it only stores search key in the internal node and stores data only in leaf nodes. This helps deletion and insertions.

Leaf node storage

B+ tree’s leaf nodes are ordered as a sequential linked list. But B tree is not. This make B+ tree able to traverse all data.

Since the leaf node and the internal nodes store different things, they will have two orders. The internal nodes will have have order ‘a’. (all internal nodes will not have more than a. The leaf node will have order ‘b’.

structure of leaf node

It contains n pointers and n – 1 search key value. The search key values are sorted: K1 < K2 … < Kn – 1. <P1, K1, P2, K2…Pn-1, Kn-1 P`n> If the B+ tree index is dense, then every search-key value must appear in some leaf node. We also need to maintain ceil( (n – 1) / 2). Just like B tree. We use Pn to point to the next leaf node. Since all the Ki are ordered, chaining leaf nodes allow fast iteration. Each pointer points to data on the disk.

Structure of internal node

It is very similar to leaf node, except all pointers are pointer to tree nodes. pi will point to tree node that contain search key value less than Ki and greater than ki – 1.

Search in B+ tree

We want to find the pointer of a search key X. Intuitively, we traverse down the tree to a leaf that would contain X if it exists. Then we scan all nodes.

<P1, K1, P2, K2…Pn-1, Kn-1 Pn>

We repeat the process below till we traverse down to a leaf: Examine the current node and find the smallest i such that Ki is greater or equal to X. If Ki = X, then we point the current node to the node pointed by Pi + 1.

Else if Ki > X, we point the current node to the node pointed by Pi.

If no Ki > X, then we point the current node to Pn, the last internal node.(this meansX > kn – 1)

In a nutshell

  1. If Ki – 1 < X < Ki, next_node = Pi
  2. If X == Ki, next_node = Pi+1
  3. If X > all K, next_node = Pn

    def find(value V):

    cur_node = root while(cur_node != leaf node) i = smallest num such that V <= cur_node.K_i if no such an i: set P_m be the last non null pointer in the node cur_node = P_m elif(V = cur_node.K_i) cur_node = cur_node.P_i + 1 else cur_node = cur_node.p_i end while

    Now cur_node is a leaf node

    i = smallest value such that V = K_i if (i exists) return (C, i) else return null

At leaf value, if there is a search key == X, then go fetch the record from disk. Else return null.

But if there are multiple records that have the same search key, we might need to traverse in the leaf node until we hit the next record with search key larger than V.

done = False
set(L, i ) = find(V)
if (L, i) == Null return
while (not done or L not null)
  while (i > number of keys in L or L.Ki > V) 
    print record pointed by L.Pi
    i = i + 1
  if(i > number of keys in L) #possible that more records with same key in next node
    then L = L.Pn
    else done = True

We can also do range queries, find(V, U), we only need to traverse leaf nodes till while(L.Ki < U)

The path is no longer than floor(log_{n/2}&{N}.

Search cost in real life

Typically a node is of the same size of a disk block, which is usually 4 kb. If each search key is 12 bytes and a disk-ptr is 32 bytes, we can store 100 key ptr pairs. If we have 1 million search value in a file, look up takes: log_50^1,000,000 = 4 nodes.

Insert

You can find the B tree reading here

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax