Data Structures with .NET - Part 4: Building a Better Binary Search Tree

** An Extensive Examination of Data Structures **

** Part 4: Building a Better Binary Search Tree **

Scott Mitchell
4GuysFromRolla.com

February 9, 2004

** Summary: ** This article, the fourth in the series, begins with a quick examination of AVL trees and red-black trees, which are two different self-balancing, binary search tree data structures. The remainder of the article examines the skip list data structure, an ingenious data structure that turns a linked list into a data structure that offers the same running time as the more complex self-balancing tree data structures. (31 printed pages)

** Note??? ** This article assumes the reader is familiar with C# and the data structure topics discussed previously in this article series.

Download the BuildingBetterBST.msi sample file .

** Contents **

Introduction
Self-Balancing Binary Search Trees
A Quick Primer on Linked Lists
Skip Lists: A Linked List with Self-Balancing BST-Like Properties
Conclusion
Books

** Introduction **

In Part 3 of this article series, we looked at the general tree data structure. A tree is a data structure that consists of nodes, where each node has some value and an arbitrary number of children nodes. Trees are common data structures because many real-world problems exhibit tree-like behavior. For example, any sort of hierarchical relationship among people, things, or objects can be modeled as a tree.

A binary tree is a special kind of tree, which limits each node to no more than two children. A binary search tree , or BST, is a binary tree whose nodes are arranged such that for every node n , all of the nodes in n 's left subtree have a value less than n , and all nodes in n 's right subtree have a value greater than n . As we discussed, in the average case BSTs offer log 2 n asymptotic time for inserts, deletes, and searches. (log 2 n is often referred to as sublinear because it outperforms linear asymptotic times.)

The disadvantage of BSTs is that in the worst-case their asymptotic running time is reduced to linear time. This happens if the items inserted into the BST are inserted in order or in near-order. In such a case, a BST performs no better than an array. As we discussed at the end of Part 3, there exist self-balancing binary search trees that ensure that regardless of the order of the data inserted, the tree maintains a log 2 n running time. In this article, we'll briefly discuss two self-balancing binary search trees—AVL trees and red-black trees. Following that, we'll take an in-depth look at skip lists. Skip lists are a really cool data structure that is much easier to implement than AVL trees or red-black trees, yet still guarantees a running time of log 2 n .

** Self-Balancing Binary Search Trees **

Recall that new nodes are inserted into a binary search tree at the leaves. That is, adding a node to a binary search tree involves tracing down a path of the binary search tree, taking lefts and rights based on the comparison of the value of the current node, and the node being inserted, until the path reaches a dead end. At this point, the newly inserted node is plugged into the tree at this reached dead end. Figure 1 illustrates the process of inserting a new node into a BST.

** Figure 1. Inserting a new node into a BST **

As Figure 1 shows, when making the comparison at the current node, the node to be inserted travels down the left path if its value is less than the current node, and down the right if its value is greater than the current node's value. Therefore, the structure of the BST is relative to the order with which the nodes are inserted. Figure 2 depicts a BST after nodes with values of 20, 50, 90, 150, 175, and 200 have been added. Specifically, these nodes have been added in ascending order. The result is a BST with no breadth. That is, its topology consists of a single line of nodes rather than having the nodes fanned out.

** Figure 2. A BST after nodes with values of 20, 50, 90, 150, 175, and 200 have been added **

BSTs—which offer sublinear running time for insertions, deletions, and searches—perform optimally when their nodes are arranged in a fanned-out manner. This is because when searching for a node in a BST, each single step down the tree reduces the number of nodes that need to be potentially checked by one half. However, when a BST has a topology similar to the one in Figure 2, the running time for the BST's operations are much closer to linear time. To see why, consider what must happen when searching for a particular value, such as 175. Starting at the root, 20, we must navigate down through each right child until we hit 175. That is, there is no savings in nodes that need to be checked at each step. Searching a BST like the one in Figure 2 is identical to searching an array. Each element must be checked one at a time. Therefore, such a structured BST exhibits a linear search time.

What is important to realize is that the running time of a BST's operations is related to the BST's height . The height of a tree is defined as the length of the longest path starting at the root. The height of a tree can be defined recursively as follows:

· ???????????????????? The height of a node with no children is 0

· ???????????????????? The height of a node with one child is the height of that child plus one

· ???????????????????? The height of a node with two children is one plus the greater height of the two children

To compute the height of a tree, start at its leaf nodes and assign them a height of 0. Then move up the tree using the three rules outlined to compute the height of each leaf nodes' parent. Continue in this manner until every node of the tree has been labeled. The height of the tree, then, is the height of the root node. Figure 3 shows a number of binary trees with their height computed at each node. For practice, take a second to compute the heights of the trees yourself to make sure your numbers match up with the numbers presented in the figure below.

** Figure 3. Example binary trees with their height computed at each node **

A BST exhibits log 2 n running times when its height, when defined in terms of the number of nodes, n , in the tree, is near the floor of log 2 n . (The floor of a number x is the greatest integer less than x . So the floor of 5.38 would be 5; the floor of 3.14159 would be 3. For positive numbers x , the floor of x can be found by simply truncating the decimal part of x , if any.) Of the three trees in Figure 3, tree (b) has the best height to number of nodes ratio, as the height is 3 and the number of nodes present in the tree is 8. As we discussed in Part 1 of this article series, log _ a _ b = y is another way of writing a y = b . log 2 8, then, equals 3 since 2 3 = 8. Tree (a) has 10 nodes and a height of 4. log 2 10 equals 3.3219 and change, the floor of that being 3. So, 4 is not the ideal height. Notice that by rearranging the topology of tree (a)—by moving the far-bottom right node to the child of one of the non-leaf nodes with only one child—we could reduce the tree's height by one, thereby giving the tree an optimal height to node ratio. Finally, tree (c) has the worst height to node ratio. With its 5 nodes it could have an optimal height of 2, but due to its linear topology is has a height of 4.

The challenge we are faced with, then, is ensuring that the topology of the resulting BST exhibits an optimal ratio of height to the number of nodes. Since the topology of a BST is based upon the order with which the nodes are inserted, intuitively you might opt to solve this problem by ensuring that the data that's added to a BST is not added in near-sorted order. While this is possible if you know the data that will be added to the BST beforehand, it might not be practical. If you are not aware of the data that will be added—like if it's added based on user input, or added as it's read from a sensor—then there is no hope of guaranteeing the data is not inserted in near-sorted order. The solution, then, is not to try to dictate the order with which the data is inserted, but to ensure that after each insertion the BST remains balanced . Data structures that are designed to maintain balance are referred to as self-balancing binary search trees .

A balanced tree is a tree that maintains some predefined ratio between its height and breadth. Different data structures define their own ratios for balance, but all have it close to log 2 n . A self-balancing BST, then, exhibits log 2 n running time. There are numerous self-balancing BST data structures in existence, such as AVL trees, red-black trees, 2-3 trees, 2-3-4 trees, splay trees, B-trees, and others. In the next two sections, we'll take a brief look at two of these self-balancing trees—AVL trees and red-black trees.

** Examining AVL Trees **

In 1962 Russian mathematicians G. M. A ndel'son- V el-skii and E. M. L andis invented the first self-balancing BST, called an AVL tree. AVL trees must maintain the following balance property: for every node n , the height of n 's left and right subtrees can differ by at most 1. The height of a node's left or right subtree is the height computed for its left or right node using the technique discussed in the previous section. If a node has only one child, say a left child, but no right child, then the height of the right subtree is -1.

Figure 4 shows, conceptually, the height-relationship each node in an AVL tree must maintain. Figure 5 provides three examples of BSTs. The numbers in the nodes represent the nodes' values; the numbers to the right and left of each node represent the height of the nodes' left and right subtrees. In Figure 5, trees (a) and (b) are valid AVL trees, but trees (c) and (d) are not because not all nodes adhere to the AVL balance property.

** Figure 4. The height of left and right subtrees in an AVL tree cannot differ by more than one. **

** Figure 5. Example trees, where (a) and (b) are valid AVL trees, but (c) and d are not. **

** Note??? ** Realize that AVL trees are binary search trees, so in addition to maintaining a balance property, an AVL tree must also maintain the binary search tree property.

When creating an AVL tree data structure, the challenge is to ensure that the AVL balance remains regardless of the operations performed on the tree. That is, as nodes are added or deleted, it is vital that the balance property remains. AVL trees maintain the balance through rotations . A rotation slightly reshapes the tree's topology such that the AVL balance property is restored and, equally important, the binary search tree property is maintained as well.

Inserting a new node into an AVL tree is a two-stage process. First, the node is inserted into the tree using the same algorithm for adding a new node to a BST. That is, the new node is added as a leaf node in the appropriate location to maintain the BST property. After adding a new node, it might be the case that adding this new node caused the AVL balance property to be violated at some node along the path traveled down from the root to where the newly inserted node was added. To fix any violations, stage two involves traversing back up the access path, checking the height of the left and right subtree for each node along this return path. If the heights of the subtrees differ by more than 1, a rotation is performed to fix the anomaly.

Figure 6 illustrates the steps for a rotation on node 3. Notice that after stage 1 of the insertion routine, the AVL tree property was violated at node 5 because node 5's left subtree's height was two greater than its right subtree's height. To remedy this violation, a rotation was performed on node 3, the root of node 5's left subtree. This rotation fixed the balance inconsistency and also maintained the BST property.

** Figure 6. AVL trees stay balanced through rotations **

In addition to the simple, single rotation shown in Figure 6, there are more involved rotations that are sometimes required. A thorough discussion of the set of rotations potentially needed by an AVL tree is beyond the scope of this article. What is important to realize is that both insertions and deletions can disturb the balance property or which AVL trees must adhere. To fix any perturbations, rotations are used.

** Note??? ** To familiarize yourself with insertions, deletions, and rotations from an AVL tree, check out the AVL tree applet at http://www.seanet.com/users/arsen/avltree.html . This Java applet illustrates how the topology of an AVL tree changes with additions and deletions.

By ensuring that all nodes' subtrees' heights differ by at most 1, AVL trees guarantee that insertions, deletions, and searches will always have an asymptotic running time of log 2 n, regardless of the order of insertions into the tree.

** A Look at Red-Black Trees **

Rudolf Bayer, a computer science professor at the Technical University of Munich, invented the red-black tree data structure in 1972. In addition to its data and left and right children, the nodes of a red-black tree contain an extra bit of information—a color, which can be either red or black. Red-black trees are complicated further by the concept of a specialized class of node referred to as NIL nodes. NIL nodes are pseudo-nodes that exist as the leaves of the red-black tree. That is, all regular nodes—those with some data associated with them—are internal nodes. Rather than having a NULL pointer for a childless regular node, the node is assumed to have a NIL node in place of that NULL value. This concept can be understandably confusing, and hopefully the diagram in Figure 7 clears up any confusion.

** Figure 7. Red-black trees add the concept of a NIL node. **

Red-black trees are trees that have the following four properties:

1. ????????????????? Every node is colored either red or black.

2. ????????????????? Every NIL node is black.

3. ????????????????? If a node is red, then both of its children are black.

4. ????????????????? Every path from a node to a descendant leaf contains the same number of black nodes.

The first three properties are self-explanatory. The fourth property, which is the most important of the four, simply states that starting from any node in the tree, the number of black nodes from that node to any leaf (NIL), must be the same. In Figure 7 take the root node as an example. Starting from 41 and going to any NIL, you encounter the same number of black nodes—3. For example, taking a path from 41 to the left-most NIL node, we start on 41, a black node. We then travel down to node 9, then node 2, which is also black, then node 1, and finally the left-most NIL node. In this journey we encountered three black nodes—41, 2, and the final NIL node. In fact, if we travel from 41 to any NIL node, we'll always encounter precisely three black nodes.

Like the AVL tree, red-black trees are another form of self-balancing binary search tree. Whereas the balance property of an AVL tree was explicitly stated as a relationship between the heights of each node's left and right subtrees, red-black trees guarantee their balance in a more conspicuous manner. It can be shown that a tree that implements the four red-black tree properties has a height that is always less than 2 * log 2 ( n +1), where n is the total number of nodes in the tree. For this reason, red-black trees ensure that all operations can be performed within an asymptotic running time of log 2 n .

Like AVL trees, any time a red-black tree has nodes inserted or deleted, it is important to verify that the red-black tree properties have not been violated. With AVL trees, the balance property was restored through rotations. With red-black trees, the red-black tree properties are restored through re-coloring and rotations. Red-black trees are notoriously complex in their re-coloring and rotation rules, requiring the nodes along the access path to make decisions based upon their color in contrast to the color of their parents and uncles. (An uncle of a node n is the node that is n 's parent's sibling node.) A thorough discussion of re-coloring and rotation rules is far beyond the scope of this article.

To view the re-coloring and rotations of a red-black tree as nodes are added and deleted, check out the red-black tree applet, which can also be accessed viewed at http://www.seanet.com/users/arsen/avltree.html .

** A Quick Primer on Linked Lists **

One common data structure we've yet to discuss is the linked list . Since the skip list data structure we'll be examining next is the mutation of a linked list into a data structure with self-balanced binary tree running times, it is important that before diving into the specifics of skip lists we take a moment to discuss linked lists.

Recall that with a binary tree, each node in the tree contains some bit of data and a reference to its left and right children. A linked list can be thought of as a unary tree. That is, each element in a linked list has some data associated with it, and a single reference to its neighbor. As Figure 8 illustrates, each element in a linked list forms a link in the chain. Each link is tied to its neighboring node, which is the node on its right.

** Figure 8. A four-element linked list **

When we created a binary tree data structure in Part 3 , the binary tree data structure only needed to contain a reference to the root of the tree. The root itself contained references to its children, and those children contained references to their children, and so on. Similarly, with the linked list data structure, when implementing a structure we only need to keep a reference to the head of the list since each element in the list maintains a reference to the next item in the list.

Linked lists have the same linear running time for searches as arrays. That is, to determine if the element Sam is in the linked list in Figure 8, we have to start at the head and check each element one by one. There are no shortcuts as with binary trees or hashtables. Similarly, deleting from a linked list takes linear time because the linked list must first be searched for the item to be deleted. Once the item is found, removing it from the linked list involves reassigning the deleted item's left neighbor's neighbor reference to the deleted item's neighbor. Figure 9 illustrates the pointer reassignment that must occur when deleting an item from a linked list.

** Figure 9. Deleting an element from a linked list **

The asymptotic time required to insert a new element into a linked list depends on whether or not the linked list is a sorted list. If the list's elements need not be sorted, insertion can occur in constant time because we can add the element to the front of the list. This involves creating a new element, having its neighbor reference point to the current linked list head, and, finally, reassigning the linked list's head to the newly inserted element.

If the linked list elements need to be maintained in sorted order, then when adding a new element the first step is to locate where it belongs in the list. This is accomplished by exhaustively iterating from the beginning of the list to the element until the spot where the new element belongs. Let e be the element immediately before the location where the new element will be added. To insert the new element, e 's processor reference must now point to the newly inserted element, and the new element's neighbor reference needs to be assigned to e 's old neighbor. Figure 10 illustrates this concept graphically.

** Figure 10. Inserting elements into a sorted linked list **

Notice that linked lists do not provide direct access, like an array. That is, if you want to access the i th element of a linked list, you have to start at the front of the list and walk through i links. With an array, though, you can jump straight to the i th element. Given this, along with the fact that linked lists do not offer better search running times than arrays, you might wonder why anyone would want to use a linked list.

The primary benefit of linked lists is that adding or removing items does not involve messy and time-consuming re-dimensioning. Recall that array's have fixed size. If an array needs to have more elements added to it than it has capacity, the array must be re-dimensioned. Granted, the ArrayList class hides the code complexity of this, but re-dimensioning still carries with it a performance penalty. In short, an array is usually a better choice if you have an idea on the upper bound of the amount of data that needs to be stored. If you have no conceivable notion as to how many elements need to be stored, then a link list might be a better choice.

In summary, linked lists are fairly simple to implement. The main challenge comes with the threading or rethreading of the neighbor links with insertions or deletions, but the complexity of adding or removing an element from a linked list pales in comparison to the complexity of balancing an AVL or red-black tree.

** Skip Lists: A Linked List with Self-Balancing BST-Like Properties **

Back in 1989 William Pugh, a computer science professor at the University of Maryland, was looking at sorted linked lists one day thinking about their running time. Clearly a sorted linked list takes linear time to search because each element may potentially be visited, one right after the other. Pugh thought to himself that if half the elements in a sorted linked list had two neighbor references—one pointing to its immediate neighbor, and another pointing to the neighbor two elements ahead—while the other half had one, then searching a sorted linked list could be done in half the time. Figure 11 illustrates a two-reference sorted linked list.

** Figure 11. A skip list **

The way such a linked list saves search time is due in part to the fact that the elements are sorted, as well as the varying heights. To search for, say, Dave, we'd start at the head element , which is a dummy element whose height is the same height as the maximum element height in the list. The head element does not contain any data. It merely serves as a place to start searching.

We start at the highest link because it lets us skip over lower elements. We begin by following the head element's top link to Bob. At this point we can ask ourselves, does Bob come before or after Dave? If it comes before Dave, then we know Dave, if he's in the list, must exist somewhere to the right of Bob. If Bob comes before Dave, then Bob must exist somewhere between where we're currently positioned and Bob. In this case, Dave comes after Bob alphabetically, so we can repeat our search again from the Bob element. Notice that by moving onto Bob, we are skipping over Alice. At Bob, we repeat the search at the same level. Following the top-most pointer we reach Dave. Since we have found what we are looking for, we can stop searching.

Now, imagine that we wanted to search for Cal. We'd begin by starting at the head element, and then moving onto Bob. At Bob, we'd start by following the top-most reference to Dave. Since Dave comes after Cal, we know that Cal must exist somewhere between Bob and Dave. Therefore, we move down to the next lower reference level and continue our comparison.

The efficiency of such a linked list arises because we are able to move two elements over every time instead of just one. This makes the running time on the order of n /2, which, while better than a regular sorted link list, is still an asymptotically linear running time. Realizing this, Pugh wondered what would happen if rather than limiting the height of an element to 2, it was instead allowed to go up to log 2 n for n elements. That is, if there were 8 elements in the linked list, there would be elements with height up to 3. If there were 16 elements, there would be elements with height up to 4. As Figure 12 shows, by intelligently choosing the heights of each of the elements, the search time is reduced to log 2 n .

** Figure 12. By increasing the height of each node in a skip list, better searching performance can be gained. **

Notice that in the nodes in Figure 12, every 2 _ i _ th node has references 2 i elements ahead (that is, the 2 0 element). Alice has a reference to 2 0 elements ahead—Bob. The 2 1 element, Bob, has a reference to a node 2 1 elements ahead—Dave. Dave, the 2 2 element, has a reference 2 2 elements ahead—Frank. Had there been more elements, Frank—the 2 3 element—would have a reference to the element 2 3 elements ahead.

The disadvantage of the approach illustrated in Figure 12 is that adding new elements or removing existing ones can wreak havoc on the precise structure. That is, if Dave is deleted, now Ed becomes the 2 2 element, and Gil the 2 3 element, and so on. This means all of the elements to the right of the deleted element need to have their height and references readjusted. The same problem crops up with inserts. This redist