Abstract

An essential text on the analysis, development, and implementation of fundemental data structures and algorithms.

Lessons

1.3 Bags, Queues, and Stacks (120)

Understanding basic ADTs is essential since they are the building blocks upon which more complex ADTs and efficient algorithms are built.

Collections and Generics

A set of values is a collection of objects.

Generics allow data structures to be reused for multiple different concretions/implementations. Java doesn’t support generic array creation.

Autoboxing is Java’s ability to cast to or from a primitive type to or from a wrapper/reference type on the fly.

Linked Lists

A linked list is a recursive data structure that is either empty (null) or a reference to a node having a generic item and a reference to a linked list.

Linked lists enable implementations of bags, queues, and stacks with performance gains that would otherwise be unobtainable.

Loitering is a common problem with Linked Lists since they are challenging to debug. Java’s GC will reclaim memory that is overwritten (i.e., null) removing all references to the object.

Linked lists are useful for representing data in collection ADTs (abstract data types). Linked lists represent sequences of items. Arrays can do this, but linked lists tend to be more performant at removing or inserting into a sequence.

The standard solution to enable arbitrary insertion or deletion is to maintain a doubly linked list, a link in each direction.

Bags, Stacks and Queues

Bag, Queue, and Stack differ in that they define different algorithms for which object is to be removed or examined next. Each one can significantly affect the efficiency of algorithms.

Bag: Removing items is not supported. Can test if it is empty and the number of items it contains. Order of iteration is unspecified.

Queue: with a FIFO queue, items are processed in the order they were added to the queue.

Stack: LIFO/Pushdown Stacks process items in the reverse of the order in which they were added. Fixed capacity stacks are in insertion order. Push/Pop operations should take time independent of the stack size.

2.1 Elementary Sorts (244)

Rules

Java’s Comparable interface makes it easy to compare and sort items that have a natural order. The compareTo() method (by convention returning -1, 0, and 1) must implement a total order while adhering to reflexive, antisymmetric, and transitive rules.

Consider four things when evaluating sorting algorithms:

Selection Sort (248)

~(N^2)/2 compares and N (Linear time) exchanges. Works by repeatedly selecting the smallest item. Uses two loops to compare a test against the smallest item already found.

Insertion Sort (250)

~N^2/4 compares and ~N^2/4 exchanges. Worst case is ~N^2/2. Best case is N-1 and 0 exchanges. Works well for certain types of non-sorted or partially sorted random arrays where the number of inversions is less than a constant multiple of the array size. It is much more efficient for these types of arrays. Can speed it up by shortening the inner loop by moving the large entries to the right one position rather than doing full exchanges.

Comparing Algorithms (254)

Compare algorithms by implementing and debugging, analyzing basic properties, formulating a hypothesis, and running experiments.

Insertion and selection are roughly quadratic though insertion sort is about two times as fast. The more trials, the more accurate the estimate. Run experiments to validate the hypothesis on the data at hand. Applications with significant numbers of equal keys involve a more careful analysis.

Shellsort (258)

Shellsort is an extension of insertion sort that works by exchanging array entries that are far apart but only partially ordered during the initial process. The increment sequence selected, the mode by which the Shellsort is accomplished has a significant impact on the performance of this algorithm, but no provably best variant has been found. It performs well on arrays that are in arbitrary order.

Its performance isn’t really quadratic, but the worst case looks something like (N^3/2) compares, depending on the increment sequence.

2.2 Mergesort (270)

Abstract In Place Merge

Abstract in-place merge requires an output array. Sorts the first half of the array in place and then the second half. After that, it merges them by moving items around in the array.

Top-down Mergesort

6N lg N array accesses. Mergesort is a divide-and-conquer paradigm algorithm. It is based on abstract in-place merge, and it works based on the proof that if it sorts two sub-arrays, then it sorts the whole array by merging. It uses between 1/2N lg N and N lg N compares. A few improvements can be made though. When addressing a new problem, you should first use the most straightforward implementation and then refine it if a bottleneck is discovered.

To improve it use insertion sorts for small sub-arrays (<15) since it is faster in the small case. You can also test if the array is in order.

Bottom-up Mergesort

1/2N lg N to N lg N compares and 6N lg N array accesses to sort an array of length N. With arrays of a power of two top-down and bottom-up perform roughly the same. Bottom-up is the method of looking for sorting linked lists since it sorts it in place. Whenever you consider using one of these algorithms, it is worth considering the other.

The Complexity of Sorting (279)

Computational complexity. Compare based algorithms have a minimum number of compares of lg(N!) ~ N lg N. The compare tree height of N <= number of leaves <= 2^h

~N lg N compares in the worst case is the upper bound since that is what Mergesort provides. Mergesort is an asymptotically optimal compare-based sorting algorithm. Proper upper bounds allow devs to make performance guarantees whereas good lower bounds protect us from striving after performance that isn’t attainable.

2.3 Quicksort (288)

The Basic Algorithm

Quicksort is a divide and conquer based algorithm with in-place and time proportional characteristics. It divides the target array in two and when the two are sorted the whole array is sorted. It is a randomized algorithm to ensure consistent performance. Proof by induction that the recursive method constitutes a proper sort. A partitioning item is used to set a dividing line between the two partitions.

Performance Characteristics

Quicksort’s performance depends on how well partitioning divides the array. Which depends on the partitioning key. The ideal scenario is when Quicksort’s partitions the array exactly in half. The partitioning falls in the middle on average. Randomizing the array helps avoid unbalanced partitions. Average ~2Nln N compares. Worst case ~ (N^2) / 2 compares worst case.

Running time will be in a constant factor of 1.39N lg N. Quicksort is typically faster than mergesort since it does less data movement.

Algorithmic Improvements (295)

Quicksort was developed in 1960 by C.A.R. Hoare. Attempts to improve it can often lead to unexpected side effects. However, certain improvements can be made. One such improvement is using insertion sort for small sub-arrays. Median of three partitioning offers better and more consistent partitioning.

Entropy-optimal sorting (3-way Quicksort) offers significant improvements for arrays with large numbers of duplicate keys. 3-way Quicksort is an ideal solution for a library/package sort function since it works well in a variety of real-world applications. The Dijkstra Dutch National flag problem characterizes the algorithm. 3-way partitioning as developed by J. Bentley and D. McIlroy is asymptotically faster than Mergesort. Worst case happens when keys are all distinct 3-way quicksort reduces sort time from linearithmic to linear for arrays with large numbers of duplicate keys.

You can use the entropy of the array derive both the lower and upper bounds of compares. Compare based sorting algorithms require at least NH - H compares (H = Shannon Entropy). 3-way mergesort ((2ln 2) NH). H = lg N when keys are all distinct.

2.4 Priority Queues

In place sorting algorithm guarenteed to be N log N is Heapsort. In practice heapsort isn’t used as much since the inner loop is longer. It makes poor use of cache memory (looks at the farthest element and then sorts) and it is unstable since it’s references for a large array are all over the place.

Priority queues can be used as the logic for sorting algorithms. Priority queue typically supports two operations, remove the maximum (or minimum) and insert. It can also support a comparison algorithm such as less() or greater(). To support sorting algorithms, you need to change removeMax() remove the smallest item.

Priority queues assist with sorting problems where the input and sorting array size are too large to keep in memory (i.e. the sorting array is the m items to be sorted.)

The lazy method of implementing a priority queue is to use an unordered sequence. The eager approach is to use an eager sequence. However, priority queues take linear time in the worst cases whereas stacks and queues take constant time. Binary Heaps can either be represented in a linked or non-linked fashion. The non-explicitly linked method aids certain algorithmic implementations since it allows one to perform operations using math. The length of a binary tree is (lg N).

Algorithms on Heaps

The heap priority queue implementation of insert and remove the maximum (sink and swim) are guaranteed to work on logarithmic time (1 + lg N for insert and 2 lg N for remove the max).

Index Minimum priority queue is similar to an array but it allows fast access to the smallest item in an array and to indexes of the data structure. Most operations are at most log N.

Heapsort (323)

Using a priority queue or unordered array to enqueue/dequeue all the items corresponds to doing a selection sort. Using an ordered array corresponds to doing an insertion sort. There are two phases to heapsort: heap construction and sort down (pull the items out in decreasing order). Use a maximum priority queue. It is more efficient to use sink() and make sub-heaps as you go. This method produces performance of 2N compares and N exchanges.

Heapsort sortdown (326)

The process is similar to selection sort but uses fewer compares. < 2n lg n + 2n compares and one half that many exchanges. During sortdown, most items reinserted sink all the way to the bottom. Time can be saved by avoiding the position check and promoting the two children until the bottom is reached and then move back up the heap until its position is reached.

Heapsort is very performant and memory conscious using ~2n lg n compares and constant extra space. However, it is rarely used today because of its poor caching ability. Priority queues, however, play an essential role since they have excellent performance for large numbers of insert and remove the max operations.

3. Searching (361)

A symbol table is an abstract mechanism to save a value and retrieve it by a key. They’re sometimes called dictionaries or indices. These classic data structures can be used to support the symbol table implementation: binary search trees, red-black trees, and hash tables.

3.1 Symbol Tables

Symbol table is a data structure for associative keys with values that supports two operations: insert() and search(). It is an associative array abstraction used by many languages.

Ordered Symbol Tables (366)

An ordered symbol table takes advantage of the ability to compare keys. In languages that support generics, keys are comparable objects. It is similar to an index minimum priority queue except that equal keys are allowed in Priority Queues but not in Symbol Tables. Symbol Tables support a much greater breadth of operations. They are characterized by their underlying data structure and their implementations of get/put.

Sequential Search in an Unordered Linked List (374)

Analyzing Symbol Table algorithms is more complicated than analyzing sorting algorithms because of the difficulty of characterizing the sequence of operations that might be invoked by a given client. Search misses and insertions require N compares and search hits N compares. Inserting N distinct keys uses ~N^2/2 compares. Average number of compares for a random search it is ~N/2. Linked list implementation is too slow for large problems.

Algorithm 3.2: Binary Search in an Ordered Array (378)

Binary Search for an ordered array is exceptionally fast for searching/ranking (lg N + 1) but it only has quadratic performance with put/insert operations. For static tables it is worthwhile. It is a far better solution than sequential search and the method of choice for small practical applications. If you need fast search and insert operations then this isn’t the choice. Pros and cons of Symbol Table implementations reviewed (386)

Binary Search’s rank(key) method works by establishing a midpoint based off of high and low values. The algorithm then iteratively compares the provided key to the current mid key in the keys array. If the mid value is greater, then reduce the high value tracker to one less than the mid tracker and repeat the process. If mid is less, then increase the low tracker to the mid tracker plus one. Once the comparison returns equals, return the midpoint. If the key not found, return the low point tracker. With rank(), it keeps halving the search area which leads to logarithmic results.

Binary Search’s get(key) first checks if the data structure is empty and returns null if it is. It then gets the rank for the current key (using the separate arrays for keys and values). Rank either returns an exact match or the closest lowest index. If it returns the closest lowest index, then get() will return null. Otherwise, there’s a match, and the value for the key is returned.

Binary Search in an Ordered Array (378)

Binary Search’s put(key) first finds the rank of the provided key. If it is found, then the value for that key is replaced. Otherwise, it moves all keys and values to the right of the new values by starting at the rank found for the search key and working back from the end of the arrays moving values and keys. The array is also resized.

3.2 Binary Search Trees

A Symbol Table based on a Binary Search Tree is one of the most fundamental algorithms. A Binary Search Tree is similar to a Linked List in that it has links to other nodes (or null), but in this case it has links to right and left children which are themselves trees. Every node is only pointed to by its parent. Each node has a key and value with an ordering restriction to support efficient search.

size() is implemented so that: size(x) = size(x.left) + size(x.right) + 1. An int size is maintained on each node. search() is efficient since the size of the search interval shrinks by half with each progression.

put(Node x, Key key, Value val) works by first checking to see if the root element/the current node is not null. If it is null, then it creates a new node. This assists with insertions once a null value is found. If not null, then the passed-in key is compared to the current key. That comparison is then used to decide whether to traverse to the right or left. If the compare matches, then the current value is replaced. Then the size is computed and stored, and the node is returned.

Only keys on the path from the root to the sought or inserted key are examined so the larger the BST becomes, the smaller, as a %, the number of examined nodes becomes. Search hits on average require around ~2 ln N compares. Insertions and search misses require about the same. These become truer as N increases. 22 compares 21 MM records… amazing. Instead of N.

Order-based Methods and Deletion (406)

Binary Search Trees are widely used since they allow us to keep the keys in order. To find the minimum key in a BST, you move left. If root, left is null then it is the smallest. To find the max, you search right instead of left.

If a key is < the root then the floor must be in the left subtree. If a key is >, then the ceiling is in the right sub-tree.

select() works by first checking the input node to see if it is null. If not then it computes the size of the left branch. It then checks whether the left branch size is greater or less than the right or left branch (and continues to recurse; there are more details here) else the node is returned.

deleteMin() works by traversing left until a null left link is found. Replace the link to that node with its right link — the symmetric works for deletions. Delete for a node with one or no children works similarly.

delete() compares the provided node to the provided key. That is used to determine its successor which is to the right. The search starts at the root node and recurses downward until the target key and successor are found. The items are then shuffled to reorganize the Binary Search Tree.

Range Queries

The keys() method works by comparing the low and high key to the root node. If the high and low compare are equal to zero, then the key for the node is enqueued. Otherwise, it recurses down the left or right tree depending on the compare low and high.

Performance for range query functions is equal to the length of the longest tree which is usually around 3 ln N. Performance is predicated on having sufficiently random keys. Poor worst-case performance.

3.3 Balanced Search Trees

2-3 Search Trees

A node can have more than one key. It can have one like a typical Binary Search Tree, but it can also contain 2. the 2-3 designator comes from the fact that a 2 node (R,L) has two links and one key whereas a 3 node has three links and two keys (R, L, Mid). Search generalizes the algorithm for a regular binary search tree.