Abstract

An essential text on the analysis, development, and implementation of fundemental data structures and algorithms.

Lessons

1.3 Bags, Queues, and Stacks (120)

Understanding basic ADTs is essential since they are the building blocks upon which more complex ADTs and efficient algorithms are built.

Collections and Generics

A set of values is a collection of objects.

Generics allow data structures to be reused for multiple different concretions/implementations. Java doesn’t support generic array creation.

Autoboxing is Java’s ability to cast to or from a primitive type to or from a wrapper/reference type on the fly.

Linked Lists

A linked list is a recursive data structure that is either empty (null) or a reference to a node having a generic item and a reference to a linked list.

Linked lists enable implementations of bags, queues, and stacks with performance gains that would otherwise be unobtainable.

Loitering is a common problem with Linked Lists since they are challenging to debug. Java’s GC will reclaim memory that is overwritten (i.e., null) removing all references to the object.

Linked lists are useful for representing data in collection ADTs (abstract data types). Linked lists represent sequences of items. Arrays can do this, but linked lists tend to be more performant at removing or inserting into a sequence.

The standard solution to enable arbitrary insertion or deletion is to maintain a doubly linked list, a link in each direction.

Bags, Stacks and Queues

Bag, Queue, and Stack differ in that they define different algorithms for which object is to be removed or examined next. Each one can significantly affect the efficiency of algorithms.

Bag: Removing items is not supported. Can test if it is empty and the number of items it contains. Order of iteration is unspecified.

Queue: with a FIFO queue, items are processed in the order they were added to the queue.

Stack: LIFO/Pushdown Stacks process items in the reverse of the order in which they were added. Fixed capacity stacks are in insertion order. Push/Pop operations should take time independent of the stack size.

2.1 Elementary Sorts (244)

Rules

Java’s Comparable interface makes it easy to compare and sort items that have a natural order. The compareTo() method (by convention returning -1, 0, and 1) must implement a total order while adhering to reflexive, antisymmetric, and transitive rules.

Consider four things when evaluating sorting algorithms:

Selection Sort (248)

~(N^2)/2 compares and N (Linear time) exchanges. Works by repeatedly selecting the smallest item. Uses two loops to compare a test against the smallest item already found.

Insertion Sort (250)

~N^2/4 compares and ~N^2/4 exchanges. Worst case is ~N^2/2. Best case is N-1 and 0 exchanges. Works well for certain types of non-sorted or partially sorted random arrays where the number of inversions is less than a constant multiple of the array size. It is much more efficient for these types of arrays. Can speed it up by shortening the inner loop by moving the large entries to the right one position rather than doing full exchanges.

Comparing Algorithms (254)

Compare algorithms by implementing and debugging, analyzing basic properties, formulating a hypothesis, and running experiments.

Insertion and selection are roughly quadratic though insertion sort is about two times as fast. The more trials, the more accurate the estimate. Run experiments to validate the hypothesis on the data at hand. Applications with significant numbers of equal keys involve a more careful analysis.

Shellsort (258)

Shellsort is an extension of insertion sort that works by exchanging array entries that are far apart but only partially ordered during the initial process. The increment sequence selected, the mode by which the Shellsort is accomplished has a significant impact on the performance of this algorithm, but no provably best variant has been found. It performs well on arrays that are in arbitrary order.

Its performance isn’t really quadratic, but the worst case looks something like (N^3/2) compares, depending on the increment sequence.

2.2 Mergesort (270)

Abstract In Place Merge

Abstract in-place merge requires an output array. Sorts the first half of the array in place and then the second half. After that, it merges them by moving items around in the array.

Top-down Mergesort

6N lg N array accesses. Mergesort is a divide-and-conquer paradigm algorithm. It is based on abstract in-place merge, and it works based on the proof that if it sorts two sub-arrays, then it sorts the whole array by merging. It uses between 1/2N lg N and N lg N compares. A few improvements can be made though. When addressing a new problem, you should first use the most straightforward implementation and then refine it if a bottleneck is discovered.

To improve it use insertion sorts for small sub-arrays (<15) since it is faster in the small case. You can also test if the array is in order.

Bottom-up Mergesort

1/2N lg N to N lg N compares and 6N lg N array accesses to sort an array of length N. With arrays of a power of two top-down and bottom-up perform roughly the same. Bottom-up is the method of looking for sorting linked lists since it sorts it in place. Whenever you consider using one of these algorithms, it is worth considering the other.

The Complexity of Sorting (279)

Computational complexity. Compare based algorithms have a minimum number of compares of lg(N!) ~ N lg N. The compare tree height of N <= number of leaves <= 2^h

~N lg N compares in the worst case is the upper bound since that is what Mergesort provides. Mergesort is an asymptotically optimal compare-based sorting algorithm. Proper upper bounds allow devs to make performance guarantees whereas good lower bounds protect us from striving after performance that isn’t attainable.

2.3 Quicksort (288)

The Basic Algorithm

Quicksort is a divide and conquer based algorithm with in-place and time proportional characteristics. It divides the target array in two and when the two are sorted the whole array is sorted. It is a randomized algorithm to ensure consistent performance. Proof by induction that the recursive method constitutes a proper sort. A partitioning item is used to set a dividing line between the two partitions.

Performance Characteristics

Quicksort’s performance depends on how well partitioning divides the array. Which depends on the partitioning key. The ideal scenario is when Quicksort’s partitions the array exactly in half. The partitioning falls in the middle on average. Randomizing the array helps avoid unbalanced partitions. Average ~2Nln N compares. Worst case ~ (N^2) / 2 compares worst case.

Running time will be in a constant factor of 1.39N lg N. Quicksort is typically faster than mergesort since it does less data movement.

Algorithmic Improvements (295)

Quicksort was developed in 1960 by C.A.R. Hoare. Attempts to improve it can often lead to unexpected side effects. However, certain improvements can be made. One such improvement is using insertion sort for small sub-arrays. Median of three partitioning offers better and more consistent partitioning.

Entropy-optimal sorting (3-way Quicksort) offers significant improvements for arrays with large numbers of duplicate keys. 3-way Quicksort is an ideal solution for a library/package sort function since it works well in a variety of real-world applications. The Dijkstra Dutch National flag problem characterizes the algorithm. 3-way partitioning as developed by J. Bentley and D. McIlroy is asymptotically faster than Mergesort. Worst case happens when keys are all distinct 3-way quicksort reduces sort time from linearithmic to linear for arrays with large numbers of duplicate keys.

You can use the entropy of the array derive both the lower and upper bounds of compares. Compare based sorting algorithms require at least NH - H compares (H = Shannon Entropy). 3-way mergesort ((2ln 2) NH). H = lg N when keys are all distinct.

2.4 Priority Queues

In place sorting algorithm guarenteed to be N log N is Heapsort. In practice heapsort isn’t used as much since the inner loop is longer. It makes poor use of cache memory (looks at the farthest element and then sorts) and it is unstable since it’s references for a large array are all over the place.

Priority queues can be used as the logic for sorting algorithms. Priority queue typically supports two operations, remove the maximum (or minimum) and insert. It can also support a comparison algorithm such as less() or greater(). To support sorting algorithms, you need to change removeMax() remove the smallest item.

Priority queues assist with sorting problems where the input and sorting array size are too large to keep in memory (i.e. the sorting array is the m items to be sorted.)

The lazy method of implementing a priority queue is to use an unordered sequence. The eager approach is to use an eager sequence. However, priority queues take linear time in the worst cases whereas stacks and queues take constant time. Binary Heaps can either be represented in a linked or non-linked fashion. The non-explicitly linked method aids certain algorithmic implementations since it allows one to perform operations using math. The length of a binary tree is (lg N).

Algorithms on Heaps

The heap priority queue implementation of insert and remove the maximum (sink and swim) are guaranteed to work on logarithmic time (1 + lg N for insert and 2 lg N for remove the max).

Index Minimum priority queue is similar to an array but it allows fast access to the smallest item in an array and to indexes of the data structure. Most operations are at most log N.

Heapsort (323)

Using a priority queue or unordered array to enqueue/dequeue all the items corresponds to doing a selection sort. Using an ordered array corresponds to doing an insertion sort. There are two phases to heapsort: heap construction and sort down (pull the items out in decreasing order). Use a maximum priority queue. It is more efficient to use sink() and make sub-heaps as you go. This method produces performance of 2N compares and N exchanges.

Heapsort sortdown (326)

The process is similar to selection sort but uses fewer compares. < 2n lg n + 2n compares and one half that many exchanges. During sortdown, most items reinserted sink all the way to the bottom. Time can be saved by avoiding the position check and promoting the two children until the bottom is reached and then move back up the heap until its position is reached.

Heapsort is very performant and memory conscious using ~2n lg n compares and constant extra space. However, it is rarely used today because of its poor caching ability. Priority queues, however, play an essential role since they have excellent performance for large numbers of insert and remove the max operations.

3. Searching (361)

A symbol table is an abstract mechanism to save a value and retrieve it by a key. They’re sometimes called dictionaries or indices. These classic data structures can be used to support the symbol table implementation: binary search trees, red-black trees, and hash tables.

3.1 Symbol Tables

Symbol table is a data structure for associative keys with values that supports two operations: insert() and search(). It is an associative array abstraction used by many languages.