Concepedia

Abstract

Data compression is when you take a big chunk of data and crunch it down to fit into a smaller space. That data is put on ice; you have to un-crunch the compressed data to get at it. Data optimization, on the other hand, is when you take a chunk of data plus a collection of operations you can perform on that data, and crunch it into a smaller space while retaining the ability to perform the operations efficiently. This thesis investigates the problem of data optimization for some fundamental static data types, concentrating on linked data structures such as trees. I chose to restrict my attention to static data structures because they are easier to optimize since the optimization can be performed off-line. Data optimization comes in two different flavors: concrete and abstract. Concrete optimization finds minimal representations within a given implementation of a data structure; abstract optimization seeks implementations with guaranteed economy of space and time. I consider the problem of concrete optimization of various pointer-based implementations of trees and graphs. The only legitimate use of a pointer is as a reference, so we are free to map the pieces of a linked structure into memory as we choose. The problem is to find a mapping that maximizes overlap of the pieces, and hence minimizes the space they occupy. I solve the problem of finding a minimal representation for general unordered trees where pointers to children are stored in a block of consecutive locations. The algorithm presented is based on weighted matching. I also present an analysis showing that the average number of cons-cells required to store a binary tree of n nodes as a minimal binary DAG is asymptotic to $n/({\pi\over 8}$ lg $n)\sp{1/2}$. Methods for representing trees of n nodes in $O$($n$) bits that allow efficient tree-traversal are presented. I develop tools for abstract optimization based on a succinct representation for ordered sets that supports ranking and selection. These tools are put to use in a building an $O$($n$)-bit data structure that represents n-node planar graphs, allowing efficient traversal and adjacency-testing.