Compressed representations of sequences and full-text indexes

Abstract

Given a sequence S = s 1 s 2 … s n of integers smaller than r = O (polylog( n )), we show how S can be represented using nH 0 ( S ) + o ( n ) bits, so that we can know any s q , as well as answer rank and select queries on S , in constant time. H 0 ( S ) is the zero-order empirical entropy of S and nH 0 ( S ) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O (log r ) time. For larger r , we can still represent S in nH 0 ( S ) + o ( n log r ) bits and answer queries in O (log r /log log n ) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string T [1, n ] within nH k ( T ) + o ( n ) bits of storage, where H k ( T ) is the k th-order empirical entropy of T . This space bound holds simultaneously for all k ≤ α log |Σ| n , constant 0 < α < 1, and |Σ| = O (polylog( n )). This index counts the occurrences of an arbitrary pattern P [1, p ] as a substring of T in O ( p ) time; it locates each pattern occurrence in O (log 1+ε n ) time for any constant 0 < ε < 1; and reports a text substring of length ℓ in O (ℓ + log 1+ε n ) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the k th-order entropy of the text T , which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = O ( n β ), for any 0 < β < 1, by paying o ( n log|Σ|) extra space and multiplying all query times by O (log |Σ|/log log n ).

References

Page 1

	Year	Citations

Page 1