Concepedia

Publication | Open Access

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices

60

Citations

42

References

2002

Year

Abstract

Abstract. Given two strings ofsize n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison–Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195–197] uses a dynamic programming matrix and compares the two strings in O(n 2) time. We address the challenge ofcomputing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel–Ziv parsing ofboth strings, and utilizing the inherent periodic nature ofboth strings. This leads to an O(n 2 / log n), algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 / log n), where h ≤ 1 is the entropy ofthe text. We also present an algorithm for comparing two run-length encoded strings oflength m and n, compressed into m ′ and n ′ runs, respectively, in O(m ′ n + n ′ m) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty.

References

YearCitations

Page 1