Publication | Open Access
Extensive Simulations for Longest Common Subsequences: Finite Size Scaling, a Cavity Solution, and Configuration Space properties
15
Citations
12
References
1998
Year
Numerical AnalysisEngineeringCavity SolutionCavity QedResidual EntropyComputational ComplexityComputational MechanicsLongest SequenceString-searching AlgorithmMany-body ProblemString ProcessingNumerical SimulationKolmogorov ComplexityPhysicsProbability TheoryComputer ScienceFinite Size ScalingExtensive SimulationsLongest Common SubsequenceNatural SciencesCombinatorial Pattern MatchingRandomized AlgorithmCritical PhenomenonMultiscale Modeling
The Longest Common Subsequence (LCS) Problem asks for the longest sequence of (non-contiguous) matches between two given strings of characters. Using extensive Monte Carlo simulations, we find a finite size scaling law of the form E(L)/N =C + A/(N^1/2 ln N)+... for the mean LCS length of two random strings of size N over S letters. We provide precise estimates of C for S between 2 and 15. We consider also a related Bernoulli Matching model where the different entries of an N times M array are independently occupied with probability 1/S. In that case we find the expression of the limit of L(N,M)/N as N grows to infinity, as a function of r=M/N. This expression provides a very good approximation for the Random String model, which gets more and more accurate as S increases. The question of the ``universality class'' of the LCS problem is also considered. For the Bernoulli Matching model we find very good agreement with recent scaling predictions of Hwa and Lassig for Needleman-Wunsch sequence alignment. We find however that the variance of the LCS length has a different scaling different in the Random String model, suggesting that long-ranged correlations among the matches are relevant in this model. We finally study the ``ground state'' properties of this problem. We find in particular that the number of solutions typically grows exponentially with N, i.e. this system has a residual entropy at T=0. Also the overlap between two LCSs chosen at random is found to be self averaging and to aproach a definite value q(S)<1 as N grows.
| Year | Citations | |
|---|---|---|
Page 1
Page 1