U 50 : A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs

Abstract

Advances in next-generation sequencing technologies enable routine genome sequencing, generating millions of short reads. A crucial step for full genome analysis is the de novo assembly, and currently, performance of different assembly methods is measured by a metric called N50. However, the N50 value can produce skewed, inaccurate results when complex data are analyzed, especially for viral and microbial datasets. To provide a better assessment of assembly output, we developed a new metric called U50. The U50 identifies unique, target-specific contigs by using a reference genome as baseline, aiming at circumventing some limitations that are inherent to the N50 metric. Specifically, the U50 program removes overlapping sequence of multiple contigs by utilizing a mask array, so the performance of the assembly is only measured by unique contigs. We compared simulated and real datasets by using U50 and N50, and our results demonstrated that U50 has the following advantages over N50: (1) reducing erroneously large N50 values due to a poor assembly, (2) eliminating overinflated N50 values caused by large measurements from overlapping contigs, (3) eliminating diminished N50 values caused by an abundance of small contigs, and (4) allowing comparisons across different platforms or samples based on the new percentage-based metric UG50%. The use of the U50 metric allows for a more accurate measure of assembly performance by analyzing only the unique, non-overlapping contigs. In addition, most viral and microbial sequencing have high background noise (i.e., host and other non-targets), which contributes to having a skewed, misrepresented N50 value-this is corrected by U50. Also, the UG50% can be used to compare assembly results from different samples or studies, the cross-comparisons of which cannot be performed with N50.

References

Page 1

	Year	Citations

Page 1