SOAPdenovo2: an empirically improved memory-efficient short-read <i>de novo</i> assembler

TLDR

De novo genome assembly from NGS short reads is rapidly expanding, yet challenges in efficiency, accuracy, continuity, coverage, and repeat resolution persist. The authors developed SOAPdenovo2 to address these issues by redesigning the algorithm for lower memory use, better repeat handling, longer scaffolds, improved gap closing, and large‑genome optimization. SOAPdenovo2 implements a memory‑efficient graph construction, enhanced repeat resolution, extended scaffold building, and gap‑closing optimizations tailored for large genomes. Benchmarking on Assemblathon1 and GAGE datasets demonstrates that SOAPdenovo2 outperforms its predecessor and rivals other assemblers in length and accuracy, while its application to the YH genome yields 3‑fold longer contigs, 50‑fold longer scaffolds, 93.9 % coverage, and two‑thirds lower peak memory usage.

Abstract

There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

References

Page 1

	Year	Citations

Page 1