Bayesian multi-domain learning for cancer subtype discovery from\n next-generation sequencing count data

Abstract

Precision medicine aims for personalized prognosis and therapeutics by\nutilizing recent genome-scale high-throughput profiling techniques, including\nnext-generation sequencing (NGS). However, translating NGS data faces several\nchallenges. First, NGS count data are often overdispersed, requiring\nappropriate modeling. Second, compared to the number of involved molecules and\nsystem complexity, the number of available samples for studying complex\ndisease, such as cancer, is often limited, especially considering disease\nheterogeneity. The key question is whether we may integrate available data from\nall different sources or domains to achieve reproducible disease prognosis\nbased on NGS count data. In this paper, we develop a Bayesian Multi-Domain\nLearning (BMDL) model that derives domain-dependent latent representations of\noverdispersed count data based on hierarchical negative binomial factorization\nfor accurate cancer subtyping even if the number of samples for a specific\ncancer type is small. Experimental results from both our simulated and NGS\ndatasets from The Cancer Genome Atlas (TCGA) demonstrate the promising\npotential of BMDL for effective multi-domain learning without "negative\ntransfer" effects often seen in existing multi-task learning and transfer\nlearning methods.\n