Concepedia

TLDR

Generative pre‑trained models have achieved remarkable success in NLP and computer vision, and large‑scale diverse datasets with transformers are emerging as a promising approach for foundation models. This study investigates whether foundation models can be applied to cellular biology and genetics by treating cells as analogous to words in language. We built scGPT, a generative pre‑trained transformer trained on over 33 million single‑cell profiles, and fine‑tuned it via transfer learning for tasks such as cell‑type annotation, multi‑batch and multi‑omic integration, perturbation prediction, and gene‑network inference. scGPT successfully extracts critical biological insights about genes and cells, and its codebase is publicly available at https://github.com/bowang-lab/scGPT.

Abstract

Abstract Generative pre-trained models have achieved remarkable success in various domains such as natural language processing and computer vision. Specifically, the combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between linguistic constructs and cellular biology — where texts comprise words, similarly, cells are defined by genes — our study probes the applicability of foundation models to advance cellular biology and genetics research. Utilizing the burgeoning single-cell sequencing data, we have pioneered the construction of a foundation model for single-cell biology, scGPT, which is based on generative pre-trained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT, a generative pre-trained transformer, effectively distills critical biological insights concerning genes and cells. Through the further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference. The scGPT codebase is publicly available at https://github.com/bowang-lab/scGPT .

References

YearCitations

Page 1