Publication | Open Access
Efficient Test and Visualization of Multi-Set Intersections
458
Citations
33
References
2015
Year
Identifying shared features across multiple sets is essential in many fields, yet no method exists to assess the statistical significance of intersections among three or more sets, and current visualization approaches lack scalability. This study develops a theoretical framework and efficient procedure for computing exact probabilities of multi‑set intersections and designs scalable visualization techniques. Using combinatorial theory, the authors implemented the framework and visualization methods in the R package SuperExactTest, offering multiple efficient, scalable techniques. Simulation and analyses of seven cancer gene sets and six GWAS‑derived gene sets demonstrate the utility of SuperExactTest, suggesting broad applicability across scientific disciplines.
Abstract Identification of sets of objects with shared features is a common operation in all disciplines. Analysis of intersections among multiple sets is fundamental for in-depth understanding of their complex relationships. However, so far no method has been developed to assess statistical significance of intersections among three or more sets. Moreover, the state-of-the-art approaches for visualization of multi-set intersections are not scalable. Here, we first developed a theoretical framework for computing the statistical distributions of multi-set intersections based upon combinatorial theory and then accordingly designed a procedure to efficiently calculate the exact probabilities of multi-set intersections. We further developed multiple efficient and scalable techniques to visualize multi-set intersections and the corresponding intersection statistics. We implemented both the theoretical framework and the visualization techniques in a unified R software package, SuperExactTest . We demonstrated the utility of SuperExactTest through an intensive simulation study and a comprehensive analysis of seven independently curated cancer gene sets as well as six disease or trait associated gene sets identified by genome-wide association studies. We expect SuperExactTest developed by this study will have a broad range of applications in scientific data analysis in many disciplines.
| Year | Citations | |
|---|---|---|
Page 1
Page 1