Genome-wide Association Studies

BioAI Systems Lab develops computational and statistical methods for large genome-wide association studies (GWAS), with a particular focus on complex traits that cannot be explained well by single-marker analyses alone. We study how to identify informative combinations of genetic variants, how to model associations across multiple diseases, and how to extract biological signal from large and heterogeneous genomic datasets.

High-throughput SNP genotyping and large public datasets have made it possible to investigate genotype-phenotype relationships at an unprecedented scale. At the same time, these data introduce major computational and statistical challenges. Interaction effects are often weak, the search space grows explosively as higher-order SNP combinations are considered, and different diseases may share overlapping genetic architectures. Our work addresses these issues through Bayesian modeling, clustering, summary-statistics methods, and high-performance computing.

Across this area, the lab has developed tools such as DCHE, DAM, MSCD, JS-MA, BAM, and TS for detecting genome-wide interactions and disease-associated genes under different study designs and data availability settings.

Research Focus Areas

High-Order Epistatic Interaction Detection

Many complex traits are influenced by interactions among multiple loci, but exhaustive searches for high-order epistasis quickly become computationally infeasible. We develop methods that reduce the search burden while retaining power to discover multilocus effects that may be missed by conventional single-locus GWAS pipelines.

This direction includes tools such as DCHE and MSCD, which were designed to search for informative SNP combinations efficiently. These methods explore how clustering, combinatorial screening, and scalable computation can make multilocus discovery practical on genome-wide data.

Association Mapping Across Multiple Diseases

Related diseases may share common genetic mechanisms, and analyzing them jointly can reveal signals that are hard to detect in isolation. We study models that leverage information across multiple case groups to identify disease-specific and shared association patterns more effectively than separate single-disease analyses.

Representative work in this area includes DAM, BAM, and JS-MA, which explore Bayesian and information-theoretic approaches for mapping associations across multiple diseases. This line of research is motivated by the idea that complex disorders often exhibit pleiotropy, heterogeneous effects, and nontrivial overlap in their underlying variant sets.

Summary-Statistics-Based Gene Discovery

Many modern studies make summary statistics publicly available even when individual-level genotype data cannot be shared. We are interested in methods that can use these summary results to identify novel disease-associated genes and prioritize promising candidates for follow-up analysis.

Our related work includes TS and other statistical methods for detecting disease-associated genes from publicly available GWAS summary data. These approaches help expand the practical value of existing datasets by making secondary analysis possible without access to raw subject-level measurements.

Scalable Computation for Combinatorial Genomics

Genome-wide interaction discovery is fundamentally a large-scale computing problem. Even when the underlying statistical model is strong, the number of possible SNP combinations can be prohibitively large. We therefore study not only association models themselves, but also the software and computational strategies needed to deploy them on real data.

Cloud and parallel computing are an important part of this effort. Projects such as DCHE demonstrate how distributed computation can accelerate high-order epistasis detection and make genome-wide searches more feasible in practice.

Selected Software and Resources

DCHE: cloud computing for detecting high-order genome-wide epistatic interactions via dynamic clustering.
DAM: a Bayesian method for detecting genome-wide associations on multiple diseases.
MSCD: the Multi-SNP Combination Set Detector for high-order SNP combination discovery.
Selected Publications: the lab's broader publication list, including GWAS-related papers.

Representative Publications

TS: a powerful truncated test to detect novel disease associated genes using publicly available GWAS summary data, BMC Bioinformatics, 2020.
JS-MA: A Jensen-Shannon Divergence Based Method for Mapping Genome-Wide Associations on Multiple Diseases, Frontiers in Genetics, 2020.
BAM: A block-based Bayesian method for detecting genome-wide associations with multiple diseases, Tsinghua Science and Technology, 2020.
Powerful statistical method to detect disease-associated genes using publicly available genome-wide association studies summary data, Genetic Epidemiology, 2019.
Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering, BMC Bioinformatics, 2014.

These methods have been applied to challenging scenarios involving multilocus effects, multiple related diseases, and summary-statistics analysis. Across the lab's GWAS work, a recurring goal is to make association discovery both more powerful and more computationally practical for complex biological questions.