Research

BioAI Systems Lab develops AI-native computational methods for biological discovery, with research spanning AI for Science, omics data analysis, genomics, metagenomics, and brain imaging.

High-throughput omics technologies are revolutionizing many aspects of modern biology. We are entering the era of Big Data in biological research. The massive sample sizes and high dimensionality of these data introduce unique computational and statistical challenges, including storage bottlenecks, measurement errors, noise accumulation, and spurious correlations.

We study data mining, machine learning, big data analysis, and data fusion. The primary focus of our research is to develop advanced frameworks and algorithms that help biologists solve practical problems in systems biology, biomedicine, and the natural sciences, and enable them to make full use of massive, high-dimensional data for a wide range of biological inquiries.

Research Areas

AI for Science

Modern biology is rapidly becoming an AI-driven science. Large-scale experiments now generate massive and heterogeneous datasets, including genome sequences, proteomics measurements, metabolomics profiles, imaging data, and scientific literature. Although these data contain rich signals about biological systems, transforming them into mechanistic insight, testable hypotheses, and actionable knowledge remains a major challenge. In this project, we develop AI-native computational methods that can accelerate scientific discovery by integrating multimodal biological data, extracting interpretable patterns, and supporting data-driven reasoning.

We are particularly interested in building intelligent systems that assist scientists throughout the discovery workflow, including result interpretation, hypothesis generation, experiment prioritization, and biological knowledge integration. By combining machine learning, data mining, and systems biology, we aim to create trustworthy and interpretable AI frameworks that help researchers move from complex data to scientific understanding.

Metaproteomics Search

Microbial communities drive nutrient cycling in aquatic and terrestrial ecosystems and influence the health of human, animal, and plant hosts. The metabolic activities of a microbial community can be inferred from the proteomes of its constituent microorganisms. In a typical metaproteomics experiment, total proteins are extracted from environmental samples of a microbial community and then measured by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using a shotgun proteomics approach. Because complex microbial communities may contain thousands of organisms, metaproteomics analyses often yield far fewer peptide and protein identifications than comparable proteomics analyses of single organisms. In this project, we aim to develop novel database searching and filtering algorithms.

Software: WinnowNet. SEMQuant. MetaLP. IDIA. CloudProteoAnalyzer. Transformer-DIA. DeepFilter, Sipros Ensemble.

Genome-wide Association Studies

Taking advantage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWAS) have shown great promise for unraveling complex relationships between genotype and phenotype. At present, traditional single-locus methods are often insufficient for detecting multilocus interactions, which are widespread in complex traits. In addition, statistical tests for high-order epistatic interactions involving more than two SNPs pose substantial computational and analytical challenges because the number of SNP combinations increases exponentially with interaction order. In this project, we design fast and powerful methods to detect genome-wide multilocus epistatic interactions across multiple case groups.

Software: DCHE, DAM, MSCD.

Sequence Assembly

Next-generation sequencing platforms have not only reduced the cost of metagenomic data analysis, but also greatly increased the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between contig noise and gains in sequence length for improved annotation has not been adequately addressed in large-scale sequencing projects, especially for datasets with moderate coverage and a large number of nonoverlapping contigs. To address this limitation and improve both accuracy and efficiency, we develop novel metagenomic sequence assembly frameworks and algorithms by leveraging high-performance computing and cyberinfrastructure.

Software: DISCO, DIME.

fMRI Analysis

An intriguing question in brain science is: what are the origins and principles underlying functional architectures, which define who we are and what we are to a great extent? Compared with other methods, functional Magnetic Resonance Imaging (fMRI) is one of the most widely used approaches for exploring whole-brain functional activity. After decades of active research, substantial evidence has shown that brain function emerges from the interactions of multiple concurrent neural processes or brain networks. We design novel statistical models to analyze functional brain dynamics among nodes in brain networks over time.

Software: BCCPM.