Computational Proteomics

BioAI Systems Lab develops computational methods for interpreting large-scale mass spectrometry data, with a particular emphasis on metaproteomics. Our goal is to improve how peptides and proteins are identified, quantified, and biologically interpreted in complex microbial and environmental samples.

In a typical metaproteomics experiment, total proteins are extracted from a microbial community and measured by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using a shotgun proteomics workflow. Unlike single-organism proteomics, the reference databases for these studies can be extremely large, incomplete, and highly redundant. As a result, peptide-spectrum matching, protein inference, quantification, and false discovery control become much more difficult. We address these bottlenecks through machine learning, statistical modeling, optimization, and scalable computing.

Our current work spans both data-dependent and data-independent acquisition workflows, deep learning for peptide identification, quantitative metaproteomics, and software systems that make advanced proteomics analysis easier to run at scale.

Research Focus Areas

Peptide Identification and PSM Filtering

One major focus of the lab is improving peptide-spectrum match (PSM) ranking after database search. Modern metaproteomics studies often generate enormous search spaces, which makes it hard to separate true peptide assignments from false positives. We develop machine learning methods that learn directly from measured and theoretical spectra, complementing conventional search scores with data-driven evidence.

This line of work includes DeepFilter and WinnowNet, which were designed to improve peptide identifications in complex metaproteomic samples. Recent lab work has shown that curriculum learning and order-invariant neural models can improve identifications at matched false discovery rates while also reducing the need for ad hoc sample-specific tuning.

Protein Inference and Quantitative Metaproteomics

After peptides are identified, proteins still must be inferred from shared and degenerate peptides, a challenge that becomes especially severe in microbial mixtures. We study methods that integrate taxonomic context, search evidence, and quantitative signals to generate more reliable protein-level results from shotgun proteomics data.

Representative efforts in this direction include MetaLP for protein inference and SEMQuant for quantitative metaproteomics with match-between-runs support. We are also interested in taxonomy-aware statistical control and in computational strategies for proteomic stable isotope probing, which helps reveal which microbial populations are actively assimilating labeled substrates in complex communities.

Data-Independent Acquisition and De Novo Analysis

Data-independent acquisition (DIA) offers broad and reproducible sampling, but its spectra are highly multiplexed and require strong computational deconvolution. We develop tools for extracting pseudo-spectra, sequencing peptides, and identifying proteins from DIA data, especially in applications where spectral libraries are incomplete or the sample complexity is high.

Our related projects include IDIA, which extracts pseudo-spectra from DIA data, and Transformer-DIA, which explores transformer-based de novo peptide sequencing for DIA mass spectrometry. Together, these efforts aim to make DIA workflows more accurate, more flexible, and more informative for complex proteomics studies.

Scalable and Reproducible Proteomics Computing

Proteomics data volumes continue to grow, and practical adoption depends on workflows that are both accurate and accessible. We build scalable software systems that combine web access, high-performance computing, and reproducible execution to support real biological studies rather than isolated benchmark runs.

CloudProteoAnalyzer is one example of this effort. It provides an end-to-end cloud-based workflow for proteomics analysis and demonstrates how parallel computing can make large proteomics and metaproteomics studies easier to process and compare.

Selected Software and Resources

WinnowNet: deep learning-based filtering for peptide-spectrum matches in metaproteomics.
SEMQuant: quantitative metaproteomics with match-between-runs support.
MetaLP: protein inference in metaproteomics using an integrative linear programming framework.
IDIA: pseudo-spectrum extraction for data-independent acquisition proteomics.
CloudProteoAnalyzer: scalable cloud-based proteomics data analysis.
Transformer-DIA: transformer-based de novo peptide sequencing for DIA mass spectrometry.
DeepFilter: deep learning for peptide identification in metaproteomics.
NIH/NLM R15 Project: lab project page for computational framework development in metaproteomics using DIA.
Selected Publications: full publication list for the lab.

Representative Publications

Enhancing peptide identification in metaproteomics through curriculum learning in deep learning, Nature Communications, 2025.
Proteomic stable isotope probing with an upgraded Sipros algorithm for improved identification and quantification of isotopically labeled proteins, Microbiome, 2024.
CloudProteoAnalyzer: scalable processing of big data from proteomics using cloud computing, Bioinformatics Advances, 2024.
MetaLP: An integrative linear programming method for protein inference in metaproteomics, PLOS Computational Biology, 2022.
Deep learning for peptide identification from metaproteomics datasets, Journal of Proteomics, 2021.
IDIA: An Integrative Signal Extractor for Data-Independent Acquisition Proteomics, IEEE BIBM, 2022.

Students interested in this area can also see our undergraduate research assistantship page for opportunities related to mass spectrometry and computational proteomics.