Abstract:
Studying microbes is essential as they influence nearly every aspect of our planet’s ecosystems and living organisms. Microbiomes exist across diverse environments, including marine, soil, the human body, and even the International Space Station (ISS), with the human gut microbiome alone containing approximately ≈ 22 million genes, reflecting its immense complexity and functional potential. Traditional culture-based techniques are often ineffective since most microbes cannot be cultured. Advancements in sequencing technologies have enabled faster and more cost-effective analysis of microbial genetic information. Approaches such as amplicon sequencing provide taxonomic insights, while whole-genome shotgun sequencing (metagenomics) allows simultaneous analysis of taxonomic composition and functional diversity.
Alignment-based methods, which align reads to reference databases and bin them into taxonomic and functional categories, are powerful tools. However, these methods face significant computational challenges due to the exponential growth of sequencing data and reference databases. Additionally, they can generate terabyte-scale files for large projects, underscoring the need for efficient data management and accessibility. To study microbiomes comprehensively, it is necessary to move beyond sequencing and adopt multi-omics approaches, such as metabolomics, which provide a deeper understanding of microbiome functionality. Integrating such diverse datasets remains challenging and requires methods that are both efficient and accessible to researchers in a streamlined and less cumbersome manner. The same holds true for other bioinformatics methods beyond microbiome research, where simplifying complex workflows is equally important.
Addressing these challenges, this dissertation focuses on the key question: How can we optimize methods to enable users to easily analyze metagenomic data, seamlessly integrate different types of data, and ensure accessibility for analysis, even with limited computational resources or expertise?
This question is further divided into four aims. For Aim I, we focus on optimizing databases for alignment-based approaches, explored in Chapter I. As the NCBI-nr protein database (which contains protein sequences from all domains of life) continues to grow (reaching ≈ 812 million proteins as of October 2024), it has become a significant computational bottleneck for alignment-based methods. We explored this challenge in the context of the DIAMOND+MEGAN approach, which helps determine the taxonomic composition and functional potential of microbial communities while allowing exploratory analysis of results using the MEGAN GUI. We addressed this challenge in two scenarios: (1) when researchers are interested only in the prokaryotic content of a metagenomic sample, and (2) when exploration extends beyond the prokaryotic fraction.
In the first case, we investigated AnnoTree, a protein database based on GTDB, and compared it with the prokaryotic portion of NCBI-nr. Our results showed that AnnoTree maintained similar alignment and assignment rates, while outperforming NCBI-nr by assigning more reads to KEGG functional categories. AnnoTree demonstrated
comparable specificity, with minor trade-offs, and was twice as fast when tested on metagenomic datasets.
In the second case, we evaluated UniRef clustered versions (100, 90, 50) and NCBI-nr clustered versions (90, 50), which include protein sequences from all domains of life, for broader exploration beyond prokaryotic content. Our results showed that higher-resolution clustered databases exhibited alignment and assignment rates similar
to the full NCBI-nr database, with slightly improved taxonomic assignment rates while maintaining agreement and specificity, albeit with trade-offs. Furthermore, all clustered versions demonstrated better assignment rates for functional categories, with a minor exception. Importantly, these optimized databases were significantly smaller
in size and provided substantial speedup—NCBI-nr50, for instance, achieved a 17-fold speedup—thus greatly reducing computational costs.
For Aim II, we focused on providing novel methods and tools for efficient metagenomic data management and accessibility, explored in Chapter II. Alignment-based files, such as DIAMOND Alignment Archives (DAAs), can reach terabyte-scale for large projects due to the nature of alignment-based approaches, where a single read may generate
multiple hits in a database. This file size limits accessibility when users need to view alignments or perform further analyses. To address this challenge, particularly in the context of the DIAMOND+MEGAN approach, we developed MeganServer, a solution that eliminates the need for local file access.
MeganServer serves these files using the REST-API approach, allowing users to access and analyze data directly from the server where it was processed, without the need for local download. Users can interact with the data through any programming language, a web browser, or—most efficiently—via the MEGAN GUI, which acts as a client to the
server. By leveraging the MEGAN GUI, users can seamlessly perform comprehensive analyses and explore large metagenomic datasets without requiring programming expertise. Our results demonstrate that this approach is highly effective, significantly improving data accessibility and streamlining metagenomic analysis workflows.
In the case of Aim III, we focused on developing innovative approaches to integrate different types of data, a critical component of microbiome-based analysis, explored in Chapter III. To address this, we introduced the Microbiome Metabolome Integration Platform (MMIP), a user-friendly web resource designed to integrate datasets at different
levels. MMIP uses taxonomic profiles generated from amplicon sequencing data, which are further used to derive functional profiles and predict metabolic potential from them. It effectively integrates taxonomic, functional, and metabolic potential data. It enables users to correlate predicted metabolic profiles with real-time metabolomic data and identify both positively and negatively correlated ones.
MMIP leverages methods such as Community-wide Metabolic Potential (CMP) for metabolic potential generation and employs learning-based approaches to identify important features. The platform also supports a range of microbiome analyses, including taxonomic comparisons, α- and β -diversity assessments, functional evaluations, and feature identification. Our results on validation datasets demonstrated outcomes comparable to those reported in published studies. With its intuitive web interface, MMIP eliminates the need for programming or advanced integration expertise, making comprehensive microbiome data integration analysis accessible and straightforward for researchers.
Finally, Aim IV focuses on designing user-friendly web resources to support diverse bioinformatics analyses, explored in Chapter IV. I developed the web resources for PLaBAse (for plant growth promotion), DeepToA (for predicting metagenomic sample sources), and MuLan-Methyl (for methylation prediction). While the conceptualization, background methods, and analyses were performed by my colleagues Sascha Patz (PLaBAse) and Wenhuan Zeng (DeepToA, MuLan-Methyl), I implemented the entire web architecture and backend workflows/pipelines to ensure these tools are easily accessible to users.
These resources provide access to the novel Plant Growth-Promoting Traits (PGPT) ontology developed by Sascha Patz and deep learning- and language-model-based approaches implemented by Wenhuan Zeng. By enabling efficient and user-friendly bioinformatics analyses, these tools eliminate the need for high-end computational
resources or advanced programming skills, making complex analyses more accessible to a broader range of users.
Taken together, the aims addressed in this thesis provide optimized methods that enable users to easily analyze metagenomic data, seamlessly integrate different types of data, and ensure accessibility for analysis, even with limited computational resources or expertise. These contributions advance microbiome research and related fields by delivering freely available resources, databases, tools, integrated systems, and web platforms that streamline bioinformatics analyses. Researchers can leverage these resources directly or build upon them, deepening our understanding of microbial communities and other research areas while paving the way for future discoveries.