Predicting Age from Microbiome Data: Benchmarking Multi-Source Machine Learning Methods
Ishraq, Shadman (2024-12-30)
Predicting Age from Microbiome Data: Benchmarking Multi-Source Machine Learning Methods
Ishraq, Shadman
(30.12.2024)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe202502039230
https://urn.fi/URN:NBN:fi-fe202502039230
Tiivistelmä
The microbiome holds significant potential as a predictor of biological processes, including age, due to its dynamic interaction with human health. This study addressed the challenge of predicting age using microbiome data by benchmarking
tree-based machine learning models such as Random Forest (RF), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost), in addition to the IntegratedLearner method. In this study, the LifeLines DEEP dataset was utilized, incorporating relative abundance, marker abundance, and pathway abundance data to predict age. Both single-omic and multi-omics models were developed, focusing on evaluating the impact of data integration on predictive performance. The results demonstrated that multi-omics models outperformed single-omic models, with GBM trained on multi-omics data sets and the stacked model used by the IntegratedLearner method achieved the highest predictive accuracy. Functional data sets, particularly pathway abundance, exhibited stronger correlations with age compared to taxonomic dataset, underscoring their significance for age prediction. Despite challenges posed by sparse, zero-inflated data and limited microbial diversity, the findings suggest that multi-omics integration enhances model performance and provides valuable insights into age-related biological processes.
tree-based machine learning models such as Random Forest (RF), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost), in addition to the IntegratedLearner method. In this study, the LifeLines DEEP dataset was utilized, incorporating relative abundance, marker abundance, and pathway abundance data to predict age. Both single-omic and multi-omics models were developed, focusing on evaluating the impact of data integration on predictive performance. The results demonstrated that multi-omics models outperformed single-omic models, with GBM trained on multi-omics data sets and the stacked model used by the IntegratedLearner method achieved the highest predictive accuracy. Functional data sets, particularly pathway abundance, exhibited stronger correlations with age compared to taxonomic dataset, underscoring their significance for age prediction. Despite challenges posed by sparse, zero-inflated data and limited microbial diversity, the findings suggest that multi-omics integration enhances model performance and provides valuable insights into age-related biological processes.