TY - JOUR
T1 - Machine learning approaches in microbiome research
T2 - challenges and best practices
AU - Papoutsoglou, Georgios
AU - Tarazona, Sonia
AU - Lopes, Marta B.
AU - Klammsteiner, Thomas
AU - Ibrahimi, Eliana
AU - Eckenberger, Julia
AU - Novielli, Pierfrancesco
AU - Tonda, Alberto
AU - Simeon, Andrea
AU - Shigdel, Rajesh
AU - Béreux, Stéphane
AU - Vitali, Giacomo
AU - Tangaro, Sabina
AU - Lahti, Leo
AU - Temko, Andriy
AU - Claesson, Marcus J.
AU - Berland, Magali
N1 - Publisher Copyright:
Copyright © 2023 Papoutsoglou, Tarazona, Lopes, Klammsteiner, Ibrahimi, Eckenberger, Novielli, Tonda, Simeon, Shigdel, Béreux, Vitali, Tangaro, Lahti, Temko, Claesson and Berland.
PY - 2023
Y1 - 2023
N2 - Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.
AB - Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.
KW - AutoML
KW - colorectal cancer
KW - feature selection
KW - machine learning methods
KW - microbiome data analysis
KW - model selection
KW - predictive modeling
KW - preprocessing
UR - https://www.scopus.com/pages/publications/85173758787
U2 - 10.3389/fmicb.2023.1261889
DO - 10.3389/fmicb.2023.1261889
M3 - Review article
AN - SCOPUS:85173758787
VL - 14
JO - Frontiers in Microbiology
JF - Frontiers in Microbiology
M1 - 1261889
ER -