Skip to main navigation Skip to search Skip to main content

Machine learning approaches in microbiome research: challenges and best practices

  • Georgios Papoutsoglou
  • , Sonia Tarazona
  • , Marta B. Lopes
  • , Thomas Klammsteiner
  • , Eliana Ibrahimi
  • , Julia Eckenberger
  • , Pierfrancesco Novielli
  • , Alberto Tonda
  • , Andrea Simeon
  • , Rajesh Shigdel
  • , Stéphane Béreux
  • , Giacomo Vitali
  • , Sabina Tangaro
  • , Leo Lahti
  • , Andriy Temko
  • , Marcus J. Claesson
  • , Magali Berland
  • University of Crete
  • Foundation for Research and Technology-Hellas
  • Polytechnic University of Valencia
  • NOVA University Lisbon
  • University of Innsbruck
  • University of Tirana
  • University of Bari
  • National Institute for Nuclear Physics
  • Université Paris-Saclay
  • CNRS
  • University of Novi Sad
  • University of Bergen
  • Université Paris-Saclay
  • University of Turku

Research output: Contribution to journalReview articlepeer-review

Abstract

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

Original languageEnglish
Article number1261889
JournalFrontiers in Microbiology
Volume14
DOIs
Publication statusPublished - 2023

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • AutoML
  • colorectal cancer
  • feature selection
  • machine learning methods
  • microbiome data analysis
  • model selection
  • predictive modeling
  • preprocessing

Fingerprint

Dive into the research topics of 'Machine learning approaches in microbiome research: challenges and best practices'. Together they form a unique fingerprint.

Cite this