Combination of Principal Component Analysis and Genetic Algorithm for Microbial Biomarker Identification in Obesity

Loading...
Thumbnail Image
File version

Accepted Manuscript (AM)

Author(s)
Zhang, P
West, N
Chen, PY
Cripps, A
Cox, A
Primary Supervisor
Other Supervisors
Editor(s)

Zheng, H

Callejas, Z

Griol, D

Wang, H

Hu, X

Schmidt, H

Baumbach, J

Dickerson, J

Zhang, L

Date
2019
Size
File type(s)
Location

Madrid, Spain

License
Abstract

Background: A large number of microbial species have been detected in human faecal samples, with many of the species having high correlations with each other. Principal components analysis (PCA) is often used to find characteristic patterns associated with certain diseases by reducing variable numbers before a predictive model is built, particularly when some variables are correlated. Usually, the first two or three components from PCA are used to see whether individuals can be clustered into two classification groups based on predetermined criteria: control and disease group. However, there might be a combination of other components that better distinguish diseased individuals from healthy controls. Genetic algorithms (GA) can be useful and efficient for searching the best combination of variables to build a prediction model. This study aimed to develop a prediction model that combines PCA and GA for identifying sets of bacterial species associated with high body mass. Results: GA has selected the subsets of the principal components (PCs) produced by PCA. The prediction models built with theses PCs produced much higher area under the curve (AUC) values compared to the models built using top PCs which explained the most variance in the sample. The combinatorial effect of the identified bacterial species that contributed the most to the PCs may be associated with body mass. Conclusions: The proposed algorithm overcomes the limitation of using PCA for prediction modelling. The application of the algorithm on an obesity study has shown the value of applying GA for selecting PC subsets from PCA to improve prediction models. The variables included in the PCs that were selected by GA can be combined with flexibility for potential clinical applications. The algorithm can be useful for many biological studies where high dimensional data are collected with highly correlated variables.

Journal Title
Conference Title

Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Item Access Status
Note
Access the data
Related item(s)
Subject

Genetics

Persistent link to this record
Citation