Identification of known and novel recurrent viral sequences in data from multiple patients and multiple cancers

View/ Open
File version
Version of Record (VoR)
Author(s)
Friis-Nielsen, Jens
Kjartansdottir, Kristin Ros
Mollerup, Sarah
Asplund, Maria
Mourier, Tobias
Jensen, Randi Holm
Hansen, Thomas Arn
Rey-Iglesia, Alba
Richter, Stine Raith
Nielsen, Ida Broman
Alquezar-Planas, David E.
Olsen, Pernille V. S.
Vinner, Lasse
Fridholm, Helena
Nielsen, Lars Peter
Willerslev, Eske
Sicheritz-Ponten, Thomas
Lund, Ole
Hansen, Anders Johannes
Izarzugaza, Jose M. G.
Brunak, Soren
Griffith University Author(s)
Year published
2016
Metadata
Show full item recordAbstract
Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test ...
View more >Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.
View less >
View more >Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.
View less >
Journal Title
Viruses
Volume
8
Issue
2
Copyright Statement
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
Subject
Microbiology not elsewhere classified
Microbiology