CORD-19:a59906b732bf4a489e282c3e4f499d4166c622e7 JSON TXT

Evaluating performance of metagenomic characterization algorithms using in silico datasets generated with FASTQSim Abstract Background 3 In silico bacterial, viral, and human truth datasets were generated to evaluate available metagenomics 4 algorithms. Sequenced datasets include background organisms, creating ambiguity in the true 5 source organism for each read. Bacterial and viral datasets were created with even and staggered 6 coverage to evaluate organism identification, read mapping, and gene identification capabilities of 7 available algorithms. These truth datasets are provided as a resource for the development and 8 refinement of metagenomic algorithms. Algorithm performance on these truth datasets can inform 9 decision makers on strengths and weaknesses of available algorithms and how the results may be best 10 leveraged for bacterial and viral organism identification and characterization. 11 12 Source organisms were selected to mirror communities described in the Human Microbiome Project as 13 well as the emerging pathogens listed by the National Institute of Allergy and Infectious Diseases. The 14 six in silico datasets were used to evaluate the performance of six leading metagenomics algorithms: 15 MetaScope, Kraken, LMAT, MetaPhlAn, MetaCV, and MetaPhyler. 16 Results Algorithms were evaluated on runtime, true positive organisms identified to the genus and species 19 levels, false positive organisms identified to genus and species level, read mapping, relative abundance 20 estimation, and gene calling. No algorithm out performed the others in all categories, and the 21 algorithm or algorithms of choice strongly depends on analysis goals. MetaPhlAn excels for bacteria 22 and LMAT for viruses. The algorithms were ranked by overall performance using a normalized 23 weighted sum of the above metrics, and MetaScope emerged as the overall winner, followed by Kraken 24 and LMAT. 25 Conclusions 27 Simulated FASTQ datasets with well--characterized truth data about microbial community composition 28 reveal numerous insights about the relative strengths and weaknesses of the metagenomics algorithms 29 evaluated. The simulated datasets are available to download from the Sequence Read Archive 30 (SRP062063). 31 32 Background 3 In silico bacterial, viral, and human truth datasets were generated to evaluate available metagenomics 4 algorithms. Sequenced datasets include background organisms, creating ambiguity in the true 5 source organism for each read. Bacterial and viral datasets were created with even and staggered 6 coverage to evaluate organism identification, read mapping, and gene identification capabilities of 7 available algorithms. These truth datasets are provided as a resource for the development and 8 refinement of metagenomic algorithms. Algorithm performance on these truth datasets can inform 9 decision makers on strengths and weaknesses of available algorithms and how the results may be best 10 leveraged for bacterial and viral organism identification and characterization. 11 12 Source organisms were selected to mirror communities described in the Human Microbiome Project as 13 well as the emerging pathogens listed by the National Institute of Allergy and Infectious Diseases. The 14 six in silico datasets were used to evaluate the performance of six leading metagenomics algorithms: 15 MetaScope, Kraken, LMAT, MetaPhlAn, MetaCV, and MetaPhyler. 16 17 Results 18 Algorithms were evaluated on runtime, true positive organisms identified to the genus and species 19 levels, false positive organisms identified to genus and species level, read mapping, relative abundance 20 estimation, and gene calling. No algorithm out performed the others in all categories, and the 21 algorithm or algorithms of choice strongly depends on analysis goals. MetaPhlAn excels for bacteria 22 and LMAT for viruses. The algorithms were ranked by overall performance using a normalized 23 weighted sum of the above metrics, and MetaScope emerged as the overall winner, followed by Kraken 24 and LMAT. 25 26 Conclusions 27 Simulated FASTQ datasets with well--characterized truth data about microbial community composition 28 reveal numerous insights about the relative strengths and weaknesses of the metagenomics algorithms 29 evaluated. The combination with an optimized database and another version of the least common ancestors algorithm. 52 MetaPhlAn[9] relies on unique clade--specific marker genes identified from 3000 reference genomes. 53 The Livermore Metagenomic Analysis Toolkit (LMAT) exploits genetic relationships between different 54 organisms by pre--computing the occurrence of each short sequence across the entire reference 55 database and storing the evolutionarily conserved sequence patterns [10--12] . MetaCV translates 56 nucleotide sequences into six frame peptides, which are then decomposed into k--mers. The k--mer 57 frequency is computed in a protein--reference database and used to assign k--mer weights [13] . Finally, 58 MetaPhyler uses a precomputed database of reference phylogenetic marker genes to build a sequence 59 classifier. ART simulator was successfully used by the 1000 Genomes Project Consortium to examine the effects 72 of read length and PE insert size on a read's ability to map to the human genome [19] . 73 In this study, six in silico datasets were simulated by the FASTQsim tool. Figure S1 illustrates the 74 composition of each dataset. These datasets contained sequences from reference bacterial and viral 75 genomes, as most human pathogens are members of these taxa. The HMP Even and HMP Staggered 76 datasets were generated to include sequences from the 20 organisms from the Human Microbiome 77 Project[20] (Supplementary Table 1 ). The HMP organisms were selected for inclusion after an attempt 78 to benchmark the performance of MetaScope with the HMP dataset revealed potential contamination 79 in the dataset. As the HMP benchmark dataset was generated by sequencing organisms cultured in 80 vitro, there was no absolute truth for any background contaminant organisms in the dataset and it was 81 not possible to determine whether the contamination was real or whether MetaScope was calling false 82 positive organisms. 83 84 The bacterial dataset (Supplementary Improvements to FASTQsim 104 The FASTQsim toolkit was augmented to annotate gene information for simulated reads [27] . The 105 "FASTQmapGenes" functionality was added, allowing users to specify NCBI accession ids to use for 106 annotating gene information in simulated reads. The FASTQsim toolkit uses the Entrez and SeqIO 107 libraries from BioPython[28] to download the specified files from GenBank in .gb format. The 108 GenbankParser[29] java application is then used to parse the .gb files in order to extract all information 109 encoded in the CDS and Gene tags. These gene and CDS annotations are appended to the headers 110 within the simulated FASTQ files generated by FASTQsim, such that all reads that fall within a CDS or 111 gene region are annotated with the corresponding CDS and gene information. 112 In silico data generation 114 The FASTQsim toolkit was used to generate six in silico datasets. All were generated with the Illumina 115 error and read length profile included with FASTQsim version 2.0, with no host background added. 116 Specifically, read length of 150 bases was used, with single base mutation, insertion, and deletion rates 117 as specified in the FASTQsim v. Two in silico datasets were generated -"HMP Even" and "HMP Staggered" (Supplementary Table 1) . 122 For the HMP even dataset, FASTQsim was executed to provide equal number of reads for each species 123 of organism (approximately 60,000 reads per species), with one exception ----559 reads for Streptococcus 124 agalactiae were added to simulate a low--level contaminant organism. Version 2.0 of the FASTQsim 125 algorithm probabilistically simulated read counts and error distributions based on a provided model. 126 Due to the probabilistic nature of the algorithm, coverage levels deviated slightly from the specified 127 60,000 reads, with the largest deviation observed for the E. faecalis organism (52,290 reads). For the 128 HMP Staggered dataset, coverage levels varied from 11.3x (217,512 reads) for Actinomyces 129 odontolyticus to 0.001x (2 reads) for Neisseria meningitidis. The goal of the staggered dataset was to 130 evaluate the ability of metagenomic algorithms to detect organisms present at very low concentrations, 131 i.e. less than 5 reads. 132 133 The bacterial dataset included reads from the genear Yersinia, Coxiella, Brucella, Salmonella, and 134 Escherichia. Runtime in seconds, true positive genus and species identification, false positive genus and species 180 identification, and false negative species calls were determined for each of the metagenomic algorithms 181 (Figure 1) . Among the algorithms evaluated, only MetaScope mapped a small number of reads in our 182 datasets to a taxon rank below species. Consequently, although the initial focus of the Bacterial dataset 183 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint was to assess the ability of the algorithms to distinguish between different strains of the same species, 184 it was decided to evaluate both true and false positives at species and genus level. To determine an 185 overall rank of the algorithms across the datasets, the area occupied by each in the radar plot was 186 computed ( Table 1) . When the polygon area was calculated using the MATLAB polyarea function and 187 summed across all datasets, MetaScope emerges as the winner, with the largest overall area. Kraken 188 and LMAT are the runner--ups, and MetaPhyler performed the worst. In addition to the algorithms' 189 rank overall, several trends can be noted in the individual performance categories. 190 The algorithms diverged in runtime by several orders of magnitude ( In addition to its high speed, MetaPhlAn also achieved the highest accuracy, defined as ratio of true with fewer than 100 reads assigned to each. Kraken has a similar false positive profile; it reports 1,266 218 species that account for <1% of the reads in the dataset. MetaCV reports 2,998 false organisms with 219 low read count, and LMAT reports 1,118 species that account for less than 0.01% of the reads. 220 MetaPhyler does not report results more specific than the Class taxonomy level for the Human 221 dataset, in line with the conservative approach of this algorithm. MetaPhlAn crashes with a 222 segmentation fault on the Human dataset, which most likely is an artifact of the non--host--filtering 223 approach used by this algorithm. 224 225 The algorithms were evaluated based on their ability to correctly map reads and predict relative 226 abundance of the organisms in the data (Figures 2,3) . For the bacterial datasets, Kraken and MetaScope 227 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint classified the highest number of reads correctly for both the genus and species level, and cluster closest 228 to the truth in the dendrogram. However, for the viral datasets, LMAT performed best, classifying the 229 most reads correctly. 230 Although the Actinomyces odontolyticus (NZ_DS264586.1) organism had the highest coverage (11.3x, 231 217512 reads) in the HMP staggered dataset, the algorithms on the whole did not perform well on this 232 organism. It was not identified by the Kraken, MetaCV, and MetaPhyler algorithms, and called at a low 233 level by MetaScope (153 reads) (Figure 2g ) MetaCV mapped the most reads correctly -108,211 (49.7%) 234 and MetaPhlAn was second best, identifying 22,647 (10.4%) of the reads. None of the algorithms 235 identified any of the 2,045 A. odontolyticus genes (Figure 5b) . This poor performance likely results from 236 the fact that A. odontolyticus genome annotation in GenBank is incomplete [36] . Conversely, at the 237 species level, five of the six algorithms mapped a high number of reads to Streptococcus agalactiae for 238 both the HMP even and HMP staggered datasets (Figure 2f, 2g) , but only a small number of reads for 239 this organism were present in the truth data. The relative abundance of Streptococcus mutans is lower 240 in the algorithm calls as compared to truth, while the relative abundance of Streptococcus agalactiae 241 is higher, suggesting that a number of the reads called for S. agalactiae are actually from S. mutans 242 (Figure 3b, 3d) . This implies difficulty distinguishing between closely related species. Similarly, a high 243 number of reads are assigned correctly to the Yersinia and Escherichia genera by Kraken and 244 MetaScope (Figure 2c. ) However, the algorithms under--assign reads for Escherichia albertii and over--245 assign reads for Yersinia pseudotuberculosis, which indicates difficulty in distinguishing between these 246 species (Figure 2h ). 247 Overall, algorithms were equally as able to identify organisms in the staggered datasets as in the even 248 datasets, suggesting that accurate read mapping depends more on the database supplied to the 249 algorithm rather than the abundance of the organism in the dataset. Additionally, for the bacterial 250 datasets, Kraken, MetaScope, LMAT, and MetaPhlAn generally agreed on read mapping assignments. 251 However, for the viral datasets, the algorithms missed different sets of organisms -i.e., in Figure 3i , 252 LMAT failed to map reads for HIV1, Influenza A virus, Marburg virus, and Machupo virus, whereas 253 MetaScope and Kraken correctly mapped reads for these organisms. However, MetaScope and Kraken 254 both failed to map reads for Human papillomavirus 5, SARS coronavirus, Human papillomavirus 32, and 255 Canine papillomavirus 3, while LMAT succeeded in mapping reads for these organisms. This suggests 256 that for viral datasets, it might be worthwhile to execute both LMAT and one of Kraken or MetaScope, 257 and calculate the union of the results. 258 The algorithms were also evaluated based on false positive hits (Figure 4) . MetaCV and LMAT have 259 diverse error profiles -small numbers of reads are mapped to a high number of false positive 260 organisms. Our past experiences with the MetaScope algorithm suggest that this false positive profile 261 indicates an algorithm has difficulty classifying organisms that are not present in the reference database. 262 Ideally, when an algorithm encounters a novel organism, it should regress up the taxonomic tree until a 263 nearest neighbor for the unknown organism can be established. However, the algorithm may instead 264 report all reference organisms that match the unknown sample to a certain threshold. In contrast, 265 Kraken has a highly concentrated error profiles; fewer than 20 false positive organisms are reported, but 266 several thousand reads are mapped to each of them, suggesting high confidence calls. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint presence of the Methanolobus genus, which is a member of the kingdom Archaea and is distantly 272 related to the bacteria in the truth data. 273 For the viral datasets, MetaCV returned a high number of false positives and exhibited poor 274 performance. Kyasanur forest disease virus, a close relative to the true positive Omsk hemorrhagic 275 fever virus was the sole false positive for LMAT, and MetaScope did not report any false positive 276 organisms for either viral dataset. 277 Finally, the gene calling capabilities of the algorithms were evaluated ( Figure 5) The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint Figure 2 . Number of correctly assigned reads to each organism at the genus and species level. Heatmap color scales are log10 (number of correctly assigned reads). The "Truth" column indicates the number of reads spiked into the FASTQ input file for the specified genus or species. a. -e. Reads mapped correctly to the genus level for the HMP even, HMP staggered, bacteria, virus even, virus staggered datasets, respectively. f. -k. Reads mapped correctly to the species level for HMP even, HMP staggered, bacteria, virus even, virus staggered, and human datasets, respectively. . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint Figure 3 . Relative abundance of organisms to the species and genus level. "Truth" column indicates relative abundance of genera and species added to the in silico FASTQ input file. . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/046532 doi: bioRxiv preprint Tables 471 Table 1

projects that include this document

Unselected / annnotation		Selected / annnotation
CORD-19_bioRxiv_medRxiv_subset (0) CORD-19_All_docs (0) CORD-19-Sentences (334) CORD-PICO (14) CORD-19-PD-MONDO (20) CORD-19-PD-UBERON (1) CORD-19-PD-HP (7)

TAB JSON ListView MergeView

CORD-19:a59906b732bf4a489e282c3e4f499d4166c622e7 JSONTXT

projects that include this document

CORD-19:a59906b732bf4a489e282c3e4f499d4166c622e7 JSON TXT