Validation of Solu Platform's Phylogeny Pipeline Using a Salmonella Bareilly Outbreak Dataset
Abstract
This study uses a 2012 Salmonella Bareilly benchmark outbreak dataset to validate Solu Platform’s phylogeny pipeline. The outbreak, linked to a frozen raw yellowfin tuna product, infected 425 people in the U.S. The platform accurately identified the Salmonella Bareilly samples and related genes, demonstrating its genomic analysis capabilities. Solu Phylogeny Pipeline produced a tree that closely matched the gold-standard tree. The produced tree also outperformed another popular pipeline, validating the pipeline's reliability.
Introduction
In 2012, Salmonella Bareilly and Salmonella Nchanga caused a food-borne outbreak in the U.S. infecting 425 people. Salmonella Bareilly was the cause of 96% of the cases. Public health agencies connected the outbreak to a frozen raw yellowfin tuna product (”tuna scrape”) originating from a tuna processing company in India. [1]
Timme et al. have used the well-documented data related to this Salmonella outbreak to create a benchmark dataset for validating phylogeny pipelines [2, 3]. In this case study, we examine this benchmark dataset using Solu Platform and compare the results with the previous findings.
Dataset
The dataset contains 23 Salmonella Bareilly samples collected from various countries between 2003 and 2012. The data has been deposited in European Nucleotide Archive (ENA) at EMBL-EBI, under accession number PRJNA170556. The table below summarizes the countries and collection times of the samples.
The 18 samples collected in 2012 are part of the outbreak, while the remaining five samples represent outgroups.
Methods
We input the samples’ SRA accession numbers to Solu Platform. The platform automatically downloaded the raw reads of the samples, assembled them and executed variety of genomic characterization and phylogenetic analyses. For further information about the platform's methodology, please visit our methodology description.
Results
Species and antimicrobial resistance
The platform accurately identified each sample as Salmonella Bareilly. It also detected numerous genes associated with antimicrobial resistance, stress response to metals, and virulence [5].
Phylogenetic analysis
Solu Platform automatically constructed a phylogenetic tree for the dataset based on SNP differences. It also detected two clusters: Se1 and Se2 containing 19 and two samples respectively. The phylogenetic tree and the color-coded clusters are shown in the image below.
The phylogenetic tree shows the branched out samples that were collected before the 2012 outbreak: SRR500493, SRR500494, SRR498373, SRR498369 and SRR498276. However, we found SRR498276 (the closest outgroup sample) to only have an average SNP distance of 14 to the other samples in the outbreak despite it being sampled in 2003. The low SNP distance also caused it to be clustered with the 2012 outbreak samples. This demonstrates the problem of defining an outbreak based on an SNP cutoff and highlights the need for validating the outbreak based on the phylogenetic tree.
Interestingly, SRR498276 also originated from India, and only 8 kilometers away from the tuna facility. This most likely explains the low evolutionary distance to the outbreak samples and shows the trace backing capabilities of WGS. The other samples in the cluster Se1 are within 0 to 6 SNPs from each other, having similar SNP distances to what Hoffman et al reported.
Validating the phylogenetic tree
To validate the phylogenetic tree generated by the platform, we compared it with the reference tree produced by Timme et al. We used two methods for comparing trees: Kendall Colijn distance and Robinson–Foulds distance [6, 7]. We applied the KendallColijn()
and RobinsonFoulds()
functions from the TreeDist R package using the default parameters [8].
We exported the phylogenetic tree from the Solu Platform in Newick format and used both this and the gold-standard tree in our calculations. For reference, we also computed the same distances for a tree that was constructed by Libuit et al. for the same dataset [9]. The table below shows the results.
The results suggest that the phylogenetic tree built by the Solu Platform aligns more closely with the reference tree compared to the tree produced by Theiagen kSNP3 workflow.
Discussion
Our results validate the accuracy of the Solu Platform in accurately identifying Salmonella Bareilly samples and related genes, and show that accurate results can be obtained with an automated, easy-to-use pipeline.
Comparing the phylogenetic tree produced by the platform with the gold-standard tree using tree comparison metrics tree further demonstrates the accuracy of Solu Platform's phylogeny pipeline.
References
- Centers for Disease Control and Prevention. 2012 Salmonella Outbreak Associated with a Raw Scraped Ground Tuna Product. 2012. https://archive.cdc.gov/#/details?url=https://www.cdc.gov/salmonella/bareilly-04-12/index.html Accessed 23 May 2024.
- Timme RE, Rand H, Shumway M, et al. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ. 2017;5:e3893. Published 2017 Oct 6. doi:10.7717/peerj.3893
- Hoffmann M, Luo Y, Monday SR, et al. Tracing Origins of the Salmonella Bareilly Strain Causing a Food-borne Outbreak in the United States. J Infect Dis. 2016;213(4):502-508. doi:10.1093/infdis/jiv297
- Alcock BP, Huynh W, Chalil R, et al. CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Res. 2023;51(D1):D690-D699. doi:10.1093/nar/gkac920
- Feldgarden M, Brover V, Gonzalez-Escalona N, et al. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):12728. Published 2021 Jun 16. doi:10.1038/s41598-021-91456-0
- Kendall, Michelle, and Caroline Colijn. Mapping phylogenetic trees to reveal distinct patterns of evolution. Molecular biology and evolution 33.10 (2016): 2735-2743.
- Robinson, David F., and Leslie R. Foulds. Comparison of phylogenetic trees. Mathematical biosciences 53.1-2 (1981): 131-147.
- Smith, M.R. TreeDist: Distances between Phylogenetic Trees. R package version 2.7.0. Comprehensive R Archive Network. doi:10.5281/zenodo.3528124
- Libuit KG, Doughty EL, Otieno JR, et al. Accelerating bioinformatics implementation in public health. Microb Genom. 2023;9(7):mgen001051. doi:10.1099/mgen.0.001051
Get started with a call
Book a 30-minute Zoom meeting to discuss options for sequencing, analysis, or genomic surveillance.