Validating Solu’s cgMLST Implementation
A reproducibility and concordance validation across seven core-genome MLST schemes
Purpose
This whitepaper documents how Solu platform implements core-genome multilocus sequence typing (cgMLST) and shows that the implementation produces valid clustering results. We validated through two methods:
- External reproducibility. For each schema we took a study that defines epidemiologically verified transmission clusters or population-structure clades, uploaded the paper’s raw reads to Solu, and verified whether our pipeline recovers the same groupings the authors reported.
- Internal concordance. Every workspace also runs Solu’s peer-reviewed SNP-based phylogeny. While SNP-based phylogeny can’t be treated as ground truth, comparing the two methods on the same isolates tests whether the cgMLST pipeline agrees with an established second method.
Summary of results
Across seven cgMLST schemes and eleven bacterial species, Solu's implementation recovers the transmission clusters and population-structure clades reported in published outbreak studies, often at or near 100%, and concords with Solu's own independent SNP phylogeny. Where the two methods diverge, the differences align with what the source studies report and never merge genomically distinct lineages, consistent with a correct implementation.
1. Species and schemes
The platform types eleven species across seven cgMLST schemes. Two genus-level schemes serve multiple species: one Pasteur Klebsiella species-complex (KpSC) schema covers four Klebsiella species (and contributes to a fifth one), and one PubMLST Campylobacter schema covers both C. jejuni and C. coli. For K. pneumoniae we use the Pasteur’s 629 loci KpSC schema combined with a more accurate K. pneumoniae - specific 2752 loci schema, resulting in a 3381 loci schema.
Allele-difference thresholds are inclusive: two isolates join a cluster when a chain of pairwise distances, each at most the threshold, connects them (single linkage).
2. Adapting Pasteur and PubMLST schemes to chewBBACA
Each schema is sourced from a published, externally curated allele database (either Institut Pasteur’s BIGSdb-Pasteur or PubMLST) and adapted into the layout chewBBACA expects. Building on curated upstream schemes keeps our calls interpretable against the same nomenclature the community already uses.
We adapted all schemes with chewBBACA (github.com/B-UMMI/chewBBACA) version 3.5.3 PrepExternalSchema.
The per-species build manifest (training genomes, scheme URLs, locus counts, dropped alleles) is in Supplementary Data.
Source curation quality
Allele rejection during CDS validation during schema adaptation was minimal. The two Pasteur-curated schemes reject almost nothing: zero alleles for Listeria, and 50 of 809409 (0.006%) for the Klebsiella KpSC schema. The PubMLST schemes reject more, up to about one percent for M. abscessus (0.991%).
Failure modes differ by source (Supplementary Table S3). Indel-containing alleles, whose length is not a multiple of three, dominate the rejections in Campylobacter. M. abscessus is the opposite case: every one of its rejections is a translation-boundary problem (a missing or premature stop codon), not a length error. S. aureus and V. cholerae show small, mixed counts across both categories.
We drop a rejected allele from the schema but keep the surviving alleles in the same locus. Adaptation drops no locus across any of the six schemas: input and output locus counts are identical for every schema (Supplementary Table S1).
3. The cgMLST analysis pipeline
Every assembly passes through two stages: per-sample allele calling, then a multi-sample comparison that produces a tree, a distance matrix, and clusters.
3.1 Per-sample allele calling
chewBBACA AlleleCall (version 3.5.4) runs once per assembly against that species’ adapted schema:
Two flags to note:
--no-inferredkeeps novel alleles out of the schema. The schema is shared across all customers and stays identical for the lifetime of the scheme, which makes calls reproducible across runs and customers.--hash-profiles sha256emits the results also in a hashed format, replacing each call with the SHA-256 of its allele DNA. Because we call each sample independently, chewBBACA’s per-runINF-<n>identifiers would not match across runs. Hashing makes allele identity content-addressed: two genomes carrying the same novel sequence produce the same value.
3.2 Multi-sample comparison
The comparison runs once at least four same-species samples have successful AlleleCall results. A converter reads the per-sample hashed TSVs and writes one integer profile in the format GrapeTree and cgmlst-dists expect:
- missing and error codes, which
--hash-profileshas already collapsed to -, become 0; - each 64-character SHA-256 becomes its first 32 bits masked to 31, a positive
int32to fitcgmlst-distsigned cells.
Tree. GrapeTree builds a neighbour-joining tree (NJ), then BioPython roots it at its midpoint.
Distances and clusters
cgmlst-dists computes the pairwise allele-difference matrix. -x 10000000 lifts its default early-exit clip at 999 so it reports distant pairs in full rather than censoring them. Solu’s clustering tool then groups samples into named clusters at a threshold of t allele differences by single linkage.
The clustering tool is the peer-reviewed tool Solu already uses and has validated for SNP-based clustering. It runs single-linkage agglomerative clustering at an absolute threshold of t (allele differences in cgMLST mode, SNPs in SNP mode): two isolates share a cluster when a chain of pairwise distances, each at most t, connects them. The threshold is inclusive.
4. Validation
For each schema below we name the published source dataset, the schema page, and the publicly accessible Solu workspace, then report two comparisons: cluster recovery against the paper's published clusters (external reproducibility) and concordance against Solu's own SNP phylogeny (internal concordance). In both, the published or SNP clustering is treated as the reference.
We report five agreement metrics, all computed from the same input: each isolate's two cluster labels (e.g. the paper's clustering and Solu's).
Adjusted Rand Index (ARI). Measures how often the two clusterings agree on whether any pair of isolates belongs together, scored from 0 (chance) to 1 (identical).
Adjusted Mutual Information (AMI). Measures how much knowing an isolate's label under one clustering tells you about its label under the other — the information the two labelings share — rescaled so 0 is chance and 1 is identical.
Homogeneity. Asks whether each Solu cluster is "pure": do all its isolates come from a single reference group? It drops when one Solu cluster pools isolates from two or more reference groups, i.e. when cgMLST merges things the reference kept apart. 1.0 = every cluster is pure.
Completeness. The mirror image: does each reference group stay inside a single Solu cluster? It drops when one reference group is scattered across two or more Solu clusters, i.e. when cgMLST splits something the reference kept together. 1.0 = no reference group is split.
V-measure. A single score combining the two: the harmonic mean of homogeneity and completeness. 1.0 means clusters are both pure and whole.
ARI and AMI are chance-corrected, so 0 = chance-level agreement. The V-measure family (homogeneity, completeness, V-measure) is not chance-corrected: 1.0 is still perfect, but 0 isn't a chance baseline.
4.1 Klebsiella: KASPAH hospital surveillance (Gorrie et al. 2022)
- Study: doi.org/10.1038/s41467-022-30717-6
- Schema: BIGSdb-Pasteur Klebsiella scheme 18 (KpSC cgMLST, 629 loci) for all KpSC species; K. pneumoniae additionally runs scheme 19 (K. pneumoniae sensu stricto, 2,752 loci), for 3,381 loci combined.
- Workspace: solu-klebsiella-cgmlst
- Threshold: 28 allele differences (K. pneumoniae, combined scheme); 5 allele differences (other KpSC species); single linkage.
Threshold note: the K. pneumoniae threshold scales the KpSC threshold by the 5.6x mean increase in pairwise allelic distance the Pasteur Kpn-cgMLST paper reports, when the K. pneumoniae scheme is added on top of the KpSC scheme (5 × 5.6 = 28).
Summary
Solu recovers all 12 paper-reported transmission clusters from the KASPAH study intact, with an ARI of 0.917 over the 41 transmission-cluster members and 0.927 over the 123 non-singleton-cluster isolates. 34 of 35 Solu multi-sample clusters are lineage-pure by the paper’s definition; the single mixed cluster is a sample-selection difference, not a clustering disagreement. 357 of 364 uploaded isolates (98.1%) appear in the Solu cgMLST output. Against Solu’s SNP phylogeny, completeness is 1.0000 for every species.
Source dataset
The KASPAH study ran one year of prospective genomic surveillance on K. pneumoniae species complex (KpSC) clinical isolates at the Alfred Hospital, Melbourne. Raw Illumina reads are deposited under ENA BioProjects PRJEB6891 and PRJNA351909. We uploaded all 364 isolates with a non-empty Reads accession (Illumina) value in the paper’s Supplementary Data 1. The paper reports 12 nosocomial transmission clusters covering 41 infection episodes, built from a maximum-likelihood core-gene SNV phylogeny. Transmission clusters additionally required ≤25 pairwise SNVs, ≤45 days between samples, plausible epidemiological links, and manual review.
Recovery against published clusters
Solu recovers all 12 paper-reported transmission clusters intact: every member of each cluster lands in a single Solu cgMLST cluster, and no member splits off.
Recovery: 12 of 12 paper transmission clusters (100%). Size is the paper transmission cluster’s isolate count. The Solu cgMLST cluster it maps to often contains additional same-lineage isolates the paper did not designate as transmission.
Solu cluster labels are namespaced per species, so cluster “4” under K. pneumoniae and cluster “4” under K. quasipneumoniae are distinct clusters. Within K. pneumoniae, two of the twelve paper clusters (cluster 5, ST323, n=3; and cluster 74, ST323, n=4) are each recovered intact but land in the same Solu cluster (cluster 4). The 28-allele threshold groups these two closely related, same-lineage ST323 transmission clusters together.
Adjusted Rand Index
Agreement at the Solu-cluster level. Of the 35 Solu cgMLST clusters with two or more members, 34 contain isolates that all share the same paper LineageNumber. 34 of 35 (97%) Solu multi-sample clusters are lineage-pure by the paper’s definition.
The one mixed cluster is a sample-selection difference, not a clustering disagreement. The lone lineage-mixing Solu cluster is K. variicola cluster 2. It contains nine ST681 isolates that the paper places in two categories:
- six isolates assigned to paper transmission cluster 215 (lineage 44);
- three isolates (
KSB1_10E,KSB1_4F,KSB1_8D) taggedNot in infections tree.
The three Not in infections tree isolates are gut-colonization samples from the paper’s KSB cohort. The authors deliberately excluded colonization samples from the infection-isolate phylogeny that defines their lineages and clusters, so these three lack a LineageNumber. Solu’s cgMLST result groups the three colonization isolates with the cluster 215 infection isolates, consistent with the paper’s own broader finding that gut colonization is a frequent source of subsequent infection. This single apparent disagreement is a sample-selection difference, and it dominates the per-species K. variicola ARI (0.590 with the three samples, 0.921 without).
Solu clusters beyond the paper's transmission set. Solu finds 24 additional multi-sample cgMLST clusters (sizes 2 to 5) that the paper did not designate as transmission clusters. In every one of these 24 clusters, all members share the same paper LineageNumber.
Coverage
357 of the 364 uploaded isolates appear in the Solu cgMLST output (98.1%). The 7 that do not:
Concordance against Solu's SNP phylogeny
Solu’s SNP phylogeny clusters at a 20-SNP threshold. Comparing the two methods on the same isolates measures consistency: where SNP forms a cluster, does cgMLST agree?
Completeness is 1.0000 for every species: every SNP cluster’s members land entirely inside a single cgMLST cluster, so the cgMLST pipeline never splits a sample group that SNP brought together. Homogeneity is 1.0000 for K. variicola and K. quasipneumoniae, and 0.9785 for K. pneumoniae. The K. pneumoniae drop comes from cgMLST clusters that contain SNP-clustered samples plus additional samples that SNP left as singletons or placed in a smaller separate cluster. The largest example is cgMLST cluster 4 (20 samples): SNP places 16 in SNP cluster 14, leaves two as singletons, and places two in SNP cluster 24; cgMLST groups all 20. This is the expected direction of divergence: cgMLST operates on the allele profile across the schema’s core loci and recovers relationships that mapped-SNV approaches can miss when local coverage is uneven or repeat regions are masked. Every multi-sample SNP cluster survives intact inside a cgMLST cluster; the extra members are samples SNP could not assign with confidence.
The K. variicola case shows the same pattern at smaller scale. SNP finds 2 clusters, cgMLST finds 3. The first cgMLST cluster (9 samples, ST681) matches the largest SNP cluster exactly and includes the three KSB-cohort colonization isolates. The second (4 samples) matches the smaller SNP cluster exactly. The third (3 samples: INF181, INF345, INF352, all ST347) consists of samples SNP left as singletons but that the KASPAH paper places in a single lineage (lineage 45): a grouping that cgMLST recovers, the SNP method misses, and the paper’s broader phylogeny independently supports.
4.2 Mycobacteroides abscessus (Diricks et al. 2022)
- Study: doi.org/10.1038/s41467-022-32122-5
- Schema: PubMLST M. abscessus scheme (2904 loci)
- Workspace: solu-m-abscessus-cgmlst
- Threshold: 25 allele differences, single linkage.
Nomenclature note: Diricks et al. and the PubMLST sequence-definition database refer to the organism as Mycobacterium abscessus. We use the current name Mycobacteroides abscessus (the genus was reclassified) throughout; the two names refer to the same taxon. Citations and the source paper’s title are quoted with their original spelling. The schema's allele definitions transferred between the two databases without incident; only the genus name failed to.
Summary
Solu recovers 13 of 15 Diricks et al. paper transmission clusters as a single cgMLST cluster, with a paper-cluster-vs-Solu-cluster ARI of 0.934 over the 76-isolate outbreak/transmission set. For the 69 cystic-fibrosis patients with sequential isolates, the pipeline places all isolates from 68 of 69 patients (98.6%) in the same cluster. Against Solu’s SNP phylogeny, ARI is 0.708 and V-measure 0.955.
Source dataset
Diricks et al. public read set (Supplementary Data 3) comprises 372 Illumina runs drawn from nine prior studies, covering all three subspecies (171 massiliense, 151 abscessus, 50 bolletii) and serving three roles:
The 76 outbreak/transmission isolates fall into 15 clusters (Supplementary Data 7): Brazil (30), mass_C1 (9), A4 (5), Tattoo (4), A1 (3), B1 (3), mabs_d25_cluster1 (3), mabs_d25_cluster2 (3), Seattle (3), Pediatric (3), M1 (2), M2 (2), A2 (2), A3 (2), mass_C2 (2). These clusters come from prior whole-genome SNV phylogenies (25 to 30 SNP thresholds in the original studies; Diricks et al. re-analyzed with cgSNP via MTBseq and cgMLST at 25 alleles) plus epidemiological evidence (shared CF centre, procedure room, tattoo parlour, surgical disinfectant). We uploaded all 372 isolates.
Recovery against published clusters
Solu recovers 13 of 15 paper transmission clusters as a single cgMLST cluster. We discuss the two exceptions below.
*A4 expands to 11 isolates once we include 6 sequential samples from the same 5 patients (5 in the outbreak/transmission set, 6 sequential). The other 14 clusters use the outbreak/transmission-set count. We compute the ARI below on the 76-isolate outbreak/transmission set, with A4 contributing 5.
Where two paper clusters share a Solu cluster (A1 + mabs_d25_cluster1 → Solu 11; A4 + mabs_d25_cluster2 → Solu 10), both members share the same ST and DCC. The 25-allele threshold places these geographically distinct outbreaks in the same lineage (finer than ST or DCC, broader than per-outbreak transmission), exactly as Diricks et al. observe for the ST9/DCC2 backbone shared by Frankfurt CF (mabs_d25_cluster2) and Italian CF (A4).
Adjusted Rand Index
The paper cluster vs Solu cluster ARI of 0.934 shows that Solu’s clusters with the chosen threshold closely correspond to the paper’s clusters.
The two non-trivial splits
mass_C1 (Papworth CF outbreak, 9 isolates). Eight isolates land in Solu cluster 4. The ninth, sample 14a (accession ERR115012, patient 14), lands as a singleton 150 alleles from its nearest neighbour; the other eight patient-14 isolates sit in cluster 4. 14a is the lowest-depth isolate from patient 14 (25.9x, against 58x to 125x for the others); at that depth the assembly fragments into 26 contigs (against 10 to 13 for siblings) and QUAST reports about 300 additional single-base mismatches across the 5.1-Mb genome. Those ~300 differences distribute across 150 cgMLST loci. chewBBACA marks any locus whose CDS is not identical to a schema allele as novel, and the metric counts a sibling’s exact match against 14a’s near-match as one allele difference. Of the 150 differing loci, 121 are exact-match in a sibling and inferred-novel in 14a.
A4 (Italian CF, 11 isolates incl. 6 sequential). Nine isolates land in Solu cluster 10; two (patient FI, isolates FI1 and FI2) land in Solu cluster 44. Diricks et al. flag this cluster in Supplementary Data 7, noting that in the original Tortoli 2017 study certain isolates are within the cluster but not connected at <25 SNPs to the other members. The paper reports A4 as a loose grouping that fails the 25-SNP threshold uniformly; the cgMLST split matches the paper’s caveat.
Within-patient consistency. For the 69 cystic-fibrosis patients with sequential isolates (2 to 18 isolates each, 291 total), the pipeline places all isolates from 68 of 69 patients in the same Solu cluster (98.6%). The single exception is patient GA (5 isolates): four ST97 isolates cluster together, one ST5 isolate clusters separately. The paper places the four ST97 isolates outside any transmission cluster and assigns the ST5 isolate to cluster A1, a two-strain co-infection the paper also separates.
Concordance against Solu's SNP phylogeny
Comparing the two clusterings on the 343 samples both methods clustered:
The ARI (0.7076) reads lower than the V-measure (0.9546) because ARI counts isolate pairs and one cluster dominates that budget: SNP cluster 2 (98 of its 99 isolates are in the 343-sample comparison; the 99th is the cgMLST-unclustered patient-14 isolate) holds 82% of all SNP same-cluster pairs, and cgMLST splits it into three finer clusters that keep the paper’s distinct outbreaks within this massiliense lineage apart: Papworth CF C1 (cluster 4), Papworth CF C2 (cluster 20), and Seattle (cluster 1). That single, paper-corroborated split accounts for essentially the entire ARI gap; the information-based metrics score it as the pure, finer refinement it is. The two methods agree on lineage structure and differ only in granularity at their respective thresholds.
Completeness sits high but below homogeneity, meaning cgMLST splits SNP clusters more often than SNP splits cgMLST clusters. The largest example is SNP cluster 2 (99 samples), which cgMLST splits into cluster 4 (59), cluster 20 (36), cluster 1 (3), and one singleton (the mass_C1 patient-14 isolate). 96 of the 99 are ST33 DCC3a massiliense; the other 3 are ST223 DCC3 Quebec_OM “Pediatric” isolates, which both the SNP phylogeny and cgMLST place alongside the Papworth CF C1 set in cluster 4. cgMLST resolves the lineage into the separate outbreaks the paper reports, Papworth CF C1 (cluster 4), Papworth CF C2 (cluster 20), and Seattle (cluster 1), rather than pooling them into one. Among the SNP clusters that move cleanly, SNP cluster 11 (31) maps 1:1 to cgMLST cluster 2 (31, the Brazil outbreak). Five cgMLST clusters each pull in a small handful from a second SNP cluster: cgMLST 30 (SNP 14 plus 3 from SNP 27), cgMLST 10 (SNP 15 plus 3 from SNP 21 plus 1 SNP-unclustered), cgMLST 11 (SNP 39 plus 2 from SNP 55), cgMLST 38 (SNP 43 plus 2 from SNP 4 plus 1 SNP-unclustered), and cgMLST 40 (SNP 5 plus 2 from SNP 42). The 0.976 homogeneity reflects exactly this: cgMLST clusters mostly sit inside a single SNP cluster, with 5 of 55 spanning two. cgMLST is equal to or finer than SNP at scale.
4.3 Listeria monocytogenes: multiclonal listeriosis outbreak (Lüth et al. 2020)
- Study: doi.org/10.1080/22221751.2020.1784044
- Schema: BIGSdb-Pasteur Listeria scheme (cgMLST1748, 1748 loci)
- Workspace: solu-listeria-cgmlst
- Threshold: 10 allele differences, single linkage.
Summary
We re-analysed the 312 L. monocytogenes isolates from a published German listeriosis outbreak investigation. The platform reproduced the outbreak structure: its largest cluster (ST5, n=157) matched the paper’s primary outbreak cluster (CC5) exactly on the two fully recovered source categories (20 food isolates and 91 food-processing-environment isolates). All clinically sourced isolates the platform clustered fell into exactly two clusters, matching the paper’s two outbreak clusters. SNP cross-check gave ARI 0.9998 and V-measure 0.9960. The platform’s clinical-isolate count is lower than the paper’s because 26 clinical isolates did not assemble due to an unsupported type (single-end short read).
Source dataset
Lüth et al. analysed 312 isolates collected in Germany 2013 to 2018: 77 clinical and 235 non-clinical (210 food-processing-environment, 25 food). Genomes are public in ENA under PRJEB37718 (food and environmental, plus outbreak-linked) and PRJEB24496 (clinical). The paper reported 7 cgMLST clusters and 12 singletons. The two outbreak-associated clusters were cluster 1 (176 isolates: 65 clinical, 20 food, 91 environment; CC5) and cluster 2 (24 isolates: 12 clinical, 12 environment; CC7), together holding all 77 sequenced clinical isolates.
Recovery against published clusters
The platform recovered the same cluster architecture: one dominant outbreak cluster, a second smaller outbreak cluster, secondary non-clinical clusters, and singletons. The lower outbreak-cluster counts reflect the 26 clinical isolates that did not assemble; since the paper places all clinical isolates in its two outbreak clusters, their absence depresses the clinical counts but leaves the structure intact.
Outbreak recovery. Cluster 1: exact food and environment match. The platform’s largest cluster (ST5, n=157) corresponds to the paper’s primary outbreak cluster (CC5):
The food count (20) and environment count (91) are identical between the two analyses. These come from independent sources (the paper’s published clusters on one side, the platform’s own cgMLST membership joined to ENA-recorded isolation source on the other), so the agreement is a genuine cross-check, not a circular comparison. The clinical shortfall (46 vs 65) reflects the unassembled clinical isolates.
Clinical isolates fall into exactly two clusters. Anchored only to public ENA isolation-source labels: of the clinically sourced isolates the platform clustered, every one fell into one of two clusters: 46 in cluster 1 (ST5/CC5) and 4 in cluster 2 (ST691/CC7). No clinical isolate landed in any other cluster or became a singleton, reproducing the paper’s central finding that the clinical cases split between exactly two outbreak clusters.
Second outbreak cluster: ST691 is CC7. All 17 members of the platform’s cluster 2 carry ST691, which the Institut Pasteur Listeria MLST database assigns to CC7, lineage II (ST691 differs from the CC7 founder ST7 at a single MLST locus, dapE, the single-locus-variant relationship defining clonal-complex membership). The paper assigns its second outbreak cluster to CC7 and the clinical co-clustering supports the same correspondence independently.
Coverage
Of 312 uploaded isolates, 286 produced a cgMLST result; 26 finished with errors. By sequencing layout the 26 split as 26 single-end; by source as 26 clinical.
Concordance against Solu's SNP phylogeny
Over the 274 isolates both methods clustered, treating SNP clustering as reference:
Completeness of 1.0000 means cgMLST never split a SNP cluster across two cgMLST clusters. Homogeneity of 0.9920 falls just short of 1.0 for a single reason: the SNP analysis separated a five-isolate ST6 group into two pairs (SNP clusters 6 and 7) plus one unclustered isolate, whereas cgMLST grouped all five into one cluster (cgMLST cluster 6): a merge by cgMLST of clusters SNP left separate, the expected direction when one method has finer resolution at a given threshold. No splits occurred in the surprising direction. This cross-check measures internal consistency between two Solu methods and is a separate quantity from any paper-versus-Solu comparison.
4.4 Campylobacter jejuni and C. coli: US PulseNet outbreak panel (Joseph et al. 2023)
- Study: doi.org/10.1099/mgen.0.001012
- Schema: PubMLST/Oxford Campylobacter scheme (1343 loci; Cody et al. 2017)
- Workspace: solu-campylobacter-cgmlst
- Threshold: 5 allele differences, single linkage.
Summary
We re-analyzed the 315-isolate Campylobacter panel from Joseph et al. (2023) in Solu. Solu cgMLST agrees closely with the SNP method (ARI 0.84 for C. jejuni; 1.00 for C. coli over only 5 co-clustered isolates from a single outbreak). Its clustering tracks the study’s own genomic distances: it recovers the tight outbreaks and splits the outbreaks the study itself found diffuse or polyclonal rather than forcing them together. The two methods reach the same recover-or-split verdict on 15 of 16 outbreaks under a strict definition and 14 of 16 under a lenient one.
Source dataset
Joseph et al. evaluated whether genomic typing can detect Campylobacter outbreaks in the United States. The dataset holds 315 isolates from the US CDC PulseNet network, public as raw Illumina reads under NCBI SRA BioProject PRJNA239251: 16 epidemiologically defined outbreaks (2008 to 2021) plus 73 sporadic isolates, mostly C. jejuni with a few C. coli. The study used the PubMLST/Oxford 1343-locus schema.
Solu uses the same Oxford/PubMLST schema with chewBBACA. We processed all 315 isolates (306 C. jejuni, 9 C. coli). We compared Solu cgMLST clusters against the internal SNP-phylogeny clusters, and both against the study’s outbreak and sporadic labels.
Recovery against published clusters
We classify each outbreak by the study’s own within-clade cgMLST distances: tight when its isolates fall within 5 alleles and a single clade, diffuse when they span more than 5 alleles or multiple clades the study itself defined.
- Of 8 tight outbreaks, Solu cgMLST recovered 6 with every isolate in one cluster, recovered 1 more (1802VADBR-1) among its clustered isolates (2 of 3 grouped, one singleton), and missed 1 outright (1612OHDBR-1, all 3 singletons, which the SNP method also failed to cluster).
- Of 7 diffuse or polyclonal outbreaks, Solu cgMLST split all 7 rather than forcing them into one cluster, falsely merging zero.
- The study reported no within-clade distance for one outbreak (1609CODBR-1), which we treat separately.
Recovery totals. Strict recovery requires every isolate of an outbreak in one cluster; lenient recovery requires every clustered isolate in one cluster, ignoring singletons.
- Solu cgMLST: 6 of 16 strictly, 8 of 16 leniently.
- Internal SNP phylogeny: 7 of 16 strictly, 8 of 16 leniently.
- The two methods agree on the recover-or-split verdict for 15 of 16 outbreaks under the strict definition and 14 of 16 under the lenient one.
This cross-method agreement indicates each outbreak’s biology drives the clustering behaviour, not a cgMLST-specific artifact.
Sporadic discrimination. Solu cgMLST placed no sporadic isolate in the same cluster as its epidemiologically matched outbreak: zero such cases across all 73 sporadic isolates at the 5-allele threshold. (This depends on an epidemiological sample-to-outbreak mapping not present in the shipped comparison files, so the claim is not reproducible from the shipped output alone.) The study itself found 68 of 73 sporadic isolates distinguishable from outbreak isolates, with the remaining 5 closely related (within about 11 cgMLST alleles) to an outbreak.
Cross-outbreak grouping. The only Solu cgMLST clusters spanning more than one outbreak label join the two pet-store-puppy outbreaks (1708FLDBR-1 and 1906NVDBR-1), which the paper confirms are the same source, plus one two-isolate ST8 pair. In every case the SNP phylogeny groups the same isolates, so these reflect genuine genomic relatedness, not a cgMLST error.
Concordance against Solu's SNP phylogeny
Metrics use only the isolates both methods placed in a cluster (163 of 306 C. jejuni, 5 of 9 C. coli):
Agreement is high for C. jejuni. The C. coli figures all read 1.0000 but over only 5 co-clustered isolates from a single outbreak. The substantive finding is that cgMLST resolves some SNP groupings more finely (completeness 0.90 for C. jejuni): it is the more conservative, higher-resolution method here.
Limitations
Solu clusters at 5 allele differences, which is perhaps tighter than it needs to be. We tested the pipeline with a 10 allele threshold, but it didn’t produce more clusters, it flagged more false-positives and it was not a significant improvement over the 5 allele threshold. The main problem is a lack of masking in fragmented assemblies. We’re separately looking into improving our assembly for such cases, which would also improve the cgMLST results.
Conclusion
On the 315-isolate panel, Solu cgMLST agreed with an internal SNP phylogeny on the recover-or-split verdict for 15 of 16 outbreaks (strict) and 14 of 16 (lenient), merged no two genomically distinct SNP lineages, and placed no sporadic isolate with its matched outbreak. It recovered the tight outbreaks (apart from one isolate-level drop-out and one outright miss that the SNP method flags identically) and split the diffuse or polyclonal outbreaks rather than forcing them together. Reproducing the study’s qualitative conclusion through an independent caller and a SNP cross-check is consistent with a correct implementation.
4.5 Vibrio cholerae: Cox’s Bazar outbreak (Taylor-Brown et al. 2023)
- Study: doi.org/10.1038/s41467-023-39415-3
- Schema: PubMLST V. cholerae scheme (2443 loci)
- Workspace: solu-cholerae-cgmlst
- Threshold: 7 allele differences, single linkage.
Summary
We validated the platform’s cgMLST clustering against 223 V. cholerae isolates sequenced during a cholera outbreak in Cox’s Bazar, Bangladesh. Of the 221 isolates that received a cgMLST cluster, every one matched its published population-structure clade, giving an ARI of 1.000 against the authors’ clade assignments and zero discordant isolates. The platform’s independent SNP phylogeny clustered the same isolates identically (ARI 1.000). This dataset is a clonal outbreak that resolves into two deeply divergent lineages, so it demonstrates correct lineage-level recovery rather than fine within-lineage resolution.
Source dataset
The authors sequenced 223 isolates by Illumina from cholera cases in Cox’s Bazar:
- Clade 5: 171 isolates. 7PET Wave 3, restricted to South and Southeast Asia; ctxB1 cholera-toxin allele; subclades 5.19 and 5.20 carry ICETET (IDH_1986) with tetracycline resistance (tetR).
- Clade 3: 51 isolates. 7PET Wave 3, globally distributed; ctxB7 allele; ICEGEN element with florfenicol resistance (floR); includes subclade 3.9.
- Wave 1 (Clade 4) outlier: 1 isolate. A single divergent genome.
Recovery against published clusters
Of the 223 isolates, 222 entered the comparison and 221 received a cgMLST cluster. The 222nd is the single Wave 1 (Clade 4) outlier, correctly left unclustered. One isolate did not enter published as Clade 5: ERR11850155 (a contaminated deposit excluded from the schema, see below). cgMLST produced exactly two clusters: cluster 1 (51 isolates, all Clade 3) and cluster 2 (170 isolates, all Clade 5).
Concordance over the 221 clustered isolates (the Clade 4 singleton excluded as it has no cluster label):
Reconciliation of all 223 isolates. 170 Clade 5 → cluster 2; 51 Clade 3 → cluster 1; 1 Clade 4 outlier → singleton; 1 contaminated deposit excluded. One of the paper’s 171 Clade 5 isolates are therefore not clustered here (the contaminated isolate), and none is misassigned. The platform’s species and contamination QC identified the contaminated deposit (ERR11850155) as Pseudomonas aeruginosa at roughly 84% contamination, an assembly of about 10 Mb against the ~4 Mb expected for V. cholerae; the paper labels it Clade 5, but the deposited reads are not V. cholerae, so exclusion by species gating is the correct outcome.
Independent marker corroboration. The two cgMLST clusters align with the independent markers the paper reports per clade: cluster 1 (Clade 3) carries ctxB7 and ICEGEN/floR; cluster 2 (Clade 5) carries ctxB1, with ICETET/tetR in the relevant subclades.
Concordance against Solu's SNP phylogeny
The platform’s SNP-based phylogeny produced the same two groups (51 and 170) and the same singleton. Over the 221 clustered isolates:
The two independent methods cluster the isolates identically.
Limitations
This result shows cgMLST recovers the known lineage structure, does not split a lineage spuriously, and does not merge across the major divide. But the dataset is a clonal 7PET outbreak resolving into two deeply divergent, well-separated lineages, so a two-way split is a relatively coarse test: it demonstrates lineage-level recovery without stress-testing fine within-lineage transmission resolution.
4.6 Staphylococcus aureus: ST22 MRSA community cluster (Toleman et al. 2017)
- Study: doi.org/10.1093/cid/cix539
- Schema: PubMLST S. aureus scheme (1716 loci)
- Workspace: solu-s-aureus-cgmlst
- Threshold: 27 allele differences, single linkage.
Summary
Solu’s pipeline reproduces the 15-patient transmission cluster from Toleman et al. (2017). Of 34 isolates from 22 patients, the pipeline places all 27 paper-cluster isolates in one Solu cluster and the 7 paper-outside isolates as singletons. Binary in-cluster / out-of-cluster classification agrees one-to-one with the paper.
Source dataset
The paper describes 34 MRSA isolates from 22 patients registered at one Cambridgeshire GP surgery. The authors built a maximum-likelihood phylogeny from 1715 ST22 reference isolates, applied a 50-SNP cutoff, and identified a 15-patient transmission cluster. Paper Table 1 classifies each isolate within or outside the phylogenetic cluster: 27 cluster isolates (patients P01 to P15) and 7 non-cluster isolates (P16 to P22).
Numbering convention. Table 1 and Supplementary Table S1 use different per-patient labels for P12 to P15. This document uses S1 numbering, because that numbering attaches to the ERS accessions used for upload (under S1: P12 has 2 isolates, P13 has 3, P14 and P15 have 1 each). Both labelings cover the same 27 cluster and 7 non-cluster isolates.
Recovery against published clusters
Solu returned one multi-sample cluster (cluster 1, 27 isolates) and 7 singletons:
Concordance against Solu's SNP phylogeny
Solu’s SNP phylogeny (20-SNP threshold) returns three groupings for these 34 samples:
- SNP cluster 1: 4 isolates (P03_1, P03_2, P05_1, P12_2)
- SNP cluster 2: 21 isolates
- Unclustered: 9 isolates (P04_1, P12_1, and the 7 paper-outside isolates)
The cgMLST cluster combines the two SNP clusters and includes 2 samples that SNP clustering left as singletons.
The SNP analysis fragments the paper cluster into three pieces instead of recovering it as one; the paper-outside isolates remain unclustered in both methods. The thresholds differ, which matters here: the paper used a 50-SNP cutoff; Solu’s SNP pipeline uses 20 SNPs, tuned for tight transmission clusters. We don't report metric numbers here because the SNP pipeline splits the paper cluster, so SNP-as-reference metrics would misrepresent a correct result.
Limitations
28 of the 34 isolates are ST22 plus one ST1539 single-locus variant; the validation tests within-lineage discrimination (the 50-SNP paper cluster vs the two outside ST22 isolates) and between-lineage separation (ST6, ST45), but does not exercise mixed-lineage or polyclonal outbreaks.
Supplementary data: schema build
All schemes built with chewBBACA 3.5.3.
Supplementary Table S1: Per-schema build manifest
The V. cholerae training accession is GenBank (GCA_) rather than RefSeq (GCF_) because the chosen complete N16961 assembly is published under GenBank only. Identical input/output locus counts confirm that CDS validation dropped no locus.
Supplementary Table S2: Allele inventory per locus
Computed from each schema’s chewBBACA-emitted schema_seed_summary_stats.tsv. “Min” and “Max” describe how unevenly alleles distribute across loci within a schema: variable loci accumulate thousands of alleles; conserved loci stay in single digits.
Supplementary Table S3: Invalid-allele failure-mode breakdown
We derive counts by parsing each schema’s chewBBACA-emitted schema_seed_invalid_alleles.txt. Each category comes from the reported sense-frame failure when present, otherwise from the bare “sequence length is not a multiple of 3” annotation. We count each rejected allele once and compute the rejection rate against total alleles in Table S2.
References
- Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422-1423. https://doi.org/10.1093/bioinformatics/btp163
- Cody AJ, Bray JE, Jolley KA, McCarthy ND, Maiden MCJ. Core genome multilocus sequence typing scheme for stable, comparative analyses of Campylobacter jejuni and C. coli human disease isolates. Journal of Clinical Microbiology. 2017;55(7):2086-2097. https://doi.org/10.1128/JCM.00080-17
- Diricks M, Merker M, Wetzstein N, Kohl TA, Niemann S, Maurer FP. Delineating Mycobacterium abscessus population structure and transmission employing high-resolution core genome multilocus sequence typing. Nature Communications. 2022;13:4936. https://doi.org/10.1038/s41467-022-32122-5
- Gorrie CL, Mirčeta M, Wick RR, Judd LM, Lam MMC, Gomi R, Abbott IJ, Thomson NR, Strugnell RA, Pratt NF, Garlick JS, Watson KM, Hunter PC, Pilcher DV, McGloughlin SA, Spelman DW, Wyres KL, Jenney AWJ, Holt KE. Genomic dissection of Klebsiella pneumoniae infections in hospital patients reveals insights into an opportunistic pathogen. Nature Communications. 2022;13:3017. https://doi.org/10.1038/s41467-022-30717-6
- Hennart M, Guglielmini J, Bridel S, Maiden MCJ, Jolley KA, Criscuolo A, Brisse S. A dual barcoding approach to bacterial strain nomenclature: genomic taxonomy of Klebsiella pneumoniae strains. Molecular Biology and Evolution. 2022;39(7):msac135. https://doi.org/10.1093/molbev/msac135
- Joseph LA, Griswold T, Vidyaprakash E, Im SB, Williams GM, Pouseele HA, Hise KB, Carleton HA. Evaluation of core genome and whole genome multilocus sequence typing schemes for Campylobacter jejuni and Campylobacter coli outbreak detection in the USA. Microbial Genomics. 2023;9(5):mgen001012. https://doi.org/10.1099/mgen.0.001012
- Lüth S, Halbedel S, Rosner B, Wilking H, Holzer A, Roedel A, Dieckmann R, Vincze S, Prager R, Flieger A, Al Dahouk S, Kleta S. Backtracking and forward checking of human listeriosis clusters identified a multiclonal outbreak linked to Listeria monocytogenes in meat products of a single producer. Emerging Microbes & Infections. 2020;9(1):1600-1608. https://doi.org/10.1080/22221751.2020.1784044
- Seemann T. cgmlst-dists: calculate distance matrix from cgMLST allele call tables. GitHub software repository. https://github.com/tseemann/cgmlst-dists
- Silva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J, Santos S, Ramirez M, Carriço JA. chewBBACA: a complete suite for gene-by-gene schema creation and strain identification. Microbial Genomics. 2018;4(3):e000166. https://doi.org/10.1099/mgen.0.000166
- Taylor-Brown A, Afrad MH, Khan AI, Lassalle F, Islam MT, Tanvir NA, Thomson NR, Qadri F. Genomic epidemiology of Vibrio cholerae during a mass vaccination campaign of displaced communities in Bangladesh. Nature Communications. 2023;14:3773. https://doi.org/10.1038/s41467-023-39415-3
- Toleman MS, Watkins ER, Williams T, Blane B, Sadler B, Harrison EM, Coll F, Parkhill J, Nazareth B, Brown NM, Peacock SJ. Investigation of a cluster of sequence type 22 methicillin-resistant Staphylococcus aureus transmission in a community setting. Clinical Infectious Diseases. 2017;65(12):2069-2077. https://doi.org/10.1093/cid/cix539
- Tortoli E, Kohl TA, Trovato A, Baldan R, Campana S, Cariani L, Colombo C, Costa D, Cristadoro S, Di Serio MC, Manca A, Pizzamiglio G, Rancoita PMV, Rossolini GM, Taccetti G, Teri A, Niemann S, Cirillo DM. Mycobacterium abscessus in patients with cystic fibrosis: low impact of inter-human transmission in Italy. European Respiratory Journal. 2017;50(1):1602525. https://doi.org/10.1183/13993003.02525-2016
- Zhou Z, Alikhan NF, Sergeant MJ, Luhmann N, Vaz C, Francisco AP, Carriço JA, Achtman M. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Research. 2018;28(9):1395-1404. https://doi.org/10.1101/gr.232397.117
Get started for free
Create your free Solu Platform account today to start analyzing genomes.
