Component Parts of Bacteriophage Virions Accurately Defined by a Machine-Learning Approach Built on Evolutionary Features

ABSTRACT Antimicrobial resistance (AMR) continues to evolve as a major threat to human health, and new strategies are required for the treatment of AMR infections. Bacteriophages (phages) that kill bacterial pathogens are being identified for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites in bespoke phage cocktails. Despite the great unsampled phage diversity for this purpose, an issue hampering the roll out of phage therapy is the poor quality annotation of many of the phage genomes, particularly for those from infrequently sampled environmental sources. We developed a computational tool called STEP3 to use the “evolutionary features” that can be recognized in genome sequences of diverse phages. These features, when integrated into an ensemble framework, achieved a stable and robust prediction performance when benchmarked against other prediction tools using phages from diverse sources. Validation of the prediction accuracy of STEP3 was conducted with high-resolution mass spectrometry analysis of two novel phages, isolated from a watercourse in the Southern Hemisphere. STEP3 provides a robust computational approach to distinguish specific and universal features in phages to improve the quality of phage cocktails and is available for use at http://step3.erc.monash.edu/. IMPORTANCE In response to the global problem of antimicrobial resistance, there are moves to use bacteriophages (phages) as therapeutic agents. Selecting which phages will be effective therapeutics relies on interpreting features contributing to shelf-life and applicability to diagnosed infections. However, the protein components of the phage virions that dictate these properties vary so much in sequence that best estimates suggest failure to recognize up to 90% of them. We have utilized this diversity in evolutionary features as an advantage, to apply machine learning for prediction accuracy for diverse components in phage virions. We benchmark this new tool showing the accurate recognition and evaluation of phage component parts using genome sequence data of phages from undersampled environments, where the richest diversity of phage still lies.

A ntimicrobial resistance (AMR) has risen to prominence as a major threat to human health (1, 2), and new strategies are required for the treatment of AMR infections (3)(4)(5).For example, the Centers for Disease Control and Prevention have identified several species of microbes as "Urgent" threats to human health by virtue of their AMR phenotypes, including Escherichia coli and Enterococcus faecalis.As another prime example of one of these, the carbapenem-resistant Enterobacteriaceae (CRE), Klebsiella pneumoniae infections represent a key target for new therapeutics to treat AMR infections (3)(4)(5).Bacteriophages (phages) that kill bacterial pathogens such as Klebsiella are being identified for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites.Careful consideration is needed in selecting the phages for use in therapeutic cocktails (4)(5)(6), considerations made difficult because annotation of phage genomes is poor (7,8), potentially obscuring phages with therapeutic potential.For example, while structural motifs are now known (9) that will promote phage virion stability (i.e., shelf-life), only with correct annotation of the major capsid, minor capsid, and other proteins involved can structural motifs be identified and evaluated.
Phage therapy has reemerged because of its potential treatment for antimicrobialresistant infections, and a common protocol for treatments is to select two or more phages for combination into a treatment cocktail (4)(5)(6).An ongoing issue is the establishment of criteria used for selection of appropriate phages for a cocktail, to enhance production and maximize efficacy, and to circumvent issues of phage resistance and collateral induction of further drug resistance in the infection sites (4,6).The phages used for phage therapy are Caudovirales conforming to a blueprint of an icosahedral protein capsid housing the phage genome and a tail composed of 20 to 40 protein components (10).The tails of these phages can be considered a complex piece of molecular machinery, with component parts of the tail recognizing and docking to a species-specific receptor on the host bacterium (11,12).Penetration of the host cell envelope depends on other components of the tail, which can have enzymatic functions to locally hydrolyze each of the distinct layers of the bacterial envelope (12)(13)(14).An ultimate goal for the development of personalized phage therapy is the recognition of all of these components from genome sequence data, so that bespoke phage could be selected for specific therapeutic purposes (5,6).However, the annotation of phage genomes is poor, potentially obscuring important features contributed by some component parts such as contributions to virion stability and shelf-life, host range, and bacterial cell lysis (7,8,15).

RESULTS AND DISCUSSION
Currently, phage genomes are assessed by tools such as multiPhATE (15) which provides a bioinformatics pipeline for functional annotation using sequence-based queries.The annotation accuracy of multiPhATE is limited by the extreme sequence diversity in phage genomes, likely due to the rapid evolutionary rates of phages (16).This limitation has been addressed to some extent with a neural network-based predictor iVIREONS (17) and further tools such as PVPred (18), PVP-SVM (19), PhagePred (20), Pred-BVP-Unb (21), and PVPred-SCM (22).However, recent evaluation of these tools in phage protein prediction showed less than satisfactory performance (23).We developed an ensemble predictor, STEP 3 , to accurately call the protein components of phage virions and visualize their predicted function-based relationships (Fig. 1).
STEP 3 extracted information from position-specific scoring matrix (PSSM) data (Fig. 1a), an approach that tracks protein evolutionary histories (24,25).In machinelearning evaluation of protein sequences, "evolutionary features" refer to information within the amino acid sequences that conceptually traces the evolutionary history of proteins, and their use often identifies highly informative patterns (24,25).STEP 3  includes data visualization capabilities to document relationships between virion components where the sequence similarity is sufficiently strong to identify high confidence homologs from other phages (Fig. 1b; see Fig. S1 in the supplemental material).59), and a MEDP (61) model.The five individual models were trained based on five balanced subsets, and their prediction scores were averaged to obtain an ensemble model.Finally, five baseline models (corresponding to five evolutionary features) were further integrated as the final ensemble model of STEP 3 through averaging their prediction scores.Support vector machine (SVM) with a radial basis function kernel was used to train each model.This ultimately provides a prediction of a "virion protein" which would be a structural component of the phage virion.(b) STEP3 data visualization provides a means to document relationships between a protein of interest.The example given is the protein component gpE from phage l, which shows clear similarity to major capsid proteins from other phages.Structural studies confirm that despite limited sequence similarity, gpE is part of a family of major capsid proteins (9).Alternative visualization features are available in STEP 3 (see Fig. S1 in the supplemental material).
There is power in integrating individual models within an ensemble framework for more robust and stable predictions: trained with an individual model alone (AAC-PSSM), predictions perform well with the fivefold cross-validation test (Fig. 2a; see Fig. S2 and Table S1 in the supplemental material) but ranked only fourth using the independent test (Fig. 2b and Table S2).In contrast, combined with other models into the ensemble model of STEP 3 , to draw on the best elements from all of the individual models (Fig. 1a), the overall best prediction performance ranking was achieved (Fig. 2a and b and Tables S1  and S2).In benchmarking against other available predictors, the ensemble STEP 3 achieved an improved performance, with the highest sensitivity (SN = 0.896), accuracy (ACC = 0.891), F-value (0.891), and Matthews correlation coefficient (MCC = 0.781) using the independent test (Fig. 2c and Table S3).The superior performance of STEP 3 can be attributed to the integration of more informative evolutionary features, as well as the comprehensive and upto-date training data set using experimentally verified inputs.It is worth noting that the BLAST-based predictor, which represents the mode used for genome annotation had the lowest accuracy (ACC) and F-value.This prediction bias is reflected by the extremely unbalanced sensitivity (the lowest) and specificity (the highest) scores, so that the BLAST-based predictor tended to predict positive samples as being negative.This quantifies and offers evidence for past observations that pairwise sequence matching methods struggle to predict phage proteins (25).
For initial case studies, we drew on three accounts published after STEP 3 was trained, where phages had been discovered, the genome sequence data deposited for public access, and the protein composition virions had been determined by mass spectrometry.The mass spectrometry data are crucial, as it enables discrimination between false-positive (FP; predicted but not present by mass spectrometry of the virion) and true-positive (TP; predicted and found present by mass spectrometry of the virion) results.Phage vB_EfaS_271 infects Enterococcus faecalis (26), phage vB_PatM_CB7 infects Pectobacterium atrosepticum (27), and phage vB_Eco4M-7 infects enteropathogenic Escherichia coli (28).STEP 3 was benchmarked against equivalent predictors: PVPred, PVP-SVM, Pred-BVP-Unb, and PVPred-SCM (Fig. 3).STEP 3 provided the greatest set of true-positive predictions for each of the three phages, predicting 9 of the 15 virion components for phage vB_EfaS_271, 23 of the 26 protein components for phage vB_PatM_CB7, and 24 out of 33 components of the phage vB_Eco4M-7 virions.Making low FP predictions on each phage, STEP 3 maintained a good balance between TP and FP results and showcased robust prediction performance across the test cases.In the case of phage vB_PatM_CB7, where mass spectrometry data had shown the number of nonvirion proteins is more than eight times as many as that of virion proteins, STEP 3 generated equal numbers of FP results and TP results.In this extreme case, STEP 3 correctly predicts 23 out of 26 virion proteins with a false-positive rate of 10.1% (23/227).
Oftentimes candidate phages that kill pathogens are isolated from hospital wastewater sources for their use in phage therapy (29,30).This raises the issue of potential oversampling of a common environmental source (i.e., wastewater) for phages, potentially limiting discovery of other, valuable phages and also potentially biasing the capability of predictors like STEP 3 .Therefore, as a further proof-of-principle test for STEP 3 , we sampled a natural watercourse with a strain of drug-resistant and hypervirulent Klebsiella pneumoniae as the host.The Merri Creek, which forms a part of the larger Merri catchment, lies within Wurundjeri Woi wurrung people's traditional homelands.Phages isolated from two separate sampling sites were characterized initially by genome sequencing and named in Woi wurrung language Merri-merri-uth nyilam marranatj (MMNM) and Merri-merri baany-a bundha-natj (MMBB); these names translate as "Dangerous Merri lurker" and "Merri water biter," respectively, in English.
Comparative genomic analysis revealed Klebsiella phages MMNM (Fig. 4) and MMBB (Fig. S3) to be distinct from previously sampled phages.In the case of MMNM, some similarities can be seen to phages belonging to the Jedunavirus genus according to the most recent International Committee on Taxonomy of Viruses (ICTV) classification, but the branch lengths on the tree designate diversity within this small group, comprising only eight phages in the NCBI database (Fig. 4a).Relatives of MMNM, isolated from hospital wastewater in Russia, showed considerable diversity in gene content and arrangement (Fig. 4b).Most notably, MMNM encodes several genes that are absent in many of the other sequenced jedunaviruses, including previously uncharacterized proteins MMNM_5, MMNM_6, MMNM_45, MMNM_51, MMNM_56, MMNM_57, and the putative polynucleotide kinase protein MMNM_50.Conversely, MMNM lacks the putative NHN endonucleaselike protein encoded by both vB_KpnM_FZ14 and vB_KpnM_KpV52.Sequence annotations suggest that MMNM has a tail structure characteristic of Myoviridae, including a baseplate protein (MMNM_21), a baseplate J-like protein (MMNM_23), and the baseplate wedge protein (MMNM_26).In high-resolution structural analyses of the Myoviridae phage T4, each virion has six molecules of each of these proteins and one to three molecules per virion of the hub proteins to which the baseplate is attached (31,32).
MMBB belongs to the Webervirus genus, a group of phages that exclusively target Klebsiella species (Fig. S3).MMBB is distinct from the other phages in this genus, with its closest relationship being to a phage isolated in China called vB_KpnS_GH-K3 (also called phage GH-K3) (33).Highlighting their differences, MMBB and GH-K3 show regions of diversity in gene content and arrangement; this is observed for the gene encoding MMBB_16, a putative AP2/HNH endonuclease previously found only in a small number of other Siphoviridae phages, including the Escherichia phage vB_EcoS_ESCO41 and Escherichia phage CJ19 (Fig. S3).Additional differences are seen in a contiguous cluster of four genes encoding hypothetical proteins (MMBB_45 to MMBB_48) that are absent in GH_K3.
Phenotypic characterization of the phages on lawns of K. pneumoniae (see Materials and Methods) showed that the plaque size for MMNM was smaller than that for MMBB (Fig. 5a) and with liquid cultures of K. pneumoniae (Materials and Methods) that  3 and other tools for vB_PatM_CB7 virion proteins defined by mass spectrometry (27).(e) For phage vB_Eco4M-7, the bar chart shows the numbers of virion proteins correctly retrieved as TP and nonvirion proteins mistakenly predicted as FP.(f) Detailed predictions from STEP 3 and other tools for vB_PatM_CB7 virion proteins defined by mass spectrometry (28).
MMNM had a shorter latent period (L) before host cell death as determined by onestep growth curves (Fig. 5b).Electron microscopy revealed that MMNM has an icosahedral head and a tail tube of ;54 nm capped with an ;30-nm baseplate to generate thick and straight tails (Fig. 5c).The baseplate structure evident in MMNM (Fig. 5c) is similar to that seen for the T4 phage (31), which serves as a paradigm for the Myoviridae (34) (Fig. 5d).In contrast, MMBB has ;200-nm-long, slender, and flexible tails (Fig. 5c).The flexible, noncontractile tail tube designate MMBB as a phage of Siphoviridae-like viruses (Fig. 5d), consistent with genome annotation data.
To directly test STEP 3 prediction capability on the novel phages MMNM and MMBB, the protein components contributing structurally to the virions were determined by high-performance mass spectrometry (35,36).To this end, samples of each virion were purified using cesium chloride gradients.The MMNM virion is composed of 25 protein components (Table S4).Assuming a similar stoichiometry between MMNM virions and the paradigm for Myoviridae, phage T4 virions, the identification of the lytic transglycosylase MMNM_19 suggests that the proteomic analysis is sensitive enough to detect three or fewer molecules per virion (31).From evaluation of the predicted proteins within the phage genomes, together with these mass spectrometry data, the MMNM genome encodes 25 structural proteins that serve as components of the virion and 42 proteins that would be expressed after infection of the host to drive phage replication (Fig. 6a).
STEP 3 successfully predicted 22 out of the 25 MMNM virion proteins (Fig. 6b and Table S5).The other predictors gave poorer outcomes with these diverse protein sequences.For example, second to STEP 3 was iVIREONS which identified 19 virion proteins, but iVIREONS also generated the largest number of false-positive results, 14, consistent with its high false-positive prediction rate in the independent tests (Table S3).In one case, the initial STEP 3 analysis made a false-negative prediction that was highly informative.The phage polynucleotide kinase (PNK) is an enzyme that has been previously assumed to be a nonvirion protein, and the sequence was therefore included in that (nonvirion) data set from which STEP 3 was trained.However, mass spectrometry identified the putative PNK protein MMNM_50 as a component of the virion (Table S4).Note that an equivalent result was achieved with the prediction for MMBB: protein MMBB_64 was detected by mass spectrometry (Table S7) but not selected by STEP 3  (Table S4 and Table S6).We suggest that for some phages the PNK remains associated with the packaged genome and is thereby incorporated within the capsid.This suggestion explains the proteomics data herein, reconciles the false-negative prediction by STEP 3 , and is consistent with the recent observation that the "gp44 ejection protein" is a virion protein in a Staphylococcus phage 80a bound to genome ends and functioning as a putative PNK would to protect the DNA from degradation upon phage entry into its host (37).
High-resolution mass spectrometry of the MMBB virions showed them to be composed of 29 protein components (Table S7).Thus, the MMBB genome encodes 29 proteins contributing structurally to the virions and 50 nonvirion proteins expressed only after infection in the host bacterium (Fig. 6c).For MMBB, STEP 3 and iVIREONS retrieved 20 and 18 virion proteins, respectively (Fig. 6d and Table S6).The other predictors achieved unsatisfactory prediction results, retrieving less than half of the 29 virion proteins.
The evolutionary features drawn on by STEP 3 and iVIREONS are structure informed, in that the patterns that they recognize are reflections of secondary and tertiary structure, and these patterns can also be used to suggest protein function.For example, a characteristic of the Webervirus has been suggested to be the presence of tail spike proteins with polysaccharide degrading activity (38), and the sequence of MMBB_78 is suggestive of such a protein, as summarized in Fig. S3.Conversely, pairwise sequence assessment is a poor means for recognition and characterization of virion proteins.For both MMNM and MMBB, sequence conservation alone proved the least satisfactory method for predicting phage virion proteins: the BLAST-based predictor recognized only three and six virion proteins, respectively (Fig. 6b and d and Tables S5 and S6).This confirmed the independent test results that the BLAST-based methods commonly used for annotations are a poor means of recognizing and classifying sequence-diverse phage proteins.Some estimates put the number of phage virions in the world at 10 31 , suggesting that there is a huge pool of phages that we know little about (39).This encourages a move toward informed bioprospecting for potentially useful phages from undersampled environments.The effective use of these for therapy and other applications depends on a number of factors, not least of which is the sequence-based choices that must be made to identify novel phages warranting further characterization and potential development into phage therapy.We suggest that application of STEP 3 will assist in distinguishing the specific and universal features in phages isolated from underrepresented (undersampled) geographical locations, with impact on the quality of future phage cocktails.Particularly in phages that might be highly divergent in their sequence characteristics, such as the MMNM and MMBB case studies here, STEP 3 can predict the component parts of the virions with a confidence level well above other computational tools.The STEP 3 toolbox is available at http://step3.erc.monash.edu/.
Phage isolation and infection of Klebsiella.Water samples were collected from catchment locations along the Merri Creek in Melbourne, Australia (Reservoir, postcode 3073, yielded MMNM, and Pascoe Vale, postcode 3044, yielded MMBB).Samples were centrifuged at 10,000 Â g for 10 min and filtered through a 0.45-mm cutoff filter.Water samples (45 ml) were subsequently mixed with 5 ml of 10Â concentrated Luria-Bertani (LB) medium and 1 ml of a K. pneumoniae B5055 DompK36 overnight culture and grown for a further 16 h at 37°C.Cellular debris was pelleted by centrifugation at 10,000 Â g for 10 min, and the resulting supernatant was passed through a 0.45-mm filter.To monitor phage activity, 20 ml of the supernatant was then spotted onto LB agar plates containing a top layer of soft agar (4 ml LB and 0.35% [wt/vol] agar) and 200 ml of bacterial culture and incubated overnight at 37°C.
For liquid infections, the filtered supernatant was serially diluted with SM buffer (100 mM NaCl, 8 mM MgSO 4 , 10 mM Tris [pH 7.5]) and added to 200 ml of K. pneumoniae B5055 DompK36.Cultures were incubated for 20 min at 37°C to allow phage adsorption and were then added to soft agar and poured using the double overlay method.Plaques with distinct morphologies were isolated from the top agar, serially diluted in SM buffer, and incubated with the bacterial host as described above.This was repeated five times to obtain pure phage stocks.
Phage amplification and purification.For large amplification of the phages MMNM and MMBB, infections were performed using 14-cm petri dishes with 60 ml of phage preparation (10 24 dilution) added to 500 ml of an overnight culture and incubated for 20 min at 37°C.Ten milliliters of soft agar was then added to the culture and poured using the double agar layer method and incubated overnight at 37°C.Ten milliliters of SM buffer were added to each plate and incubated at room temperature for 10 min.The soft agar layer was scraped off using a disposable spreader, and chloroform was subsequently added (1 ml/100 ml) to lyse bacterial cells to release the phages.The sample was then subjected to vigorous shaking, before the agar and bacterial cell debris were removed by centrifugation at 11,000 Â g for 40 min (4°C).The supernatant containing the phages was collected, and DNase (1 mg/ml) and RNase (1 mg/ml) were subsequently added to the supernatant and incubated for 30 min at 4°C.NaCl (1 M final concentration) was added and incubated at 4°C for 1 h with gentle mixing.Phages were precipitated from the medium by adding polyethylene glycol (PEG) 8000 (10% final concentration) and incubated at 4°C overnight.Precipitated phage particles were collected by centrifugation at 11,000 Â g for 20 min at 4°C and resuspended in SM buffer (1.6 ml/100 ml of precipitated supernatant).An equal volume of chlo-  roform was added to the resuspended phage suspension to remove residual PEG and cell debris and vortexed for 30 s.The organic and aqueous phases were separated by centrifugation at 3,000 Â g for 15 min at 4°C.For purification on cesium chloride (CsCl) gradients, the aqueous phase containing the phages was removed and added to CsCl (0.5 g/ml of bacteriophage suspension) and mixed gently to dissolve the CsCl.The suspension was layered onto a discontinuous CsCl gradient (2 ml of 1.70 g/ml, 1.5 ml of 1.50 g/ ml, and 1.5 ml of 1.45 g/ml in SM buffer) in a Beckman SW41 centrifuge tube.Gradients were centrifuged at 22,000 rpm for 2 h (4°C).Phage particles were collected from the gradient by piercing the side of the centrifuge tube with a syringe and removing the visible band in the gradient.Residual nucleic acid was removed from the phage preparation using floatation gradient centrifugation.Equal volumes of phage suspension (500 ml) and 7.2 M CsCl SM buffer were mixed and added to the bottom of a Beckman SW41 centrifuge tube.CsCl solutions (3 ml of 5 M and 7.5 ml of 3 M) were overlaid on top of the phage sample and centrifuged at 22,000 rpm for 2 h (4°C).Phage particles were collected (;500 ml) using a syringe as described above.CsCl was dialyzed out of the phage stock twice with 2 liters of SM buffer overnight at 4°C.
Phage growth.One-step growth curve experiments were performed on K. pneumoniae as previously described (29).Mid-log-phase cultures were adjusted to an optical density at 600 nm (OD 600 ) of 0.5, pelleted, and suspended in 0.1 volume of SM buffer.Phage lysate was subsequently added at a multiplicity of infection (MOI) of 0.01 and was allowed to adsorb for 10 min at 37°C.Following centrifugation at 12,000 Â g for 4 min, the pellet was washed twice with SM buffer, resuspended with 30 ml of fresh LB broth, and incubated at 37°C.Samples were collected at 10-min intervals for 120 min and titrated to determine PFU per milliliter.Growth experiments were performed in biological triplicates.
Electron microscopy.From the CsCl purifications, phage preparations (4 ml) were added to freshly glow-discharged CF200-Cu Carbon Support Film 200 Mesh Copper grids (ProScieTech) for 30 s.The sample was blotted from the grid using Whatman filter paper, and samples were subsequently stained with 4 ml of Nano W methylamine tungstate (Nanoprobes) for 30 s and blotted again.Grids were imaged using a 120 keV Tecnai Spirit G2 transmission electron microscope (Tecnai).
A BLAST-based predictor was implemented during the evaluation of STEP 3 .It ran using blast-2.2.261.For a query protein, the BLAST-based predictor will predict it to be positive if there is a BLAST hit against the training positive samples with a specified E value.The E value was set at 0.01 in this study, optimized on the independent data set with a range of values, 0.001, 0.01, 0.1, 1, and 10.
Mass spectrometry.Each CsCl-purified phage sample was solubilized in sodium dodecyl sulfate (SDS) lysis buffer (4% SDS, 100 mM HEPES [pH 8.5]) and sonicated to assist protein extraction.The protein concentration was determined using a BCA kit (Thermo Scientific).SDS was removed as previously described (52), and the proteins were proteolytically digested with trypsin (Promega) and purified using OMIX C18 Mini-Bed tips (Agilent Technologies) prior to liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) analysis.Using a Dionex UltiMate 3000 RSLCnano system equipped with a Dionex UltiMate 3000 RS autosampler, an Acclaim PepMap RSLC analytical column (75 mm Â 50 cm, nanoViper, C 18 , 2mm, 100 Å; Thermo Scientific), and an Acclaim PepMap 100 trap column (100 mm Â 2 cm, nanoViper, C 18 , 5mm, 100 Å; Thermo Scientific), the tryptic peptides were separated by increasing concentrations of 80% acetonitrile20.1% formic acid at a flow of 250 nl/min for 120 min and analyzed with a QExactive Plus mass spectrometer (Thermo Scientific) using in-house optimized parameters to maximize the number of peptide identifications.To obtain peptide sequence information, the raw files were searched with Byonic v3.0.0 (ProteinMetrics) against the K. pneumoniae B5055 (derivative str.Kp52.145)GenBank file FO834906 that was appended with the phage protein sequences.Only proteins falling within a false discovery rate (FDR) of 1% based on a decoy database were considered for further analysis.
Homology modeling.Structural homologs were selected by querying the MMBB_78 sequence via the BLASTp webserver against the Protein Databank (PDB).In addition, this same sequence was probed FIG 6 Legend (Continued) spectrometry.The green circles represent a successful hit by a predictor.The green stars denote the proteins that have not previously been identified in phages.The red stars denote those with activities that have previously been identified in phages but not previously found as protein components of purified virions.(c) Prediction statistics for MMBB.(d) Detailed predictions from STEP 3 and other tools for MMBB virion proteins defined by mass spectrometry.using the Phyre2 software suite to identify local homology (53).Residues 186 to 872 of MMBB_78 were modeled against the enzymatic domain of the bacteriophage CBA120 tail spike protein (PDB ID 5W6P [54]).MODELLER v9.19 (55) was used with custom in-house scripts to generate 1,000 potential models.These models were validated and sorted by their discrete optimized protein energy (DOPE) score, followed by visual inspection.An additional atomic model was calculated by the predictive software GalaxyTBM using the full-length MMBB_78 sequence, as part of the GalaxyWEB (56) software suite.Construction of STEP 3 .(i) Data set construction.A total of 481 phage virion proteins were collected from the UniProt database with the "reviewed" tag and from the NCBI database following extensive literature searches.Redundant sequences were removed using the CD-HIT program (57) at a cutoff threshold of 0.4.As a result, 339 virion proteins with less than 40% sequence similarity were obtained.These proteins were further divided into two parts as positive samples: 243 in the training data set and 96 in the independent data set.For negative samples, we downloaded all 1,335 reviewed phage nonvirion proteins (with keywords "NOT Virion" and organism="phage" and fragment="no") from the UniProt database.After sequence redundancy reduction using the cutoff threshold of 0.4 within the negative samples and against positive samples, 790 phage nonvirion proteins were obtained to make up the final negative training (694) and independent (96) data sets, respectively.Finally, a training data set (243 positive samples and 694 negative samples) and an independent data set (96 positive samples and 96 negative samples) were obtained, where each had less than 40% sequence similarity against each other.Three very recently reported phage genomes vB_EfaS_271 (26), vB_PatM_CB7 (27), and vB_Eco4M-7 (28), as well as two newly sequenced phage genomes MMNM and MMBB in this study, were used to validate the prediction capability of STEP 3 in practical scenarios.
(ii) PSSM generation.PSSM is an L Â 20 matrix, where L is the length of its original protein sequence and 20 is the number of amino acids.The (i, j)-th element (1 # i # L, 1 # j # 20) in an PSSM corresponds to the probability of the jth amino acid to appear in the ith position of its protein sequence.To generate an PSSM, blast-2.2.26 resource (https://ftp.ncbi.nlm.nih.gov/blast/executables/) was used to search the protein sequence against the UniRef50 data set (https://www.uniprot.org/help/uniref)with an E value of 0.001 and the iteration of 3.
(iii) Feature encoding.Instead of extracting features directly from the protein sequences, evolutionary features mine patterns from a more informative profile in the format of PSSM.Five types of evolutionary features were generated using the POSSUM toolkit (58), including AAC-PSSM ( 59), PSSM composition (60), DPC-PSSM (59), AADP-PSSM (59), and MEDP (61).For a given PSSM, their calculations are briefly described as follows.(i) AAC-PSSM generates a 20-dimensional vector through summing up and averaging all rows of the PSSM (59).(ii) PSSM composition further divides PSSM rows into 20 groups according to their corresponding amino acids in the original protein sequence (60).The rows in each group are summed up and normalized, and as a result, the PSSM are transformed into a 20 Â 20 matrix.Converting this matrix into a vector by row, PSSM composition finally generates a 400-dimensional vector.
Additionally, four commonly used features were additionally implemented for comparison purpose, including the amino acid composition (AAC), dipeptide composition (DPC), QSOrder (62), and PAAC (63).AAC and DPC count the frequencies of residues and dipeptides in a protein sequence, respectively.QSOrder and PAAC extract features from a protein sequence as well, incorporating the physicochemical properties of its individual amino acids.Among them, QSOrder adopts Schneider-Wrede physicochemical distance matrix (64) and Grantham's distance matrix (65), while PAAC takes hydrophobicity values from Tanford (66) and from Hopp and Woods (67), as well as amino acid side chains.
(iv) Model training on imbalanced data.Our imbalanced training data set is to reflect the fact that the number of virion proteins is usually smaller than that of the nonvirion proteins in a phage isolate.The ratio of positive and negative samples in our training data set is 1:2.86 (i.e., 243:694), which falls in the general range of ratios (usually from 1.5 to 3) between virion and nonvirion proteins in any given phage genome.To avoid prediction bias of models directly trained on imbalanced data, we applied the undersampling technique to generate multiple balanced data sets for model training.Specifically, we combined all of the virion proteins with the same number of randomly selected nonvirion proteins to generate a new balanced subset.This procedure was repeated five times to generate five balanced subsets.For each feature, five individual models were trained based on five balanced subsets, and their prediction scores were averaged to obtain an ensemble model as the baseline model.Support vector machine (SVM) with a radial basis function kernel was used to train each model, implemented by the e1071 package (https://CRAN.R-project .org/package=e1071) in the R language (https://www.r-project.org/).The two parameters of SVM, including the Cost and Gamma, were optimized by a grid search between 2 210 and 2 10 with a step of 2 1 using the same R package.
(v) Model integration.Training a model with each of the features and then integrating them as an ensemble model usually have better and more robust performance, compared with simply training a model with all features (68).Accordingly, the five baseline models (corresponding to five evolutionary Bacteriophage Component Parts May/June 2021 Volume 6 Issue 3 e00242-21 msystems.asm.org 13 features) were further integrated as the final ensemble model of STEP 3 through averaging their prediction scores (Fig. 1a).
(vi) Performance evaluation.The STEP 3 predictor was extensively validated, with the baseline models and existing state-of-the-art tools on the fivefold cross-validation and independent tests.Five performance metrics were used, including sensitivity (SN), specificity (SP), accuracy (ACC), F-value, and Matthews correlation coefficient (MCC) (69).For each model, fivefold cross-validation tests were conducted five times based on the five balanced training data sets, and then the performance metrics were averaged as the final performance result.The other tools compared to STEP 3 were iVIREONS (https:// vdm.sdsu.edu/ivireons),PVPred (http://lin-group.cn/server/PVPred),PVP-SVM (http://www.thegleelab.org/PVP-SVM/PVP-SVM.html), PVPred-SCM (http://camt.pythonanywhere.com/PVPred-SCM),and Pred-BVP-Unb (21).With no available tool for Pred-BVP-Unb, we developed one based on our training data set by strictly following its methods, including its synthetic minority oversampling technique (SMOTE) to cope with the imbalance data set, feature encodings, feature selection (a more generalized method GainRatio used), and the same grid search for parameter optimization.The prediction threshold for Pred-BVP-Unb is a standard cutoff of 0.5, which is the same as STEP 3 .
(vii) Server construction and usage.The STEP 3 server contains a client web interface and a server backend.The client web interface was implemented by the JAVA server development suite, JSP, CSS, jQuery, Bootstrap, and their extension packages.The server backend was used by the Perl CGI (https:// metacpan.org/pod/CGI).For visualization purposes, the blast 2.8.11 (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.8.1/) was used to search each predicted virion protein against known virion proteins to generate sequence similarities, which was visualized by BlasterJS (70).The MAFFT v7.271 (https://mafft.cbrc.jp/alignment/software/)was used to generate multiple alignment results between each predicted virion protein and known virion proteins, which was visualized by jsPhyloSVG (71).The all-against-all BLAST (version blast-2.2.26) was used to generate the sequence similarity network, visualized by ECharts (https://echarts.apache.org/).A queuing system was implemented using the Gearman framework (http://gearman.org/) to store the jobs the client deposits and dispatch them to idle threads maintained in the server backend.In this way, it links the two parts of STEP 3 but decouples the prompt response required in a client web interface and the time-consuming server backend for better user experience.To use the STEP 3 server, users submit their protein sequences in FASTA format and obtain a unique link to track the prediction progress or obtain the results once finished.In default mode, i.e., "For normal use," the known virion proteins were marked with "exp." with an external link to the UniProt or NCBI database, while the predicted virion proteins were marked with "pred."with detailed annotations and options for visualization.Through interactive visualization, users could tentatively annotate the putative virion proteins with their potential subtype or functions, based on the sequence similarity or phylogenetic analysis considerations.For users who want to benchmark the STEP 3 server, a "For benchmarking test" option is available to obtain prediction scores for all their sequences.We declare that we have no competing interests.

FIG 1
FIG 1Construction and workflow for STEP3 .(a) Graphic summarizing the construction and prediction process of STEP3 .A set of experimentally validated virion proteins and nonvirion proteins was compiled, and sequence data were fed into five PSSM models, including AAC-PSSM (59), PSSM composition (60), DPC-PSSM (59), AADP-PSSM(59), and a MEDP (61) model.The five individual models were trained based on five balanced subsets, and their prediction scores were averaged to obtain an ensemble model.Finally, five baseline models (corresponding to five evolutionary features) were further integrated as the final ensemble model of STEP3 through averaging their prediction scores.Support vector machine (SVM) with a radial basis function kernel was used to train each model.This ultimately provides a prediction of a "virion protein" which would be a structural component of the phage virion.(b) STEP3 data visualization provides a means to document relationships between a protein of interest.The example given is the protein component gpE from phage l, which shows clear similarity to major capsid proteins from other phages.Structural studies confirm that despite limited sequence similarity, gpE is part of a family of major capsid proteins(9).Alternative visualization features are available in STEP 3 (see Fig.S1in the supplemental material).

FIG 2
FIG 2 Performance validation of STEP 3 .(a) Performance evaluation on the fivefold cross-validation test.(b) Performance evaluation on the independent test.(c) Performance comparison with existing tools on the independent test.

FIG 3
FIG 3 Prediction details from STEP 3 and other tools.(a) For phage vB_EfaS_271, horizontal bars denote the number of virion and nonvirion proteins.The bar chart shows the numbers of the virion proteins correctly retrieved as true-positive results (TP), i.e., confirmed by mass spectrometry (26), and nonvirion proteins mistakenly predicted as virion proteins (denoted by false-positive results [FP]).(b) For each protein in the phage vB_EfaS_271 virion defined by mass spectrometry, a green circle represents a successful hit by a predictor.(c) For phage vB_PatM_CB7, the bar chart shows the numbers of virion proteins correctly retrieved as TP and nonvirion proteins mistakenly predicted as FP.(d) Detailed predictions from STEP3 and other tools for vB_PatM_CB7 virion proteins defined by mass spectrometry(27).(e) For phage vB_Eco4M-7, the bar chart shows the numbers of virion proteins correctly retrieved as TP and nonvirion proteins mistakenly predicted as FP.(f) Detailed predictions from STEP 3 and other tools for vB_PatM_CB7 virion proteins defined by mass spectrometry(28).

FIG 4
FIG 4 Comparative genome analysis of Klebsiella phage MMNM.(a) Proteomic tree analysis of Myoviridae that infect Gammaproteobacteria.The branch lengths represent genomic similarity based on normalized pairwise sequence similarity scores plotted on a logarithmic scale.The tree was constructed using sequences from the default ViPTree data set and the following selected Klebsiella phage genomes: vB KpnM KpV79 (GenBank accession no.NC_042041), vB KpnM FZ14 (MK521906), vB KpnM KpV52 (NC_041900), 1611E-K2-1 (MG197810), vB KpnM IME346 (MK685667), vB KpnM 15-38 KLPPOU148 (MN689778), PEAT2 (NC_044940), and MMNM (MT894004).Viral subfamilies or genera are indicated by the colored bars.Gray bars represent phages that are currently unclassified.All known members of the Jedunavirus, including Klebsiella phage MMNM (*), are highlighted in red.(b) Whole-genome alignment of Klebsiella phage MMNM, vB_KpnM_FZ14, and vB_KpnM_KpV52.Each genome has been oriented to start with the gene encoding the putative tape measure protein.The sequences are linked by colored bars highlighting sequence identity values as shown in the key.

FIG 5
FIG 5 Morphological characterization of phages MMNM and MMBB.(a) Plaque morphology analysis was performed using the double overlay method.Plaque morphology analysis was performed using (Continued on next page)

FIG 5
FIG 5 Legend (Continued)    the double overlay method after liquid infections of B5055 DompK36 with serially diluted MMNM and MMBB.Plaque morphologies of MMNM and MMBB were determined after overnight incubation at 37°C.Bars, 10 mm.(b) One-step growth curve of MMNM (left) and MMBB (right) was performed by coincubation with the host strain for 10 min at 37°C for phage adsorption, after which the mixture was subjected to centrifugation to remove free phage particles.The resuspended cell-phage pellets were incubated at 37°C and sampled at 10-min intervals for 120 min.L, latent period; B, burst size.Data points are the means of three biologically independent samples, and the error bars are the standard deviations.(c) Transmission electron micrographs of MMNM (left) and MMBB (right).Bars, 100 nm.(d) Based on electron microscopy (EM) micrographs, illustrations of MMNM (left) and MMBB (right) show the cognate features in Myoviridae and Siphoviridae with annotation.

FIG 6
FIG 6  Prediction details from STEP3 and other tools applied to MMNM and MMBB.(a) The statistics of the prediction results on MMNM.Horizontal bars on top describe the number of virion and nonvirion proteins in the phage isolates.The bar chart shows the numbers of virion proteins correctly retrieved (denoted by true-positive results [TP], i.e., confirmed by mass spectrometry) and nonvirion proteins mistakenly predicted as virion proteins (denoted by false-positive results [FP]).(b) Detailed predictions from STEP 3 and other tools for MMNM for the virion proteins defined by mass (Continued on next page) Thung et al.May/June 2021 Volume 6 Issue 3 e00242-21 msystems.asm.org 12 of their Country with distinct rights and will ensure the equitable sharing of resources, including any commercial benefits realized from the development of Wurundjeri Woi wurrung resources.We acknowledge Jordan Smith and Karmen Jobling of the Wurundjeri Woi wurrung Cultural Heritage Aboriginal Corporation's Water Unit for their stewardship in shaping the MoU between Wurundjeri Woi wurrung Cultural Heritage Aboriginal Corporation and Monash Centre to Impact AMR.We are grateful to Richard Strugnell, Department of Microbiology and Immunology, University of Melbourne for access to his collection of Klebsiella isolates.W.D. was a visiting MSc student at Monash University, supported by the study abroad program for graduate student of Guilin University of Electronic Technology (GDYX2019010).Research was supported by a seed grant from the Monash-Warwick Alliance (to T.L., E.J., and S.M.), and the initial phase of the project was supported by the Australian Research Council (FL130100038).T.Y.T., M.E.W., and J.J.W. performed biological experiments.W.D. and Y.Z.performed computational experiments.D.W., R.S.B., and S.M. performed structural calculations and modeling, and R.S.B. performed electron microscopy.E.J., J.J.B., A.R., C.J.S., C.H., and R.S. analyzed data.J.W., R.A.D., and T.L. supervised the project, analyzed data, and wrote the paper.All authors read and approved the final version.