This invention generally relates to infectious diseases and microbial genomics. In alternative embodiments, provided are products of manufacture and kits, and methods, that comprise or comprise use of inhibitory molecules for treating or ameliorating, or providing clinical decision support for the treatment of, a bacterial infection, for example, a Mycobacterium tuberculosis (MTB) infection, or an infection by any member species of the M. tuberculosis (MTBC) complex. In alternative embodiments, the inhibitory molecules inhibit the expression of one or more of the 38 PE/PPE MTB genes as identified herein, or one or more of the 366 identical MTB genes and proteins common to a global set of MTB isolates as identified herein, or the inhibitory molecules inhibit the expression of a transcript or a polypeptide gene product thereof, or inhibit the changing (and hence evading the immune system) of one or more of the 52 special immune-evading MTB PE/PPE genes as identified herein.
Mycobacterium tuberculosis has recently surpassed HIV as the deadliest pathogen in the world1. The success of M. tuberculosis lies in its exceptional ability to evade host immunity and resist antibiotic treatment. However, M. tuberculosis is assumed to have low genetic diversity between strains, often reported to be caused by the lack of horizontal acquisition of genetic material3 and low recombination rates4. Gene gain is an important source of diversity, providing a way for organisms to adapt to shifting environmental conditions5. Gene gain mechanisms include acquisition of external genetic content, such as horizontal gene transfer (HGT), and duplication5. These gain events are freed from selective constraints and can accumulate mutations that produce several fates: maintenance of exact copies, partitioned function, new function, or loss of function6. Because M. tuberculosis strains lack HGT3, gene loss is a proposed mechanism of diversity in M. tuberculosis7. However, gene duplications have been observed in several M. tuberculosis strains that have influenced various phenotypes8,9, suggesting gene loss is not a dominating source of genetic diversity.
Multigene families are created by repeated duplication events10 and are susceptible to recombination due to high sequence homology among members. Recombination hotspots11 and gene conversion12 have been described in two protein families unique to Mycobacteria, PE and PPE13. These proteins (characterized by Pro-Glu (PE) at codons 7 and 8 or Pro-Pro-Glu (PPE) at codons 7, 8, and 913) are considered antigens in M. tuberculosis14 and represent popular vaccine candidates15. However, the mechanisms driving antigenic variation in the PE/PPE proteins is poorly characterized. Proteins in a PE subfamily, PE_PGRS (polymorphic GC-rich sequences), contain a hypervariable region (PGRS) expected to interact with immune cells, particularly T cells16. Yet, the PGRS domain is not involved in evasion of T cell recognition due to a significant lack of T cell epitopes in this hypervariable domain17. Conversely, antigens with tandem repeats have been proposed to elicit a T-cell-independent response18, which stimulate a B cell response independent of T cell activation19. The PGRS domain has a tandemly repetitive amino acid motif20, indicating a possible presence of B cell epitopes. Furthermore, recombination may be altering these epitopes to promote evasion, similar to several other pathogens21. Previous studies have developed creative methods to elucidate these mechanisms of diversity in M. tuberculosis9,11 but have had several limitations and lack resolution. This is largely attributed to the wide-scale use of short-read sequencing, which cannot confidently capture repetitive22 or GC-rich23 regions (well-known genomic signatures of M. tuberculosis13).
Additionally, traditional processing of short-read sequencing relies on a reference-based alignment24, which assumes a similar genome structure as the reference25. Long-read sequencing allows for de novo assembly of a bacterial genome into a single, contiguous sequence, which has revealed genomic changes in a limited number of M. tuberculosis genomes26-29. However, these studies lack the sample size to describe the diversity of M. tuberculosis in these regions.
In alternative embodiments, provided are methods for treating, preventing or ameliorating infection by a member of the Mycobacterium tuberculosis complex (MTBC), comprising administering to an individual in need thereof at least one inhibitory molecule or composition of:
In alternative embodiments, of methods as provided herein:
the inhibitory molecule or composition acts as a NAC (Non-Antibiotic Chemotherapeutic) molecule that inhibits the MTBC PE/PPE genes listed in 1(c) from changing and evading the immune system, wherein optionally the inhibitory molecule acts as a NAC (Non-Antibiotic Chemotherapeutic) molecule that inhibits targets comprising: the antigenic variation of one or more of 52 immune-evading MTBC PE/PPE genes as identified in TABLE B (see
the inhibitory molecule or composition, or the formulation or pharmaceutical composition, is contained in or on, or expressed on, or is carried in: a nanoparticle, a particle, a micelle or a liposome or lipoplex, a polymersome, a polyplex, a phage or a dendrimer;
the inhibitory nucleic acid is contained in a nucleic acid construct or a chimeric or a recombinant nucleic acid, or an expression cassette, vector, plasmid, phagemid or artificial chromosome, optionally stably integrated into a TB cell's chromosome, or optionally stably episomally expressed in a TB cell;
In alternative embodiments, provided are kits for or treating, preventing or ameliorating an MTBC infection, comprising an inhibitory molecule or composition used in any of the preceding claims, and optionally further comprising instructions for practicing a method as provided herein.
In alternative embodiments, provided are uses of an inhibitory molecule or composition as provided herein for: treating, preventing or ameliorating an MTBC infection, or, for the manufacture of a medicament for treating, preventing or ameliorating a TB infection.
In alternative embodiments, provided are inhibitory molecules or composition as provided herein for use in treating, preventing or ameliorating an MTBC infection.
In alternative embodiments, provided are methods for selecting environmentally-derived or a chimeric genetically engineered phage for formulating a phage or a cocktail of phages that can act as a therapeutic for treating, preventing or ameliorating an MTBC infection,
wherein the environmentally-derived phage or chimeric genetically engineered phage recognizes at least one peptide or a polypeptide encoded by:
wherein optionally the at least one peptide or a polypeptide recognized by the environmentally-derived phage or chimeric genetically engineered phage is expressed on the cell surface of an MTBC, or the at least one peptide or a polypeptide recognized by the environmentally-derived or chimeric genetically engineered phage is not expressed on the cell surface of an MTBC bacterium,
and optionally the at least one peptide or a polypeptide is responsible for synthesis and/or localization of surface molecules and targeted by the phage, optionally a mycobacteriophage,
wherein the method comprises: (i) selecting a set of environmentally-derived or chimeric genetically engineered phages that can attack a desired gene target(s), and/or using the gene targets as a guide to engineer new phages that can target the desired genes; (ii) combinatorial selecting subsets of the selected phages for therapy; (iii) delivering the phage subsets to an MTBC-infected tissue for elimination or amelioration of TB infection, or delivering the subsets to healthy tissues for prevention of an MTBC infection; and, (iv) selecting the effective or most effective subsets for therapeutic application.
In alternative embodiments, the environmentally-derived phage or chimeric genetically engineered phage is a lytic phage or a non-lytic phage, optionally a mycobacteriophage, or the phage therapeutic comprises a plurality of phages, comprising environmentally-derived phage or phages, and/or chimeric genetically engineered phage or phages, that are synergistically effective for treating, prevention or ameliorating TB or MTBC infection.
In alternative embodiments, the phage encodes or comprises a protein toxic to or inhibitory to: (a) one or more of the 38 Pro-Glu (PE)/Pro-Pro-Glu (PPE) TB genes as identified in TABLE A, or a transcript or a polypeptide encoded by a gene as identified in TABLE A; (b) one or more of the 366 identical TB genes as identified in Table C (see
In alternative embodiments, of methods as provided herein:
In alternative embodiments, provided are methods for analyzing sequencing data that can be assembled, de novo, into single contig genomes from DNA isolated from samples of a subspecies, species, collection of species, or genus of interest to identify and prioritize genomic elements as targets for various research and industrial applications where genomic elements' suitability as targets for the application of interest are defined by their essentiality for survival of the organisms,
wherein the method comprises providing or having provided (for example from publicly available databases) DNA sequencing data, such as that obtained from long-read single molecule sequencing of DNA isolated from a collection of samples of interest isolated for a subspecies, species, collection of species, or genus of interest, and for each sample among the set, the method further comprises:
(i) assembling the DNA sequence or sequences that comprise the genome from the sequencing data without reference to a previously known genome structure,
(ii) refining the assembled genome by correcting probable errors through mapping of sequencing reads onto the assembled DNA sequence and correcting the consensus where a majority of mapped reads contradict the consensus sequence; and iteratively repeating this process until convergence (no more consensus corrections) or oscillation between consensus states, or true heterogeneity is otherwise supported. In regions of heterogeneity, alternative consensus sequences are offered representing differences between subpopulations.
(iii) inferring coordinates defining genomic elements within the assembled DNA sequence through a hierarchical annotation prioritizing transfer of coordinates from the annotated genome of a related, well-characterized strain, and secondarily integrating annotations assigned through ab initio and/or orthology-based approaches, and
(iv) determining the core and accessory sets of genomic elements of the set of genomes under examination.
In alternative embodiments, of methods as provided herein:
In alternative embodiments of methods as provided herein:
elements that are functionally identical, but different in sequence, and collectively invariably present (one is always present), but invariably, or nearly invariably, mutually exclusive within any single genome;
and optionally the industrial application is to efficiently or specifically yield a chemical species, and optionally the chemical species comprises or is an antimicrobial compound,
and optionally the industrial application is to efficiently or specifically degrade a chemical species, and optionally the chemical species is or comprises a pollutant, and optionally the pollutant is or comprises a petroleum-derived hydrocarbon.
In alternative embodiments, provide are vaccines or pharmaceutical formulation for providing immunity against a Mycobacterium tuberculosis complex (MTBC) infection, comprising an immunologically protective dose of an inoculum and a vehicle,
wherein optionally the vehicle is for subdural, intravenous, intranasal, aerosol, or intramuscular delivery, administration to an immunocompetent individual,
wherein the inoculum or dosage unit comprises at least one of:
(a) an acellular vaccine or pharmaceutical formulation comprising a peptide or a polypeptide encoded by at least one of:
(i) the 38 PE/PPE MTBC genes as identified in TABLE A; or
(ii) the 366 identical MTBC genes as identified in TABLE C (see
(b) a live attenuated vaccine comprising attenuated Mycobacterium tuberculosis (MTB) engineered through functional deletion in the M. tuberculosis (MTB) one or more of:
In alternative embodiments, provided are methods for engineering phage therapeutics comprising use of any one of the 456 genes in claim 1(a), (b), or (c) for engineering phage therapeutics, the method comprising:
(a) confirming essentiality of the genes among the 38 listed in Claim 1a and the 366 listed in Claim 1b that by deleting the targeted gene using CRISPR or equivalent and culturing the resulting cell, and selecting for the next steps those genes whose deletion causes cell death; and
(b) engineering phages that can target (specifically bind to) at least one of the genes identified in (a), or claim 1(c), using homologous recombineering, bacteriophage recombineering of electroporated DNA (BRED), CRISPR-cas-based phage engineering, rebooting phages using assembled phage genomic DNA, or any equivalents thereof.
In alternative embodiments, provided are methods for sensitive heterogeneity analysis for detection of small but clinically relevant subpopulations, comprising:
In alternative embodiments, provided are methods of identifying heterogeneity based on oscillatory behavior during the iterative consensus sequence refinement described in the de novo assembly polishing step as provided herein.
In alternative embodiments, provided are methods of identifying heterogeneous genomic elements through annotating heterogeneous segments of a genome and genomic elements that are affected by the heterogeneous segments through methods as provided herein.
In alternative embodiments, provided are methods of determining the clinical relevance of the subpopulations harboring the heterogeneous genomic element variations, as provided herein.
In alternative embodiments, provided are methods for prognosing emergence of new phenotype or phenotype-conferring sequence variations in pathogens for Clinical Decision Support (CDS) comprising the steps of:
In alternative embodiments of methods as provided herein:
The details of one or more exemplary embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
All publications, patents, patent applications cited herein are hereby expressly incorporated by reference for all purposes.
The drawings set forth herein are illustrative of exemplary embodiments provided herein and are not meant to limit the scope of the invention as encompassed by the claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Like reference symbols in the various drawings indicate like elements.
In alternative embodiments, provided are products of manufacture and kits, and methods, that comprise or comprise use of inhibitory molecules for treating or ameliorating a Mycobacterium tuberculosis (TB) infection. In alternative embodiments, the inhibitory molecules inhibit the expression of one or more of the 38 PE/PPE TB genes as identified herein (see Table A), or one or more of the 366 identical TB genes and proteins (see Table C) common to a global set of TB isolates as identified herein, or the inhibitory molecules inhibit the expression of a transcript or a polypeptide gene product thereof or inhibit the changing (and hence evading the immune system) of one or more of the 52 special immune-evading TB PE/PPE genes as identified herein (see Table B).
As described in Example, 1 we produced 97 M. tuberculosis genomes, the most reference-quality genomes and methylomes of M. tuberculosis published to-date. A hybrid annotation was applied and all protein-coding sequences were compiled (pangenome analysis) to quantify gene gain (gene duplication) and classify gene loss (pseudogenes). This analysis further revealed hundreds of genes not present in the primary reference strain, H37Rv, which indicates significant sequence divergence in the species. Our data also confirm significantly more genome contraction (pseudogenization) than amplification (duplication). Investigation into PE/PPE genes identified by a more accurate sequence motif confirmed T cell epitopes primarily reside in the PE domain of PE_PGRS proteins, possibly serving as decoy antigens. Finally, we observed strong evidence of gene conversion within PE_PGRS proteins that significantly altered B cell epitopes.
Provided herein is a comprehensive method of DNA sequencing analysis for identifying genomic elements, for example, genes or other genomic elements defined by genomic coordinates, as targets for medical and industrial applications.
In alternative embodiments, term “genomic elements” encompasses genomic elements as described in Kanduri et al, Bioinformatics, Vol 35, Issue 9, 1 May 2019, Pages 1615-1624, or genomic features as described in https://docs.patricbrc.org/user_guides/data/data_types/genomic_features.html.
We have developed a method to identify genomic elements as high-confidence targets for various industrial applications in a given species. In alternative embodiments, methods as provided herein can be used for drug and vaccine development and for accelerating development of industrial applications; however, methods are provided herein are not limited in scope to applications for drug and vaccine targets.
In alternative embodiments, methods are provided herein are applied to the analysis of genomic sequences using extant, nascent, and yet to be developed technologies for ascertaining the sequence of nucleotides comprising the genome of an organism/sample of interest. Methods as provided herein can be applied to other techniques of ascertaining the DNA sequence and producing complete genomes, and assembling genomes de novo, for example, methods as provided herein can be applied to DNA sequencing that utilizes single-molecule sequencing technologies, for example, as defined by in the Encyclopedia of Biophysics, which states “Single-molecule sequencing refers to techniques that can read the base sequence directly from individual strands of DNA or RNA present in a sample of interest”.
In alternative embodiments, methods are provided herein are used with single-molecule sequencing technologies that can comprise multiple techniques, such as:
Sequencing-by-Synthesis:
Nanopore-Sequencing Technologies:
Like many industrial applications, drug and vaccine development is hindered by lack of appropriate targets mostly because current techniques involve extensive and time consuming laboratory experimentation implementing a broad blind search approach to identify target candidates. In alternative embodiments, methods as provided herein quickly provide such appropriate targets, for example, methods as provided herein can determine which genomic elements meet the requirements of an application.
For example, in alternative embodiments, methods as provided herein address the problem of having an efficient way to identify drug and vaccine targets, which can be an exceedingly slow and expensive process, partly due to lack of targets that meet the vaccine or drug requirements, for example conservation and accessibility. Conservation, almost always, is a required attribute for a target since its absence in any variety of the species will make the application not broadly applicable. The next most frequently important attribute is the variability of a target. Many applications require that the target be highly stable across the varieties of the species. An example of such applications is the development of antibiotics where the target needs to be present, essential, and as stable as possible. Other applications require that the target be partly hypervariable in a specific way.
Provided herein are in silico methods to quickly identify a comprehensive set of high-potential targets for a variety of industrial applications including vaccine and drug development that are conserved, their variability profile determined, and their essentiality deduced from their conservation and variability without lengthy laboratory experiments (which significantly increases the time and cost of a project). In alternative embodiments, these in silico methods comprise:
The general methodology of steps 1 to 4 is depicted as a workflow schematic in
Steps 5-8 are written above as applied to genes as the class of genomic element under analysis, but the method is designed to be applied to any class or combination of classes of genomic elements. Expanded to multiple classes, instead of the accessory genome basis as defined in the steps above, we would have the basis of the accessory set of genomic elements and defined analogously according to the steps above applied to all genomic elements.
In alternative embodiments, for some applications, not all the steps outlined above are required, and some can be skipped. For example, for the case of efficient chemical conversion by bacteria, engineering species with minimal genomes is important. In such an application, genomic elements identified in steps 4 and 6 would be sufficient for engineering a minimal genome species.
Provided herein is a rapid, and more accurate method for prediction of essentiality. Essentiality is typically determined by experiments and survival assays measuring whether the cells of interest continue replicating when a genomic element is deleted or disrupted while cultured in an artificial environment. This artificial environment fails to recapitulate the constraints levied in the true environment and produces essentiality classifications that are misleading when extrapolated to the organism's native environment. In alternative embodiment, provided herein is a rapid and accurate method for prediction of essentiality that relies on an element's hyperconservation, or the combination of its variability profile and its gene exclusivity profile (step 7 above) across samples of the species. In alternative embodiment, this method has numerous applications that spans multiple sectors, for example:
These exemplary methods are outlined in detailed and validated in Example 1. The results of its application to 97 different strains of M. tuberculosis is also described Example 1, where some useful drug and vaccine targets have been identified.
Pathogens are successful in causing infections mostly due to their ability to manipulate and adapt their environment. These abilities allow pathogens to evade host immunity and survive chemotherapy. Pathogens' ability to dynamically adapt to immune and drug pressure is primarily attributed to their ability to change their genomes in response to the pressures that they encounter. For most pathogens, the complete map and nature of these changes are poorly understood due to limitations of popular sequencing technologies (short-read Whole Genome Sequencing, or srWGS). SrWGS imposes limitations on assembling the pathogen's genome, preventing a comprehensive genetic analysis. Because of these limitations, srWGS reads are mapped onto the genomic structure of a well-studied reference genome to identify changes with respect to the reference. The srWGS approach offers several shortcuts but also has a critical drawback. It assumes that the pathogen's genomic structure remains constant across members of the species. In fact, this assumption is invalid, and pathogens change their genomic structure in order to evade the immune system and chemotherapy pressures.
To properly study such genomic changes (for example for comprehensive understanding of mechanisms of antibiotic resistance), the genome must be assembled de novo. De novo assembly of srWGS presents several challenges, both in initial assembly and in downstream analyses:
While some of the methods described have been used for infectious diseases, our method of annotation and the larger method of identifying novel targets for vaccines and drug targets is novel and non-obvious.
Beyond infectious diseases, the two-step, hybrid annotation approach of methods as provided herein, which include a prediction of the accessory genome, identification of alternative sets, and the method as a whole, have utility for identifying subsets of essential genes in the native environment from which samples are derived. In alternative embodiments, this addresses a fundamental problem in determining gene essentiality. In alternative embodiments, methods as provided herein are rapid, more accurate, and not rely on direct killing assays in an artificial environment optimized for growth.
In alternative embodiments of methods as provided herein, there are numerous applications in clinical, industrial, and synthetic microbiology for which identifying the set of essential genes in an organism's natural environment would accelerate or newly enable innovation. For example, in the application of engineering bacteria that breakdown petroleum, applying exemplary methods as provided herein to multiple strains would identify the set of highly conserved genes that would serve as a scaffold to construct a minimal genome of essential genes as a starting point for engineering strains of bacteria that efficiently breakdown petroleum into less toxic metabolites, accelerating oil spill recovery efforts, and reducing harm to the environment. In alternative embodiments, exemplary methods as provided herein also can be used for engineering strains of bacteria for other types of chemical conversion, including biofuel production, nanoparticle synthesis, and metal contaminant and waste remediation. The methods as provided herein are emerging possibilities that hold large promise for combatting challenging environmental issues of the 21st century. Engineering efficient strains for these applications using natively essential genes as a starting point as provided herein has not been previously described. The prevalence and utility of exemplary methods as provided herein can be used for improvements to engineering tools to enable more precise genetic manipulation to engineer solutions to issues of excess waste, clean energy generation, and numerous other applications of bacteria-driven chemical conversion.
We have developed a method to more easily identify high-potential genomic targets for a number of industrial applications including vaccine and drug development, biofuel generation, toxic spill cleanup, and many other environmental challenges of the 21st century. This will significantly reduce drug development cost and accelerate the development timeline. Methods as provided herein can also advance genomic analysis in general for research. Methods as provided herein can be used for the assembling of genomes exactly as they are without assumption, annotating them for further analysis, and downstream analysis for target identification.
We have demonstrated the application of exemplary methods as provided herein to tuberculosis, the deadliest human infection killing 1.5 million people a year. As a result, exemplary methods as provided herein can be used with novel non-antibiotic therapeutics and new vaccine development.
Exemplary methods as provided herein are also broadly applicable to identify sets of genes essential for survival of a specie in its native environment. Identification of such genes has numerous potential applications in synthetic biology, improving engineering of bacteria to clean up our landfills and for efficiently synthesizing complex materials or materials expensive to synthesize through current methods.
Provided herein are Non-Antibiotic Chemotherapeutics (NACs) that neither sterilize (for example, as bactericidal antibiotics) nor stop bacterial growth (for example, as bacteriostatic antibiotics). Rather, these NAC drugs disable the immune-evasion capabilities of the bacteria making them easier for the immune system to identify and eradicate.
In alternative embodiments, provided are effective targets for NACs which will help in engineering NAC compounds. Our findings identify a unique set of genes with evolutionary dynamics previously unknown in M. tuberculosis. These evolutionary dynamics offer opportunity for development of NACs.
In alternative embodiments, provided are effective targets for phage therapy of mycobacterial infections. After its initial introduction phage therapy has stalled mostly due to lack of effective targets for the phages to attack. Provided herein are phage targets that can be used to screen against targets of known phages and for directed engineering of novel phages.
By their nature, NACs are less susceptible to emergence of resistance and will offer a more permanent solution to TB control than with classical antibiotics alone. Phage resistance has been observed, but we believe that this has more to do with the targets selected for phages. For the first time, we have identified targets that are identical across all 97 strains of M. tuberculosis suggesting that change in these targets are detrimental to the bacteria. As such, these targets cannot change to resist phage treatment. Provided are the targets identified from finished M. tuberculosis genomes, which offer promising avenues to developing novel alternative therapeutic classes with lessened probability of developing resistance.
In alternative embodiments, provided are targets in M. tuberculosis for NAC and phage therapeutics:
Through our novel methods (described in detail in Example 1), we have fully captured these gene families for the first time in clinical isolates. We have applied this method to a diverse set of clinical M. tuberculosis isolates collected from 97 patients from six countries in Asia, Africa, and Europe. This set represents the five major lineages of M. tuberculosis. We have captured the full set of PE/PPE genes in each genome. Through the analysis described in Example 1, we have identified a subset of 38 PE/PPE genes (see Table A) that are conserved in all 97 clinical isolates. This, for the first time, identifies a stable set of candidate vaccine targets.
Accordingly, in alternative embodiments, provided are: identification of 456 TB genes useful for developing NACs; the use of these 456 genes for selecting environmentally-derived phages for formulating cocktails of phage therapeutics; the use of these 456 genes for engineering phage therapeutics; the use of these 456 genes above developing drug/phage combination therapies; and, the use of these 456 genes above for engineering phage therapeutics.
In alternative embodiments, methods as provided herein overcome several barriers for the development of NACs and for the use of the specified genes for phage therapeutics:
In alternative embodiment, NACs as identified using methods as provided herein can be treated as co-drugs along with existing antibiotics. More effective NACs can function as single drug therapeutics in immunocompetent patients. An additional advantage of NACs lies in the more challenging path that pathogens have to take to evolve resistance to NACs. As such, NAC resistance is not expected to emerge, or to emerge far less frequently than traditional antibiotics.
In alternative embodiments, provided herein are genomic targets for drug and vaccine development, where the identified targets can be classified into three classes, segregated by the type of protein and their genomic stability:
In alternative embodiments, identified genomic targets, which can be used for live attenuated vaccine development, acellular vaccine development, and antibiotic development, have at least one or more of the following five attributes:
In alternative embodiments, methods as provided herein comprise the following steps for developing NACs:
Common steps:
In alternative embodiments, provided are environmentally-derived phages to treat TB:
The first instance of engineered phages curing a human infection was recently reported, and was engineered against a mycobacterial lung infection, providing precedent that justifies this approach as a viable development strategy for TB phage therapeutics:
In alternative embodiments, provided are phage/antibiotic combination therapies; these work best when mechanisms of resistance to phage and drug are mutually antagonistic:
Identify synergistic drug/phage pairs. In alternative embodiments, two method as provided herein for identifying effective phage/drug combinations for further development comprise:
Stable M. tuberculosis genes are identified herein that are used for developing phage therapeutics, for example, as described herein, with significant therapeutic potential. M. tuberculosis is a target for developing phage therapeutics, for several reasons:
In studying nearly 100 clinical isolates of M. tuberculosis from around the world, the bacteria usually responsible for human TB infections have been analyzed with long-read sequencing technology and, unexpectedly, we have uncovered a large trove of genes that are identical across all strains, making these promising targets for developing new therapeutics. In addition, we also identified 52 genes that are very different across strains in a manner that evades the immune system but have conserved portions that do not change.
Provided herein are exemplary methods for using these identified genes as targets for developing novel therapeutic approaches to TB:
One exemplary method as provided herein provides a non-antibiotic chemotherapeutics (NACs) approach that does not kill or stop their growth like typical antibiotics do. Rather, NACs prevent the bacterial cells to evade the human immune system through their frequent B Cell epitope changes. In this way, the immune system will easily identify the infecting cells and destroy them.
A second exemplary method as provided herein provides for the phage therapy of a mycobacterial infection by providing new phage targets. In alternative embodiments, provided herein are targets for phage therapy and/or for engineering new phages.
In alternative embodiments, this new set of genomic targets as provided herein has enabled a three-pronged pipeline for phage therapeutics: First, where naturally occurring phages are turned on the TB-causing bacteria; second, where phages are engineered in a lab to kill the bacteria; and, a third where phages are combined in special ways with traditional anti-TB drugs that makes it far more challenging for the bacteria to develop resistance phage or bacteria. These new exemplary approaches use stable targets in the TB-causing bacteria to offer new solutions that are desperately needed against TB.
In alternative embodiments, this new set of genomic targets can be used for:
Live attenuated vaccine development: In alternative embodiments, provided are a class of vaccines or pharmaceutical formulations formulated using strains of M. tuberculosis engineered to be avirulent to build immunity against virulent M. tuberculosis strains. In alternative embodiments, provided are methods for developing M. tuberculosis live attenuated vaccine, and optionally steps in M. tuberculosis live attenuated vaccine development comprise:
In alternative embodiments, provides are methods for M. tuberculosis acellular vaccine development, which can comprise the following steps:
Tuberculosis (TB) is now the most successful human infectious disease. It has killed over a billion people in the last 200 years. Its virulence and survival in the human host is largely a mystery. Despite decades of research, a TB vaccine has proven to be elusive. The PE/PPE family of genes are thought to interact with the human immune system in order to evade it.
For the first time, we have identified 38 PE/PPE genes (see Table A) (the TB family PE/PPE family of genes are thought to interact with the human immune system in order to evade it), 52 special immune-evading PE/PPE genes (see Table B, and 366 identical proteins common to a global set of isolates (see Table C) (97). The manipulation or knocking out of these genes will result is strains that are avirulent.
These avirulent strains hold promise for developing a vaccine, which has been challenging for TB. The 366 identical genes provide ideal targets for development of novel antibiotics. The fact that these genes remain identical across so many strains from across the world is an indication that bacteria are sensitive to changes in these genes and hence they provide the perfect target for antibiotic attack.
In alternative embodiments, methods as provided herein comprise the following steps for prognosing emergence of new phenotypes, such as resistance to chemotherapeutics.
This will determine which specific genetic or epigenetic variants are important at each time point for emergence of the new phenotype, drawing a spatiotemporal pattern of emergence for each new phenotype.
In alternative embodiments, methods as provided herein, depicted in
In the example on the left, the patient continues on the original regimen until resistance emerges and treatment fails, as current phenotypic measures using only known resistance-conferring variants (stars, highlighted by red rectangle) did not identify resistance until it had already manifested phenotypically. In the scenario on the right, the level of heterogeneity (darkness of ovals) of harbinger mutations cued the prognostic CDS tool to recommend changing the treatment regimen after sample 4, 2 samples prior to emergence of resistance in the alternative scenario, leading to treatment success. The CDS output comes in the form of a 0-1 computed probability of emergent resistance (Prob), and one of three suggestions based on that probability: “continue treatment”, “Monitor closely”, which means Prob has been increasing and samples should be taken more frequently if possible, and “change regimen”, indicating high probability of impending resistance emergence if the current regimen continues. Prob can be computed using a recursive artificial neural network (rANN) trained to recognize spatiotemporal (variants across the genome and time) patterns predictive of future emergence of phenotypic resistance.
Provided are products of manufacture and kits for practicing methods as provided herein, and optionally further comprising instructions for practicing methods as provided herein.
Table A: Pro-Glu (PE)/Pro-Pro-Glu (PPE) TB or MTBC genes
The names of the following Pro-Glu (PE)/Pro-Pro-Glu (PPE) TB or MTBC genes (of Table A) can are their common names and are described for example, in known databases such as GenBank:
PPE18; PE33; PE_PGR; S1; PPE44; PPE11; PPE53; PE14; PPE51; PE25; PPE10; PE17; PE13; PE_PGRS59; PE15, PPE20; PPE22; PE12; PE5; PPE3; PE_PGRS18; PPE1; PPE45; PPE12; PPE19; PE4; PPE4; PE_PGRS44; PPE41; PE23; PE35; PPE68; PE16; lipX; PPE31; PE20; PPE32; PPE14; PPE15.
Table B: conserved region of the 52 special immune-evading MTBC PE/PPE genes
The names of the conserved region of the 52 special immune-evading MTBC PE/PPE genes of Table B, as illustrated in
Table C: MTBC genes
The names of the MTBC genes of Table C, as illustrated in
Any of the above aspects and embodiments can be combined with any other aspect or embodiment as disclosed here in the Summary and/or Detailed Description sections.
As used in this specification and the claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise.
Unless specifically stated or obvious from context, as used herein, the term “or” is understood to be inclusive and covers both “or” and “and”.
Unless specifically stated or obvious from context, as used herein, the term “about” is understood as within a range of normal tolerance in the art, for example within 2 standard deviations of the mean. About can be understood as within 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.1%, 0.05%, or 0.01% of the stated value. Unless otherwise clear from the context, all numerical values provided herein are modified by the term “about.”
The entirety of each patent, patent application, publication and document referenced herein hereby is incorporated by reference. Citation of the above patents, patent applications, publications and documents is not an admission that any of the foregoing is pertinent prior art, nor does it constitute any admission as to the contents or date of these publications or documents. Incorporation by reference of these documents, standing alone, should not be construed as an assertion or admission that any portion of the contents of any document is considered to be essential material for satisfying any national or regional statutory disclosure requirement for patent applications. Notwithstanding, the right is reserved for relying upon any of such documents, where appropriate, for providing material deemed essential to the claimed subject matter by an examining authority or court.
Modifications may be made to the foregoing without departing from the basic aspects of the invention. Although the invention has been described in substantial detail with reference to one or more specific embodiments, those of ordinary skill in the art will recognize that changes may be made to the embodiments specifically disclosed in this application, and yet these modifications and improvements are within the scope and spirit of the invention. The invention illustratively described herein suitably may be practiced in the absence of any element(s) not specifically disclosed herein. Thus, for example, in each instance herein any of the terms “comprising”, “consisting essentially of”, and “consisting of” may be replaced with either of the other two terms. Thus, the terms and expressions which have been employed are used as terms of description and not of limitation, equivalents of the features shown and described, or portions thereof, are not excluded, and it is recognized that various modifications are possible within the scope of the invention. Embodiments of the invention are set forth in the following claims.
The invention will be further described with reference to the examples described herein; however, it is to be understood that the invention is not limited to such examples.
Unless stated otherwise in the Examples, all recombinant DNA techniques are carried out according to standard protocols, for example, as described in Sambrook et al. (1989) Molecular Cloning: A Laboratory Manual, Second Edition, Cold Spring Harbor Laboratory Press, NY and in Volumes 1 and 2 of Ausubel et al. (1994) Current Protocols in Molecular Biology, Current Protocols, USA. Other references for standard molecular biology techniques include Sambrook and Russell (2001) Molecular Cloning: A Laboratory Manual, Third Edition, Cold Spring Harbor Laboratory Press, NY, Volumes I and II of Brown (1998) Molecular Biology LabFax, Second Edition, Academic Press (UK). Standard materials and methods for polymerase chain reactions can be found in Dieffenbach and Dveksler (1995) PCR Primer: A Laboratory Manual, Cold Spring Harbor Laboratory Press, and in McPherson at al. (2000) PCR—Basics: From Background to Bench, First Edition, Springer Verlag, Germany.
This example describes how to make and use alternative embodiments, as provided herein.
Mycobacterium tuberculosis is described to have low genetic diversity corroborated by consistent use of short-read sequencing. To elucidate the true genetic diversity, we present 97 reference-quality, circularized genomes sequenced on Pacific Biosciences from independent clinical M. tuberculosis isolates. A novel hybrid annotation approach was applied to all genomes, preserving reference annotations in regions of high homology but filling in gaps with an ab initio method. The resulting pangenome (6160 genes including 2061 not in H37Rv) revealed an accessory genome enriched for host interaction functions. The core genome (2556 genes) was enriched for housekeeping functions and contained 366 identical protein-coding genes in all genomes. Intriguingly, several toxin-antitoxin proteins were absolutely identical across all genomes, indicating high selective pressure to maintain these typically hypervariable genes. Sixty unique gene duplications were discovered, representing an important mechanism of gene gain in the species. However, gene loss through pseudogenization was significantly more prevalent than gene duplication, indicating more genome contraction than amplification. A per-genome calculation revealed an average of 1.5% of genes not in H37Rv, encompassing several PE/PPE genes. Remarkably, gene conversion was found to alter B cell epitopes in the PGRS domain of PE_PGRS proteins, denoting a potential mechanism of immune evasion in M tuberculosis. Although M. tuberculosis is considered a clonal bacterium, the differences we observed indicate a rapidly adapting genomic landscape that contributes to its success as a deadly pathogen.
Strict Quality Control Protocols Produce the Most Reference-Quality Genomes of M. tuberculosis To-Date
We compiled all available M. tuberculosis genomes (single patient isolates) that were sequenced on PacBio on the Sequence Read Archive database30 (n=24) and sequenced an additional 154 genomes on PacBio RS II™. All genomes were assembled de novo and the quality of the consensus sequence was evaluated using strict exclusion criteria associated with each of the four assembly steps (described in “Genome Assembly with Quality Control” section). These criteria were used to exclude genomes with uncertainty at any point of their assembly (typically caused by single-base insertions or deletions). For statistics of excluded and included genomes, please refer to
Applying these exclusion criteria identified 97 circularized genomes with a polished consensus sequence and no detectable mis-assemblies (Table S1, or
Genome sizes varied (minimum 4368671 bp, maximum 4450340 bp)) with over 80 kb difference between the largest and smallest genomes (
A phylogenetic tree was also produced to ensure sequencing errors did not affect the expected evolution of the included M. tuberculosis genomes. Similar to previous studies (Kato-Maeda et al. 2013; Stucki et al. 2012), a SNP tree was generated by, first, identifying all SNPs compared to reference genome, M. tuberculosis H37Rv; then a phylogenetic tree was created with a maximum likelihood method. MIRU-VNTR and spoligotyping data was referenced for accuracy.
These reference-quality genomes were then queried for new genetic content through a hybrid annotation approach.
We sought to preserve annotations from a well-studied reference strain (H37Rv32, implemented by RATT33) and annotate new content ab initio (implemented by Prokka34). Relying on a well-characterized reference annotation lowers the possibility of mis-annotations and false positive annotations, common in ab initio methods35. The annotation transfer resulted in an average of 85% concordance with the H37Rv proteome across 97 genomes (Figure S3). However, relying solely on this type of annotation removes the possibility of identifying new genes with respect to the reference. An ab initio approach was used to annotate new genes not present in the reference, along with two custom filtering steps to avoid mis-annotations and false positive annotations. The first step minimized mis-annotation of gene coordinates (characterized by an incorrect protein-coding sequence (CDS) start position) by recalibrating the gene coordinates of CDSs likely to have an incorrect start coordinate (see Supplemental Information). The second step removed potential false positive annotations in genomic regions that are unannotated by RATT by excluding genes with unknown function (see Supplemental Information for details). These steps, a CDS sequence clustering step, and orthologous functional annotation with eggNOG comprise a custom annotation pipeline for M. tuberculosis genomes, Annotate TUBerculosis (AnnoTUB).
AnnoTUB identified an average of 4020 CDSs (range 3958-4051) in 97 genomes as well as the avirulent M. tuberculosis reference strain, H37Ra (CP016972.1), and re-annotation of H37Rv (NC000962.3) (Table 51, or
M. tuberculosis Pangenome Reveals Hidden Genetic Diversity and Hyperconservation
M. tuberculosis has been reported to have an open pangenome36 (additional genomes would be needed to identify all genes in the species37), which indicates an expanding and divergent accessory genome (CDSs present in less than 99% of the population). To see if finished assemblies corroborate this, we constructed a pangenome using Roary38 from annotations of the 97 de novo assembled PacBio-only genomes, H37Rv, and H37Ra. The resulting pangenome (n=6160 CDSs) revealed an accessory genome (n=3604 CDSs) enriched for evasion of host defenses and a core genome (conserved genes, n=2556 CDSs) enriched for housekeeping functions specific to the species (
Hyperconserved Proteins as Prudent Drug Targets for Infections by M. tuberculosis
The M. tuberculosis genome sequence was published 22 years ago1 and cited by over 8,500 works, yet no drug informed by M. tuberculosis genomic analyses has been put to market. While target essentiality has been a popular criterion for target discovery, gene essentiality studies on virulent type strain H37Rv are typically done in artificial media. Others have noted that a more relevant measure of essentiality may be conservation of genes over evolutionary time, since it implies essentiality in the native environment2.
Conservation over long evolutionary time periods has been used previously as an alternative measure of essentiality of M. tuberculosis, but focused on metabolic genes3, and used Illumina sequencing, which, while largely reliable, has areas of systematic bias that preclude identification of invariable genes in many portions of the genome. Third generation sequencing technologies bring new fervor to this endeavor, however, by permitting unbiased, comprehensive genomic characterization of M. tuberculosis strains, including PE/PPE genes, which have eluded full genomic characterization during the predominance of short-read sequencing. Here, we have analyzed the proteomes of 97 finished-grade de novo assembled M. tuberculosis clinical genomes and identified invariable protein sequences within them. Next. we evaluate these hyperconserved proteins according to their essentiality, chromosomal distribution, and functional composition. We then integrate these findings with data collected in situ and ex vivo to develop a list of targets particularly robust against resistance development and with high potential for efficacy during infection.
Replisome Conflicts with Transcriptional Machinery Biases Protein Conservation Across the MTBC.
First, we analyzed how invariably conserved proteins distribute across the M. tuberculosis chromosome. Head-on collisions between the replication and transcriptional machinery bias genes on the lagging strand toward higher mutability in prokaryotes and specifically in M. tuberculosis4. We asked whether this bias penetrated over the long time-course between the common ancestors of these diverse strains to affect identical gene distribution. Identical gene density is higher on the leading strand than on the lagging strand (
We reasoned that lagging strands genes that are hyperconserved despite their biased mutability must be under strong purifying selection and thus likely essential for pathogenesis. Eight operons meet this criterion (Table Y) and successfully recover targets with in vivo essentiality or promise for developing vaccine and therapeutics. The eight operons include multidrug efflux pumps (p55 and mmr) and an operon currently in trials as a vaccine candidate and drug target (Rv1410c-lprG). The Rv1410c-lprG operon has demonstrated in vivo essentiality5 and regulates import of triacylglycerol, a key carbon source during infection and energy reservoir during dormancy6. The operon is also essential for drug export via the P55 efflux pump7 and for evading immunity through inhibiting phagosome-lysosome fusion8. Encouragingly, small molecule inhibitors efficacious against LprG were recently identified9. Our results add to the evidence for this operon as a promising therapeutic target.
CadI-Rv2642, a cadmium-inducible gene and an ArsR-family transcription factor that regulates its transcription, form part of the “arsenic detoxification operon” which, unexpectedly, were the only consistently downregulated genes across all attenuated MTB10, adding to the mounting evidence that metal ion homeostasis factors critically into M. tuberculosis pathogenesis11. Transcriptional regulators are significantly overrepresented in the operons (p=0.008, odds ratio=6.03, 95% CI: 1.40-20.1) and regulate their co-operonic protein in several cases. These include Rv3066, which regulates mmr, a multidrug efflux protein underpinning intrinsic resistance to multiple drugs, and whiB1, an important coordinator of the transcriptional response to hypoxia.
Identical genes distribute into regional (
Second, we assessed the functional distribution of invariable M. tuberculosis proteins. The As expected, considering their fundamental role in life, 505 (rplBEFKPRTUW and rpmAEF) and 30S ribosomal proteins (rpsBFHKPS/N2/R1/R2) were abundant among invariable genes. TA genes were not restricted to the clusters highlighted in
The importance of iron M. tuberculosis pathogenesis showed in their conservation. Genes encoding proteins mediating Iron acquisition (mbtDLN and mmpS4/5), storage (brfAB), transcriptional regulation (ideR), and FeS clusters (sufT, whiB1, whiB3) are invariably conserved. As were genes encoding proteins for other metals, including copper resistance (mymT, mctB, and ricR) and molybdenum processing (moaC2/E2 and mog) proteins.
These identical genes were also significantly enriched for essential genes (determined in H37Rv in an in vitro system), but not within each lineage (Table S5, or
The over-representation of regulatory, virulence, and hypothetical proteins aligns with the functional composition of the enduring hypoxic response stimulon19, supporting the notion that these genes fill conserved roles required for enduring the stressors imposed by host immunity.
Integrative analysis with in situ expression data highlights non-essential transcriptional factors as priority drug targets. To distill targets most likely accessible throughout infection out of this unexpectedly large trove of invariably conserved non-essential genes, we integrated invariable proteins with the genes comprising the “core transcriptome” of M. tuberculosis active and inactive macrophage20 (Table Z). The genes in this intersection were remarkably overrepresented for non-essential transcription factors (4/11, p=0.019, odds ratio=10.34, 95% CI: 2.20-41.1). Yet this remains an underestimate of the degree of their enrichment; Four of the seven genes not annotated as TFs in the Rustad dataset, have since characterized had transcriptional regulatory roles characterized in M. tuberculosis (abmR21, Rv232722, Rv2028c, and sufT) or in homologs in other species.
Like RpoB, the target of first-line antitubercular therapeutic rifampin, HspR (protein folding & stability) and PrfA (protein syntheses) mediate fundamental processes, making them intriguing therapeutic candidates. However, prfA has a SNP specific to sublineage 4.2.2.1 (a sublineage absent from our isolates). The hspR gene developed mutations in a PafE (which degrades HspR) mutant23, but it is unclear whether this variation would occur naturalistically during infection or if its permissibility is restricted to the artificial experimental environment.
Several hits in Table Z are known players in drug resistance. Rv2327 is upregulated upon BDQ exposure24, and its M. smegmatis homolog is a redox-sensing MarR-type repressor that confers resistance to rifampicin, erythromycin, and hypochlorite stress in M. smegmatis through regulation of a multidrug efflux pump in response to oxidative stress22. These reports suggest a potential mechanism of tolerance to BDQ, or perhaps multiple drugs that could be targeted for synergistic effects. Other hits mediate immune subversion of host defenses or are involved in stress response. These are interesting targets for combinatorial therapy with more direct drugs with metabolic targets.
Transcription factor abmR is required for regulation of central carbon metabolism genes by sRNA Mcr1125 and appears to mediate adherence and invasion of host cells21. These roles in metabolic adaptation and infectivity make abmR an appealing therapeutic target. One intriguing direction would be targeting AbmR as a prophylactic measure, given its potential essentiality for transmission.
Hyperconserved PE/PPE proteins in Mycobacterium tuberculosis. Next, we examined invariable PE/PPE genes. The lack of systematic bias in SMRT-sequencing permits bona fide de novo analysis of PE/PPE gene composition, which has been difficult to resolve with short read sequencing technologies26. To our knowledge, this is first report of their conservation across a global set of finished genomes. While they are starkly under-represented among invariable proteins (5/366, p=0.00239, odds ratio=0.291), four of the five invariable PE/PPE proteins have clear important roles, substantiating the validity of our screen for inferring phenotypic significance during infection. pe25 and ppe41 are co-operonic and together encode products to form the PE25/PPE41 heterodimer. PE25/PPE41 is secreted out of M. tuberculosis and acts as a highly immunogenic immune effector″, inducing macrophage death by necrosis28, though the precise mechanisms are not yet teased out. Also known as pe11, lipX encodes a functional lipase, LipX, that influences the lipid composition of the mycobacterial membrane and propensity for biofilm formation29. PE5 lies within the esx-3 region of the genome and is secreted with PPE4 by the ESX-3 secretion system. Once secreted, PE5 scavenges iron and suppresses host-generated nitric oxide stress. Its expression is upregulated more than two-fold following exposure to multiple stressors, including the recently repurposed TB antimicrobial Clofazamine30.
PE_PGRS40, on the other hand, is functionally uncharacterized. At only 60 AA it is extremely short (compared to 176-1,901 AA of other PE_PGRS genes31) and reported to have 0 T-cell epitopes, unlike others in the PE_PGRS family, which have up to 5231.
The rigid conservation of clear functions of LipX, and PE25, and PPE41 suggests their functions are conserved mechanisms of host subversion and stress response, rather than population-specific adaptations or mechanisms to generate variability. This is consistent with recent reports of PE/PPE genes as mediating specific functions like acquiring iron32 and acting as a selective porin for nutrient uptake33. Given the recent discovery that some, and likely many, PE/PPE genes act as selective nutrient gates33, PE/PPE proteins stable across the phylogeny would imply mechanisms of interfacing with the host milieu that are conserved across humanity. Disrupting such processes thus make attractive therapeutic candidates.
Through interrogating finished genomes for hyperconservation, and filtering invariant proteins according to their (i) resilience to mutagenic replication-transcription conflict and (ii) consistent transcription in situ, we capture 7 targets therapeutic or vaccine targets under active development and highlight 22 genes as new high-priority targets (Table Z, above, and the Table in
The 11 target proteins distilled in Table Z provide a short list of conserved gene targets expressed during infection. The criteria implemented in our approach was rather strict; proteins need not be entirely invariable to make good drug targets. The overlap of known genes mediating M. tuberculosis stress response, drug tolerance, and pathogenesis recovered by this approach is encouraging, and suggests the less well-known genes may play similar roles, yet undiscovered.
Gene Loss (Pseudogenization) is More Prevalent than Gene Gain (Duplication)
Gene loss, measured by prevalence of pseudogenes, is proposed to be the primary mechanism to generate diversity in M. tuberculosis (Bolotin et al. 2015). We sought to confirm this observation by using a homology-based approach which has been described previously (Carlier et al. 2013; Danneels et al. 2018). On average, there were 82 pseudogene regions per genome (min=75, max=94), which contributed to 3.37% of the genome (bp) (min=2.81%, max=4.09%), while gene duplications contributed only 0.135% of the genome (bp) (min=0.036, max=0.619). Based on these statistics and previous reports (Singh and Cole 2011; Bolotin et al. 2015), we postulated that there would be more genome contraction (pseudogenization) than amplification (duplication). To account for lineage being a confounding factor, an ANOVA test was performed based on the proportional differences between pseudogene length and duplication length (lineage 7 was excluded from statistical analyses as only one genome represented the lineage). Thep-value was 0.236, indicating that genome contraction vs amplification is not more pronounced in specific lineages. A paired Student's t-test was then used to determine if there was significantly more genome contraction than amplification per lineage. We observed significantly more genome contraction than amplification across all lineages (Table 1).
A GO enrichment was performed on the functional homologs of pseudogenes. This resulted in several functions related to host evasion (Table 2). Loss of evasion mechanisms is likely related to the ever-adapting landscape of antigens.
The M. tuberculosis Pangenome Contains Over 1000 Genes not Present in H37Rv
Comparing the size of the pangenome (6160) to the size of the H37Rv proteome (4099 CDSs re-annotated by AnnoTUB), we identified 2061 genes that were not observed in H37Rv (Table S3). The 2061 genes were enriched for avoidance of host defenses and negative regulation of host response (
We postulated that the genes not in H37Rv may be diverged homologs of genes in H37Rv. A hierarchy was developed to categorize these genes into mutually exclusive groups. The order of the hierarchy was: RvD1-6/TbD1, in CDC1551, transposases, duplications, pseudogenes, contained frameshift mutations, truncations, extensions, and partial alignments (Table 3).
The first category listed (RvD1-6/TbD1) represents previously described regions within the H37Rv genome that are deleted but present in other members of the M. tuberculosis complex (Zheng et al. 2008; Brosch et al. 1999). A BLASTP query in the 2061 genes not in H37Rv to each of these RvD1-6/TbD1 regions revealed 44 genes with at least 99% amino acid sequence identity (Table S3).
Second, a BLASTP search of the remaining 2017 genes to the CDC1551 proteome identified 120 homologs not in H37Rv. This included pbp-1 (mt0930), which encodes a penicillin-binding protein 4. In the current dataset, four genomes had pbp-1 in addition to the H37Rv penicillin-binding protein homologs, ponA1 and ponA2. This protein has been reported to be secreted in two proteomics studies of M. tuberculosis CDC1551 (Sinha et al. 2005; Non et al. 2006). Although penicillin-binding protein 4 homologs have been linked to penicillin resistance in several pathogenic bacteria (Zawadzka-Skomial et al. 2006; Stefanova et al. 2004; Finan et al. 2001), the role of this protein has yet to be elucidated in M. tuberculosis.
After classifying genes not in H37Rv within the RvD1-6/TbD1 and CDC1551 categories, the remaining genes were subsequently categorized. Truncations and pseudogenes were excluded from further analysis of the categories because they were merely shortened or mutated genes present in H37Rv, which reduced the number of genes not in H37Rv to 730 (11.9% of the pangenome).
These results indicate that across M. tuberculosis strains, there still appears to be a large number of differences in genetic content. Therefore, the number of genes not in H37Rv were identified and summed for each genome and a proportion was calculated based on the total number of CDSs per genome. There was an average of 1.5% of genes not present in H37Rv per genome (median=1.6%, minimum=0.71%, maximum=2.1%, standard deviation=0.33%). Although this aligns with previous reports of the highly clonal nature of M. tuberculosis strains, the percentages were still higher than these reports (0.4% of genes in M. tuberculosis CDC1551 not present in H37Rv (Fleischmann et al. 2002)).
To identify a possible homolog in H37Rv, we aligned (blastn (Altschul et al. 1990)) all 730 to the nucleotide proteome of H37Rv. H37Rv homologs were identified based on lowest Evalue among all blastn hits (maximum Evalue of 0.01). Genes not meeting criteria for any category were categorized as having no homolog in H37Rv (did not align to any H37Rv gene or had an Evalue above the 0.01 threshold). Of the 730 genes, 705 aligned to a homolog in H37Rv. Additionally, 192 were considered PE/PPE genes (26.3%). Partial alignments represented the most PE/PPE genes not present in H37Rv compared to other categories (
Gene Duplication is the Primary Mechanism of Gene Gain in M. tuberculosis
To identify duplicated genes, we used CDHIT42 to find clusters of more than one sequence in the proteome of each isolate. We also queried for genes annotated as the same gene name from AnnoTUB annotations within all genomes. Sixty-three unique genes were duplicated at least once within the current dataset (Table S6, or
After excluding transposases (n=10), embR2, moaA3, and rv1319c were the top three most frequent gene duplications. All three duplications events have been described to occur in M. tuberculosis CDC1551 and lineage 4 strains (Azhikina et al. 2006; Stavrum et al. 2008; Gautam et al. 2017), the duplications moaA3 and rv1319c (mt1360) have been reported additionally in lineage 2 strains (Stavrum et al. 2008).
The single homolog in H37Rv (rv1319c) was described previously to be a result of a deletion-fusion event where the ancestor had two copies as demonstrated in CDC1551 (mt1360 and mt1361) but were fused into a single CDS (Fleischmann et al. 2002). M. tuberculosis Haarlem strain and M. bovis also demonstrated this fused CDS (Fleischmann et al. 2002). However, our investigation revealed duplications of rv1319c existed across the phylogeny, with the exception of all lineage 1 strains and the single lineage 7 strain (
Seventy-nine strains had a previously reported duplication event involving moaA3 and embR2 (Fang et al. 1999), known as RvD5 (Brosch et al. 1999). This region contains moaX-moaB3-moaA3-embR, similar to M. bovis and M. tuberculosis CDC1551 (Sarojini et al. 2005). The remaining 20 genomes did not have moaA3 or embR2: two strains matched H37Rv in this region (moaX-moaB3-rv3325-rv3326-rv3327) and the remaining had 2-4 copies of IS elements in various orientations. The twenty strains represented lineages 1, 2, 3, and 4, indicating that this variation is not lineage-specific. Since the moa duplication was postulated to occur in the ancestor of tubercle baccilli (Levillain et al. 2017), the duplication of embR2 may have also been duplicated in the ancestor. A BLAST search of embR2 in M. canettii strains in the BLAST nr database returned an identical embR2, indicating M. canettii also has two homologs of embR. Since both M. bovis and M. canettii strains have RvD5 intact (moaX-moaB3-moaA3-embR2), the expanse of IS elements in the 20 M. tuberculosis clinical strains likely occurred independently. This was not unexpected as this region is a known “hotspot” for IS element activity (IS6110 preferential locus [ipl]) (Fang and Forbes 1997; Ho et al. 2000).
Gene Conversion Alters B Cell Epitopes in M. tuberculosis PE_PGRS Proteins
Partial alignments represented the most genes not present in H37Rv (
We initially attempted to use the originally described PE and PPE motifs13, but these motifs detected PE/PPE genes in the H37Rv proteome with low sensitivity and specificity (Table S7, or
While these large numbers of PE/PPE genes within the accessory genome suggests active diversification within these families, their frequency in individual isolates varied within a tight range (142-162, median of 154). Furthermore, there was an average of 25 PE/PPE genes not present in H37Rv per genome (range 4-39, median of 25). These numbers suggest that extensive gene amplification is not occurring in these protein families even with the high amount of sequence divergence. This apparent limited amplification, together with the enrichment of PE/PPE genes not in H37Rv within partial alignments (
We investigated this hypothesis further and focused on gene conversion, which has been previously suggested to occur in PE_PGRS genes12. Gene conversion is a biased homologous recombination mechanism that results in intragenomic donation of genetic information46. Detecting gene conversion requires existence of non-converted versions of the two genes originally involved in the event (
One gene conversion event was discovered in lineage 1, 3, 4, and 7; two events were identified in lineage 2 (Table S10, or
The representative isolates from all lineages 2, 3, and 4 had predicted conversion events where M. bovis BCG PE_PGRS54 was the putative acceptor and M. bovis BCG PE_PGRS57 was the putative donor (Table S10, or
This consistent observation of the same parental genes converting but at different breakpoints suggests increased opportunity for recombination in different areas of the gene. A non-B DNA structure—G triplet/quartet quadruplexes (G4s)—could be acting as signals for DNA strand nicking48, which is the first step required for gene conversion. This is supported by significantly more G4s in PE_PGRS and frequent G4s observed in PE_PGRS54 (Table S11, or
PE_PGRS genes are known antigens that contain GC-rich, repetitive motifs but lack T cell epitopes within the PGRS domain17, which we support (
Noting that gene conversion events affected PGRS regions, we hypothesized that the resulting recombinants were selected for because they altered B cell epitopes, affecting antibody recognition and creating antigenic variation. To test this hypothesis, recombination breakpoints within each gene conversion event were compared to B cell epitope scores in each acceptor, donor, and recombinant. Importantly, the regions surrounding the recombined region within the acceptor and resulting recombinant aligned almost exactly (blue dotted boxes in
We sought to elucidate genetic diversity in M. tuberculosis clinical strains beyond small chromosomal changes, which are thought to be the main contributors of diversification in the species49. Yet, genetic events such as duplications and recombination events contribute to genetic diversity far more than small changes. Duplications described in the M. tuberculosis W/Beijing sublineage contribute to the constitutive overexpression of the dormancy regulon, increasing virulence (Domenech et al. 2010). Tandem gene duplications described in Vibrio species provide a protein dosage effect, which elevates respiration under anaerobic conditions50. Meanwhile, recombination via large segmental gene conversion has been shown to drive antigenic variation in pathogens like Neisseria gonorrhoeae51, Borrelia species52, and African Trypanosoma species53. However, due to the widespread use of short-read sequencing on M. tuberculsosis genomes, these events have been obscured by the systematic biases that exist with this type of sequencing. The most commonly used method to process short reads—reference alignment—precludes detection of new genetic content not present in the reference. Another bias—GC amplification—can obscure recombination hotspots, as these are most commonly found in GC rich regions54.
By de novo assembling long reads produced by PacBio, which have none of the systematic biases described above, we generated the largest number of reference-quality M. tuberculosis clinical isolate genomes reported to-date. If strict QC measures are not implemented, random sequencing errors (single-base indels55) propagate into downstream analyses, which can introduce inaccurate gene coordinates in genome annotation. To mitigate this effect of erroneous single-base indels, we developed a comprehensive quality control protocol, carefully verified every genome, and confirmed the consensus sequence. However, these strict criteria reduced the sample size for analysis by almost 50%. Those that were excluded contained a higher rate of error (
These genomes allowed annotation of genetic content absent from the most studied reference strain, H37Rv. Nevertheless, this reference strain has the most experimentally-verified CDS functions and coordinates among all M. tuberculosis reference strains56. Therefore, a hybrid annotation approach (AnnoTUB) was developed to preserve annotations from this well-studied reference while carefully annotating new genetic content ab initio. This resulted in a tight range of CDS number across the population (
Owing to systematic exclusion of these genes from short-read sequencing studies, this work is the first to assemble PE/PPE genes and study their conservation across a global collection of clinical isolates. A pangenome analysis compiled all annotations and revealed an unexpected abundance of absolutely identical genes across all 97 clinical genomes. These genes were enriched for essential genes but over 200 were considered not essential, which is unsurprising for an intracellular pathogen. Because the dataset referenced to analyze essentiality was performed in an in vitro environment39, it's possible these non-essential but identical genes are essential in vivo. To confirm their essentiality in vivo, transposon insertion mutants grown in an in vivo model would be needed. Several unexpected non-essential genes were several toxin-antitoxins (TAs), which typically mutate at high frequencies and are often lost through pseudogenization (Soo et al. 2014). Observing these identical TAs across all 97 strains indicates that a few genes within this family are not mutating or being lost. Although any of the identical proteins could be potential drug targets, toxin-antitoxin proteins are particularly attractive as they lack human homologs57, are active during dormancy41, and may be essential for its induction through inhibiting protein synthesis41. Outside of identical genes, the pangenome also uncovered an unprecedented number of genes not present in H37Rv (
Gene duplications represent an important source of gene gain in M. tuberculosis due to lack of HGT in the species (Boritsch et al. 2016). Gene duplication is hypothesized to be the safest way to acquire adaptive mutations rather than increasing mutation and recombination rates genome-wide (Francino 2005). However, once a duplication event has occurred, they are expected to be lost through genetic drift, random mutation, or recombination (Lynch and Conery 2000; Lynch and Force 2000). Retention of extra gene copies would likely be a recent evolutionary event or a selective force maintains both or all copies that originated from the ancestor, restricting the freedom to diverge (Bergthorsson et al. 2007). One such duplication in M. tuberculosis (moaA3-embR2), which represents the RvD5 region (deletion of this region in H37Rv) (Zheng et al. 2008), has been reported in several clinical strains (Sekar et al. 2009; Fang et al. 1999; Sarojini et al. 2005) including M. tuberculosis CDC1551 (Gautam et al. 2017), M. bovis (Sarojini et al. 2005), and M. canettii (Levillain et al. 2017). This region was created via an inverted gene duplication of moa1 operon and subsequent fusion of moaE1 and moaD1 (moaX), within the ancestor of tubercle (TB) bacilli (Levillain et al. 2017). embR2 may have also been an inverted duplication event within the ancestor of TB as M. bovis and M. canettii have embR2 adjacent to moaA3. This region appears to be conserved as the majority of strains in the present study had the ancestral moaA3-embR2 region intact. Another frequent gene duplication was rv1319c. In H37Rv, this region contains three adenylyl cyclases (rv1318c-rv1319c-rv1320c) but in CDC1551 there are four adenylyl cyclases (mt1359-mt1360-mt1361-mt1362). mt1359, mt1361, and mt1362 are considered homologs of rv1318c, rv1319c, and rv1320c, respectively, with mt1360 having close homology to mt1361/rv1319c (Fleischmann et al. 2002). It was previously suggested that rv1319c was created by a fusion between mt1360 and mt1361 and a subsequent deletion of mt1360 (Fleischmann et al. 2002). However, the orientation of three cyclases was more likely a result of the duplicated homolog (mt1360) being lost, not a deletion-fusion event. This hypothesis is supported by two clades in lineage 2 and lineage 4 in which the ancestors of the clades seemed to have had a functional mt1360 homolog but had accumulated mutations that introduced premature stop codons (
When a gene is duplicated, one of the two copies is freed from selective constraints and can accumulate neutral mutations6. However, these mutations can often be deleterious and result in a loss-of-function, producing a pseudogene. Gene loss is expected to occur more frequently than gene gain because of the expected genome contraction of intracellular organisms60. Gene loss has been reported to be the primary source of diversity in M. tuberculosis but this conclusion could not be substantiated, as duplications were excluded from the analysis7. Our analyses demonstrated a slightly higher rate of gene loss measured by pseudogene frequency, but frequent gene gain indicates gene loss is not the sole source of diversity in this pathogen.
Recombination is another important source of genetic diversity, generating antigenic variation in many pathogens21. PE/PPE proteins are thought to play a role in evasion of immune responses61 and have been suggested to be undergoing gene conversion12, indicating a potential mechanism of antigenic variation. We confirm this hypothesis, though gene conversion was observed only in the PE_PGRS subfamily in the hypervariable PGRS domain (
The consistent recombination in PE_PGRS genes indicates a benefit for survival in the human host, possibly linked to immune evasion due to their putative function as antigens32. Hypervariable regions of antigens are expected to mutate to evade recognition by immune cells69. However, a previous report17 and the data in the present study (
In contrast, B cell epitopes were predicted within the PGRS domain indicating a potential interaction with B cells. In pathogens like N. gonorrhoeae, antigens undergo gene conversion to create new combinations of B cell epitopes, significantly affecting the binding affinity of existing antibodies. This mechanism requires a donor sequence that is often a pseudogene, which always invades the acceptor sequence through homologous recombination, resulting in a novel chimera of acceptor and donor epitopes (Deitsch et al. 2009; Palmer et al. 2016). This chimera requires the immune system to create new antibodies with high specificity to the mutated epitopes. We hypothesize the pgrs domain of pe_pgrs genes are undergoing a similar mechanism of recombination to create chimeras of the donor and acceptor sequences to promote immune evasion. This mechanism has been experimentally verified in several species (Palmer et al. 2016) but would need to be confirmed in M. tuberculosis with immune assays. Additionally, secretion is a required aspect of antigens to interact with immune cells. Several reports have noted that PE_PGRS proteins are exported to the surface of the cell (Banu et al. 2002; Brennan et al. 2001), while others have noted certain lineages lack the ability to secrete PE_PGRS proteins (Ates et al. 2018). Secretion of PE_PGRS proteins across all M. tuberculosis lineages would need to be confirmed.
Although care was taken in producing reference-quality genomes and annotation, results presented here and in other studies (Bolotin et al. 2015; Yang et al. 2018b) are affected by sample size and population structure. Additionally, M. africanum is known to cause human tuberculosis but was not included in the current study. A similar study on PacBio-sequence M. africanum genomes may elucidate clinically relevant genetic content. Diverse samples representing all M. tuberculosis lineages proportionally as well as isolated from several geographic regions would further support the conclusions made here. Additionally, M. bovis BCG may not contain all non-converted PE/PPE genes which can increase false negatives in a gene conversion detection analysis. A reference-quality, annotated genome from an ancestral strain closely related to M. tuberculosis would more accurately represent non-converted PE/PPE genes. Finally, accuracy in detecting linear B cell epitopes is low compared to detection of T cell epitopes. To identify more true positives, a higher threshold was chosen to identify more confident epitope locations, but this also increased false negatives. Experimental verification of antibody recognition to the inferred epitopes in PE_PGRS with predicted gene conversion would be needed to confirm the findings presented here.
This distraction and the predicted dearth of B cell epitopes observed in the PGRS domain (
Finally, accuracy in detecting linear B cell epitopes is low compared to detection of T cell epitopes. To identify more true positives, a higher threshold was chosen to identify more confident epitope locations, but this also increased false negatives. Experimental verification of antibody recognition to the inferred epitopes in PE_PGRS with predicted gene conversion would be needed to confirm the findings presented here.
This study demonstrates that relying solely on small variations compared to M. tuberculosis H37Rv limits resolution when studying genetic diversity within the species. We show that by producing reference-quality genomes and carefully annotating each genome, the genetic diversity of M. tuberculosis is evolving and expanding through gene gain, gene loss, and gene conversion. Overall, this study opens the door to understanding the remarkable genetic diversity of this deadly pathogen.
One hundred and seventy-eight genomes were either re-sequenced due to low coverage in a previous dataset (n=113)74, chosen from the strain collections at the World Health Organization Supranational References Laboratories in Stockholm, Sweden and Antwerp, Belgium based on their drug susceptibility profile and discordant genotype (n=41), or downloaded from the Sequence Read Archive (n=24).
Isolate selection for the Global Consortium for Drug-resistant tuberculosis Diagnostics (GCDD) population was described previously (Hillery et al. 2014). We selected 113 isolates from the GCDD to re-sequence on Pacific Biosciences polymerase 6-chemistry 4 (P6C4) that had the lowest coverage genome-wide in a previous sequencing run (Bioproject PRJNA353873). We chose an additional 41 isolates from World Health Organization Supranational References Laboratories in Stockholm, Sweden and Antwerp, Belgium to sequence based on their drug susceptibility profile and a corresponding discordant susceptibility genotyping result (i.e. resistant to amikacin but did not observe rrs A1401G on the MTBDRsl platform); or, they were chosen because mono-resistance was detected. Each strain was isolated from a single patient.
Additionally, publicly available PacBio sequences were queried within the SRA (Leinonen et al. 2011) database. FASTQ formatted files do not contain the required quality scores to run any SMRTAnalysis protocol. We therefore downloaded the HDF5 (. bax.h5) files, which contain all necessary sequencing quality metrics including the interpulse duration (IPD) for subsequent methylation analysis. The HDF5 files were downloaded using wget for the SRA accession numbers within the following Bioproject accession numbers: PRJEB21888, PRJBA270004, and PRJEB8783. Knockout strains and non-tuberculosis Mycobacteria were excluded from this study. This resulted in 19 genomes from public datasets. Two reference strains were additionally included: M. tuberculosis H37Rv (NC000962.3) and M. tuberculosis H37Ra (CP016972.1).
For isolates re-sequenced from the GCDD, DNA was extracted as described previously (Elghraoui et al. 2017). Briefly, cells were cultured on Middlebrook 7H11 agar plates and incubated until growth of a full lawn. Extraction was performed using Genomic-tips (Qiagen Inc., Hilden, Germany) following manufacturer's protocol. B1/RNase buffer solution was introduced to the culture, which was then homogenized by vortex mixing and inactivated at 80° C. for 15 minutes. Lysozyme was added and incubated for 30 minutes at 37° C. then proteinase K was added and incubated at 37° C. for 60 minutes. Wide-bore pipet tips were used to recover large DNA fragments and DNA purity and concentration were analyzed on Nanodrop 1000 Spectrophotometer (ThermoFischer Scientific, San Diego, Calif., USA).
Sequencing was also performed as previously described (Elghraoui et al. 2017). Briefly, each clinical strain was sequenced on the PacBio RS II platform with P6C4 and a 20 kb insert library preparation using a single SMRT cell per strain.
DNA extraction and sequencing for the publicly available HDF5 files were described previously (Berney et al. 2015; Zhu et al. 2015; Phelan et al. 2018).
Genome Assembly with Quality Control
Assembly: For isolates that were sequenced on multiple SMRT cells (for example multiple SRA entries sharing the same isolate name), all SMRT cell sequencing runs were combined. Hierarchical Genome Assembly Process (Chin et al. 2013) (HGAP) version 2 (RS_HGAP_Assembly.2 protocol) was used to assemble raw reads from HDF5 formatted files using default parameters. If HGAP2 was unable to assemble the genome (for example multiple contigs or no contig could be circularized [see #0]) or a misassembly was detected (see #0), the assembler Canu (Koren et al. 2017) (v1.6) was used with default parameters with the exception of the -pacbio-raw flag. Canu (Koren et al. 2017) was confirmed to be more accurate in assembly than several assemblers and was, therefore, the assembler of choice for isolates sequenced in 2017 (the publication year of Canu). Any genome that failed assembly using Canu was excluded from this study.
Circularization: For isolates sequenced prior to 2017, a custom Perl script was written to reset the genome start to the origin of replication with dnaA as the first gene (Brzostek et al. 2009); minimus2 from AMOS (http://amos.sourceforge.net) was then used to circularize each genome using default parameters, as described previously (Elghraoui et al. 2017). However, minimus2 could not successfully circularize all genomes due to multiple contigs or no detection of dnaA. For these cases, circulator (Hunt et al. 2015) was used to circularize using the ‘all’ command and the --genes_fa flag set to the H37Rv dnaA sequence to reset the genome order, default parameters were used otherwise. All isolates sequenced in 2017 (the publication year of circlator) or later were circularized with circlator. Any genome that failed to circularize after this step was excluded from this study.
Consensus error correction/polishing: Consensus error correction is an important step in producing a final consensus sequence of high-accuracy, particularly since HGAP2 and Canu are not sensitive enough to call consensus in repetitive regions. Error correction (or “polishing”) was performed as previously described (Elghraoui et al. 2017). Briefly, the circularized sequence was polished using BLASR+Quiver (RS_Resequencing protocol) in SMRTAnalysis (v2.3) to achieve a complete consensus sequence; three rounds were determined to be sufficient to attain a consensus sequence. The maximum coverage parameter in Quiver was set to 1000, default parameters were used otherwise. If a consensus sequence could not be achieved after three rounds of polishing, it was excluded from this study.
Assembly Quality Control: In order to validate the assembly, each genome was analyzed for potential breaks using PBHoney (English et al. 2014) and errors in consensus calling (see #3 above). We considered regions as “breaks” in PBHoney when the variant was supported by at least 10% of the total coverage at the location in the genome. Assemblies that failed this step (had breaks with at least 10% support) were excluded from further analysis in this study. Verifying the final consensus sequence was included in QC since iterating over polishing rounds is not a convergent process for repetitive genomes). A threshold of four variants in the final error correction run was chosen based on preliminary analyses indicating a high rate of single-base deletions (Error! Reference source not found.7) and significant breaks in the annotation when more than four variants were detected (see Sequence QC through Detection of Unique Genes below).
A custom annotation pipeline was created and designed specifically to annotate M. tuberculosis genomes. This annotation pipeline (AnnoTUB) first transfers annotations from a well-characterized reference, H37Rv, where sequence homology is high; next, performs ab initio annotation on regions where the reference annotation could not be transferred; followed by merging these two annotation methods; identifies orthologous annotations where the transfer and ab initio methods could not infer function, and, lastly, identifies PE and PPE genes. The characteristics of the M. tuberculosis genome allow for specific assumptions to be made (for example high genome conservation) which are discussed in detail below. The full pipeline of AnnoTUB is presented in
Annotation Transfer: The first step in the annotation pipeline was transferring genes from a well-characterized reference. We prioritized this transfer step over ab initio methods because the reference annotation contains a larger number of experimentally verified functions, as compared to databases referenced by ab initio software (see Ab initio Annotation). In addition to a required high amino acid sequence identity, synteny is considered during transfer, which is absent in ab initio algorithms. Because of the high sequence conservation across M. tuberculosis strains, resulting in a 98.8% transfer of CDSs in H37Rv to M. tuberculosis F11 in a previous report (Otto et al. 2011), we assumed that the majority of the reference annotation will transfer to each clinical isolate's genome, with the exception of a few corner cases (see Merge section).
We transferred an in-house curated M. tuberculosis H37Rv annotation to all genomes included in this study. The custom curation was part of a comprehensive study of all hypothetical and ambiguously annotated proteins in H37Rv that included literature curation and function inferred from shared protein structure (Modlin et al. 2018). The Rapid Annotation Transfer Tool (RATT) (Otto et al. 2011) (v18) was used to perform the transfer. Although RATT allows for the configuration of custom start codons, start codons were not reset due to a lack of prioritization of the configured codons; we therefore relied on experimentally and computationally predicted start codons in the H37Rv annotation. The codons TGA, TAA, and TAG were configured as stop codons. We set the transfer type to be Strain (Transfer between strains), which has been tested on transferring annotations in H37Rv to clinical M. tuberculosis genomes (Otto et al. 2011). Within this setting, a 95% nucleotide sequence identity was implemented.
Ab initio Annotation: Although M. tuberculosis has high sequence conservation across strains (Otto et al. 2011), there are areas of the genome that continually change due to recombination, duplication, or spontaneous mutation (Phelan et al. 2016). In these regions, RATT will not be able to transfer annotations since the sequence identity has altered significantly or the sequence does not exist in the reference. When these regions were encountered, we relied on Prokka (Seemann 2014) (v2.6) to annotate. Prokka utilizes the gene prediction algorithm, Prodigal (Hyatt et al. 2010), to identify candidate gene coordinates and annotates these genes, prioritizing accuracy to minimize over-annotation, then references several databases to infer function based on sequence homology, domain search, and motif discovery. We ran Prokka twice on each genome; the two executions of Prokka differed by a single parameter, the --proteins flag with the custom H37Rv annotation as the argument (“Prokka-reference”). This flag prioritizes the user-defined input/database in annotating CDSs, then queries RefSeq, UniProt, and InterPro (for domains); however, we observed several corner cases (see below in
Merge Transfer and Ab Initio Annotations-Identifying partial annotations). We performed another run of Prokka without the --proteins argument, allowing Prokka to first query RefSeq for all CDSs (see below
Merge-Annotations with Prokka-no-reference). For both executions of Prokka, common parameters were as follows: --genus was set to Mycobacterium, --kingdom was set to bacteria, --rfam was flagged, --rnammer was flagged, --gram was set to pos, --usegenus was flagged, --centre was set to C, and --locus_tag was set to the isolate ID.
Merge Transfer and Ab Initio Annotations: We created a custom software, annomerge (implemented in Python), that merges the output of RATT (Otto et al. 2011) and Prokka (Seemann 2014) into a Genbank format file. annomerge accepts annotations in a hierarchical manner: first, taking confident annotations from the output of RATT; second, identifying confident annotations from Prokka-reference; third, filling in remaining annotations from Prokka-no-reference, with strict criteria (see below).
The following describes how corner cases were identified and resolved:
Non-contiguous CDSs: If a CDS annotation from RATT spanned non-contiguous regions (typically caused by a frameshift or ribosomal slippage in the coding region (Tatusova et al. 2016; National Center for Biotechnology Information) and represented as join's in the output), this annotation was not accepted and potential annotations of this region were queried for in Prokka.
Resolving Overlapping Annotations between RATT and Prokka-reference: In cases where a CDS overlaps between RATT and Prokka-reference, we presumed that either Prodigal did not set the start codon correctly or, because we did not allow RATT to reset the start codons, it's possible the gene starts with a rare codon and may be misannotated. If the two annotations did not share the same open-reading frame (i.e. did not share a start or stop codon), both RATT and Prokka annotations were accepted.
However, in cases where the frame was the same (typically shared the same stop codon), we compared the nucleotide sequence of the promoter region (assuming a leaderless transcript, 40 bp upstream of the gene start) from Prokka-reference coordinates to the sequence of the same region in H37Rv (AL123456.3) using blastn with default parameters. If a mutation was observed in the promoter region in the clinical isolate, the gene coordinates obtained from the Prokka-reference annotation were maintained, otherwise, the gene start was modified to match the relative start of the gene in H37Rv and the gene product annotation was transferred.
Verification of gene coordinates: Prodigal, executed in Prokka, is an unsupervised machine learning algorithm that uses the input genome sequence to learn gene structure properties such as codon statistics, ribosome binding site (RBS) motif usage, start codon frequencies (only ATG, GTG and TTG) and predicts gene coordinates in this genome (Hyatt et al. 2010). Although utilization of genome-specific metrics such as codon statistics make Prodigal an unbiased tool for gene predictions, some implicit biases may occur due to restriction of start codons to ATG, GTG and TTG. If a rare codon is the start, Prodigal would propagate until one of the three start codons was identified. This bias can be corroborated with the results of gene coordinate prediction of 62 experimentally verified genes in Mycobacterium tuberculosis H37Rv using Prodigal, where all 62 genes were identified with 58 having the correct start coordinate, indicating that the start site prediction by Prodigal has an accuracy of 93.6% for the set of 62 experimentally verified genes (Hyatt et al. 2010).
Because Prodigal was trained on a specific set of prokaryotic genomes and contains general rules about gene structure in prokaryotes (Hyatt et al. 2010), we executed Prodigal on the H37Rv genome and compared the gene coordinates to those predicted in the H37Rv annotation available on NCBI Protein database (AL123456.3). This resulted in an 88.09% concordance, with 480 genes having incorrect coordinates predicted by Prodigal. The incorrect gene coordinate annotations for the 480 genes were then used to recalibrate start codons for these genes in the clinical isolates.
When any of these 480 genes were encountered in a genome, the length of each gene was compared to the same gene in H37Rv. If the lengths were different, the nucleotide sequence upstream (40 bp) of this gene (the start was considered the same, relative start as that in H37Rv) was compared to the same upstream region in H37Rv to check for any mutations that may have caused Prodigal to choose a different start coordinate. If a mutation in the upstream region of the gene was found, the gene coordinates predicted by Prokka were accepted; otherwise, the coordinates were recalibrated to match that of H37Rv.
Identifying partial annotations: A partial sequence was classified as such if the alignment between the query (isolate) and subject (database hit in RefSeq or UniProt) satisfied the identity threshold (>95%) but was below the coverage threshold (<95%, subject or query coverage). Often, due to the use of an E-value rather than identity and alignment coverage, Prokka will annotate a CDS with the same functional information as the subject sequence even though the query coverage was extremely low. However, preliminary analyses indicated that these cases were often caused by a mutation, abolishing a stop codon, which extends the protein length; or, the mutation introduced a stop codon, prematurely, thereby truncating the protein. We assumed that in either case, the new sequence encodes an entirely different protein. If a CDS annotated by Prokka-reference was a partial sequence from a gene in H37Rv then the CDS, but not the gene name, was accepted and a downstream clustering or orthology-based method (mentioned below) was used to assign functional annotation.
CDSs with only domain annotation: If a CDS was annotated by Prokka (reference or no-reference) solely based on domain information (i.e. using InterPro alone), then a downstream clustering method (discussed in Clustering) was used to assign functional annotation for this CDS instead of direct transfer of functional annotation from Prokka. The InterPro results were maintained as a note.
Annotations with Prokka-no-reference: During the incorporation of CDSs from Prokka-no-reference, CDS functional annotations categorized as hypothetical protein that were not annotated with a gene name or Rv locus tag were excluded. These CDSs were not annotated as they were putative protein-coding genes and evidence of the proteins translated from these coding sequences have not been reported in the Mycobacterium genus yet(Seemann 2014). This was to ensure reduction of false positive annotations in this pipeline.
Annotation of non-CDS genomic elements: In addition to protein-coding genes, other genes and genomic features were incorporated as part of the annotation. Although these non-CDS genomic elements were incorporated into the annotation of clinical isolate for completeness, they were not used in any analyses in this study. Genomic features such as transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), and signal peptides were annotated in each genome. Prokka executes RNAmmer(Lagesen et al. 2007) (v1.2) to identify and annotate rRNAs. tRNAs can be annotated by both RATT and Prokka; RATT uses tRNAscan-SE (Lowe and Eddy 1997) and Prokka uses ARAGORN(Laslett and Canback 2004) (v1.2) for the annotation of tRNAs. However, because ARAGORN is reported to have higher sensitivity compared to tRNAscan-SE (Laslett and Canback 2004), only tRNAs annotated by Prokka were accepted. Signal peptides were also accepted from Prokka.
To identify genes that were not assigned a locus tag or gene name (annotated by Prokka) and identify which genes were the same across isolates, we clustered sequences first using CD-HIT (Li and Godzik 2006; Fu et al. 2012) then the Markov Cluster (MCL) (Enright et al. 2002) algorithm.
CD-HIT sorted a given set of sequences (here all CDSs present in the population) in descending length and the longest sequence was chosen as the representative sequence for the first cluster. Each sequence was then compared to this representative and considered part of the same cluster based on a 99% to 98% amino acid sequence identity and 100% coverage. If the query sequence did not meet these thresholds, it was considered the representative sequence of a new cluster. This process continued for all protein sequences in all the clinical isolates. The identity threshold was then reduced by 0.005% and the process began again based on the previous set of clusters, which created a cluster of clusters that was input into an all-against-all BLAST.
An all-against-all blastp (Altschul et al. 1990) alignment was implemented to create a matrix. The matrix represents a connection graph with sequences or CDSs as nodes and sequence identity as edges (greater than 95% identity and query and subject coverage). Weights were calculated by taking the logarithmic value (base 10) of the E-value from the blastp alignment between those two nodes or CDSs.
The matrix was then passed into MCL, which performs iterative rounds of matrix multiplication (expansion) and inflation (contraction) until there was no net change. To determine when there was no net change, the inflation parameter, which determines the granularity or tightness of the clusters, was set to 1.5 in our clustering step. The lower the inflation threshold the tighter the clusters (range of 1 to 10). This tighter threshold was chosen based on the assumed high genome conservation across M. tuberculosis strains.
Based on the clusters created above, we isolated clusters that contained unnamed sequences (i.e. CDSs with randomly assigned locus tags), which resulted in three types of clusters: i) clusters containing only unnamed sequences, ii) clusters with unnamed sequences and named sequences with the same name, and iii) clusters with unnamed sequences and named sequences with different names. Here, named sequences were those with an Rv locus tag or a fully characterized gene name; unnamed sequences were those that Prokka identified as a CDS but was unable to identify a gene name (see
Genome Annotation (AnnoTUB)—
Annotation Transfer)
Steps to assign unnamed sequences for each case were as follows:
1. Clusters containing only unnamed sequences: The representative sequence of this cluster was compared to all CDSs in H37Rv (AL123456.3) using blastp to see if the unnamed sequence was homologous to a gene in H37Rv. Thresholds for this consideration were at least 95% amino acid sequence identity and 95% query and subject coverage. If no hit was identified the H37Rv proteome, the cluster was assigned a new gene name with the syntax ‘MTBxxxx’, where the ‘x’ indicates a digit, starting at MTB0001 assigned in order of the clusters output from MCL.
2. Clusters with unnamed sequences and named sequences with the same name: All unnamed sequences in these types of clusters were assigned the gene name given to the representative sequence of this cluster, if the representative was annotated with an Rv locus tag or a gene name. If the representative sequence was not annotated with a gene name or Rv locus, the first named sequence was identified within the cluster and all other unnamed sequences, including the representative sequence, were annotated with this gene name.
3. Clusters with unnamed sequences and named sequences with different names: Each unnamed sequence was compared to all named sequences in the cluster using blastp to determine which named gene the unnamed sequence was most identical to using a 95% threshold of identity and coverage. The unnamed sequences were annotated with a gene name based on the best hit to a named sequence. If the thresholds were not met, the gene was assigned a ‘MTB’ gene name.
All clusters that only contained gene names (either multiple or just one and no unnamed sequences), were not modified or processed further.
Annotations were updated with the newly assigned gene names using a custom Python script.
Thresholds used in previous steps were designed to be strict to offer confidence in the annotation. All thresholds were decided based on RATT and Prokka defaults. However, this leaves approximately 3300 genes without any functional annotation across 99 genomes. We therefore sought to annotate these unannotated genes using orthology via the eggNOG-Mapper software (Huerta-Cepas et al. 2017), which parses the eggNOG database based on an input multisequence FASTA file. This software utilizes two annotation mapping algorithms: DIAMOND and HMMER. The DIAMOND algorithm performs a BLASTP across all the protein sequences in the eggNOG database to identify the best seed ortholog match for the query sequence. HMMER on the other hand, parses all Hidden Markov Models (HMMs) of orthologous groups in the eggNOG database and identifies the best match. Once the orthologous group is identified, phmmer is used to identify the best sequence hit within the orthologous group. In both instances, the seed ortholog, which is the best sequence hit to the query, is identified. This seed ortholog is then used to infer fine-grained orthologs, based on the pre-computed Cluster of Orthologous Group (COG) this seed ortholog belongs to in the eggNOG 4.5 database (Huerta-Cepas et al. 2017). An E-value threshold of 10−6 was set to avoid transfer of functional annotation from distant orthologs. Functional annotation was transferred from orthologs that were taxonomically closest to the query (this parameter was automatically adjusted for each query by eggNOG-mapper), minimizing false/mis-annotations.
We chose to first process all unannotated genes (identified in the
Clustering: step) using the DIAMOND algorithm since this algorithm is more time-efficient and has a higher sensitivity as HMMER when the source organism of the query proteins is well-represented in the eggNOG database (Powell et al. 2012). We set the E-value threshold to 10−6 (the default in Prokka). Genes that could not be annotated with the DIAMOND algorithm were passed into the HMMER algorithm. The annotation of the seed ortholog was then picked as the best annotation. Additional functional information such as experimentally verified Gene Ontology (GO) terms and KEGG pathways were transferred from the orthologs that were taxonomically closest to the query.
The gene feature was updated with the gene name predicted by eggNOG apart from PE/PPE genes. Genes annotated by eggNOG-mapper as a PE or PPE family proteins were not updated because these genes more likely aligned across the conserved region, not by the variable region, which is what defines different PE/PPE genes within each family (Cole et al. 1998). PE/PPE gene assignments is described in-text. (Identifying PE and PPE Genes). Seqret v.6.6.0.0 (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/seqret.html) was used to convert Genbank format to GFF3 for input into Roary (see below Error! Reference source not found.).
AnnoTUB and all corresponding modules are available at https://gitlab.com/LPCDRP/annotub.
Although we implemented a strict quality control protocol post-assembly, we sought to further QC each assembly by: 1) annotating each genome that passed assembly and consensus QC; 2) execute a pangenome analysis (see Error! Reference source not found. below for details); 3) tabulate the number of unique genes per genome. Error! Reference source not found. a demonstrates an outlier with a higher than average number of unique genes as compared to their closest relative (determined by SNP Hamming distance). We postulated that these outliers may have a high amount of single-base deletions. We, therefore, calculated the deletion-to-insertion ratio and each outlier had two times higher the amount of single-base deletions than insertions. This outlier was excluded due to potential sequencing errors (
To create pan, core, and accessory genomes, we used the software Roary (Page et al. 2015). We separated isolates into five populations: all isolates in the population, Euro-American isolates, East Asian isolates, East African-Indian isolates, and Indo-Oceanic isolates. The ‘all’ isolates population provided a global perspective of genes in clinical isolates while the lineage separation allowed analysis of lineage-specific effects, genome-wide. The pangenome is defined as the unique union of all genes in the given population; the core genome is considered the compendium of genes that are represented in 99% of the population; and the accessory genome is the remaining genes that did not meet the core genome threshold but existed in more than one genome. A 99% threshold for the core genome was chosen since the population consists of the same species of a clonal bacteria, which is also recommended by Roary. We split inparalogs to identify copies of genes that may be part of the accessory genome; we flagged the −e option to create a core alignment using PRANK; default parameters were used otherwise (apart from the custom edits described below).
Custom Edits to Roary: A few modifications were implemented to the Roary source code to account for the clonality and high sequence similarity of M. tuberculosis genomes. Specific details to the modifications were as follows:
More Strict Thresholds in blastp Parallel All-Against-All: We modified the criteria within the parallel all-against-all blastp execution to be stricter when identifying aligned pairs. Our criteria were: at least 95% identity, query coverage, and subject coverage to pass. This was implemented to match thresholds used within AnnoTUB and provide matching results in the cluster output.
“Collapsing” Clusters that Share Gene Names: We favored the annotation of a given CDS over the clustering generated by Roary. This translates to iterating over the output of MCL (after splitting paralogs) and identifying clusters that were annotated as the same gene name (except for PE/PPE genes annotated by eggNOG, see
Genome Annotation (AnnoTUB)—
Orthologous Annotation). Clusters that were not considered inparalogs were collapsed if they shared a gene name.
We utilized the H37Rv gene classified as essential or otherwise from DeJesus et al (DeJesus et al. 2017). This study had five classifications for essentiality in H37Rv: essential (ES), essential domain (ESD), growth defect (GD), growth advantage (GA), and non-essential (NE). We classified ES and ESD as essential and GD, GA, and NE as non-essential. We developed a custom Python script to compare each lineage's core genome to the list of essential and non-essential genes, assuming essential genes are more likely conserved across strains.
Classifying genes not in H37Rv: We created four categories to classify these genes:
1. Within RvD1-6/TbD1
2. In M. tuberculosis CDC1551
3. Transposases
4. Duplications
5. Pseudogenes
6. Genes that containing a single-base indel or contained an indel in a homopolymer region
7. Gene truncations
8. Gene extensions
9. Partial alignments
To identify the closest H37Rv gene relative for all genes in the not in H37Rv gene list, we performed a pairwise all-against-all BLASTN for all genes not in H37Rv against all genes in H37Rv. We chose the closest gene relative based on the following criteria. Of all BLASTN hits, choose the hit with
The HGAP275 protocol from SMRT Analysis v2. 3™ was used to assemble each genome (isolates sequenced on multiple SMRT cells were combined). If HGAP2 could not produce a quality genome, Canu76 was used to attempt to resolve the genome into a single contig. Due to the similarity of the algorithms between the two assemblers and a previously reported minimal difference in consensus sequences of E. coli K12 between the two assemblers77, we did not expect many of these genomes to be resolved by Canu. Each genome was circularized with either minimus2 (http://amos.sourceforge.net) (if HGAP2 was the assembler) or circlator78 (if canu was the assembler). Consensus polishing was performed over three rounds of alignment and variant calling to obtain a final consensus sequence. Several quality control measures were implemented: successful circularization, accuracy of the final consensus sequence, and/or misassemblies detected by PBHoney79. Detailed parameters and thresholds are discussed in the Supplemental Information.
A custom annotation pipeline, Annotate TUBerculosis (AnnoTUB) (https://gitlab.com/LPCDRP/annotub), was designed to annotate M. tuberculosis genomes. To reduce false positive annotations, we relied on an initial annotation transfer step, performed by the Rapid Annotation Transfer Tool (RATT) (Welcome Trust Sanger Institute), which has been previously validated in M. tuberculosis genomes33. Because H37Rv has the most well-characterized functional CDSs as well as experimentally confirmed gene coordinates, we used H37Rv as the reference annotation to transfer. We used RATT's Strain flag, which implements a 95% identity to identify conserved genomic regions. Large stretches of unannotated regions were observed; therefore, we attempted to fill these regions with an ab initio approach, implemented by Prokka first using a reference database (H37Rv) and then prioritizing public databases such as UniProt and InterPro/PFAM. An Evalue threshold was set to 10−6, the default in Prokka. We further verified any annotation to a gene in H37Rv using a 95% amino acid identity and subject and query coverage. For any predicted CDSs lacking functional annotation, we queried EggNOG80 (European Molecular Biology Laboratory (EMBL)) using the eggnog-mapper81 for orthologous genes, requiring H37Rv to be the seed orthologue. We first utilized the DIAMOND method using an Evalue threshold of 10−6 to match the threshold set in Prokka (Victorian Bioinformatics Consortium); we then queried any remaining CDSs lacking functional annotation using the HMM method. A custom Python script (https://gitlab.com/LPCDRP/annomerge) merged all of these annotations into a single annotation file for each genome, incorporating information in a hierarchical manner: i) transferred CDSs from H37Rv, ii) ab initio annotations, iii) EggNOG orthologues.
rRNAs were annotated using Prokka's implementation of HMMER82. tRNAs were annotated using Prokka's implementation of ARAGORN83.
We further sought to score each assembly based on a previously defined set of criteria used to quantify assembly quality31, including sequence quality score, rRNA score, tRNA score, and Pfam-A essentiality score. Because the first three scores were 1.0 across all genomes, calculating an overall quality score using deviations in variance was not applicable. Instead, all four scores were averaged for each genome.
For isolates that were re-sequenced, lineage information was obtained by inserting the MIRU-VNTR and spoligotype patterns determined previously84 into TBInsight85. For all other genomes, a custom script, MiruHero (https://gitlab.com/LPCDRP/miru-hero), determined lineage.
First, SNPs were called using dnadiff (Marcais et al. 2018) (v1.3) with M. tuberculosis H37Rv (NC_000962.3) as the reference and called SNPs and small indels with default parameters. A custom Perl script converted the dnadiff SNP output into a VCF v4.0 file and the Variant Effect Predictor (McLaren et al. 2016) (v87) determined the consequence of each variant.
Next, a multisequence FASTA file was created by concatenating the SNP calls present in the VCF of each genome. If a SNP was not called in a genome, the reference base was inserted. This resulted in 20549 SNPs.
Finally, the SNP FASTA was uploaded to the CIPRES Science Gateway (Miller et al. 2010). RAxML (Stamatakis 2014) v8.2.10 created a maximum likelihood phylogenetic tree using the GTR+GAMMA substitution model, 100 bootstraps, and ascertainment bias correction with Lewis. The SNP tree was rooted using Mycobacterium bovis (AM408590.1) and Mycobacterium canetti (NC_019951.1) as outgroups.
Tree visualization and analysis were performed with the Interactive Tree of Life (iTOL) (Letunic and Bork 2016).
Gene duplications were identified using two methods: i) sequence homology clustering and ii) annotated as the same gene name from AnnoTUB. The first was performed by executing CDHIT on each proteome using a 95% amino acid identity and overall coverage to cluster sequences. Any cluster with more than one gene was considered a duplication event. For the second, gene names were compared across each genome and if there were more than one CDS with the same gene name, the name and copy number was noted.
To search for embR2 in M. canettii, EmbR2 from M. tuberculosis CDC1551 was queried across all M. canettii strains available in the nr database using BLASTP with default parameters.
Gene Ontology (GO) enrichment analysis
All known protein coding genes in H37Rv (GenBank accession AL123456.3) and all genes annotated by AnnoTUB were compiled in multisequence amino acid FASTA file. This file was input into the eggNOG-Mapper (Huerta-Cepas et al. 2017) software to retrieve experimentally verified GO terms for each protein coding gene. GO annotations associated with a protein sequence have evidence codes as additional information (Gene Ontology Consortium 2017). The evidence code corresponds to varying degrees of confidence and the source for inference of function of the protein. The evidence for GO terms range from the most confident source of annotation, ‘inferred from experiment’, to the least confident ‘inferred from electronic annotation’. For this study, only GO terms denoting functions inferred from experimental methods (including, direct enzymatic assays, mutant phenotype, gene expression pattern) were included from eggNOG-Mapper (Huerta-Cepas et al. 2018, 2017) for each protein-coding gene. Additionally, a custom Python script was used to retrieve GO term predictions based on structural homology, as part of the study of all proteins of unknown function in H37Rv (Modlin et al. 2018). A dataset of these GO terms, either known to be experimentally verified for the genes, or annotated based on structural homology, was used as the custom annotation for GO enrichment analysis.
GO term for candidate set of genes was performed using a custom R script using the GOfuncR package (Grote 2018) (https://github.com/sgrote/GOfuncR). For all candidate sets, the pangenome was used as the background dataset to determine enrichment.
Pan, core, and accessory genomes were created using Roary38. Roary was executed on all isolates. Details on parameters for Roary are described in the Supplemental Information.
Identical genes were isolated from the core genome using CDHIT86. First, all protein alleles for core genes were compiled across all isolates. Next, CDHIT was executed for each gene requiring 100% amino acid sequence identity, 100% query coverage, and 100% subject coverage (100% identical). A gene was considered identical if and only if all alleles were 100% identical in at least all 97 M. tuberculosis clinical genomes. A gene was not excluded if H37Rv and/or H37Ra were also 100% identical.
To determine enrichment of essential genes, genes were compared to the essential genes described in DeJesus et al39. Genes listed as Essential or Growth Defect were considered essential; genes listed as Not Essential or Growth Advantage were considered not essential. Enrichment of essential genes was determined with a Fisher's Exact test with a p-value of less than 0.05 to be considered significant. Functional categories were assigned to genes that were identical but considered not essential including transcription factors87, hypothetical proteins (described as “conserved hypothetical s” in Mycobrowser88), toxin/antitoxins89, and ribosomal proteins (if the annotation had the term “ribosomal” or “ribosome”). A functional enrichment was performed for functional categories transcription factors, hypothetical proteins, and toxin/antitoxins using a Fisher's Exact test. A p-value of less than 0.05 was considered significant.
Briefly, the pangenome was parsed for genes that did not exist in H37Rv. Criteria for the categories (duplications, transposases, frameshifted, truncations, extensions, and pseudogenes) are described in full in the Supplemental Information. The closest gene relative in H37Rv was considered the best matching based on lowest E-value and E-value was less than 0.1. Full details are described in the Supplemental Information.
Pseudogene classification was based on methods by previous authors (Danneels et al. 2018; Carlier et al. 2013). Briefly, intergenic regions between annotated CDS features were queried against the pangenome protein database using the NCBI blastx+ program (Altschul et al. 1990) (v2.6.0) with default parameters. A pseudogene was identified in an intergenic region if it met the following criteria: i) the pairwise alignment had an E-value less than or equal to 10E-6; ii) the alignment query coverage and identity was greater than or equal to 50%; iii) putative pseudogenes shared synteny with the functional homolog in at least one clinical isolate; iv) the putative pseudogene had a frameshift or premature stop codon that affected more than 20% of the gene (using the tfasty (Pearson 2000) program (v36) with default parameters).
Conversely, for annotated CDSs, a blastp (v2.6.0) pairwise all-against-all of the pangenome was executed. If a gene aligned to multiple CDSs, the best hit was chosen based on lowest E-value. An annotated CDS was classified as a pseudogene if i) aligned less than 80% and greater than 10% the length of the respective homolog in the pangenome ii) shared amino acid identity of at least 50%, iii) the alignment Evalue less than or equal to 10E-6, iv) the query length must be less than the subject length, v) the putative pseudogene had a frameshift or premature stop codon that affected more than 20% of the gene (using the fastx (Pearson 2000) program (v36) with default parameters)(Carlier et al. 2013). Finally, similar to a previous study (Bolotin et al. 2015), pseudogenes must be present in less than 75% of the study population, known as the “near core” genome.
Statistical tests were then performed to determine if there was more genome contraction (pseudogenes) than amplification (duplications). To account for lineage being a confounding factor, for each genome, the nucleotide lengths of all classified pseudogenes and, separately, duplications were summed and a proportion to the genome size was calculated. The difference between the proportion of pseudogene length and duplication length was calculated per genome and differences were grouped based on lineage. An ANOVA test was performed to determine if genome contraction vs amplification was more pronounced in specific lineages. These same calculations were performed for the frequency of pseudogenes and duplications and the proportion to the number of CDSs per genome was calculated.
A paired t-test was then used to determine if there was significantly more pseudogene than amplification (by length and frequency) for each lineage. Both the ANOVA and paired t-test were calculated using the scipy (Virtanen et al. 2019). A p-value of less than 0.05 was considered significant.
We sought to identify a more sensitive motif than the originally described “PE” at codons 8 and 9, and “PPE” at codons 7, 8, and 9, by searching for the longest motifs that were present within the PE and PPE proteins in H37Rv. All PE/PPE protein sequences from H37Rv were compiled. Several PE/PPE protein sequences were excluded from the creation of the motif (Table S8, or
For the PE motif discovery using MEME (Bailey et al. 2015) (v5.0.5), -minw was set to 40, -maxw was set to 70, and -nsites was set to 83. For the PPE motif discovery using MEME, -minw was set to 70, the -maxw was set to 180, and the -nsites was set to 61. This resulted in one PE motif that was 56 amino acids long, with an E-value of 7.2e-2536; and one PPE motif that was 84 amino acids long with an E-value of 8.6e-2422 (
To determine a threshold of high confidence when searching for these motifs in the clinical strains, we queried for these motifs using Finding Individual Motif Occurrences (FIMO) in the H37Rv proteome (AL123456.3). We then iteratively calculated the false positive rate and the false negative rate in a custom R script to find a corresponding p-value threshold. We remained conservative by favoring a higher false negative rate. For PE, a threshold of 2.14e-16 yielded a false positive rate of 0.0% and a false negative rate of 6.06%. For PPE, a threshold of 1.21e-33 yielded a false positive rate of 0.0% and a false negative rate of 7.25%. The genes that were called false negative were among the sequences excluded from the creation of the motif (Table S8, or
To identify all possible PE/PPE genes in the 99 genomes included in this study, we queried for our created motifs in all pangenome protein sequences, using the determined p-value threshold for each family.
Sublineage motifs were reported previously (Gey van Pittius et al. 2006) and were converted into regular expressions. A custom Python script iterated across all PE or PPE amino acid sequences and identified which sublineage motifs existed in the sequence. The entire protein sequence was queried since the sublineage motifs exist in the variable region of these genes.
All code developed to analyze the PE/PPE gene families is available at https://gitlab.com/LPCDRP/pe-ppe.
To detect gene conversion, non-converted versions of the genes must be present in the analysis. PE/PPE genes were identified using the same method described above in M. bovis BCG Pasteur (Genbank accession AM408590.1) and subfamily motifs were also identified as described in the Supplemental Information. Three isolates were randomly chosen to represent each lineage, with the exception of lineage 7 in which only one isolate represented this lineage in the study. Ab initio predicted subfamily genes were isolated for each representative isolate and concatenated with all genes within the respective subfamily from M. bovis BCG Pasteur, creating a multisequence FASTA for each comparison (n=78).
Each multi-sequence FASTA was aligned with CLUSTALO94 (v1.2.1) using default parameters, and then uploaded into RDP445. The following methods were chosen to evaluate potential recombination events: RDP95, GENECONV96, Chimaera97, MaxChi98, Bootscan99, SiScan100, and 3SEQ101. Default parameters were used with the following exceptions: highest acceptable P-value was set to 0.01; GENECONV g-scale was set to 2; Bootscan was set to “Use Neighbor-Joining trees” with the cutoff percentage set to 90, and the substitution model was set to “Kimura Model”; MaxChi was set to “Variable window size”. To confidently call a recombination event and reduce false positives, at least five methods were required to detect the event. Additionally, RDP4 creates two neighbor-joining trees to assess whether the recombinant sequences were correctly identified. The first tree was built using the portion of sequences outside of the recombinant region. In this tree, the putative acceptor gene and the putative recombinant gene should share homology and exist in the same clade separate from other input sequences (
Each predicted recombinant event that passed our thresholds was assessed in this manner, and subsequently considered to be correctly identified. Recombination breakpoints were manually corrected to MaxChi-identified statistically optimal positions, as recommended by the developer of RDP4. All recombination events meeting these criteria were recorded (Table S10, or
After identifying conversion events in the representative isolate set, we sought to identify genes in the same syntenic region in isolates not directly tested for conversion. To this end, the genomic location of the gene directly after the converted gene was identified. As the location of these genes was highly conserved across the genomes, we searched for the location of this conserved gene and identified the gene directly before it. This was performed for all remaining isolates within each lineage in which gene conversion was found.
G4Hunter104 was used to identify guanine triplet or quartet quadruplexes (G4s) (https://github.com/AnimaTardeb/G4Hunter). The score threshold was set to 1.5 and the window was set to 20 based on previous accuracy calculations104. A two-proportion Chi-square test105 was used to determine if the number of G4s in a given subfamily of PE or PPE was significantly greater than other subfamilies.
T Cells
T cell epitope prediction was replicated as previously described17. Briefly, the Immune Epitope Database (IEDB)106 command-line epitope prediction program using the NetMHCpan107 (v4.0) and NetMHCIIpan108 (v3.1) methods were used for predicting CD8+ and CD4+ T cell, respectively. For MHC class I prediction, the HLA-A and HLA-B 9-peptide long alleles described previously were used17. For MHC class II prediction, the HLA-DR alleles described previously were used17. Threshold cutoff values corresponding to IC50 values of <50 nM for high affinity binding and <500 nM for intermediate affinity binding were used for both MHC class II and class I predictions. All PE_PGRS protein sequences (n=282) were included in this analysis.
B cells
Enrichment of linear B-cell epitopes was queried for within all PE_PGRS proteins using iBCE-EL109 with default parameters. Proteins with enrichment scores meeting or exceeding a score of 0.35 or higher were considered enriched for B cell epitopes. Linear B cell epitopes queried within PE_PGRS proteins that had gene conversion and were enriched for B cell epitopes determined by iBCE-EL using Bepipred 2.0110 in IEDB106 with default parameters. A threshold of 0.55 was chosen to increase specificity, reducing false positives, though sensitivity was sacrificed.
Complete genomes for 97 M. tuberculosis clinical strains are deposited under BioProject accessions PRJNA555636 and PRJEB8783.
CODE AVAILABILITY: Custom code including AnnoTUB and associated pangenome analyses are located at https://gitlab.com/LPCDRP/
A number of embodiments of the invention have been described. Nevertheless, it can be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.
This Patent Convention Treaty (PCT) International Application claims benefit of priority of U.S. Provisional Application Ser. No. (USSN) 63/000,014 filed Mar. 26, 2020. The aforementioned application is expressly incorporated herein by reference in its entirety and for all purposes.
This invention was made with government support under grant No. R01AI105185-06 awarded by NIH. The government has certain rights in the invention.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2021/024528 | 3/26/2021 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63000014 | Mar 2020 | US |