Two identical CDs identified as “Copy 1 of 2” and “Copy 2 of 2,” containing the sequence listing for the present invention, is included as a sequence listing appendix.
The present invention is related to the design and development of recombinant, synthetic, and DNA vaccines and, in particular, to the design and development of conserved-element vaccines that prevent mutational escape, by viruses that replicate rapidly and with relatively low fidelity, from the targeted adaptive-immune response elicited by the conserved-element vaccines.
Recombinant, synthetic, and DNA vaccines, prepared by polypeptide or polynucleic-acid synthesis and by transforming microorganisms to produce epitope-containing polypeptides or epitope-encoding polynucleic acids, respectively, have been successfully developed for immunizing various hosts, including humans, against various pathogens, including the hepatitis-B and human papilloma viruses. Recombinant, synthetic, and DNA vaccines are particularly useful for targeting pathogens for which live or attenuated-virus vaccines are impractical or pose potential risks to vaccine recipients. Recombinant, synthetic, and DNA vaccines are also potentially more economically designed and manufactured, and can be used to address a wider range of pathogens than can be targeted by live-virus and attenuated-virus vaccines. However, the methods of the present invention may also be used in combination with virus-based or poxvirus-based delivery methods.
The human immunodeficiency virus (“HIV”), a retrovirus that causes the acquired immunodeficiency syndrome disease (“AIDS”), is one of the primary targets for current vaccine-development efforts. HIV infection in humans is now pandemic, and represents a severe and continuing health risk throughout the world. Although researchers and vaccine developers were initially hopeful of producing an effective vaccine for HIV, many years of research and development efforts have so far failed. HIV poses a number of difficult hurdles. For one thing, HIV infects the very lymphatic cells within humans that serve to help mount an immune response to destroy viral pathogens and virally infected cells. Another problem is that HIV replicates with relatively low fidelity, leading to frequent mutations and to a corresponding plethora of variant viruses within both individuals and the population as a whole. HIV can thus readily escape, by mutation, a specifically targeted immune response elicited by the prototype vaccines that have so far been prepared and tested.
Because AIDS remains a continuing and critical health threat, and because traditional vaccine design and development methods have failed to produce effective HIV vaccines, researchers and vaccine developers, public health officials, governmental agencies, health-care providers, and many health-conscious individuals have all recognized the need for new approaches to designing and developing an effective HIV vaccine. In addition, viral, bacterial, and parasitic threats continue to arise, including various strains of avian flu virus, for which vaccines may need to be developed quickly, on a massive scale, to prevent health and economic disasters. However, effective methods for controlling many well-known viruses, bacteria, and parasites have not yet been developed, despite great effort and investment. Vaccine developers, health care professionals, and the general population are acutely aware of the need for fast, economically efficient methods for developing vaccines to address fast-arising viral, bacterial, and parasitic threats.
Embodiments of the present invention include conserved-element vaccines and methods for designing and producing conserved-element vaccines. A conserved-element vaccine (“CEVac”) is a recombinant, synthetic, and/or DNA vaccine that incorporates highly conserved sequences from an observed set of pathogen variants. In the case of a recombinant and synthetic CEVac, the conserved sequences are polypeptide sequences that are incorporated in one or more viral protein components, including viral structural and envelope proteins, proteases, transcriptases, and integrases, accessory and regulatory proteins, and other such protein and polypeptide viral components. In the case of a DNA CEVac, the sequences are nucleic-acid sequences that encode conserved protein and polypeptide viral components.
In disclosed embodiments of the present invention, the conserved sequences are identified computationally by considering biopolymer sequences, such as concatenated polypeptide sequences that together represent a pathogen proteome, corresponding to an observed set of pathogen variants, and computationally selecting, from the considered biopolymer sequences, conserved subsequences according to a number of subsequence-selection criteria. These subsequence-selection criteria may include a minimum conserved-subsequence length, a threshold frequency of occurrence of a particular monomer at each conserved, single-monomer position within a conserved subsequence, a threshold combined occurrence for a set of allowable variant monomers at a particular conserved, variable position within a conserved subsequence, and a maximum number of variable positions within a subsequence. A set of conserved subsequences identified according to the subsequence-selection criteria is then filtered to remove subsequences identical to, or too similar to, naturally-occurring host subsequences, to remove subsequences that may be immunodominant with respect to conserved subsequences more effective in eliciting an immune response, and to remove subsequences that fail, for other reasons, to effectively elicit a protective immune response or that elicit undesired responses. The filtered set of conserved subsequences is then assembled into expression vectors for incorporation into microbial hosts for biosynthesis of a recombinant or DNA CEvac, or assembled into one or more synthetic constructs for a synthetic CEVac.
The present invention is directed to conserved-element vaccines and methods for designing and producing conserved-element vaccines. In the following discussion, an embodiment of the present invention directed to CEVac vaccines directed to HIV is discussed. However, it should be noted that the present invention is applicable to designing and producing recombinant, synthetic, and DNA vaccines directed to any of a large number of pathogen targets for use in any of a large number of animal and human hosts.
Of the nine HIV genes, two genes, gag and env, encode the structural proteins for the viral particle. The gag gene encodes structural proteins, including among others, p24, and p17. The gene env encodes a gp160 protein that is cleaved by a viral enzyme to produce the gp120 and gp41 proteins that together make up the protrusions 120 and 121. The gene pol encodes viral reverse transcriptase, integrase, and an RNase, whereas the remaining genes encode auxiliary and regulatory molecules needed to orchestrate viral replication and other functions. A gene spanning the gag-pol gene border encodes the viral protease.
HIV infects various immune-system cells, including macrophages and CD4+T-cells. In a first step, shown in
Cytotoxic T-cells (“CTL”) 212, also known as killer T-cells, represent a subgroup of the T lymphocytes, a type of white-blood cell, capable of killing virally infected host cells or transformed host cells. CTL cells are produced in the bone marrow and migrate to the thymus, where they undergo complex genetic recombination to produce a large variety of different types of CTL cells bearing specific receptors 214. CTL cells with stable antigen-specific receptors and CD8 co-receptors are selected for maturation and release by the thymus. The thymus selects CTL cells that exhibit positive binding to foreign antigens as well as weak or no binding to host-cell biopolymer subsequences, so that the mature CTL cells released by the thymus specifically target APCs presenting foreign polypeptides rather than normal host polypeptides and other host molecules. A molecule that elicits an immune response, such as a foreign protein that is cleaved into peptide fragments that are presented by APCs recognized by CTL cells, is referred to as an “epitope.”
When a mature CTL cell bearing a particular antigen-specific receptor 212 binds to a specific foreign peptide complementary to its receptor 214, and upon further, stable binding via a CD8 co-receptor, the CTL cell undergoes clonal expansion to vastly increase the number of circulating CTL cells 216 bearing the particular antigen-specific receptor. These circulating CTL cells can then migrate throughout host tissues to search for, and kill, APCs presenting the foreign antigen specifically recognized by the CTL cells. When a CTL cell recognizes an APC presenting the foreign antigen complementary to the CTL cell receptor 218, the CTL cell releases the cytotoxins perforin and granulysin 220 that cause formation of pores in the APC's plasma membrane that eventually lead to lysis of the APC. Killer-T-cell recognition of pathogen-infected host cells may be enhanced by circulating antibodies, produced by B lymphocytes, that are activated by an MHC-Class-II antigen-presentation mechanism, which bind to foreign antigens and which are, in turn, recognized by killer cells and phagocytes.
MHC-Class-II molecules present peptide fragments derived from intravesicular pathogens and extracellular pathogens to CDR-receptor containing T cells. One type of CD4-containing T cell, the TH1 helper T cell, recognizes antigen bound to MHC-Class-II molecules on the surface of a macrophage, and activates the macrophage to engulf and kill bacteria that produce the antigen. TH1 T cells may also release cytokines and chemokines to attract macrophages to a site of infection. Another type of CD4-containing T cell, the TH2 helper T cell, recognizes antigen bound to MHC-Class-II molecules on the surface of B cells, and activates the B cell to proliferate and differentiate into antibody-producing plasma cells. Antibodies produced by antibody-producing plasma cells circulate in the blood plasma. Antibodies comprise four polypeptides that aggregate together and are linked by disulphide bonds. A portion of an antibody molecule is complementary to, and binds, a particular antigen. By binding to antigen-containing bacteria and viruses, antibodies facilitate their neutralization and/or destruction. Neutralization occurs when the antibody binds to a bacterium or virus and thereby interferes with the ability of the bacterium or virus to infect host cells. However, in general, bound antibodies elicit destruction of their targets by phagocytes, either directly, or by recruiting complement molecules to coat the target. In certain cases, recruited complement can directly kill bacteria.
The human genes for the principle MHC-Class-I and MHC-Class-II component molecules are located on chromosome 6. These genes are often referred to as the human leukocyte antigen (“HLA”) genes. Each MHC molecule comprises a number of component polypeptides, and there are multiple genes for each of these component polypeptides, each encoding different versions of the component polypeptides. As a result, in each individual, there are multiple different MHC-Class-I and MHC-Class-II molecules, each with different peptide-binding properties, and each thus capable of presenting a different range of antigen fragments. Furthermore, the MHC genes are polymorphic, with many different variants present within the human population, leading to a quite broad range of antigen-presenting characteristics within the human population. The MHC-Class-I and MHC-Class-II molecules present in a given individual may thus differ from those of another individual in antigen-fragment-presentation characteristics. As a result, a single polypeptide-based vaccine may elicit different immune responses in different individuals, due, in part, to the differences in antigen-fragment presentation by the MHC-Class-I and MHC-Class-II molecules in the different individuals. In other words, a given foreign-molecule or foreign-molecule fragment may only elicit an immune response in those individuals with particular MHC-Class-I and/or MHC-Class-II molecules that can bind to the foreign-molecule or foreign-molecule fragment. For a vaccine to be useful, is should be directed to a target bacterium or virus to effectively raise an immune response to a particular foreign target molecule across a range of individuals selected from the human population. The vaccine generally needs to contain a sufficient number of target-molecule fragments that generate peptide fragments with specific affinities to particular MHC-Class-I and/or MHC-Class-II molecule variants to ensure that an MHC-Class-I and/or MHC-Class-II molecule variant in each individual can present a peptide fragment derived from the target-molecule fragments. Alternatively, it needs to contain fewer, more broadly effective target-molecule fragments, peptide fragments of which can be presented by many different MHC-Class-I and/or MHC-Class-II molecule variants. Because of the large numbers of MHC-Class-I and/or MHC-Class-II molecule variants, more broadly effective target-molecule fragments are desirable.
Although MHC class I alleles are extremely polymorphic, with more than 800 alleles for HLA-A and HLA-B already reported in humans, at the functional level, most HLA class 1 A and B alleles can be classified into 9 different groups or supertypes. The supertypes are characterized by overlapping peptide binding motifs and repertoires. Thus, selecting peptide fragments effective with respect to the binding motifs known for all 9 supertypes can provide a CEVac with a wide coverage of the population.
Certain antigen-producing B-cells and antigen-recognizing T-cells, once activated, can persist within the host to remember specific pathogens previously recognized by the host during its lifetime, so that a strong immune response can be quickly mustered should the pathogen be again detected by the host's immune system. Vaccines elicit long-term B-cell-mediated and T-cell-mediated antigen memory within a host's immune system by introducing foreign molecules into the host that are recognized as foreign molecules by the host immune system and that elicit clonal expansion of B cells and T cells.
A number of host cells infected with a particular virus may present many hundreds or thousands of different foreign polypeptides for recognition by T-cells that, in turn, lead to foreign-polypeptide-specific immune responses. For example, many hundreds of 9-amino-acid and larger polypeptides may be cleaved from the nine HIV gene products and presented by MHC Class I molecules. However, it is observed in many cases that, of the many hundreds or thousands of different possible polypeptides presented by APC cells, generally only a small number lead to clonal expansion of antigen-specific T-cells. In other words, only a portion of the many possible presented foreign antigens obtained by proteolysis of viral proteins appear to raise a strong, specifically-targeted adaptive-immune-system response at any given time. This phenomenon is known as immunodominance.
Immunodominance may not be a problem when the constrained immune response raised by a few immunodominant epitopes is sufficient to suppress and destroy a relatively static target organism. However, in the case of a rapidly evolving target, such as HIV, the immunodominance phenomenon may lead to focusing of the immune response on a limited number of viral sequences that are relatively evolutionarily plastic, or that, in other words, can mutate to alternative, variant sequences without sufficiently impacting viral fitness to inhibit viral escape from the immune response. Thus, while the initial immune response constrained by immunodominance might be initially effective against the virus, mutation of the small number of immunodominant viral sequences to variant sequences that do not elicit an immune response leads to variant virus that can infect cells and proliferate, despite the initial immune response. Such mutable, immunodominant sequences are referred to as “decoy sequences.”
Prominent information-containing biopolymers include deoxyribonucleic acid (“DNA”), ribonucleic acid (“RNA”), including messenger RNA (“mRNA”), and proteins.
A gene is a subsequence of deoxyribonucleotide subunits within one strand of a double-stranded DNA polymer. One type of gene can be thought of as an encoding that specifies, or is a template for, construction of a particular protein.
In eukaryotic organisms, including humans, each cell contains a number of extremely long, DNA-double-strand polymers called chromosomes. Each chromosome can be thought of, abstractly, as a very long deoxyribonucleotide sequence. Each chromosome contains hundreds to thousands of subsequences, many subsequences corresponding to genes. The exact correspondence between a particular subsequence identified as a gene, in the case of protein-encoding genes, and the protein or RNA encoded by the gene can be somewhat complicated, for reasons outside the scope of the present invention. However, for the purposes of describing embodiments of the present invention, a chromosome may be thought of as a linear DNA sequence of contiguous deoxyribonucleotide subunits that can be viewed as a linear sequence of DNA subsequences. In certain cases, the subsequences are genes, each gene specifying a particular protein or RNA. Similarly, the HIV viral RNA, transcribed by reverse transcriptase into vDNA, represents the single genetic sequences, or genome, for the HIV virus.
There are many different types of mutations. Deletion and insertion mutations may lead to frame shifts within a DNA sequence, in turn leading to changes in all or a large portion of the amino-acid monomers downstream from the amino-acid monomer corresponding to the location of the mutation. In the case of either base-substitution mutations, such as that illustrated in
As briefly noted above, HIV reverse transcriptase is a relatively low-fidelity viral-RNA-to-vDNA transcription mediator. Viral reverse transcriptase has a relatively high error rate, incorporating the wrong base into the complementary DNA in about one out of every 3000 nucleotide bases transcribed. This high transcription error rate leads to frequent and diverse mutations within the vDNA. Because HIV is characterized by a relatively fast replication cycle, producing as many as 1010 or more virions per day in a human host, a single infected patient typically develops a large number of different mutation-generated HIV variant viruses, each having viral genome different from those of the other variants.
HIV is characterized by an enormous diversity both within hosts and at the population level, exemplified by the identification of multiple subtypes and an expanding number of circulating recombinant forms. Since HIV-1 sequences can vary by up to 30% in the envelope gene when considering only subtype B sequences, there is a considerable challenge in developing a vaccine suited for the universe of circulating strains, and there are practical limitations to the variability that can be incorporated in a vaccine.
In general, a large number of the mutant viral genomes may correspond to less viable or completely defective viruses which cannot continue the infection cycle, and therefore represent dead ends in the evolutionary tree of mutation-produced variants. For example, mutations in sequences of structural-protein domains that interface with complementary domains in other structural proteins to form macromolecular complexes, such as viral coats and capsids, may tend to be more deleterious than mutations in sequences that do not interface with other proteins, because the architecture of binding and interface domains may be strongly constrained by that of the complementary domains, as well as by overall molecular conformation. Similarly, mutations within the sequences of the active sites of enzymes may be far more deleterious than mutations in non-catalytic domains. Mutations therefore may span a range of detrimental consequences, from innocuous, silent mutations to invariably fatal mutations. Because of the large number of infected cells and fast replication cycle, a sufficient number of viable, variant viruses with less detrimental mutations are produced by relatively low-fidelity viral transcription to generally overwhelm the host immune response. Although the host immune system may recognize and react strongly to some number of viral epitopes presented by host APCs, the high viral mutation rate generally leads to viable variants lacking the epitopes initially recognized by the immune system. Thus, HIV continues to escape host immune response directed to specific epitopes. Although many mutations may lead to virus variants that reproduce less efficiently than native virus, less fit virus variants that can nonetheless reproduce and continue the infection cycle allow the virus population to adapt to the host immune system, and avoid destruction. Additionally, less-fit viral variants may, through further mutation, revert to native virus when the immune response subsequently weakens, or may continue to evolve to produce increasingly fit variants.
Very limited sequence variation can be tolerated in some structurally and functionally important regions, like the capsid protein of HIV-1. Mutations are rare in this region. The mutations are likely to incur a substantial cost to fitness, corresponding to epitopes in which immune escape will be both very unlikely to be sustained in a host, and are likely to revert after transmission to a host without that particular restricting allele. These mutations sometimes appear in conjunction with flanking mutations that are compensatory in function, restoring fitness or preventing the proper cleavage and presentation on MHC.
Because of the HLA-restriction, HLA-polymorphism, and immunodominance phenomena, discussed above, specific vaccines directed to HIV generally elicit only a relatively small number of strong, epitope-directed immune responses within a given host. This allows HIV to eventually escape the immune response by producing variant viruses lacking the small number of epitopes to which the immune response is directed. Although the immune response may recognize new epitopes of variant viruses, and may continue to respond to viral mutation, the immune response lags viral escape through mutation, contributing, in most individuals, to the eventual overwhelming the individual's immune system.
Conservative-element vaccines (“CEvacs”) and methods for designing and producing CEvacs, both embodiments of the present invention, may theoretically block HIV escape of the immune-system response. In general, certain portions of a viral genome, or of any genome, are more stable towards mutation than others. For example, subsequences of critical portions of structural proteins recognized by other structural proteins in order to coalesce to form a viral capsid or protrusion or that bind to host-cell receptors, may be far more critical to viral reproduction and infectivity than polypeptide domains that do not interact with other polypeptide domains or host molecules. Mutations to these critical regions most often result in defective and non-viable viral particles. Using viral gene sequence data, segments of the viral proteins that do not, or only rarely, mutate can be identified. These segments represent candidates for immutable viral function, i.e. candidate segments for epitopic recognition that is more likely to play a protective role in HIV infection. Were it possible to develop a vaccine capable of raising a strong immune response to all, or a very large proportion of, these critical regions, it is possible that viral-mutation-directed escape of the immune-system response may be entirely prevented. In the face of a strong immune response directed to all, or a large portion of, the critical-region epitopes, a virus would need a relatively large number of simultaneous mutations in order to escape the immune response. However, as the number of mutations needed to escape the immune response increases, the likelihood of a virus incorporating the needed mutations and remaining viable exponentially decreases. Mutation-directed immune-response escape can be thought of as a path search within a huge forest of possible sequence mutations, a successful path representing only a tiny fraction of the possible mutational pathways, the overwhelming majority of which lead to defective-virus dead ends. When a virus can search the sequence space one mutation-at-a-time, the virus, because of the huge number of parallel searches made possible by the large number of infected host cells, can efficiently search the sequence space for a path of non-defective mutations leading to a sequence that escapes the immune response. However, if multiple simultaneous mutations are needed, the sequence-space search becomes intractable, because of the enormous number of possible multiple-mutation defective sequences separating a viable sequence from a next viable sequence. Thus, CEVacs may represent the best possible approach to eliciting effective immune-system control of rapidly mutating viruses, such as HIV, and may, also represent the best approach to quickly and economically subduing any of a multitude of human pathogens via recombinant and synthetic vaccines.
Effective CEVac design embodies a number of principles. First, a CEVac needs to target only conserved elements identified in target organism molecules. As a corollary, segments that can easily mutate, referred to as “decoys,” should be excluded. Decoys provide escape pathways for a virus or other pathogen, allowing the pathogen to escape the immune system by altering mutable sequences to evade an immune response directed to the current decoy sequence. Moreover, sequences that can mutate to forms resulting in a less fit, but still viable, pathogen need to be eliminated, so that a pathogen cannot temporally trade fitness or optimal function for survivability, and then, subsequently, revert to a more optimal sequence after the immune response to the more optimal sequence has subsided. An effective CEVac needs to target conserved elements present within all, or as many as possible, native viral variants currently infecting the human population. The conserved elements targeted by an effective CEVac need to be sequences that, when mutated, confer extremely deleterious or fatal consequences on the mutant virus, in order to avoid inadvertently including decoy sequences in the CEVac. The conserved elements included in an effective CEVac need to elicit an immune response across the various polymorphic MHC-Class-I and MHC-Class-II molecules present in the human population. A broad response may be obtained by broadly immunogenic conserved elements, or by including a sufficient number of less broadly effective conserved elements to elicit an immune response across a range of MHC-Class-I supertypes and MHC-Class-II molecule polymorphisms present in the human population, or within large subpopulations for which specific vaccines can be developed.
Identifying conserved elements with the above-described characteristics for a CEVac is a first step. However, CEVac design also involves packaging conserved elements effectively into one or more vaccine molecules, such as polypeptides or DNA sequences, in order to prevent inadvertent generation of host-like constructs, that might lead to autoimmune reactions, prevent inadvertent generation of decoy sequences, and in order to ensure that the conserved elements lead to effective presentation of immunogenic peptide fragments to elicit specific immune responses to the conserved elements. The packaging step may involve selecting linker sequences, positioning conserved elements correctly within the vaccine molecules, and correctly with respect to one another, including different numbers of conserved-element copies, and other such considerations.
The following C++-like pseudocode provides an illustration of one embodiment of the present invention. The C++-like pseudocode is meant to illustrate one approach to implementing a conserved-element analysis program for analyzing sequences in order to find conserved elements, but is not intended to define the invention or in any way limit the scope of the claims.
First, a number of constants and an enumeration are provided:
1 const int maxPositionsPerSequence=60;
2 const int maxNumSequences=100;
3 const char NULL_CHAR=‘z’+1;
4 const int numAminoAcids=27;
5 const int maxFreqPerPos=10;
6 enum posType {conserved, variable, unconserved};
The constants “maxPositionsPerSequence and “maxNumSequences” specify the maximum number of amino-acid monomers allowed per sequence and the number of sequences that can be analyzed, respectively. The relatively small numbers used in the pseudocode are not reflective of the sizes of sequences, and numbers of sequences, that would be analyzed in an actual implementation. In the pseudocode implementation provided below, static data structures are employed, and thus relatively small sequences and numbers of sequences are used. In a more practical, robust implementation, dynamic memory allocation is employed, to provide more flexible memory usage, and the ability to dynamically allocate memory on an as-needed basis. In general, thousands of sequences may be analyzed, each of which has thousands, tens of thousands, hundreds of thousands, or millions of sequence positions. In the pseudocode embodiment, it is assumed that polypeptide sequences having amino-acid identifiers at each position are analyzed, but, in alternative embodiments, nucleic-acid sequences may be similarly analyzed, and, in yet further embodiments, various other biopolymers may be analyzed by alternative sequence-analysis routines.
The constant “NULL_CHAR” represents a null, or blank character that is inserted into sequences during alignment in order to insert one or more placeholders, or gaps, into the sequences. The constant “numAminoAcids” represents the number of different amino acids numerically identified for insertion into sequences and for other purposes. In general, there are 20 commonly occurring amino acids, but certain additional amino acids may be found in certain polypeptides found in various organisms. The constant “maxFreqPerPos” defines the size of a sequence-position/amino-acid-occurrence-frequency table, discussed below. The enumeration “posType” presents the classification of a position within a one-dimensional map representing the aligned sequences corresponding to the original sequences supplied for alignment, with the possible types of positions being “conserved,” “variable,” or “unconserved.”
Next, a declaration for a type of structure, “Amino_Acid_Frequency,” is defined. This structure contains a floating-point value indicating the frequency of occurrence of an amino acid, along with an integer value defining the particular amino acid.
Next, the class “compatibleAminoAcids” is declared:
The instance of the class “compatibleAminoAcids” contains a number of amino-acid-identifying integers. The amino-acid identifiers included within an instance of the class “compatibleAminoAcids” represents a set of amino acids that can be substituted for one another at a variable position within a sequence. For example, it may be the case that it is a desire to restrict variable positions within conserved elements to include only related amino acids, such as substitutions of valine for isoleucine or other non-polar amino acids. This class includes function members for writing or deleting particular amino acids from the set represented by an instance of the class, as well as the function member “in,” declared above on line 11, which returns a Boolean value indicating whether a particular amino acid provided as an argument is included in the set of amino acids represented by the instance of the class “compatibleAminoAcids.”
Next, the class “positionAssignmentParameters” is provided:
The instance of the class “positionAssignmentParameters” contains numerical parameters that specify a particular search for, or sequence-analysis for discovering, conserved elements. These parameters include: (1) “conservedThreshold,” the lowest frequency of occurrence of an amino acid at a particular position needed to consider the position conserved; (2) “numVariablePositions,” the number of variable positions allowed within a conserved element; (3) “numAAsAtVariablePosition,” the number of different amino acids that may occur in a single variable position; (4) “variableThreshold,” the minimum combined frequency of occurrences of the amino acids that occur at a variable position that allow the position to be considered to be a variable position; and (5) “thresholdCELength,” the minimum length, in amino-acid residues, of a conserved element. The class “positionAssignmentParameters” includes function members that allow these parameters to be entered into, and to be retrieved from, an instance of the class “positionAssignmentParameters.” It should be noted that many additional parameters, and types of constraints, may be defined in more fully specified conserved-element analysis programs representing alternative embodiments of the present invention. The five parameters chosen to define conserved-element searches in this pseudocode implementation are meant merely to illustrate the process and coding conventions by which such parameters may be defined and used to tailor a search for conserved elements. Next, a declaration for the class “sequence” is provided:
An instance of the class “sequence” is simply a sequence of amino-acid identifiers, or an array of amino-acid identifiers. A sequence has a length and an ordered sequence of amino-acid identifiers, which may include the NULL_CHAR representing a gap, or space, in the sequence, and which can be set and retrieved using the function members declared in the declaration of the class “sequence,” above.
Next, the declaration for the class “sequences” is provided:
The class “sequences” is essentially an array of sequences. An instance of the class “sequences” may, for example, be used to contain all the original sequences to be analyzed for conserved elements, aligned versions of the original sequences, and the conserved elements identified in a conserved-element search. The function members declared for the class “sequences” include function members to add sequences to an instance of the class “sequences,” retrieve sequences from an instance of the class “sequences,” obtain the number of sequences in an instance of the class “sequences,” and to reinitialize the instance of the class “sequences” to the empty set. A special instance of the class “sequence” is declared as: sequence NULL_SEQ. This sequence is used as a return value in several member functions of the class “sequences” to indicate that no further sequences are available in a set of sequences.
Next, an instance of the class “aligner” is provided:
The class “aligner” represents alignment functionality for aligning sequences prior to searching the aligned sequences for conserved elements. There are many different possible techniques and methods for aligning sequences. Many of these techniques and methods are quite sophisticated and employ a vastly more complex set of considerations than the alignment functionality provided in this pseudocode example. The techniques and methods employed for aligning sequences for a conserved-element search may significantly impact the results of the search, so alignment methods need to be chosen appropriately and carefully. The alignment method encapsulated in the class “aligner” in this pseudocode example is meant only to illustrate one simple approach to alignment. Many other alignment methods and techniques may be alternatively used for a conserved-element search. In certain embodiments of the present invention, no alignment is carried out, but, instead, all of the sequences to be analyzed are computationally cleaved into small subsequences that are analyzed to find conserved elements.
Alignment is carried out by the single public function member “align,” declared above on line 18. This function member takes two argument: (1) “orig,” a pointer to a set of sequences containing the sequences to be aligned; and (2) “aligned,” a pointer to an empty set of sequences that the alignment routine populates with aligned versions of the sequences in the set of sequences referenced by the argument “orig.” The alignment routine employs the private function members “findBest” and “score,” declared on lines 10-11, to identify the best average sequence from among the original sequences. The alignment routine then, in pairwise fashion, aligns each of the remaining sequences to this best sequence via the function member “pairwiseAlign,” declared on line 15. This “pairwiseAlign” function member calls the recursive function member “computeiRuns,” declared on line 14, to recursively align the next sequence to the reference sequence, or best sequence. In alignment, null characters, or gaps, may need to be inserted into either the reference sequence, via the private function member “insertNullsAllExcept,” or into the sequence currently being aligned via the private function member “insertNullsOnce.”
Next, the class “CE_Generator” is declared:
The class “CE_Generator” represents the conserved-element analysis logic that, in turn, represents one embodiment of the present invention. The class “CE_Generator” includes six public function members, declared above on lines 32-39: (1) “get,” a function member that returns the ith original sequence; (2) “set,” a function member that allows the amino-acid identity for a position within the original sequences to be set; (3) “filter,” a function that allows for further processing of conserved elements, an implementation for which is not provided in the pseudocode; (4) “aGet,” a function member that retrieves the ith alined sequence; (5) “aSet,” a function member that allows the amino acid at a particular position in a particular aligned sequence to be set; and (6) “getCEs,” the main function member of the class “CE_Generator” that is called to carry out a search for conserved elements within a set of sequences. The parameters to the public function member “getCEs” include: (1) “sqs,” a pointer to an instance of the class “sequences” that includes the identified conserved elements and that represents the results of a conserved-element search; (2) “orig,” a pointer to an instance of the class “sequences” that contains the original sequences to be analyzed for conserved elements; (3) “aligned,” a pointer to an instance of the class “sequences” that contains aligned versions of the original sequences; (4) “c,” an instance of the class “positionAssignmentParameters” that specifies the various parameter values that control the conserved-element search; (5) “a,” a pointer to an array of instances of the class “compatibleAminoAcids” which specify the allowed amino acid substitutions at variable positions within conserved elements; and (6) “numCA,” an integer value specifying the number of instances of the class “compatibleAminoAcids” in the array referenced by argument “a.”
Next, implementations for a number of the function members of the classes “compatibleAminoAcids,” “sequence,” and “sequences,” are provided. These implementations are quite straightforwardly implemented, and are not further described or annotated:
Next, implementations for function members of the class “aligner” are discussed. As mentioned above, there are a variety of different alignment methods and technologies that may be used for sequence alignment. The logic included in the class “aligner” is extremely simplistic and straightforward, but may provide adequate alignment in certain cases. It is included in the pseudocode for completeness and to illustrate an example of alignment, but is in no way intended to define or limit the present invention or the types of alignment techniques and methodologies that may be chosen for conserved-element analysis.
Implementations for the aligner function members “findBest” and “score” are next provided:
The function member “score” simply computes the number of positions in two sequences, identified by the indexes i and j, which contain identical amino-acid identifiers. The function member “findBest” computes all possible pairwise scores among the set of original sequences, and selects, as the best sequence, the sequence with the best, or highest, cumulative score.
Next, an implementation for the function members “insertNullsOnce” and “insertNullsAllExcept” are provided:
The function member “insertNullsOnce” inserts a null character at a specified position within the sequence that is being aligned. By contrast, the function member “insertNullsAllExcept” inserts null characters at the same position within the reference sequence, or best sequence, and all already aligned sequences. In certain cases, null characters are inserted into the sequence being currently aligned during the alignment process, while, in other cases, null characters are inserted into the reference, or best, sequence and all already aligned sequences.
Next, an implementation for the function member “computeIRuns” is provided:
The function member “computeIRuns” attempts to find the longest string of amino-acids identifiers common to a currently considered portions of the reference sequence and a currently considered portion of a sequence currently being aligned to the reference sequence. In addition, the function member “computeIRuns” attempts to find a best-aligned common sequence of amino-acid identifiers. As the alignment between a run decreases, or the offset between the starting positions of the common run in the two sequences increases, the run is more greatly penalized. In the outer nested while-loops of the function member “computeiRuns,” beginning on lines 15 and 16, the function member “computeIRuns” tries all possible starting positions within the two sequences “s” and “ref” being compared and aligned. In the inner while-loop, on lines 23-30, pointers are iteratively advanced from the currently considered starting positions as long as the contents of the sequence positions referenced by the pointers in the two compared sequences contain the same amino-acid identifier. At the end of this while-loop, the size of any detected, commonly shared run of amino-acid identifiers is computed, along with the difference in alignment of the runs in the two sequences, or offset between starting positions of the commonly shared subsequence, and a metric is computed, on line 36, to balance length and alignment. If the value of the metric is better than the best metric so far computed, then a number of variables are set, on lines 39-42, to indicate that a best new commonly shared run of amino-acid identifiers, or commonly shared subsequence, has been found in the two sequences.
Next, an implementation of the function member “painviseAlign” is provided:
This function member recursively aligns the sequence specified by index “s” to the reference, or best, sequence identified by index “ref.” On line 7, the function member “pairwiseAlign” calls the function member “computeIRuns” to find the best length of matching identical amino-acid identifiers in the two sequences, and then recursively calls itself, on lines 20 and 21, to align portions of the two sequences following and prior to the identified best run.
Next, an implementation of the function member “align” is provided:
The function member “align” determines the reference, or best, sequence, on line 7, via a call to the function member “findBest,” and then proceeds to align all sequences in the set of original sequences prior to the reference sequence, in the for-loop of lines 11-16, and then aligns all the sequences following the reference sequence in the for-loop of lines 17-22.
Next, implementations for function members of the class “CE_Generator” are provided. No implementation is provided for the function member “filter,” which is intended to illustrate that, following initial identification of conserved elements, additional considerations may be employed to discard certain of the identified conserved elements for various criteria. For example, initially identified conserved elements may be compared to host sequences in order to eliminate conserved elements similar or identical to native host sequences that, if included in a vaccine polymer, might elicit an autoimmune response. As another example, conserved elements that are known to be strongly immunodominant, and less than optimally effective in eliciting a desired, protective immune response, may also be eliminated or somehow identified for special positioning or inclusion at a special multiplicity within the vaccine. Other considerations may also be applied by the filter function. No implementation is provided for this function because the implementation generally depends on extraneous databases and other information, accessible through specialized interfaces that are beyond the scope of the present discussion, and may also be vaccine-type and host-type dependent.
Next, implementations with function members “clearTable” and “generateTable” are provided:
These function members initialize and generate a table that includes the amino-acid-occurrence frequencies at each position within the set of aligned sequences. In other words, the table is a matrix of amino-acid-frequency of occurrence with respect to sequence position, with one axis, or index, spanning the possible amino acids, and another axis, or index, spanning all of the positions within the aligned set of sequences. Note that, after alignment, all aligned sequences have equal length. The frequencies range from 0 to 1, and are floating-point values computed by dividing the number of occurrences of each amino acid at each position by the total number of sequences, on line 14 of the function member “generateTable.” Again, as with many aspects of the pseudocode implementation, many different design choices and alternative algorithms are possible. For example, frequencies might be adjusted downward in the case that a position is only sparsely populated or, in other words, the null character is frequently observed at the position.
Next, implementations for the CE_Generator member functions “listClear” and “listAdd” are provided:
The list that is created using these routines is a list of amino acid occurrences at a particular position within the aligned sequences. A list is created for each position, with the ten most frequent occurring amino acids, if ten or more amino acids occur at that position, maintained in the list in order of decreasing frequency of occurrence. This list is used to determine whether a position is a variable position and, if so, to determine a minimal set of amino acids with a combined frequency of occurrence greater than the variable threshold.
Next, an implementation for the CE_Generator function member “compatible” is provided:
The function member “compatible” determines whether an amino acid proposed to be included in the set of amino acids that together comprise a variable position is compatible with the other amino acids already included in the variable position.
Next, an implementation for the CE_Generator function member “varPos” is provided:
The function member “varPos” recursively examines the ordered list of amino-acid frequencies prepared for a particular position in the aligned sequences to determine if there is a set of amino acids sized less than or equal to the maximum number of amino acids allotted a variable position with a combined frequency of occurrence greater than or equal to the threshold frequency of occurrence for a variable position. This function member returns a Boolean result indicating whether or not a particular position within the aligned sequences is a variable position.
Next, an implementation of the function member “mapPos” is provided:
The function member “mapPos” creates a one-dimensional map of the aligned sequence positions, for each position indicating when the position is conserved, variable, or unconserved. For variable positions, identities of the amino acids at those positions are preserved in an instance of the class “sequences,” “conservedAAs.”
Next, implementation of the CE_Generator function member “contains” is provided:
The function member “contains” determines whether a conserved element identified during conserved-element analysis has already been included in a set of conserved elements already found during the conserved-element analysis.
Next, an implementation of the function member “enterCE” is provided:
The function member “enterCE” enters a next identified conserved element into the set of conserved elements that represents the result of conserved-element analysis. When the next identified conserved element includes one or more variable positions, all possible related sequences, obtained by substitution of the various amino acids that occur at the variable positions, are generated and entered.
Next, an implementation for the CE_Generator function member “getCEs” is provided:
This is the main function member of the class “CE_Generator.” First, on line 11, the table of amino-acid-occurrence frequencies is generated. Then, on line 19, the one-dimensional map of the aligned-sequence positions, indicating whether each position is conserved, variable, or unconserved, is generated via a call to the function member “mapPos.” Finally, in the for-loop of lines 21-34, the one-dimensional map is exhaustively searched for conserved elements that meet all of the thresholds and parameters, including the length threshold, number of variable positions threshold, number of amino acids allowed at a variable position threshold, and other parameters. Each identified conserved element not already entered into the results set is entered into the results set via a call to “enterCE” on line 32.
Finally, a truncated version of an exemplary program for searching a set of sequences for conserved elements is provided:
In an actual program, sequences are added to an instance of class “sequences,” “orig,” through calls to the sequences function member “addSeq,” and compatible sets of amino acids are similarly added to an instance of the class “compatibleAminoAcids.”
Again, there are an essentially unlimited number of different implementations of the conserved-element analysis logic that represent embodiments of the present invention. There are many different design choices, additional parameters and constraints that may be considered, different analytical techniques with different computational efficiencies, that may all be considered when addressing particular problem domains, including particular types of vaccines, particular hosts, and particular pathogens. For example, vaccines may be targeted to eukaryotic parasite pathogens, bacterial pathogens, and complex viral pathogens, with much larger genomes and corresponding proteomes than HIV, perhaps requiring different computational strategies and additional criteria for selecting conserved elements. Certain of the above-described features of the C++-like pseudocode may be omitted, without significantly impacting conserved-element analysis.
A Perl program actually used for generating conserved sequences for the HIV virus is provided in
In addition, as suggested above, various embodiments of the present invention may avoid aligning sequences altogether. Instead, the set of sequences to be analyzed may be decomposed computationally into small subsequences that are then computationally re-assembled to identify conserved elements. Many other computational approaches are also possible.
Application of the above-described method for selecting conserved elements (“CEs”) from aligned sequences has produced a set of CE peptide sequences with very high conservation and a set with slightly less conservation from large sets of aligned HIV gene sequences. The analysis was done on a gene-by-gene basis, using the following numbers of HIV-gene variants: (1) gag—619; (2) pol—615; (3) vif—967; (4) vpr—835; (5) tat—1225; (6) rev—938; (7) vpu—925; (8) env—871; (9) nef—1474. Highly conserved CEs are included below in Table 1:
An additional set of less highly conserved CE peptide sequences has been identified from large sets of aligned HIV gene sequences by relaxing certain of the threshold constraints:
Alternative embodiments of the conserved-element identifying methods of the present invention may produce additional conserved elements. In addition, analysis of a greater number of HIV sequences from additional strains may lead to modification of the final set of conserved elements for HIV.
Once conserved elements are identified, they are used to construct one or more biopolymers used directly as a vaccine, or used in intermediate steps of vaccine development. The combination of conserved elements to produce vaccine biopolymers, or intermediate biopolymers used to produce vaccines, is a complex process that may involve many considerations, constraints, and use of linker sequences and other sequences in addition to the conserved elements. The problem of combining CEs to produce vaccine-relate biopolymers may be parameterized, just as CE-identification methods are parameterized. For example, the problem of combining CEs to produce a vaccine-relate biopolymer may optimize variables, including the number of copies of each CE to include in the biopolymer, the relative positions of CEs, the length and types of linker sequences used to join the CEs together, the number of discrete biopolymers to use for the vaccine, or as intermediate biopolymers, and other such parameters. Optimization constraints and goals may include the frequency of display of CEs by antigen-presenting cells, the effective concentration, or copy number, of displayed CEs, the effectiveness of the immune response elicited by the vaccine, and other such constraints and goals, avoiding inadvertent generation of undesirable sequence fragments displayed by antigen-presenting cells, overall size constraints for a useable vaccine biopolymer, and other such constraints.
Although the present invention has been described in terms of particular embodiments, it is not intended that the invention be limited to these embodiments. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, it should be noted that, although certain embodiments of the present invention are described for identifying conserved elements of viral proteomes, alternative method embodiments may be directed to identifying conserved viral RNA subsequences or vDNA subsequences, and designing CEVacs based on conserved viral RNA subsequences or vDNA subsequences. As discussed above, any of a vast number of different subsequence-selection criteria may be applied in order to identify conserved elements. Once the conserved elements within the two-dimensional viral proteome array, discussed with reference to
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the invention. The foregoing descriptions of specific embodiments of the present invention are presented for purpose of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations are possible in view of the above teachings. The embodiments are shown and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents:
Number | Date | Country | |
---|---|---|---|
Parent | 11713474 | Mar 2007 | US |
Child | 13097592 | US |