Sources for, and types of, insecticidally active proteins, and polynucleotides that encode the proteins

BACKGROUND

Billions of dollars are spent each year to control insects, and additional billions of dollars are lost because of crop damage inflicted by insects. Synthetic organic chemical insecticides have been the primary tools used to control insects, but biological insecticides are playing an important role in some areas. Insect-resistant plants transformed with insecticidal protein genes, such as the insecticidal proteins derived from Bacillus thuringiensis (B.t.), have revolutionized modern agriculture and heightened the importance and value of insecticidal proteins and their genes.

Toxin Complex (TC) proteins and genes, found primarily in bacteria of the genera Photorhabdus and Xenorhabdus (but also in other bacterial genera such as Serratia, Pseudomonas, and Paenibacillus) are an important, relatively new source of insecticidal proteins and genes. There are at least three distinct classes of TC proteins. Native Class A TC proteins are approximately 280 kDa in size and possess insecticidal activity. Class B TC proteins (approximately 170 kDa) and Class C TC proteins (approximately 107 kDa), in combination, enhance the insecticidal potency of Class A TC proteins but possess little to no insecticidal activity in the absence of a Class A TC protein. That is to say, Class B and Class C TC proteins in combination potentiate the insecticidal activity of Class A TC proteins. See e.g. US-2004-0208907 and WO 2004/067727 for a more detailed review of the art. Class A TC proteins possess insecticidal activity, but this activity is relatively low. When a Class A TC protein is combined with a Class B and a Class C TC protein, they form a complex that is much more potent than the Class A TC protein alone.

Unlike Bacillus thuringiensis, Xenorhabdus, and Photorhabdus, which are organisms that are known to be insecticidal and to have insecticidal proteins, organisms such as Fusarium graminaerum (now known as Gibberella zeae) and Methanosarcina were not known to be insecticidal and were not known to produce insecticidally active proteins.

BRIEF SUMMARY

The subject invention provides new classes and types of toxin complex (“TC”) proteins, and exciting new sources for TC proteins. The subject invention also includes polynucleotides that encode the subject proteins. The subject invention further provides vectors and cells comprising these polynucleotides. The subject invention also provides novel methods of controlling insects.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 presents the graphical output of a search using the artificial fusion protein sequence of SEQ ID NO:6 in a standard protein-protein BLAST search of the NCBI nonredundant protein database, using the following default values: Filter set to low complexity; Expect 10; Word size 3; Matrix BLOSUM62; Gap Costs: Existence 11, Extension 1.

FIG. 2 shows the amino acid sequence of SEQ ID NO:2 for the natural BC fusion in Tannerella, where underlined amino acids show the spvB (Salmonella virulence plasmid B protein) domain using the standard spvB-ls.hmm (“hmm” =“hidden Markov model”) model; amino acids with double underlining show the FG-GAP domain using the BMode17.hmm model; amino acids in bold show the RHS (recombination hot spot) domains using Pfam rhs_ls.hmm model (where “ls” mode requires a sequence model to match the entire HMM profile); and amino acids in italics show the HVR mapped by lack of homology to other proteins.

FIG. 3 shows the amino acid sequence of SEQ ID NO:4 for the hypothetical protein FG10566.1 fused BC toxin protein [Gibberella zeae PH-1], where the underlined amino acids indicate the spvB domain using the standard spvB-ls.hmm model; amino acids with double underlining indicate three FG-GAP domains found with the BModels3 .hmm model; amino acids in bold indicate RHS domains using the Pfam rhs_ls.hmm model; and italics indicates HVR mapped by lack of homology to other proteins.

FIGS. 4A-D show the global alignment of the two BC fused toxin proteins from Tannerella of SEQ ID NO:2 and Gibberella of SEQ ID NO:4.

BRIEF DESCRIPTION OF THE SEQUENCES

SEQ ID NO:1 is the native genomic DNA sequence tcp1_Gzwhich encodes the protein of SEQ ID NO:2.

SEQ ID NO:2 shows the native amino acid sequence of the Tcp1_Gzprotein (including readthrough of putative intron).

SEQ ID NO:3 shows the native, hypothetical cDNA sequence with the putative intron removed. This sequence encodes the protein of SEQ ID NO:4.

SEQ ID NO:4 is the native amino acid sequence of the Tcp1_Gzprotein with the intron-encoded sequence removed.

SEQ ID NO:5 is an E. coli-optimized polynucleotide sequence which codes for the Tcp1_Gzprotein of SEQ ID NO:2.

SEQ ID NO:6 is an example of a fusion protein generated from the amino acid sequences of TcaC (GenBank Accession AAC38625.1) and TccC1 (GenBank Accession AAL18473.1) (both from Photorhabdus luminescens strain W-14).

SEQ ID NO:7 is the genomic sequence from Methanosarcina acetivorans strain C2A that codes for a two-domain toxin complex protein.

SEQ ID NO:8 is the amino acid sequence encoded by SEQ ID NO:7.

SEQ ID NO:9 is the genomic sequence (from Gibberella zeae PH-1 strain PH-1; NRRL 31084 chromosome 1) that encodes a Class A TC protein.

SEQ ID NO:10 is the amino acid sequence encoded by SEQ ID NO:9.

SEQ ID NO:11 is the full length sequence of the Class B/C fusion gene in Tannerella forsythensis (ATCC 43037).

SEQ ID NO:12 is the protein encoded by SEQ ID NO:11.

SEQ ID NO:13 is primer P1 used for PCR according to the subject invention.

SEQ ID NO:14 is primer P2 used for PCR according to the subject invention.

SEQ ID NO:15 is primer P3 used for PCR according to the subject invention.

SEQ ID NO:16 is primer P4 used for PCR according to the subject invention.

SEQ ID NO:17 is primer P5 used for PCR according to the subject invention.

SEQ ID NO:18 is primer P6 used for PCR according to the subject invention.

SEQ ID NO:19 is the nucleotide sequence of fusion 8884 (TcdB2/Tcp1_GzC). Nucleotides 1-4422 encode TcdB2; nucleotides 4423-4464 encode the TcdB2/Tcp1_GzC linker peptide; and nucleotides 4465-7539 encode Tcp1_GzC.

SEQ ID NO:20 is the amino acid sequence of the 8884 TcdB2/Tcp1_GzC fusion peptide encoded by SEQ ID NO:19. Amino acids 1-1474: TcdB2; amino acids 1475-1488: TcdB2/Tcp1_GzC linker peptide; amino acids 1489-2513: Tcp1_GzC.

SEQ ID NO:21 is the nucleotide sequence of fusion 8883 (tcp1_GzB/tccC3). Nucleotides 1-4536 encode Tcp1_GzB; nucleotides encode the Tcp1_GzB/TccC3 linker peptide; and nucleotides 4576-7455 encode TccC3.

SEQ ID NO:22 is the amino acid sequence of the 8883 fusion protein Tcp1_GzB/TccC3 encoded by SEQ ID NO:21. Amino acids 1-1512: Tcp1_GzB; amino acids 1513-1525: Linker; amino acids 1526-2485: TccC3.

SEQ ID NO:23 is the plant-optimized nucleotide sequence encoding a variant of Gibberella zeae fused Class B/Class C Tcp1_Gzprotein.

SEQ ID NO:24 is the variant of Gibberella zeae fused Class B/Class C Tcp1_Gzprotein encoded by SEQ ID NO:23.

SEQ ID NO:25 is the nucleotide sequence extracted from AContig12 of Fusarium verticillioides. The Threonine codon (ACG) that serves as the beginning of the open reading frame for the coding region of the first segment of the deduced putative TC Class A protein is noted as a misc_feature at nucleotides 21-23. The AAA Lysine codon which serves as the start of the open reading frame for the second portion of the deduced putative TC Class A protein is noted as a misc_feature at nucleotides 3022-3024.

SEQ ID NO:26 is the first segment of the deduced putative TC Class A protein encoded by SEQ ID NO:25.

SEQ ID NO:27 is the second segment of the deduced putative TC Class A protein encoded by SEQ ID NO:25.

SEQ ID NO:28 is the nucleotide sequence extracted from Acontig34 of Fusarium verticillioides. The beginning of the coding region corresponding to the first Asparagine of the putative TC Class A encoded protein in SEQ ID NO:29 is noted as a misc_feature at nucleotides 20-22. The second part of the open reading frame starts 4 bases downstream of the TGA stop codon, comprises 690 bases, and encodes the 230 amino acids shown in SEQ ID NO:30. The third portion of the open reading frame starts 11 bases downstream of the TAA stop codon, comprises 1122 bases, and encodes the 374 amino acids shown in SEQ ID NO:31. A large gap in the DNA sequence, indicated as a string of 2098 n's, is noted in a misc_feature at nucleotides 3299-5396. The portion of the DNA sequence following the Ns comprises a fourth portion of the deduced putative TC Class A protein open reading frame, and encodes the 1273 amino acids shown in SEQ ID NO:32. The GGA codon for the first Glycine of this portion of the deduced putative TC Class A protein is indicated as a misc_feature at nucleotides 5451-5453.

SEQ ID NO:29 is the first portion of the putative TC Class A protein encoded by SEQ ID NO:28.

SEQ ID NO:30 is the second portion of the putative TC Class A protein encoded by SEQ ID NO:28.

SEQ ID NO:31 is the third portion of the putative TC Class A protein encoded by SEQ ID NO:28.

SEQ ID NO:32 is the fourth potion of the putative TC Class A protein encoded by SEQ ID NO:28.

SEQ ID NO:33 is the nucleotide sequence extracted from BCContig12 of Fusarium verticillioides. The beginning of the coding region corresponding to the first Alanine of the encoded putative TC fused ClassB/Class C protein in SEQ ID NO:34 (GCC) is noted as a misc_feature from nucleotides 22-24. A large gap in the DNA sequence, indicated as a string of 659 n's, is noted as a misc_feature from nucleotides 5483-6141. The in-frame Histidine codon (CAT) that starts the second portion of the putative TC fused ClassB/Class C protein is noted as a misc_feature from nucleotides 6203-6205.

SEQ ID NO:34 is the first portion of the putative fused TC ClassB/Class C protein encoded by SEQ ID NO:33.

SEQ ID NO:35 is the second portion of the putative fused TC ClassB/Class C protein encoded by SEQ ID NO:33.

SEQ ID NO:36 is the nucleotide sequence extracted from BCContig6 of Fusarium verticillioides. The beginning of the coding region corresponding to the first Glutamine of the deduced putative TC fused Class B/Class C protein (CAG) is noted as a misc_feature from nucleotides 20-22. The Aspartic Acid codon (GAT) that starts the second portion of the putative TC fused Class B/Class C protein is noted as a misc_feature at nucleotides 619-621.

SEQ ID NO:37 is the first portion of the putative fused TC Class B/Class C protein encoded by SEQ ID NO:36.

SEQ ID NO:38 is the second portion of the putative fused TC Class B/Class C protein encoded by SEQ ID NO:36.

SEQ ID NO:39 is the nucleotide sequence extracted from BCContig46 from BCContig6 of Fusarium verticillioides. The beginning of the coding region (GAG), corresponding to the first Glutamic Acid of the first portion of the deduced putative TC fused Class B/Class C protein, is noted as a misc_feature at nucleotides 21-23. A large gap in the DNA sequence, indicated as a string of 1009 n's, is noted as a misc_feature from nucleotides 3424-4432. The TTG codon that specifies the first Leucine of the second portion of the deduced TC fused Class B/Class C protein following the n's is noted as a misc_feature at nucleotides 4435-4437.

SEQ ID NO:40 is the first portion of the putative fused TC Class B/Class C protein encoded by SEQ ID NO:39.

SEQ ID NO:41 is a second portion of the putative fused TC Class B/Class C protein encoded by SEQ ID NO:39.

SEQ ID NO:42 is the ClustalX sequence alignment for the FG_GAP Domain 4 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:43 is the ClustalX sequence alignment for the FG_GAP Domain 4 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:44 is the ClustalX sequence alignment for the FG_GAP Domain 3 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:45 is the ClustalX sequence alignment for the FG_GAP Domain 3 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:46 is the ClustalX sequence alignment for the FG_GAP Domain 6 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:47 is the ClustalX sequence alignment for the FG_GAP Domain 6 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:48 is the ClustalX sequence alignment for the FG_GAP Domain 2 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:49 is the ClustalX sequence alignment for the FG_GAP Domain 2 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:50 is the ClustalX sequence alignment for the FG_GAP Domain 5 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:51 is the ClustalX sequence alignment for the FG_GAP Domain 5 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:52 is the ClustalX sequence alignment for the FG_GAP Domain 1 extracted from the protein (represented under the GenBank Accession number 16416891).

SEQ ID NO:53 is the ClustalX sequence alignment for the FG_GAP Domain 1 extracted from the protein (represented under the GenBank Accession number 66047263).

SEQ ID NO:54 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 27479639).

SEQ ID NO:55 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 37524966).

SEQ ID NO:56 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 45441893).

SEQ ID NO:57 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 51596557).

SEQ ID NO:58 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 48730374).

SEQ ID NO:59 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 48730376).

SEQ ID NO:60 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 28871477).

SEQ ID NO:61 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047265).

SEQ ID NO:62 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 28871480).

SEQ ID NO:63 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 28868442).

SEQ ID NO:64 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 45443601).

SEQ ID NO:65 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 66044304).

SEQ ID NO:66 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66045648).

SEQ ID NO:67 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047260).

SEQ ID NO:68 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 66043853).

SEQ ID NO:69 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66045648).

SEQ ID NO:70 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047260).

SEQ ID NO:71 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047259).

SEQ ID NO:72 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047264).

SEQ ID NO:73 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 27479639).

SEQ ID NO:74 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 27479683).

SEQ ID NO:75 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479669).

SEQ ID NO:76 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 27479677).

SEQ ID NO:77 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 45441893).

SEQ ID NO:78 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 28871480).

SEQ ID NO:79 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 66047265).

SEQ ID NO:80 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 28868442).

SEQ ID NO:81 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 66043853).

SEQ ID NO:82 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479683).

SEQ ID NO:83 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 51597848).

SEQ ID NO:84 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479677).

SEQ ID NO:85 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 37524950).

SEQ ID NO:86 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047264).

SEQ ID NO:87 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 48730374).

SEQ ID NO:88 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 28871477).

SEQ ID NO:89 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047259).

SEQ ID NO:90 is the ClustalX sequence alignment of the ref NP_—995139.1 from the protein (represented under the GenBank Accession number 45443600).

SEQ ID NO:91 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 51597848).

SEQ ID NO:92 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 27479639).

SEQ ID NO:93 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 37524966).

SEQ ID NO:94 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 45441893).

SEQ ID NO:95 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 51596557).

SEQ ID NO:96 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 48730374).

SEQ ID NO:97 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 48730376).

SEQ ID NO:98 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 28871477).

SEQ ID NO:99 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047265).

SEQ ID NO:100 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 28871480).

SEQ ID NO:101 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 28868442).

SEQ ID NO:102 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 45443601).

SEQ ID NO:103 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 66044304).

SEQ ID NO:104 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66045648).

SEQ ID NO:105 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047260).

SEQ ID NO:106 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 66043853).

SEQ ID NO:107 is the ClustaIX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66045648).

SEQ ID NO:108 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047260).

SEQ ID NO:109 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047259).

SEQ ID NO:110 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 66047264).

SEQ ID NO:111 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 27479639).

SEQ ID NO:112 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 27479683).

SEQ ID NO:113 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479669).

SEQ ID NO:114 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 27479677).

SEQ ID NO:115 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 45441893).

SEQ ID NO:116 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 28871480).

SEQ ID NO:117 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 66047265).

SEQ ID NO:118 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 28868442).

SEQ ID NO:119 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 66043853).

SEQ ID NO:120 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479683).

SEQ ID NO:121 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 51597848).

SEQ ID NO:122 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 27479677).

SEQ ID NO:123 is the ClustalX sequence alignment of the RHS domain_—2 extracted from the protein (represented under the GenBank Accession number 37524950).

SEQ ID NO:124 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047264).

SEQ ID NO:125 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 48730374).

SEQ ID NO:126 is the ClustalX sequence alignment of the RHS domain_—4 extracted from the protein (represented under the GenBank Accession number 28871477).

SEQ ID NO:127 is the ClustalX sequence alignment of the RHS domain_—3 extracted from the protein (represented under the GenBank Accession number 66047259).

SEQ ID NO:128 is the ClustalX sequence alignment of the ref NP_—995139.1 from the protein (represented under the GenBank Accession number 45113600).

SEQ ID NO:129 is the ClustalX sequence alignment of the RHS domain_—1 extracted from the protein (represented under the GenBank Accession number 51597848).

SEQ ID NO:130 is the ClustalX sequence alignment of the FG_GAP domain 7 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:131 is the ClustalX sequence alignment of the FG_GAP domain 5 of the protein represented under the GenBank Acession number 13475700.

SEQ ID NO:132 is the ClustalX sequence alignment of the FG_GAP domain 3 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:133 is the ClustalX sequence alignment of the FG_GAP domain 4 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:134 is the ClustalX sequence alignment of the FG_GAP domain 1 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:135 is the ClustalX sequence alignment of the FG_GAP domain 2 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:136 is the ClustalX sequence alignment of the FG_GAP domain 4 of the protein represented under the GenBank Acession number 13475700.

SEQ ID NO:137 is the ClustalX sequence alignment of the FG_GAP domain 5 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:138 is the ClustalX sequence alignment of the FG_GAP domain 9 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:139 is the ClustalX sequence alignment of the FG_GAP domain 8 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:140 is the ClustalX sequence alignment of the FG_GAP domain 3 of the protein represented under the GenBank Acession number 13475700.

SEQ ID NO:141 is the ClustalX sequence alignment of the FG_GAP domain 6 of the protein represented under the GenBank Acession number 48862345.

SEQ ID NO:142 is the ClustalX sequence alignment of the FG_GAP domain 1 of the protein represented under the GenBank Acession number 13475700.

SEQ ID NO:143 is the ClustalX sequence alignment of the FG_GAP domain 2 of the protein represented under the GenBank Acession number 13475700.

SEQ ID NO:144 is a synthetic syntchtic peptide as discussed in Example 25.

SEQ ID NO:145 is a synthetic cyntchtic peptide as discussed in Example 25.

DETAILED DESCRIPTION

The subject invention relates in part to the surprising discovery that new types of TC proteins can be obtained from a widely diverse phylogenetic spectrum of organisms including, most notably, eukaryotic fungi. This is the first known disclosure of anti-insect toxins in, for example, Gibberella zeae (formerly known as Fusarium graminaerum) and Methanosarcina. These organisms were not known to be insecticidal and were not, heretofore, suspected of possessing genomic segments that encode insecticidally active proteins.

This discovery broadens the scope of organisms in which TC-like genes have been found. Thus, the subject invention generally relates to TC-like proteins obtainable from such species, to methods of screening these species for such proteins, and the like.

Considering the “role” that some source organisms play in nature can also lead to novel approaches for discovering additional TC proteins and genes. For example, Gibberella zeae (formerly known as Fusarium graminaerum) is a known plant pathogen. Having the benefit of the subject disclosure, one theory is that microbes that use crops, such as corn, as a food source evolved anti-insect toxins to help them outcompete insects that also feed on the crops. Thus, the subject invention can include methods of screening plant-pathogenic microbes for anti-insect proteins and the like.

This is also the first known discovery of naturally occurring, functionally active two-domain toxin complex (“TC”) proteins, wherein one domain has functional and some level of sequence relatedness to “Class B” TC proteins (as discussed in more detail below), and the other domain has functional and some level of sequence relatedness to “Class C” TC proteins (as discussed in more detail below). As used herein, “B domains,” “B segments,” “C domains,” and “C segments” refer to polypeptide domains or segments having structural and functional similarities to “Class B” and “Class C” TC proteins as discussed in detail in, for example, US-2004-0208907 and WO 2004/067727. Likewise, “Class A” proteins of the subject invention are discussed generally in, for example, US-2004-0208907 and WO 2004/067727.

Although the sequence of the Gibberella zeae genome (for example) was published in GENBANK, heretofore, there was no prior suggestion or expectation that the subject proteins would be active like known TC proteins. For example, there is a very low degree of sequence relatedness and unique conformations of the presently identified domains. The same is true for the bacterial sequences disclosed herein. There was not even a motivation to test these genomic sequences for any activity of hypothetically encoded proteins, considering, for example, the low degree of sequence relatedness, the unique conformations of the proteins, and the organisms possessing the genomic sequences. There was no reason to expect TCs in these sources, let alone active, naturally “fused” proteins like Tcp1_Gz. There was certainly no motivation to clone these genes into plant cells, for example. There was also no motivation to screen culture collections of isolates of these species to determine if the subject genes are more widely present in various strains of these organisms.

One exemplified anti-insect protein (a potentiator of Class A toxins) is referred to herein as Tcp1_Gz. For ease of reference, these two-domain proteins of the subject invention are sometimes referred to herein as “natural fusions” and as Tcp1_Gz-like proteins. The subject invention thus includes these new classes and types of TC proteins. The subject invention also includes polynucleotides that encode the subject proteins. The subject invention further provides vectors and cells comprising these polynucleotides. In some preferred embodiments, the subject invention also provides novel methods of controlling insects and other like pests, using the novel toxin protein of the subject invention.

Having discovered and shown that naturally occurring (but not heretofore “isolated”) two-domain TC proteins of the subject invention are active, one will now be motivated to test and use other naturally occurring, two-domain TC proteins. Methanosarcina is preferred for such embodiments. In addition to Methanosarcina and Gibberella, novel source organisms for use according to the subject invention include species of the genera Treponema, Leptospira, Microbulbifer, Burkholderia, and Nitrosospora.

The subject invention also relates to screening new source organisms for novel Class A type proteins and genes, as disclosed herein. Eukaryotes, fungi, Gibberella, Fusarium, and Aspergillus are some preferred sources, as are bacteria of the genus Burkholderia.

Tcp1_Gz-like (natural fusion) proteins of the subject invention are typically in the molecular weight range of approximately 220 kDa to approximately 295 kDa, although this is just an approximate size range. A preferred weight, for example, is in the approximate range of 280-285 kDa. Another example of a naturally occurring, two-domain/BC-type toxin complex protein is obtainable from Methanosarcina acetivorans str. C2A. The sequences of the native gene and protein are set forth in SEQ ID NOs:7-8.

Another surprising feature of the exemplified Tcp1_Gzprotein is that it has an apparent intron. Thus, the subject invention includes isolated TC proteins comprising an intron sequence. The subject invention also includes searching for, identifying, and/or screening for TC proteins that contain intron-like sequences.

The subject invention provides exciting new sources for surprising, new types of toxin complex (“TC”) proteins. Thus, the subject invention relates generally to Gibberella, Fusarium, and Methanosarcina species, for example, that have active TC proteins. The subject invention also includes methods of screening these and other species (some of which are identified herein) for these new classes of TC genes and proteins (as well as for known Class A-, Class B-, and Class C-type TC proteins). The subject invention also includes methods of isolating and/or purifying TC proteins from these species and testing them for toxin activity as disclosed herein. The subject invention further includes preparing and screening libraries of genes cloned (or otherwise produced) from these organisms. In some preferred embodiments, the organisms are eukaryotic. The subject proteins and genes of eukaryotic origin are particularly promising for high levels of expression in plants.

This is the first known report of these organisms having functionally active TC-like proteins, of any kind. This discovery was even more surprising because of the unique, two-domain conformation of proteins of the subject invention. Thus, the subject invention relates to methods of screening these species for TC-like genes and proteins. These pioneering observations have broad implications and thus enable one skilled in the art to screen appropriate species of bacteria and fungi for unique operons of the subject invention.

Tcp1_Gz-like proteins of the subject invention are shown herein to be useful to enhance or potentiate the activity of “stand-alone” Xenorhabdus and/or Photorhabdus “Class A” toxin proteins, for example. One or more TC proteins of the subject invention can be used as a novel element combined with techniques known in the art. See e.g. US-2004-0208907 and WO 2004/067727.

The subject invention also provides novel “Class A”-type TC proteins, which, as a Class, have “stand-alone” toxin activity. See e.g. US-2004-0208907 and WO 2004/067727 for a more detailed explanation. One exemplified Class A gene and protein of this type can be derived from the Gibberella organism disclosed herein. See SEQ ID NOs:9-10.

While the subject TC-like proteins have some sequence relatedness to, and characteristics in common with, TC proteins of Xenorhabdus and Photorhabdus for example, the sequences of the subject TC-like proteins are distinct from previously known TC proteins. Thus, the subject application provides new classes of TC-like proteins and genes that encode these proteins, which are obtainable from bacterial and fungal genera identified and suggested herein.

Other objects, advantages, and features of the subject invention will be apparent to one skilled in the art having the benefit of the subject disclosure.

Administration of the Subject Proteins, and Function, Activity, and Utility Thereof. Individual Class A, Class B, and Class C TC proteins, as the term is used herein, were known in the art. Such proteins include stand-alone toxins (Class A TC proteins) and potentiators (Class B and C TC proteins). Bacteria known to produce TC proteins include those of the following genera: Photorhabdus, Xenorhabdus, Paenibacillus, Serratia, and Pseudomonas. See, e.g., Pseudomonas syringae pv. Syringae B728a (GenBank Accession Numbers gi:23470933 and gi:23472543).

As mentioned above in the Background section, although “Toxin A” proteins have some insecticidal activity, alone, the high insecticidal potency of the “A+B+C” complex is much preferred for commercial applications of TC proteins. However, the exact mechanism(s) of action of TC proteins remains unknown. Likewise, it is unknown exactly how (and if) each of the A, B, and C components interact with each other. Thus, there was no way to a priori predict whether proteins of the subject invention would allow for proper functioning in the insect gut.

It came with surprise that the subject proteins were found to be highly effective for controlling insects. There was no expectation that the subject natural fusion proteins would be active (i.e., toxic in combination with a Class A TC protein) after ingestion by the target insect. It is shown herein that the subject proteins surprisingly function quite well in the insect gut.

The subject invention can be performed in many different ways. For example, a plant can be engineered to produce one or more types of Class A TC proteins together with a Tcp1_Gz-type protein of the subject invention, the latter of which potentiate the activity of the Class A TC protein. Every cell of the plant, or every cell in a given type of tissue (such as roots or leaves) can be designed to have genes to encode the A proteins and the Tcp1_Gz-type protein. Alternatively, different cells of the plant can produce only one (or more) of each of these proteins. In this situation, when an insect bites and eats tissues of the plant, it could eat a cell that produces a first Class A TC protein, another cell that produces a second Class A TC protein, and another cell that produces the Tcp1_Gz-type protein. Thus, the plant (not necessarily each plant cell) can produce one or more types of Class A TC proteins and the Tcp1_Gz-type protein of the subject invention so that insect pests eat all these types of proteins when they eat tissue of the plant.

Aside from transgenic plants, there are many other ways of administering the proteins, in a combination of the subject invention, to the target pest. Spray-on applications are known in the art. Some or all of the Class A and Tcp1_Gz-type proteins can be sprayed (the plant could produce one or more of the proteins and the others could be sprayed). Various types of bait granules for soil applications, for example, are also known in the art and can be used according to the subject invention.

The present invention provides easily administered, functional proteins. The present invention also provides a method for delivering insecticidal proteins that are functionally active and effective against many orders of insects, preferably lepidopteran and/or coleopteran insects. By “functional activity” (or “active against”) it is meant herein that the proteins function as orally active insect control agents (alone or in combination with other proteins), that the proteins have a toxic effect (alone or in combination with other proteins), or are able to disrupt or deter insect growth and/or feeding which may or may not cause death of the insect. When an insect comes into contact with an “effective amount” of an “insecticidal protein” of the subject invention delivered via transgenic plant expression, formulated protein composition(s), sprayable protein composition(s), a bait matrix or other delivery system, the results are typically death of the insect, inhibition of the growth and/or proliferation of the insect, and/or prevention of the insects from feeding upon the source (preferably a transgenic plant) that makes the proteins available to the insects.

Thus, insects that ingest an effective amount of a Class A TC protein and a Tcp1_Gz-type protein, for example, can be deterred from feeding, have their growth stunted, and/or be killed, for example. A Tcp1_Gz-type protein of the invention has “functionality” or toxin activity if it enhances the functional activity of a Class A TC protein when used in combination therewith.

Complete lethality to feeding insects is preferred, but is not required to achieve functional activity. If an insect avoids the protein or ceases feeding, that avoidance will be useful in some applications, even if the effects are sublethal or lethality is delayed or indirect. For example, if insect resistant transgenic plants are desired, the reluctance of insects to feed on the plants is as useful as lethal toxicity to the insects because the ultimate objective is avoiding insect-induced plant damage.

Transfer of the functional activity to plant, bacterial, or other systems typically requires nucleic acid sequences, encoding the amino acid sequences for the toxins, integrated into a protein expression vector appropriate to the host in which the vector will reside. One way to obtain a nucleic acid sequence encoding a protein with functional activity is to isolate the native genetic material from the native source species that produces the toxins, using information deduced from the toxin's amino acid sequence, as disclosed herein. The native sequences can be optimized for expression in plants, for example, as discussed in more detail below. Optimized polynucleotides can also be designed based on the protein sequence.

There are many other ways in which TC proteins can be incorporated into an insect's diet. For example, it is possible to adulterate the larval food source with the toxic protein by spraying the food with a protein solution, as disclosed herein. Alternatively, the purified protein could be genetically engineered into an otherwise harmless bacterium, which could then be grown in culture, and either applied to the food source or allowed to reside in the soil in an area in which insect eradication was desirable. Also, DNA for producing the protein could be genetically engineered directly into an insect food source. For instance, the major food source for many insect larvae is plant material. Therefore the genes encoding toxins can be transferred to plant material so that said plant material produces the toxin of interest.

When Tcp1_Gz-type proteins of the subject invention are said to have two domains, it should be noted that this does not exclude the existence of various subdomains, regions, and protein motifs, for example, in each of the two main domains. In addition, as the two main domains have homology to Class B and Class C TC proteins, respectively, and given that Tcp1_Gz-type proteins of the subject invention are shown herein to function like and to be useful like Class B and Class C TC proteins, the subject invention includes the use of either or both domains of the Tcp1_Gz-type proteins individually. That is, the Class C-like domain of a Tcp1_Gz-type protein can be used with a Xenorhabdus or Photorhabdus Class B protein, for example. The same is true for the Class B-like domain of Tcp1_Gz-type protein. Various methods for cutting proteins and corresponding DNA, to isolate and re-ligate fragments of interest, are described below in the section entitled “Modification of genes and proteins” for example. (Such DNA and protein fragments, for example, are within the scope of the subject invention.) A great number of possible combinations and utilities can be envisioned. For example, in some embodiments, a fragment of a Tcp1_Gz-type protein (preferably a B domain fragment or a C domain fragment) can be isolated (from the remaining fragment), swapped (fused or unfused) and “mixed and matched” according to the teachings of US-2004-0208907 and WO 2004/067727. (Any of the Class B and Class C sequences, for example, disclosed therein can also be used to define embodiments of the subject invention. For example, in the full-length sequences exemplified herein, the Class B and Class C domains can be identified, by comparison to sequences in US-2004-0208907 and WO 2004/067727, and used separately accordingly.) As discussed below, a Class C domain of the subject invention can also be synthetically ligated to a Class B TC protein. Likewise, a Class B domain can be synthetically ligated to a Class C TC protein.

Ligations and Other Terminology and Definitions. Tcp1_Gz-type proteins of the invention can be ligated to Class A TC proteins. See e.g. U.S. Ser. No. 60/549,516, filed Mar. 2, 2004. As mentioned above, other possibilities are that a Class B and/or a Class C domain of the subject invention (corresponding fragments of two-domain proteins of the subject invention) can be synthetically ligated to another TC protein. See e.g. U.S. Ser. No. 60/549,502, filed Mar. 2, 2004. As used herein, it is understood that ligation of normally separate proteins or protein domains can be brought about as a consequence of translation of a polynucleotide containing coding sequence regions that encode the amino acid sequences of the normally separate proteins or protein domains.

As used herein, the terms “linker” and “linker sequence” refer to nucleotides used to join a first protein coding region to a subsequent, immediately following protein coding region, such that both the first and second (and/or subsequent) protein coding regions form a single longer protein coding region in the +1 reading frame, as defined by the open reading frame of the first protein coding region. Such linker or linker sequence therefore cannot include translation termination codons in the +1 reading frame. As a consequence of translation of the linker or linker sequence, the protein encoded by the first protein coding region is joined by one or more amino acids to the protein encoded by the second protein coding region. A linker is optional, as the polypeptide components can be ligated directly, without a linker sequence.

As used herein, reference to “isolated” polynucleotides and/or proteins, and “purified” proteins refers to these molecules when they are not associated with the other molecules with which they would be found in nature. Thus, reference to “isolated” and/or “purified” signifies the involvement of the “hand of man” as described herein. For example, a bacterial or fungal polynucleotide (or “gene”) of the subject invention put into a plant for expression is an “isolated polynucleotide.” Likewise, a protein of the subject invention when produced by a plant is an “isolated protein.” The term “ligated” can also be used to signify involvement of the “hand of man.” That is, one polypeptide component (such as a Tcp1_Gz-type protein) can be synthetically joined or “ligated” to another polypeptide component (such as a Class A protein) to form a fusion protein of the subject invention.

A “recombinant” molecule refers to a molecule that has been recombined. When made in reference to a nucleic acid molecule, the term refers to a molecule that is comprised of nucleic acid sequences that are joined together by means of molecular biological techniques. The term “recombinant” when made in reference to a protein or a polypeptide refers to a protein molecule that is produced using one or more recombinant nucleic acid molecules.

The term “heterologous” when made in reference to a nucleic acid sequence refers to a nucleotide sequence that is ligated to, or is manipulated to become ligated to, a nucleic acid sequence to which it is not joined in nature, or to which it is joined at a different location in nature. The term “heterologous” therefore indicates that the nucleic acid molecule has been manipulated using genetic engineering, i.e. by human intervention. Thus, a gene of the subject invention can be operably linked to a heterologous promoter (or a “transcriptional regulatory region” which means a nucleotide sequence capable of mediating or modulating transcription of a nucleotide sequence of interest, when the transcriptional regulatory region is operably linked to the sequence of interest). Preferred heterologous promoters can be plant promoters. A promoter and/or a transcriptional regulatory region and a sequence of interest are “operably linked” when the sequences are functionally connected so as to permit transcription of the sequence of interest to be mediated or modulated by the transcriptional regulatory region. In some embodiments, to be operably linked, a transcriptional regulatory region may be located on the same strand as the sequence of interest. The transcriptional regulatory region may in some embodiments be located 5′ of the sequence of interest. In such embodiments, the transcriptional regulatory region may be directly 5′ of the sequence of interest or there may be intervening sequences between these regions. The operable linkage of the transcriptional regulatory region and the sequence of interest may require appropriate molecules (such as transgenic activator proteins) to be bound to the transcriptional regulatory region, the invention therefore encompasses embodiments in which such molecules are provided, either in vitro or in vivo.

There are a number of methods for obtaining the proteins for use according to the subject invention. For example, antibodies to the proteins disclosed herein can be used to identify and isolate other proteins from a mixture. Specifically, antibodies may be raised to the portions of the proteins that are most constant and most distinct from other proteins. These antibodies can then be used to specifically identify equivalent proteins with the characteristic activity by immunoprecipitation, enzyme linked immunosorbent assay (ELISA), or immuno-blotting. Antibodies to the proteins disclosed herein, or to equivalent proteins, or to fragments of these proteins, can be readily prepared using standard procedures. Such antibodies are an aspect of the subject invention. Proteins of the subject invention can be obtained from a variety of sources/source microorganisms.

One skilled in the art would readily recognize that proteins (and genes) of the subject invention can be obtained from a variety of sources. A protein “from” or “obtainable from” any of the subject isolates referred to or suggested herein means that the protein (or a similar protein) can be obtained from the exemplified isolate or some other source, such as another fungal or bacterial strain, or a plant (for example, a plant engineered to produce the protein). “Derived from” also has this connotation, and includes polynucleotides (and proteins) obtainable from a given type of fungus or bacterium wherein the polynucleotide is modified for expression in a plant, for example. One skilled in the art will readily recognize that, given the disclosure of a microbial gene and protein, a plant can be engineered to produce the protein. Antibody preparations, nucleic acid probes (DNA and RNA), and the like may be prepared using the polynucleotide and/or amino acid sequences disclosed herein and used to screen and recover other protein genes from other (natural) sources.

Identification of Proteins and Genes of the Subject Invention. Proteins and genes for use according to the subject invention can be identified and obtained by using oligonucleotide probes, for example. These probes are detectable nucleotide sequences which may be detectable by virtue of an appropriate label or may be made inherently fluorescent as described in International Application No. WO 93/16094. The probes (and the polynucleotides of the subject invention) may be DNA, RNA, or PNA. In addition to adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U; for RNA molecules), synthetic probes (and polynucleotides) of the subject invention can also have inosine (a neutral base capable of pairing with all four bases; sometimes used in place of a mixture of all four bases in synthetic probes). Thus, where a synthetic, degenerate oligonucleotide is referred to herein, and “N” or “n” is used generically, “N” or “n” can be G, A, T, C, or inosine. Ambiguity codes as used herein are in accordance with standard IUPAC naming conventions as of the filing of the subject application (for example, R means A or G, Y means C or T, etc.).

As is well known in the art, if a probe molecule hybridizes with a nucleic acid sample, it can be reasonably assumed that the probe and sample have substantial homology/similarity/identity. Preferably, hybridization of the polynucleotide is first conducted followed by washes under conditions of low, moderate, or high stringency by techniques well-known in the art, as described in, for example, Keller, G. H., M. M. Manak (1987) DNA Probes, Stockton Press, New York, N.Y., pp. 169-170. For example, as stated therein, low stringency conditions can be achieved by first washing with 2×SSC (Standard Saline Citrate)/0.1% SDS (Sodium Dodecyl Sulfate) for 15 minutes at room temperature. Two washes are typically performed. Higher stringency can then be achieved by lowering the salt concentration and/or by raising the temperature. For example, the wash described above can be followed by two washings with 0.1×SSC/0.1% SDS for 15 minutes each at room temperature followed by subsequent washes with 0.1×SSC/0.1% SDS for 30 minutes each at 55° C. These temperatures can be used with other hybridization and wash protocols set forth herein and as would be known to one skilled in the art (SSPE can be used as the salt instead of SSC, for example). The 2×SSC/0.1% SDS can be prepared by adding 50 ml of 20×SSC and 5 ml of 10% SDS to 445 ml of water. 20×SSC can be prepared by combining NaCl (175.3 g/0.150 M), sodium citrate (88.2 g/0.015 M), and water, adjusting pH to 7.0 with 10 N NaOH, then adjusting the volume to 1 liter 10% SDS can be prepared by dissolving 10 g of SDS in 50 ml of autoclaved water, then diluting to 100 ml.

Detection of the probe provides a means for determining in a known manner whether hybridization has been maintained. Such a probe analysis provides a rapid method for identifying toxin-encoding genes of the subject invention. The nucleotide segments which are used as probes according to the invention can be synthesized using a DNA synthesizer and standard procedures. These nucleotide sequences can also be used as PCR primers to amplify genes of the subject invention.

Hybridization with a given polynucleotide is a technique that can be used to identify, find, and/or define proteins and genes of the subject invention. As used herein, “stringent” conditions for hybridization refers to conditions which achieve the same, or about the same, degree of specificity of hybridization as the conditions employed by the current applicants. Specifically, hybridization of immobilized DNA on Southern blots with ³²P-labeled gene-specific probes was performed by standard methods (see, e.g., Maniatis, T., E. F. Fritsch, J. Sambrook [1982] Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.). In general, hybridization and subsequent washes were carried out under conditions that allowed for detection of target sequences. For double-stranded DNA gene probes, hybridization was carried out overnight at 20-25° C. below the melting temperature (Tm) of the DNA hybrid in 6×SSPE, 5× Denhardt's solution, 0.1% SDS, 0.1 mg/ml denatured DNA. The melting temperature is described by the following formula (Beltz, G. A., K. A. Jacobs, T. H. Eickbush, P. T. Cherbas, and F. C. Kafatos [1983] Methods of Enzymology, R. Wu, L. Grossman and K. Moldave [eds.] Academic Press, New York 100:266-285):

- 1) Tm=81.5° C.+16.6 Log[Na+]+0.41(% G+C)−0.61(% formamide)−600/length of duplex in base pairs.
- 2) Washes are typically carried out as follows:
- 3) Twice at room temperature for 15 minutes in 1×SSPE, 0.1% SDS (low stringency wash).
- 4) Once at Tm-20° C. for 15 minutes in 0.2×SSPE, 0.1% SDS (moderate stringency wash).

For oligonucleotide probes, hybridization was carried out overnight at 10-20° C. below the melting temperature (Tm) of the hybrid in 6×SSPE, 5× Denhardt's solution, 0.1% SDS, 0.1 mg/ml denatured DNA. Tm for oligonucleotide probes was determined by the following formula: Tm (° C.)=2(number T/A base pairs)+4(number G/C base pairs) (Suggs, S. V., T. Miyake, E. H. Kawashime, M. J. Johnson, K. Itakura, and R. B. Wallace [1981] ICN-UCLA Symp. Dev. Biol. Using Purified Genes, D. D. Brown [ed.], Academic Press, New York, 23:683-693).

Washes were typically carried out as follows:

- 1) Twice at room temperature for 15 minutes 1×SSPE, 0.1% SDS (low stringency wash).
- 2) Once at the hybridization temperature for 15 minutes in 1×SSPE, 0.1% SDS (moderate stringency wash).

In general, salt and/or temperature can be altered to change stringency. With a labeled DNA fragment >70 or so bases in length, the following conditions can be used:

- Low: 1 or 2×SSPE, room temperature
- Low: 1 or 2×SSPE, 42° C.
- Moderate: 0.2× or 1×SSPE, 65° C.
- High: 0.1×SSPE, 65° C.

Duplex formation and stability depend on substantial complementarity between the two strands of a hybrid, and, as noted above, a certain degree of mismatch can be tolerated. Therefore, the probe sequences of the subject invention include mutations (both single and multiple), deletions, insertions of the described sequences, and combinations thereof, wherein said mutations, insertions and deletions permit formation of stable hybrids with the target polynucleotide of interest. Mutations, insertions, and deletions can be produced in a given polynucleotide sequence in many ways, and these methods are known to an ordinarily skilled artisan. Other methods may become known in the future.

PCR technology. Polymerase Chain Reaction (PCR) is a repetitive, enzymatic, primed synthesis of a nucleic acid sequence. This procedure is well known and commonly used by those skilled in this art (see Mullis, U.S. Pat. Nos. 4,683,195, 4,683,202, and 4,800,159; Saiki, Randall K., Stephen Scharf, Fred Faloona, Kary B. Mullis, Glenn T. Horn, Henry A. Erlich, Norman Arnheim [1985] “Enzymatic Amplification of β-Globin Genomic Sequences and Restriction Site Analysis for Diagnosis of Sickle Cell Anemia,” Science 230:1350-1354). PCR is based on the enzymatic amplification of a DNA fragment of interest that is flanked by two oligonucleotide primers that hybridize to opposite strands of the target sequence. The primers are oriented with the 3′ ends pointing towards each other. Repeated cycles of heat denaturation of the template, annealing of the primers to their complementary sequences, and extension of the annealed primers with a DNA polymerase result in the amplification of the segment defined by the 5′ ends of the PCR primers. The extension product of each primer can serve as a template for the other primer, so each cycle essentially doubles the amount of DNA fragment produced in the previous cycle. This results in the exponential accumulation of the specific target fragment, up to several million-fold in a few hours. By using a thermostable DNA polymerase such as Taq polymerase, isolated from the thermophilic bacterium Thermus aquaticus, the amplification process can be completely automated. Other enzymes that can be used are known to those skilled in the art.

The DNA sequences of the subject invention can be used as primers for PCR amplification. In performing PCR amplification, a certain degree of mismatch can be tolerated between primer and template. Therefore, mutations, deletions, and insertions (especially additions of nucleotides to the 5′ end) of the exemplified primers fall within the scope of the subject invention. Mutations, insertions, and deletions can be produced in a given primer by methods known to an ordinarily skilled artisan.

Modification of genes and proteins. The genes and proteins useful according to the subject invention include not only the specifically exemplified full-length sequences, but also portions, segments and/or fragments (including internal and/or terminal deletions compared to the full-length molecules) of these sequences, variants, mutants, chimerics, and fusions thereof. Proteins used in the subject invention can have substituted amino acids so long as they retain the characteristic pesticidal/functional activity of the proteins specifically exemplified herein. “Variant” genes have nucleotide sequences that encode the same proteins or equivalent proteins having functionality equivalent to an exemplified protein. The terms “variant proteins” and “equivalent proteins” refer to proteins having the same or essentially the same biological/functional activity as the exemplified proteins. As used herein, reference to an “equivalent” sequence refers to sequences having amino acid substitutions, deletions, additions, or insertions that improve or do not adversely affect functionality. Fragments retaining functionality are also included in this definition. Fragments and other equivalents that retain the same or similar function, as a corresponding fragment of an exemplified protein are within the scope of the subject invention. Changes, such as amino acid substitutions or additions, can be made for a variety of purposes, such as increasing (or decreasing) protease stability of the protein (without materially/substantially decreasing the functionality of the protein).

Variations of genes may be readily constructed using standard techniques for making point mutations, for example. In addition, U.S. Pat. No. 5,605,793, for example, describes methods for generating additional molecular diversity by using DNA reassembly after random fragmentation. Variant genes can be used to produce variant proteins; recombinant hosts can be used to produce the variant proteins. Using these “gene shuffling” techniques, equivalent genes and proteins can be constructed that comprise any 5, 10, or 20 contiguous residues (amino acid or nucleotide) of any sequence exemplified herein.

Fragments of full-length genes can be made using commercially available exonucleases or endonucleases according to standard procedures. For example, enzymes such as Bal31 or site-directed mutagenesis can be used to systematically cut off nucleotides from the ends of these genes. Also, genes that encode active fragments may be obtained using a variety of restriction enzymes. Proteases may be used to directly obtain active fragments of these proteins.

It is within the scope of the invention as disclosed herein that TC proteins may be truncated and still retain functional activity. By “truncated protein” it is meant that a portion of a protein may be cleaved and yet still exhibit activity after cleavage. Cleavage can be achieved by proteases inside or outside of the insect gut. Furthermore, effectively cleaved proteins can be produced using molecular biology techniques wherein the DNA bases encoding said protein are removed either through digestion with restriction endonucleases or other techniques available to the skilled artisan. After truncation, said proteins can be expressed in heterologous systems such as Escherichia coli, baculoviruses, plant-based viral systems, yeast and the like and then placed in insect assays as disclosed herein to determine activity. It is well-known in the art that truncated proteins can be successfully produced so that they retain functional activity while having less than the entire, full-length sequence. It is well known in the art that B.t. toxins can be used in a truncated (core toxin) form. See, e.g., Adang et al., Gene 36:289-300 (1985), “Characterized full-length and truncated plasmid clones of the crystal protein of Bacillus thuringiensis subsp kurstaki HD-73 and their toxicity to Manduca sexta.” There are other examples of truncated proteins that retain insecticidal activity, including the insect juvenile hormone esterase (U.S. Pat. No. 5,674,485 to the Regents of the University of California). As used herein, the term “toxin” is also meant to include functionally active truncations.

Because of the degeneracy/redundancy of the genetic code, a variety of different DNA sequences can encode the amino acid sequences disclosed herein. It is well within the skill of a person trained in the art to create alternative DNA sequences that encode the same, or essentially the same, toxins. These variant DNA sequences are within the scope of the subject invention.

The subject invention include, for example:

1) proteins obtained from wild type organisms;

2) variants arising from mutations;

3) variants designed by making conservative amino acid substitutions; and

4) variants produced by random fragmentation and reassembly of a plurality of different sequences that encode the subject TC proteins (DNA shuffling). See e.g. U.S. Pat. No. 5,605,793.

The DNA sequences encoding the subject proteins can be wild type sequences, mutant sequences, or synthetic sequences designed to express a predetermined protein. DNA sequences designed to be highly expressed in plants by, for example, avoiding polyadenylation signals, and using plant preferred codons, are particularly useful.

Certain proteins and genes have been specifically exemplified herein. As these proteins and genes are merely exemplary, it should be readily apparent that the subject invention comprises use of variant or equivalent proteins (and nucleotide sequences coding for equivalents thereof) having the same or similar functionality as the exemplified proteins. Equivalent proteins will have amino acid similarity (and/or homology) with an exemplified TC protein. Preferred polynucleotides and proteins of the subject invention can be defined in terms of narrower identity and/or similarity ranges. For example, the identity and/or similarity of the Class A, B, and/or C TC protein can be 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99% as compared to a sequence exemplified or suggested herein and, the identity and/or similarity of the Class C TC protein can be 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99% as compared to a sequence exemplified or suggested herein. Any number listed above can be used to define the upper and lower limits. For example, a protein of the subject invention can be defined as having 50-90% identity, for example, with an exemplified protein.

Unless otherwise specified, as used herein, percent sequence identity and/or similarity of two nucleic acids is determined using the algorithm of Karlin and Altschul (1990), Proc. Natl. Acad. Sci. USA 87:2264-2268, modified as in Karlin and Altschul (1993), Proc. Natl. Acad. Sci. USA 90:5873-5877. Such an algorithm is incorporated into the NBLAST and XBLAST programs of Altschul et al. (1990), J. Mol. Biol. 215:402-410. BLAST nucleotide searches are performed with the NBLAST program, score=100, wordlength=12. Gapped BLAST can be used as described in Altschul et al. (1997), Nucl. Acids Res. 25:3389-3402. When utilizing BLAST and Gapped BLAST programs, the default parameters of the respective programs (NBLAST and XBLAST) are used. See NCBI/NIH website. The scores can also be calculated using the methods and algorithms of Crickmore et al. as described in the Background section, above.

To obtain gapped alignments for comparison purposes, the AlignX function of Vector NTI Suite 8 (InforMax, Inc., North Bethesda, Md., U.S.A.), was used employing the default parameters. These were: a Gap opening penalty of 15, a Gap extension penalty of 6.66, and a Gap separation penalty range of 8. Two or more sequences can be aligned and compared in this manner or using other techniques that are well-known in the art. By analyzing such alignments, relatively conserved and non-conserved areas of the subject polypeptides can be identified. This can be useful for, for example, assessing whether changing a polypeptide sequence by modifying or substituting one or more amino acid residues can be expected to be tolerated.

The amino acid homology/similarity/identity will typically (but not necessarily) be highest in regions of the protein that account for its activity or that are involved in the determination of three-dimensional configurations that are ultimately responsible for the activity. In this regard, certain amino acid substitutions are acceptable and can be expected to be tolerated. For example, these substitutions can be in regions of the protein that are not critical to activity. Analyzing the crystal structure of a protein, and software-based protein structure modeling, can be used to identify regions of a protein that can be modified (using site-directed mutagenesis, shuffling, etc.) to actually change the properties and/or increase the functionality of the protein.

Various properties and three-dimensional features of the protein can also be changed without adversely affecting the toxin activity/functionality of the protein. Conservative amino acid substitutions can be expected to be tolerated/to not adversely affect the three-dimensional configuration of the molecule. Amino acids can be placed in the following classes: non-polar, uncharged polar, basic, and acidic. Conservative substitutions whereby an amino acid of one class is replaced with another amino acid of the same type fall within the scope of the subject invention so long as the substitution is not adverse to the biological activity of the compound. Table 1 provides a listing of examples of amino acids belonging to each class.

TABLE 1

Classes of amino acids.

Class of Amino Acid
Examples of Amino Acids

Nonpolar
Ala, Val, Leu, Ile, Pro, Met, Phe, Trp

Uncharged Polar
Gly, Ser, Thr, Cys, Tyr, Asn, Gln

Acidic
Asp, Glu

Basic
Lys, Arg, His

In some instances, non-conservative substitutions can also be made. The critical factor is that these substitutions must not significantly detract from the functional/biological/toxin activity of the protein.

Equivalent TC proteins and/or genes encoding these equivalent proteins can be obtained/derived from wild-type or recombinant bacteria and/or from other wild-type or recombinant organisms using the teachings provided herein. Various species of fungi and bacteria can now be used as source isolates, as disclosed herein.

Optimization of sequence for expression in heterologous organisms. To obtain high expression of heterologous genes in plants, for example, it may be preferred to reengineer said genes so that they are more efficiently expressed in plant cells. Maize is one such plant where it may be preferred to re-design the heterologous gene(s) prior to transformation to increase the expression level thereof in said plant. Therefore, an additional step in the design of genes encoding a bacterial or fungal toxin, for example, is reengineering of a heterologous gene for optimal expression in a different type of organism. Guidance regarding the production of synthetic genes that are optimized for plant expression can be found in, for example, U.S. Pat. No. 5,380,831. A sequence optimized for expression in E. coli is also exemplified, as discussed in the Examples below.

Transgenic hosts. The genes encoding Toxin Complex proteins of the subject invention can be introduced into a wide variety of microbial or plant hosts. In preferred embodiments, transgenic plant cells and plants are used. Preferred plants (and plant cells) are corn (maize), cotton, canola, sunflowers, and soybeans.

In preferred embodiments, expression of the gene results, directly or indirectly, in the intracellular production (and maintenance) of the protein. Plants can be rendered insect-resistant in this manner. When transgenic/recombinant/transformed/transfected host cells (or contents thereof) are ingested by the pests, the pests will ingest the toxin. This is the preferred manner in which to cause contact of the pest with the toxin. The result is control (killing or making sick) of the pest. Sucking pests can also be controlled in a similar manner. Alternatively, suitable microbial hosts, e.g., Pseudomonas such as P. fluorescens, can be applied where target pests are present; the microbes can proliferate there, and are ingested by the target pests. The microbe hosting the toxin gene can be treated under conditions that prolong the activity of the toxin and stabilize the cell. The treated cell, which retains the toxic activity, can then be applied to the environment of the target pest. The subject invention also includes the administration of a combination of cells, with some cells expressing one or more types of proteins and other cells expressing other types of proteins (such as some cells producing Class A toxin proteins and other cells producing “potentiating” Tcp1_Gz-type proteins of the subject invention, for example).

Where the toxin gene is introduced via a suitable vector into a microbial host, and said host is applied to the environment in a living state, certain host microbes should be used. Microorganism hosts are selected which are known to occupy the “phytosphere” (phylloplane, phyllosphere, rhizosphere, and/or rhizoplane) of one or more crops of interest. These microorganisms are selected so as to be capable of successfully competing in the particular environment (crop and other insect habitats) with the wild-type microorganisms, provide for stable maintenance and expression of the gene expressing the polypeptide pesticide, and, desirably, provide for improved protection of the pesticide from environmental degradation and inactivation.

A large number of microorganisms are known to inhabit the phylloplane (the surface of the plant leaves) and/or the rhizosphere (the soil surrounding plant roots) of a wide variety of important crops. These microorganisms include bacteria, algae, and fungi. Of particular interest are microorganisms, such as bacteria, e.g., genera Pseudomonas, Erwinia, Serratia, Klebsiella, Xanthomonas, Streptomyces, Rhizobium, Rhodopseudomonas, Methylophilus, Agrobacterium, Acetobacter, Lactobacillus, Arthrobacter, Azotobacter, Leuconostoc, and Alcaligenes; fungi, particularly yeast, e.g., genera Saccharomyces, Cryptococcus, Kluyveromyces, Sporobolomyces, Rhodotorula, and Aureobasidium. Of particular interest are such phytosphere bacterial species as Pseudomonas syringae, Pseudomonas fluorescens, Serratia marcescens, Acetobacter xylinum, Agrobacterium tumefaciens, Rhodopseudomonas spheroides, Xanthomonas campestris, Rhizobium meliloti, Alcaligenes entrophus, and Azotobacter vinelandii; and phytosphere yeast species such as Rhodotorula rubra, R. glutinis, R. marina, R. aurantiaca, Cryptococcus albidus, C. diffluens, C. laurentii, Saccharomyces rosei, S. pretoriensis, S. cerevisiae, Sporobolomyces roseus, S. odorus, Kluyveromyces veronae, and Aureobasidium pollulans. Also of interest are pigmented microorganisms.

Insertion of genes to form transgenic hosts. One aspect of the subject invention is the transformation/transfection of plants, plant cells, and other host cells with polynucleotides of the subject invention that express proteins of the subject invention. Plants transformed in this manner can be rendered resistant to attack by the target pest(s).

A wide variety of methods are available for introducing a gene encoding a protein into the target host under conditions that allow for stable maintenance and expression of the gene. These methods are well known to those skilled in the art and are described, for example, in U.S. Pat. No. 5,135,867.

For example, a large number of cloning vectors comprising a replication system in E. coli and a marker that permits selection of the transformed cells are available for preparation for the insertion of foreign genes into higher plants. The vectors comprise, for example, pBR322, pUC series, M13mp series, pACYC184, etc. Accordingly, the sequence encoding the toxin can be inserted into the vector at a suitable restriction site. The resulting plasmid is used for transformation into E. coli. The E. coli cells are cultivated in a suitable nutrient medium, then harvested and lysed. The plasmid is recovered. Sequence analysis, restriction analysis, electrophoresis, and other biochemical-molecular biological methods are generally carried out as methods of analysis. After each manipulation, the DNA sequence used can be cleaved and joined to the next DNA sequence. Each plasmid sequence can be cloned in the same or other plasmids. Depending on the method of inserting desired genes into the plant, other DNA sequences may be necessary. If, for example, the Ti or Ri plasmid is used for the transformation of the plant cell, then at least the right border, but often the right and the left border of the Ti or Ri plasmid T-DNA, has to be joined as the flanking region of the genes to be inserted. The use of T-DNA for the transformation of plant cells has been intensively researched and described in EP 120 516; Hoekema (1985) In: The Binary Plant Vector System, Offset-durkkerij Kanters B.V., Alblasserdam, Chapter 5; Fraley et al., Crit. Rev. Plant Sci. 4:1-46; and An et al. (1985) EMBO J. 4:277-287.

A large number of techniques are available for inserting DNA into a plant host cell. Those techniques include transformation with T-DNA using Agrobacterium tumefaciens or Agrobacterium rhizogenes as transformation agent, fusion, injection, biolistics (microparticle bombardment), or electroporation as well as other possible methods. If Agrobacterium are used for the transformation, the DNA to be inserted has to be cloned into special plasmids, namely either into an intermediate vector or into a binary vector. The intermediate vectors can be integrated into the Ti or Ri plasmid by homologous recombination owing to sequences that are homologous to sequences in the T-DNA. The Ti or Ri plasmid also comprises the vir region necessary for the transfer of the T-DNA. Intermediate vectors cannot replicate themselves in Agrobacterium. The intermediate vector can be transferred into Agrobacterium tumefaciens by means of a helper plasmid (conjugation). Binary vectors can replicate themselves both in E. coli and in Agrobacterium. They comprise a selection marker gene and a linker or polylinker which are framed by the right and left T-DNA border regions. They can be transformed directly into Agrobacterium (Holsters et al. [1978] Mol. Gen. Genet. 163:181-187). The Agrobacterium used as host cell is to comprise a plasmid carrying a vir region. The vir region is necessary for the transfer of the T-DNA into the plant cell. Additional T-DNA may be contained. The bacterium so transformed is used for the transformation of plant cells. Plant explants can advantageously be cultivated with Agrobacterium tumefaciens or Agrobacterium rhizogenes for the transfer of the DNA into the plant cell. Whole plants can then be regenerated from the infected plant material (for example, pieces of leaf, segments of stalk, roots, but also protoplasts or suspension-cultivated cells) in a suitable medium, which may contain antibiotics or biocides for selection. The plants so obtained can then be tested for the presence of the inserted DNA. No special demands are made of the plasmids in the case of injection and electroporation. It is possible to use ordinary plasmids, such as, for example, pUC derivatives.

The transformed cells grow inside the plants in the usual manner. They can form germ cells and transmit the transformed trait(s) to progeny plants. Such plants can be grown in the normal manner and crossed with plants that have the same transformed hereditary factors or other hereditary factors. The resulting hybrid individuals have the corresponding phenotypic properties.

In some preferred embodiments of the invention, genes encoding the toxin are expressed from transcriptional units inserted into the plant genome. Preferably, said transcriptional units are recombinant vectors capable of stable integration into the plant genome and enable selection of transformed plant lines expressing mRNA encoding the proteins.

Once the inserted DNA has been integrated in the genome, it is relatively stable there (and does not come out again). It normally contains a selection marker that confers on the transformed plant cells resistance to a biocide or an antibiotic, such as kanamycin, G418, bleomycin, hygromycin, or chloramphenicol, inter alia. The individually employed marker should accordingly permit the selection of transformed cells rather than cells that do not contain the inserted DNA. The gene(s) of interest are preferably expressed either by constitutive or inducible promoters in the plant cell. Once expressed, the mRNA is translated into proteins, thereby incorporating amino acids of interest into protein. The genes encoding a toxin expressed in the plant cells can be under the control of a constitutive promoter, a tissue-specific promoter, or an inducible promoter.

Several techniques exist for introducing foreign recombinant vectors into plant cells, and for obtaining plants that stably maintain and express the introduced gene. Such techniques include the introduction of genetic material coated onto microparticles directly into cells (U.S. Pat. No. 4,945,050 to Cornell and U.S. Pat. No. 5,141,131 to DowElanco, now Dow AgroSciences, LLC). In addition, plants may be transformed using Agrobacterium technology, see U.S. Pat. No. 5,177,010 to University of Toledo; U.S. Pat. No. 5,104,310 to Texas A&M; European Patent Application 0131624B1; European Patent Applications 120516, 159418B1 and 176,112 to Schilperoot; U.S. Pat. Nos. 5,149,645, 5,469,976, 5,464,763 and 4,940,838 and 4,693,976 to Schilperoot; European Patent Applications 116718, 290799, 320500 all to Max Planck; European Patent Applications 604662 and 627752, and U.S. Pat. No. 5,591,616, to Japan Tobacco; European Patent Applications 0267159 and 0292435, and U.S. Pat. No. 5,231,019, all to Ciba Geigy, now Novartis; U.S. Pat. Nos. 5,463,174 and 4,762,785, both to Calgene; and U.S. Pat. Nos. 5,004,863 and 5,159,135, both to Agracetus. Other transformation technology includes whiskers technology. See U.S. Pat. Nos. 5,302,523 and 5,464,765, both to Zeneca. Electroporation technology has also been used to transform plants. See WO 87/06614 to Boyce Thompson Institute; U.S. Pat. Nos. 5,472,869 and 5,384,253, both to Dekalb; and WO 92/09696 and WO 93/21335, both to Plant Genetic Systems. Furthermore, viral vectors can also be used to produce transgenic plants producing the protein of interest. For example, monocotyledonous plant can be transformed with a viral vector using the methods described in U.S. Pat. No. 5,569,597 to Mycogen Plant Science and Ciba-Giegy, now Novartis, as well as U.S. Pat. Nos. 5,589,367 and 5,316,931, both to Biosource.

As mentioned previously, the manner in which the DNA construct is introduced into the plant host is not critical to this invention. Any method that provides for efficient transformation can be employed. For example, various methods for plant cell transformation are described herein and include the use of Ti or Ri-plasmids and the like to perform Agrobacterium mediated transformation. In many instances, it will be desirable to have the construct used for transformation bordered on one or both sides by T-DNA borders, more specifically the right border. This is particularly useful when the construct uses Agrobacterium tumefaciens or Agrobacterium rhizogenes as a mode for transformation, although T-DNA borders may find use with other modes of transformation. Where Agrobacterium is used for plant cell transformation, a vector may be used which may be introduced into the host for homologous recombination with T-DNA or the Ti or Ri plasmid present in the host. Introduction of the vector may be performed via electroporation, tri-parental mating and other techniques for transforming gram-negative bacteria which are known to those skilled in the art. The manner of vector transformation into the Agrobacterium host is not critical to this invention. The Ti or Ri plasmid containing the T-DNA for recombination may be capable or incapable of causing gall formation, and is not critical to said invention so long as the vir genes are present in said host.

In some cases where Agrobacterium is used for transformation, the expression construct being within the T-DNA borders will be inserted into a broad spectrum vector such as pRK2 or derivatives thereof as described in Ditta et al., (PNAS USA (1980) 77:7347-7351 and EPO 0 120 515, which are incorporated herein by reference. Included within the expression construct and the T-DNA will be one or more markers as described herein which allow for selection of transformed Agrobacterium and transformed plant cells. The particular marker employed is not essential to this invention, with the preferred marker depending on the host and construction used.

For transformation of plant cells using Agrobacterium, explants may be combined and incubated with the transformed Agrobacterium for sufficient time to allow transformation thereof. After transformation, the Agrobacterium are killed by selection with the appropriate antibiotic and plant cells are cultured with the appropriate selective medium. Once calli are formed, shoot formation can be encouraged by employing the appropriate plant hormones according to methods well known in the art of plant tissue culturing and plant regeneration. However, a callus intermediate stage is not always necessary. After shoot formation, said plant cells can be transferred to medium which encourages root formation thereby completing plant regeneration. The plants may then be grown to seed and said seed can be used to establish future generations. Regardless of transformation technique, the gene encoding a toxin is preferably incorporated into a gene transfer vector adapted to express said gene in a plant cell by including in the vector a plant promoter regulatory element, as well as 3′ non-translated transcriptional termination regions such as Nos and the like.

In addition to numerous technologies for transforming plants, the type of tissue that is contacted with the foreign genes may vary as well. Such tissue would include but would not be limited to embryogenic tissue, callus tissue types I, II, and III, hypocotyl, meristem, root tissue, tissues for expression in phloem, and the like. Almost all plant tissues may be transformed during dedifferentiation using appropriate techniques described herein.

As mentioned above, a variety of selectable markers can be used, if desired. Preference for a particular marker is at the discretion of the artisan, but any of the following selectable markers may be used along with any other gene not listed herein which could function as a selectable marker. Such selectable markers include but are not limited to aminoglycoside phosphotransferase gene of transposon Tn5 (Aph II) which encodes resistance to the antibiotics kanamycin, neomycin and G418, as well as those genes which encode for resistance or tolerance to glyphosate; hygromycin; methotrexate; phosphinothricin (bialaphos); imidazolinones, sulfonylureas and triazolopyrimidine herbicides, such as chlorsulfuron; bromoxynil, dalapon and the like.

In addition to a selectable marker, it may be desirous to use a reporter gene. In some instances a reporter gene may be used with or without a selectable marker. Reporter genes are genes which are typically not present in the recipient organism or tissue and typically encode for proteins resulting in some phenotypic change or enzymatic property. Examples of such genes are provided in K. Wising et al. Ann. Rev. Genetics, 22, 421 (1988). Preferred reporter genes include the beta-glucuronidase (GUS) of the uidA locus of E. coli, the chloramphenicol acetyl transferase gene from Tn9 of E. coli, the green fluorescent protein from the bioluminescent jellyfish Aequorea victoria, and the luciferase genes from firefly Photinus pyralis. An assay for detecting reporter gene expression may then be performed at a suitable time after said gene has been introduced into recipient cells. A preferred such assay entails the use of the gene encoding beta-glucuronidase (GUS) of the uidA locus of E. coli as described by Jefferson et al., (1987 Biochem. Soc. Trans. 15, 17-19) to identify transformed cells.

In addition to plant promoter regulatory elements, promoter regulatory elements from a variety of sources can be used efficiently in plant cells to express foreign genes. For example, promoter regulatory elements of bacterial origin, such as the octopine synthase promoter, the nopaline synthase promoter, the mannopine synthase promoter; promoters of viral origin, such as the cauliflower mosaic virus (35S and 19S), 35T (which is a re-engineered 35S promoter, see U.S. Pat. No. 6,166,302, especially Example 7E) and the like may be used. Plant promoter regulatory elements include but are not limited to ribulose-1,6-bisphosphate (RUBP) carboxylase small subunit (ssu), beta-conglycinin promoter, beta-phaseolin promoter, ADH promoter, heat-shock promoters, and tissue specific promoters. Other elements such as matrix attachment regions, scaffold attachment regions, introns, enhancers, polyadenylation sequences and the like may be present and thus may improve the transcription efficiency or DNA integration. Such elements may or may not be necessary for DNA function, although they can provide better expression or functioning of the DNA by affecting transcription, mRNA stability, and the like. Such elements may be included in the DNA as desired to obtain optimal performance of the transformed DNA in the plant. Typical elements include but are not limited to Adh-intron 1, Adh-intron 6, the alfalfa mosaic virus coat protein leader sequence, the maize streak virus coat protein leader sequence, as well as others available to a skilled artisan. Constitutive promoter regulatory elements may also be used thereby directing continuous gene expression in all cells types and at all times (e.g., actin, ubiquitin, CaMV 35S, and the like). Tissue specific promoter regulatory elements are responsible for gene expression in specific cell or tissue types, such as the leaves or seeds (e.g., zein, oleosin, napin, ACP, globulin and the like) and these may also be used.

Promoter regulatory elements may also be active during a certain stage of the plant's development as well as active in plant tissues and organs. Examples of such include but are not limited to pollen-specific, embryo-specific, corn-silk-specific, cotton-fiber-specific, root-specific, seed-endosperm-specific promoter regulatory elements and the like. Under certain circumstances it may be desirable to use an inducible promoter regulatory element, which is responsible for expression of genes in response to a specific signal, such as: physical stimulus (heat shock genes), light (RUBP carboxylase), hormone (Em), metabolites, chemical, and stress. Other desirable transcription and translation elements that function in plants may be used. Numerous plant-specific gene transfer vectors are known in the art.

Standard molecular biology techniques may be used to clone and sequence the genes (and toxins) described herein. Additional information may be found in Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989), Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press, which is incorporated herein by reference.

Resistance Management. With increasing commercial use of insecticidal proteins in transgenic plants, one consideration is resistance management. That is, there are numerous companies using Bacillus thuringiensis toxins in their products, and there is concern about insects developing resistance to B.t. toxins. One strategy for insect resistance management would be to combine the TC insecticidal proteins produced by Xenorhabdus, Photorhabdus, Gibberella, and the like with toxins such as B.t. crystal toxins, soluble insecticidal proteins from Bacillus stains (see, e.g., WO 98/18932 and WO 99/57282), or other insect toxins. The combinations could be formulated for a sprayable application or could be molecular combinations. Plants could be transformed with genes that produce two or more different insect toxins (see, e.g., Gould, 38 Bioscience 26-33 (1988) and U.S. Pat. No. 5,500,365; likewise, European Patent Application 0 400 246 A1 and U.S. Pat. Nos. 5,866,784; 5,908,970; and 6,172,281 also describe transformation of a plant with two B.t. crystal toxins). Another method of producing a transgenic plant that contains more than one insect resistant gene would be to first produce two plants, with each plant containing an insect resistance gene. These plants could then be crossed using traditional plant breeding techniques to produce a plant containing more than one insect resistance gene. Thus, it should be apparent that the phrase “comprising a polynucleotide” as used herein means at least one polynucleotide (and possibly more, contiguous or not) unless specifically indicated otherwise.

Formulations and Other Delivery Systems. Formulated bait granules containing cells and/or proteins of the subject invention (including recombinant microbes comprising the genes described herein) can be applied to the soil. Formulated product can also be applied as a seed-coating or root treatment or total plant treatment at later stages of the crop cycle. Plant and soil treatments of cells may be employed as wettable powders, granules or dusts, by mixing with various inert materials, such as inorganic minerals (phyllosilicates, carbonates, sulfates, phosphates, and the like) or botanical materials (powdered corncobs, rice hulls, walnut shells, and the like). The formulations may include spreader-sticker adjuvants, stabilizing agents, other pesticidal additives, or surfactants. Liquid formulations may be aqueous-based or non-aqueous and employed as foams, gels, suspensions, emulsifiable concentrates, or the like. The ingredients may include rheological agents, surfactants, emulsifiers, dispersants, or polymers.

As would be appreciated by a person skilled in the art, the pesticidal concentration will vary widely depending upon the nature of the particular formulation, particularly whether it is a concentrate or to be used directly. The pesticide will be present in at least 1% by weight and may be 100% by weight. The dry formulations will have from about 1-95% by weight of the pesticide while the liquid formulations will generally be from about 1-60% by weight of the solids in the liquid phase. The formulations will generally have from about 10²to about 10⁴cells/mg. These formulations will be administered at about 50 mg (liquid or dry) to 1 kg or more per hectare.

The formulations can be applied to the environment of the pest, e.g., soil and foliage, by spraying, dusting, sprinkling, or the like.

Another delivery scheme is the incorporation of the genetic material of toxins into a baculovirus vector. Baculoviruses infect particular insect hosts, including those desirably targeted with the toxins. Infectious baculovirus harboring an expression construct for the toxins could be introduced into areas of insect infestation to thereby intoxicate or poison infected insects.

Insect viruses, or baculoviruses, are known to infect and adversely affect certain insects. The effect of the viruses on insects is slow, and viruses do not immediately stop the feeding of insects. Thus, viruses are not viewed as being optimal as insect pest control agents. However, combining the toxin genes into a baculovirus vector could provide an efficient way of transmitting the toxins. In addition, since different baculoviruses are specific to different insects, it may be possible to use a particular toxin to selectively target particularly damaging insect pests. A particularly useful vector for the toxins genes is the nuclear polyhedrosis virus. Transfer vectors using this virus have been described and are now the vectors of choice for transferring foreign genes into insects. The virus-toxin gene recombinant may be constructed in an orally transmissible form. Baculoviruses normally infect insect victims through the mid-gut intestinal mucosa. The toxin gene inserted behind a strong viral coat protein promoter would be expressed and should rapidly kill the infected insect.

In addition to an insect virus or baculovirus or transgenic plant delivery system for the protein toxins of the present invention, the proteins may be encapsulated using Bacillus thuringiensis encapsulation technology such as but not limited to U.S. Pat. Nos. 4,695,455; 4,695,462; 4,861,595 which are all incorporated herein by reference. Another delivery system for the protein toxins of the present invention is formulation of the protein into a bait matrix, which could then be used in above and below ground insect bait stations. Examples of such technology include but are not limited to PCT Patent Application WO 93/23998, which is incorporated herein by reference.

Plant RNA viral based systems can also be used to produce an anti-insect toxin protein. In so doing, the gene encoding a toxin can be inserted into the coat promoter region of a suitable plant virus that will infect the host plant of interest. The toxin can then be expressed thus providing protection of the plant from insect damage. Plant RNA viral based systems are described in U.S. Pat. No. 5,500,360 to Mycogen Plant Sciences, Inc. and U.S. Pat. Nos. 5,316,931 and 5,589,367 to Biosource Genetics Corp.

In addition to producing a transformed plant, there are other delivery systems where it may be desirable to engineer the toxin-encoding gene(s). For example, a protein toxin can be constructed by fusing together a molecule attractive to insects as a food source with a toxin. After purification in the laboratory such a toxic agent with “built-in” bait could be packaged inside standard insect trap housings.

Mutants. Mutants of bacterial and fungal isolates (and other organisms) can be made by procedures that are well known in the art. For example, mutants can be obtained through ethylmethane sulfonate (EMS) mutagenesis of an isolate. The mutants can be made using ultraviolet light and nitrosoguanidine by procedures well known in the art.

Examples of Various Specific Embodiments. (As used in this specification, the terms “a” and “an” signify at least one, unless specifically indicated otherwise.) The subject invention can include, but is not limited to, an isolated, eukaryotic protein comprising a B domain and a C domain, wherein said protein potentiates the insecticidal activity of a Class A toxin complex protein, wherein said B domain comprises an spvB subdomain followed by at least one FG-GAP subdomain, and said C domain comprises at least one RHS subdomain followed by a hyper-variable region. (It should be noted that these subdomains are sometimes referred to herein as “domains.” It should be understood that such “domains” can be subparts of fusion proteins, of the subject invention, having two main domains—B and C with each main domain having domains of their own). In some embodiments, said protein can further comprise a transmembrane domain (or subdomain). In some embodiments, said protein is a fungal protein, including a Gibberella protein. Some preferred proteins have a molecular weight of approximately 200-300 kDa. In some other embodiments, said protein has the domains as described above but is prokaryotic or archaeal (not eukaryotic) obtainable from a naturally occurring organism of a genus selected from Methanosarcina, Treponema, Leptospira, Tannerella, and Microbulbifer. Novel bacterial sources (other than Xenorhabdus, Photorhabdus, and the like) are also described herein. Variations (such as conservative substitutions) of the naturally occurring amino acid sequences are also possible.

The subject invention includes an isolated polynucleotide that encodes any of these proteins. Some preferred polynucleotides have a codon composition optimized for expression in a plant. The invention includes a transgenic cell comprising any of these polynucleotides. In some preferred embodiments, the transgenic cell further comprises a nucleic acid molecule that encodes a Class A toxin.

The subject invention further includes a method of screening polynucleotide sequences for a polynucleotide that encodes a protein mentioned (and/or suggested) above, wherein said method comprises providing a reference sequence, comparing said reference sequence to a sequence database using an algorithm, assigning a score to a sequence in said database, selecting a minimum value, identifying said polynucleotide in said database that has a said score above said minimum value, producing a protein that is encoded by said polynucleotide, and assaying said protein for ability to potentiate the activity of a Class A toxin complex protein.

Still further included is a method of screening a culture of naturally occurring eukaryotic cells for a protein selected from the group of a Class A toxin complex protein and a BC fusion protein mentioned above. A naturally occurring organism of a genus selected from Methanosarcina, Treponema, Leptospira, Tannerella, and Microbulbifer can be substituted for the eukaryote in such screening methods.

The subject invention also includes a method of identifying a BC fusion protein, mentioned above, from a naturally occurring organism wherein said method comprises analyzing a sequence for a sequence of subdomains discussed above and elsewhere herein.

The subject invention also includes a method of screening a plurality of naturally occurring (microbial) isolates for an approximately 220 kDa to approximately 295 kDa protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin protein wherein said method comprises obtaining protein from said isolates and screening said protein for said potentiating activity, said protein comprising a B domain and a C domain, wherein said B domain and said C domain comprise subdomains discussed above and elsewhere herein. Said microbes can be fungal. Said microbes can also be selected from Gibberella, Methanosarcina, Treponema, Leptospira, Tannerella, and Microbulbifer.

The subject invention also includes a method of screening a plurality of Gibberella isolates for a gene that encodes an approximately 220 kDa to approximately 295 kDa protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin protein wherein said method comprises obtaining nucleic acid molecules from said isolates and contacting said nucleic acid molecules with a polynucleotide that hybridizes with said gene. The step of obtaining DNA from said culture can comprise creating a library of clones from said DNA and assaying at least one of said clones for the presence of said gene. The step of assaying said clone for the presence of said polynucleotide can comprise assaying said clone for lepidopteran toxin activity, thereby indicating the presence of said polynucleotide. The step of assaying said DNA can comprise performing polymerase chain reaction with at least one primer that is designed to indicate the presence of said gene. The step of assaying said DNA can comprise hybridizing a nucleic acid probe to said DNA wherein said probe is designed to indicate the presence of said gene. The method can also comprise assaying for said protein, comprising (for example) immunoreacting an antibody for said protein with a protein sample wherein said antibody is designed to indicate the presence of said protein.

The subject invention further provides a method of controlling an insect, or like pest, wherein said method comprises the step of contacting said insect with a BC fusion protein discussed above and a Class A toxin complex toxin protein. Also included is a method of potentiating the toxin activity of a Class A toxin complex toxin protein wherein said method comprises providing a BC fusion protein discussed above to an insect for ingestion. Novel B and/or C domains from these novel fusion proteins can also be used individually (not in the form of a fusion).

The subject invention still further includes a synthetic BC fusion protein comprising a B or a C domain obtainable from (or derived from) a novel source (naturally occurring) organism discussed above, wherein said B domain or said C domain is fused to a heterologous C domain or B domain. Various combinations are possible. Such synthetic fusions can also be fused to a Class A toxin complex toxin protein.

In some other embodiments, the subject invention includes a method of producing a transgenic cell, wherein said method comprises inserting a polynucleotide into said cell, wherein said polynucleotide encodes an approximately 220 kDa to approximately 295 kDa Gibberella (or other subject organism) protein, wherein said protein potentiates anti-insect toxin activity of a Class A toxin complex toxin protein. The subject invention also includes a transgenic cell comprising a heterologous polynucleotide from a culture of a Gibberella (or other subject organism) isolate, wherein said polynucleotide encodes an approximately 220 kDa to approximately 295 kDa protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin protein. The subject invention includes a method of screening a plurality of Gibberella isolates for an approximately 220 kDa to approximately 295 kDa protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin protein wherein said method comprises obtaining protein from said isolates and screening said protein for said potentiating activity. The subject invention still further includes a method of screening a plurality of Gibberella isolates for a gene that encodes an approximately 220 kDa to approximately 295 kDa protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin protein wherein said method comprises obtaining nucleic acid molecules from said isolates and contacting said nucleic acid molecules with a polynucleotide that hybridizes with said gene.

Other organisms, proteins, and genes can be substituted in the above methods and embodiments. For example, Methanosarcina anti-insect proteins and genes can be identified or used according to the above methods. Likewise, Class A TC proteins and genes from Gibberella can identified or used according to the above methods. Novel strains identified using the subject methods are also within the scope of the subject invention.

Various approaches can be used for performing the above methods. For example, a library of clones can be constructed in performing some of the above methods. Some of these methods can include a step of performing polymerase chain reaction with at least one primer that is designed to indicate the presence of a gene of interest. The above methods can include a step of hybridizing a nucleic acid probe to DNA of interest wherein said probe is designed to indicate the presence of said gene. Proteins can be assayed by immunoreacting an antibody with said protein wherein said antibody is designed to indicate the presence of said protein.

The subject invention also includes an isolated protein that potentiates anti-insect toxin activity of a Class A toxin complex toxin wherein a polynucleotide sequence that encodes said protein hybridizes under stringent conditions with the complement of a sequence selected from SEQ ID NOs:1, 3, 5, 7, 9, and 11. In some preferred embodiments, the protein comprises an amino acid sequence selected from SEQ ID NO:2, SEQ ID NO:4, SEQ ID NO:6, SEQ ID NO:8, SEQ ID NO:10, and SEQ ID NO:12. An isolated polynucleotide that encodes any of these proteins is also within the scope of the subject invention, as are transgenic cells (such as microbial and plant cells) comprising said polynucleotides. The subject invention still further includes a method of controlling an insect pest wherein said method comprises the step of contacting said pest with a protein of the subject invention.

Again, other organisms, proteins, and genes can be substituted in the above methods and embodiments. This includes, for example, Methanosarcina anti-insect proteins and genes, and Class A TC proteins and genes from Gibberella.

All patents, patent applications, provisional applications, and publications referred to or cited herein are incorporated by reference in their entirety to the extent they are not inconsistent with the explicit teachings of this specification.

Following are examples that illustrate procedures for practicing the invention. These examples should not be construed as limiting. All percentages are by weight and all solvent mixture proportions are by volume unless otherwise noted.

EXAMPLE 1
Discovery of Class B and Class C Gene Homologs in Gibberella zeae

A DNA sequence encoding a hypothetical protein with similarity to the Photorhabdus luminescens Toxin Complex TcaC (Class B) and TccC1 (Class C) proteins (GenBank Accession Numbers AAC38625.1 and AAL18473.1, respectively) was discovered by tblastn analysis of the Gibberella zeae genome. The analysis was done using the NCBI (National Center for Biotechnology Information) genomic BLAST algorithm at the World Wide Web site (ncbi.nlm.nih.gov/sutils/genom_table.cgi), using the following default values:

- Expect 10;
- Filter Default.

One hit for each protein was found within GenBank Accession Number AACM01000442 using tblastn. Both of the hits mapped to a single hypothetical protein which was annotated as follows:

- CDS; join (52114 . . . 56781, 56863 . . . 59514);
- locus_tag=“FG10566.1”;
- codon_start=1;
- product=“hypothetical protein”;
- protein_id=“EAA68452.1”;
- db_xref=“GI:42545609”.

The DNA sequence from AACM01000442 was translated using DNA Translator (a program that allows the user to select start and stop parameters for protein coding regions). The resulting predicted translation products were used to search a nonredundant local protein database using blastp. A similar blastp analysis was performed with all the proteins annotated in GenBank within the AACM01000442 sequence. In both cases, a single polypeptide (EAA68452.1) was identified that has significant homology to both the TcaC and TccC1 Photorhabdus Toxin Complex proteins.

Further analysis of the relationship between the EAA68452.1 protein and the TcaC and TccC1 proteins was performed with the program “Blast 2 sequences” which contains the blastp comparison algorithm [Tatiana A. Tatusova, Thomas L. Madden (1999), “Blast 2 sequences—a new tool for comparing protein and nucleotide sequences”, FEMS Microbiol Lett. 174:247-250]. The default search/comparison parameters were used, as listed below:

- Matrix Blosum62;
- Open Gap 11;
- Extension Gap 1;
- Gap x dropoff 50;
- Expect value 10;
- Word size 3;
- Filter off.
  
  The “Blast2 sequences” comparison results for the TcaC protein are given below:
- Length=2439
- Score=318 bits (814);
- Expect=9e-85;
- Identities= 333/1291 (25%);
- Positives= 527/1291 (40%),
- Gaps= 187/1291 (14%)
  
  Analysis of the protein-protein alignment identified in this search revealed that the region of homology between TcaC and the Gibberella zeae EAA68452.1 hypothetical protein comprises amino acids 72-1266 of EAA68452.1.

The “Blast 2 sequences” comparison results for the TccC1 protein are given below:

- Length=2439;
- Score=192 bits (489);
- Expect=3e-47;
- Identities= 198/723 (27%);
- Positives= 317/723 (43%);
- Gaps= 89/723 (12%)
  
  Analysis of the protein-protein alignment identified in this search revealed that the region of homology between TccC1 and the Gibberella zeae EAA68452.1 hypothetical protein comprises amino acids 1557-2239 of EAA68452.1. Thus, it is apparent that the Gibberella zeae EAA68452.1 hypothetical protein comprises two consecutive domains, the first having some homology to the Class B TcaC protein, and a second domain having some homology to the Class C TccC1 protein.

GenBank Accession AACM01000442 is a 95095 base linear DNA sequence obtained via whole genome shotgun sequencing of the Gibberella zeae strain PH1 (NRRL 31084) chromosome 1, and has a deposit date of Feb. 13, 2004. It is noted that the CDS annotation above suggests the presence of an intron (intervening) sequence within the genomic sequence, comprising bases 56782 to 56862. Although annotated as an intron sequence, it should be noted that all the bases comprising the putative intron are in the +1 reading frame relative to exon1 preceding the intron (i.e. bases 52114 to 56781 of Accession AACM01000442). Therefore, an uninterrupted open reading frame extends from base 52114 to base 59514 of Accession AACM01000442. The predicted translation product provided by the DNA Translator program, and which was used in the above searches and comparisons, was larger than that annotated within AACM01000442, since the putative in-frame intron was not removed.

The sequence of the Gibberella zeae DNA which encodes a hypothetical protein with homology to toxin complex potentiator proteins, complete with translated intron, is shown in SEQ ID NO:1. The sequence is also referred to herein as tcp1_Gz(toxin complex potentiator 1 of Gibberella zeae). The translation of SEQ ID NO:1 is shown in SEQ ID NO:2, and is referred to herein as Tcp1_Gz. The DNA sequence of tcp1_Gzwithout the intron is shown in SEQ ID NO:3. The translation of SEQ ID NO:3 is shown in SEQ ID NO:4. According to some embodiments of the subject invention, the Tcp1_Gzprotein can potentiate the activities of the TC Class A proteins TcdA and XptA2 against their respective target insects. This observation is surprising and previously unexpected, since the Tcp1_Gzprotein has as its source a eukaryotic organism, and the TcdA and XptA2 proteins are derived from bacterial sources.

EXAMPLE 2
Design and Synthesis of a Tcp1_Gz-Encoding Gene for Expression in Bacteria

This example teaches the design of a new DNA sequence that encodes the Tcp1_Gzprotein of SEQ ID NO:2, but is optimized for expression in Escherichia coli cells. Table 2, Columns D and H, present the distributions (in % of usage for all codons for that amino acid) of synonymous codons for each amino acid, as found in the coding regions of Class II genes of E. coli. [Class II genes are those that are highly expressed during the exponential growth phase of E. coli cells, as reported in: Henaut, A. and Danchin, A. (1996) Escherichia coli and Salmonella typhimurium cellular and molecular biology, vol. 2, pp. 2047-2066. [Neidhardt, F., Curtiss III, R., Ingraham, J., Lin, E., Low, B., Magasanik, B., Reznikoff, W., Riley, M., Schaechter, M. and Umbarger, H. (eds.). American Society for Microbiology, Washington, D.C.]. It is evident that some synonymous codons for some amino acids are present only rarely in those highly expressed genes (e.g. Leucine codon CTA and Arginine codon CGG). In the design process of creating a protein-encoding DNA sequence that approximates the codon distribution of highly expressed E. coli genes, any codon that is used infrequently relative to the other synonymous codons for that amino acid was not included (indicated by NA in Columns C and G of Table 2). Usually, a codon was considered to be rarely used if it was represented in the Class II genes at about 18% or less of the time to encode the relevant amino acid.

To balance the distribution of the remaining codon choices for an amino acid, a weighted average representation for each codon was calculated, using the formula:

Weighted % of C1=1/(% C1+% C2+% C3+ etc.)×% C1×100

where C1 is the codon in question, C2, C3, etc. represent the remaining synonymous codons, and the % values for the relevant codons are taken from columns D and H of Table 2 (ignoring the rare codon values in bold font). The Weighted % value for each codon is given in Columns C and G of Table 2.

Design of the E. coli-optimized DNA sequence was initiated by reverse translation of the protein sequence of SEQ ID NO:2 using a codon bias table constructed from Table 2, Columns C and G. The initial sequence was then modified by compensating codon changes (while retaining overall weighted average representation) to remove or add restriction enzyme recognition sites, remove highly stable intrastrand secondary structures, and other sequences that might be detrimental to cloning manipulations or expression of the engineered gene. An example of such detrimental sequence to avoid within a coding region is a 16S ribosomal RNA binding sequence (“Shine-Dalgarno sequence”) such as AGGAGG, which could encode, for example, two consecutive arginine amino acids, but which might also serve as an intragenic (and therefore undesirable) translation initiation signal.

The E.-coli-biased DNA sequence that encodes the protein of SEQ ID NO:2 is given as bases 23-7420 of SEQ ID NO:5. To facilitate cloning and to ensure efficient translation initiation, a 5′ terminal XbaI restriction enzyme recognition sequence (TCTAGA) and Shine Dalgarno sequence (AAGAAGGAG) were placed upstream of the ATG translation start codon (bases 1-22 of SEQ ID NO:5). Also to facilitate cloning, and to ensure proper translation termination, bases encoding two TAA translation stop codons and an XhoI restriction enzyme recognition site (CTCGAG) were included at the 3′ end of the coding region (bases 7421-7440 of SEQ ID NO:5). Synthesis of a DNA fragment comprising SEQ ID NO:5 was performed by a commercial supplier (Entelechon GmbH, Regensburg, Germany).

It is presently noted that the Gibberella zeae genomic DNA sequence tcp1_Gz, disclosed in SEQ ID NO:1, as annotated in GenBank Accession AACM01000442, comprises a putative intron sequence (bases 4669-4749 of SEQ ID NO:1). Analysis of the open reading frame of SEQ ID NO:1 reveals that the bases comprising the putative intron maintain the +1 reading frame initiated by the ATG start codon at bases 1-3. In other words, SEQ ID NO:1 comprises a single open reading frame of 7398 bases that encodes a theoretical protein of 2466 amino acids. Thus, if the primary transcript derived from the DNA of SEQ ID NO:1 is not spliced (i.e. the intron sequence is not cleaved from the mRNA) translation would produce the Tcp1_Gzprotein disclosed in SEQ ID NO:2. On the other hand, if the primary transcript is spliced (that, is the intron sequences are removed) the mRNA would have a sequence corresponding to SEQ ID NO:3, and translation would produce the 2439 amino acid protein disclosed as SEQ ID NO:4.

For the purposes of this example, an E. coli-biased DNA sequence encoding the theoretical protein derived from the entire 7398 base open reading frame of SEQ ID NO:1 was designed and synthesized. The encoded protein of 2466 amino acids is identical in sequence to SEQ ID NO:2 (Tcp1_Gz,) and thus includes the amino acids encoded by the putative intron identified in the genomic sequence. As seen in the further Examples, this protein, derived from a eukaryotic organism, has the surprising activity of potentiating the insect toxicity of bacterially derived Class A TC proteins.

TABLE 2

Synonymous codon representation in highly expressed

genes of E. coli, and calculation of a biased codon

representation set for E. coli-optimized synthetic

gene design.

A

C
D
E

G
H

Amino
B
Weighted
Class II
Amino
F
Weighted
Class II

Acid
Codon
%
Genes %
Acid
Codon
%
Genes %

Ala (A)
GCA
28.6
24.0
Leu (L)
CTA
NA
0.8

GCC
NA
16.1

CTC
NA
8.0

GCG
38.5
32.3

CTG
100.0
76.7

GCT
32.8
27.5

CTT
NA
5.6

Arg (R)
AGA
NA
0.6

TTA
NA
3.4

AGG
NA
0.3

TTG
NA
5.5

CGA
NA
1.1
Lys (K)
AAA
78.6
78.6

CGC
33.9
33.0

AAG
21.5
21.5

CGG
NA
0.8
Met (M)
ATG
100.0
100.0

CGT
66.1
64.3
Phe (F)
TTC
70.9
70.9

Asn(N)
AAC
100.0
82.8

TTT
29.1
29.1

AAT
NA
17.3
Pro (P)
CCA
17.5
15.3

Asp (D)
GAC
54.0
54.0

CCC
NA
1.6

GAT
46.1
46.1

CCG
82.5
71.9

Cys(C)
TGC
61.2
61.2

CCT
NA
11.2

TGT
38.9
38.9
Ser (S)
AGC
29.2
24.3

END
TAA
100.0

AGT
NA
4.5

TAG

TCA
NA
4.8

TGA

TCC
31.9
26.6

Gln(Q)
CAA
18.7
18.7

TCG
NA
7.4

CAG
81.4
81.4

TCT
38.9
32.4

Glu (E)
GAA
75.4
75.4
Thr (T)
ACA
NA
4.7

GAG
24.7
24.7

ACC
64.8
53.6

Gly (G)
GGA
NA
2.0

ACG
NA
12.7

GGC
45.7
42.8

ACT
35.2
29.1

GGG
NA
4.4
Trp (W)
TGG
100.0
100.0

GGT
54.3
50.8
Tyr (Y)
TAC
64.8
64.8

His (H)
CAC
70.2
70.2

TAT
35.2
35.2

CAT
29.8
29.8
Val (V)
GTA
23.1
20.0

Ile (I)
ATA
NA
0.6

GTC
NA
13.5

ATC
66.3
65.9

GTG
31.0
26.8

ATT
33.7
33.5

GTT
46.0
39.8

EXAMPLE 3
Engineering of the Synthetic Bacterial DNA Encoding Tcp1_Gz

The synthetic, bacterial biased DNA (SEQ ID NO:5) encoding the protein Tcp1_Gz, was engineered for expression by inserting it into two different E. coli expression vectors to assist in optimizing expression conditions. The first was vector pBT (U.S. application Ser. No. 10/754,115, filed Jan. 7, 2003), which uses a standard E. coli promoter. The pBT based plasmid was designated pDAB8828. The second was a pET expression vector (Novagen, Madison Wis.), which utilizes a bacteriophage T7 promoter. The pET based expression plasmid was designated pDAB8829. Each of these expression plasmids was constructed using standard molecular biology techniques. The engineering was done in such a way as to maintain appropriate bacterial transcription and translation signals. Both plasmids pDAB8828 and pDAB8829 encode the protein Tcp1_Gz, yet differ in promoter, selectable marker and various other features of the vector backbones.

EXAMPLE 4
Expression Conditions of pDAB8828 and Lysate Preparations

The expression plasmids pBT (empty vector control described in U.S. application Ser. No. 10/754,115, filed Jan. 7, 2003), and pDAB8828 were transformed into the E. coli expression strain BL21 (Novagen, Madison, Wis.) using standard methods. Expression cultures were initiated with 10-200 freshly transformed colonies placed into 250 mL LB medium containing 50 μg/mL antibiotic and 75 μM IPTG (isopropyl-α-D-thiogalactopyranoside). The cultures were grown at 28° C. for 48 hours at 180-200 rpm, and the cells were collected by centrifugation at 5,000×g for 20 minutes at 4° C. Cell pellets were suspended in 4-4.5 mL Butterfield's Phosphate solution (Hardy Diagnostics, Santa Maria, Calif.; 0.3 mM potassium phosphate pH 7.2), transferred to 50 mL polypropylene screw cap centrifuge tubes with 1 mL of 0.1 mm diameter glass beads (Biospec, Bartlesville, Okla., catalog number 1107901), and then were chilled on ice. The cells were lysed by sonication with two 45 second bursts using a 2 mm probe with a Branson Sonifier 250 (Danbury Conn.) at an output of ˜30, chilling completely between bursts. The lysates were transferred to 2 mL Eppendorf tubes and centrifuged 10 minutes at 16,000×g. Supernatants were collected and the protein concentration measured. Bio-Rad Protein Dye Assay Reagent was diluted 1:5 with H₂O and 1 mL was added to 10 μL of a 1:10 dilution of each sample and to bovine serum albumin (BSA) at concentrations of 5, 10, 15, 20 and 25 μg/mL. The optical densities of the samples were then read at 595 nm wavelength in a SpectraMax Plus spectrophotometer (Sunnyvale, Calif.). The lysates were assayed fresh.

EXAMPLE 5
Expression Conditions of pDAB8829 and Lysate Preparations

The expression plasmids pET (empty vector control), pDAB8920 and pDAB8829 were transformed into the E. coli T7 expression strain BL21(DE3) STAR (Invitrogen, Carlsbad, Calif.) using standard methods. Plasmid pDAB8920 was used as a positive potentiation control. It contains a fused potentiator gene consisting of the Photorhabdus luminescens genes tcdB2 and tccC3 fused via a 14 amino acid linker. Plasmid pDAB8920 is the subject of a separate application (U.S. Ser. No. 60/549,516, filed Mar. 2, 2004). Expression cultures were initiated with 10-200 freshly transformed colonies placed into 250 mL LB medium containing 50 μg/mL antibiotic and 75 μM IPTG (isopropyl-α-D-thiogalactopyranoside). The cultures were grown, lysed, and otherwise processed as described in Example 4 above.

EXAMPLE 6
Bioassay Conditions for pDAB8828 and pDAB8829 Lysates

Insect bioassays were conducted with neonate larvae on artificial diets in 128-well trays specifically designed for insect bioassays (C-D International, Pitman, N.J.). The species assayed were the southern corn rootworm, Diabrotica undecimpunctata howardi (Barber), and the corn earworm, Helicoverpa zea (Boddie).

Bioassays were incubated under controlled environmental conditions (28° C., ˜40% relative humidity, 16 h:8 h [Light:Dark]) for 5 days, at which point the total number of insects in the treatment, the number of dead insects, and the weights of surviving insects were recorded.

The biological activity of the crude lysates alone or with added Toxin Complex Class A proteins TcdA or XptA2_xwiwas assayed as follows. Crude E. coli lysates (40 μL) (3-21 mg/mL) of either control cultures or those expressing potentiator proteins were applied to the surface of artificial diet in 8 wells of a bioassay tray. The average surface area of treated diet in each well was ˜1.5 cm². The TcdA or XptA2_xwiproteins were added as highly purified fractions from bacterial cultures heterologously expressing the individual proteins. The final concentrations of XptA2_xwiand TcdA on the diet were 250 ng/cm²and 50 ng/cm², respectively. At these doses, these proteins have essentially no significant effect on the growth of the test insect larvae.

EXAMPLE 7
Bioassay Results for pDAB8828 Lysates

Table 3 shows the bioassay results for lysates of cells programmed to express the Tcp1_Gzprotein from plasmid pDAB8828, as compared to control cell lysates. Examination of the data show that TcdA (coleopteran active) and XptA2_xwi(lepidopteran active) had negligible impact when mixed with vector only control lysates. It should be noted that the amount of TcdA and XptA2_xwiadded to the lysates was adjusted to highlight the potentiation affect of the potentiator encoding genes. Lysates from cells containing pDAB8828 did not kill insects. However, when mixed with TcdA or XptA2_xwiproteins, significant growth inhibition was noted, with the expected spectrum of activity. Analysis of the various lysates by SDS-PAGE (sodium dodecyl sulfate polyacrylamide gel elctrophoresis) showed the presence of a ˜280 kDa band in pDAB8828 samples but not in the control samples. The migration of the band is consistent with the calculated size of Tcp1_Gz(i.e. 277.7 kDa). These results demonstrate that the plasmid pDAB8828 produces the protein Tcp1_Gzand this protein exhibits the surprising function of potentiating the activity of the Class A proteins TcdA and XptA2 against their target insects.

TABLE 3

Response of coleopteran and lepidopteran species to E. coli lysates

and purified proteins. Seven to nine insects used per replicate.

Southern Corn
Corn

Sample
Lysate Tested
Rootworm
Earworm

pBT
Control
0
+

pBT + TcdA
Control
+
nt

pBT + XptA2_Xwi
Control
nt
0

pDAB8828
Tcp1_Gz
0
0

pDAB8828 + TcdA
Tcp1_Gz
++++
nt

pDAB8828 + XptA2_Xwi
Tcp1_Gz
nt
++++

Data are for two independent replicates.

Growth Inhibition Scale:

0 = 0-20%;

+ = 21-40%;

++ = 41-60%;

+++ = 61-80%;

++++ = 81-100%;

nt = not tested.

EXAMPLE 8
Bioassay Results pDAB8829 Lysates

Table 4 shows the bioassay results for lysates of cells programmed to express the Tcp1_Gzprotein from plasmid pDAB8829, as compared to control cell lysates, and lysates of cells programmed to express the fused potentiator 8920. Examination of the data show that TcdA (coleopteran toxin) and XptA2_xwi(lepidopteran toxin) had negligible impact when mixed with vector only control lysates. It should be noted that the amount of TcdA and XptA2_xwiadded to the lysates was adjusted to highlight the potentiation affect of the TcdB2 and TccC3 encoding genes. Lysates from pDAB8920 containing cells alone did not kill insects. However, when mixed with TcdA or XptA2_xwi, significant insect inhibition was noted with the expected spectrum. Surprisingly, lysates of cells programmed to produce the Tcp1_Gzprotein exhibited a similar activity profile as the 8920 potentiator. Analysis of the various lysates by SDS-PAGE showed the presence of a ˜280 kDa in pDAB8829 samples as compared to vector lysates. The migration of the band is consistent with the predicted molecular weight of Tcp1_Gz. These results demonstrate that the plasmid pDAB8829 produces the protein Tcp1_Gzand this protein potentiates the activity of the insect active Class A TcdA and XptA2 proteins.

TABLE 4

Response of coleopteran and lepidopteran species to E. coli lysates

and purified proteins. Seven to nine insects used per replicate.

Southern Corn
Corn

Sample
Lysate Tested
Rootworm
Earworm

pET
Control
+
+

pET + TcdA
Control
0
nt

pET + XptA2
Control
nt
0

pDAB8920
8920
0
0

(TcdB2/TccC3)

pDAB8920 + TcdA
8920
++++
nt

(TcdB2/TccC3)

pDAB8920 + XptA2_Xwi
8920
nt
++++

(TcdB2/TccC3)

pDAB8829
Tcp1_Gz
0
0

pDAB8829 + TcdA
Tcp1_Gz
++++
nt

pDAB8829 + XptA2_Xwi
Tcp1_Gz
nt
++++

Data are for two independent replicates.

Growth Inhibition:

0 = 0-20%;

+ = 21-40%;

++ = 41-60%;

+++ = 61-80%;

++++ = 81-100%;

nt = not tested.

EXAMPLE 9
Identification of Other Naturally Occurring Fused Class B/Class C Proteins

This example provides a further illustration of an approach that could be used to search a protein database for candidate proteins having homology to Class B and Class C TC proteins. An artificially generated fusion protein sequence was first constructed using a DNA/Protein analysis program [Vector NTI (Informax, Inc.)]. One skilled in the art would recognize that several other DNA/Protein analysis programs could alternatively be used. An example of such fusion protein, generated from the amino acid sequences of TcaC (GenBank Accession AAC38625.1) and TccC1 (GenBank Accession AAL18473.1) (both from Photorhabdus luminescens strain W-14) is disclosed in SEQ ID NO:6. This artificial fusion protein sequence was used in a standard protein-protein BLAST search of the NCBI nonredundant protein database, using default values as listed below:

- Filter set to low complexity;
- Expect 10;
- Word size 3;
- Matrix BLOSUM62
- Gap Costs: Existence 11, Extension 1

FIG. 1 presents the graphical output of such a search. [The actual ouput of the computer search is presented in colors on a computer monitor; it is understood that the figure, as printed, is not exactly the same as the computer monitor output. This does not limit the interpretations presented herein.] Across the top of the figure is a bar with differently shaded segments to represent Alignment Scores calculated from different amounts of amino acid sequence homology of the query sequence to a sequence identified in the search. The values shown are: <40, 40-50, 50-80, 80-200- and >=200. The next horizontal line under the Alignment Score bar represents the amino sequence of the artificial fusion query sequence of 2858 amino acids, with divisions of 500 amino acids. This artificial fusion protein used as the query sequence is comprised of TcaC amino acids from residues 1 to 1485, and TccC1 amino acids from residues 1486 to 2858. The horizontal lines in the data portion of FIG. 1 represent individual proteins, identified by the BLAST algorithm as possessing an amino acid sequence related (within the parameters of the search) to the query sequence. For clarity and ease of reference, numbers have been added to certain of the landmark lines; such numbers are not a part of the original output. Inspection of FIG. 1 reveals that there were 64 lines representing protein sequences identified as having significant homology regions to the query sequence. It is noted that several of the horizontal lines do not represent single proteins. For example, the larger, left-hand portion of line 1 was identified by the output as “gi|3265037|gb|AAC38625.1| insecticidal toxin complex protein TcaC [Photorhabdus luminescens]” (i.e., part of the query sequence), whereas the right-hand, smaller portion of line 1 is identified as “>gi|53693249|ref|ZP_—00127870.2| COG3209: Rhs family protein [Pseudomonas syringae pv. syringae B728a].” The gap in line 1 between the left-hand and right-hand portions of the line indicates that the two homology regions belong to separately encoded proteins.

In some instances, however, the homology lines represent a single protein. For example, line 53 has the left-hand and right-hand homology regions joined by a slashed line. The BLAST output identified this protein as the subject of this invention: “>gi|42545609|gb|EAA68452.1| hypothetical protein FG10566.1 [Gibberella zeae PH-1]”. Other single proteins with homology regions to the query sequence are identified in Table 5. Although discovered by their homologies to a Class B and a Class C protein, it is understood that the biological functions/activities of the deduced proteins have not been confirmed. However, given the subject disclosure (but not prior to it), one is now motivated to assess the functionality of these proteins for their ability to potentiate the activity of Class A Toxin Complex proteins.

TABLE 5

Hypothetical fusion proteins identified by BLAST search

of NCBI nonredundant protein database.

Output
Identified Sequence Listing

Line 55
>gi|42527066|ref|NP_972164.1|

YD repeat protein [Treponema denticola ATCC 35405]

Length = 3320

Left-hand portion:

Score = 49.7 bits (117), Expect = 0.001

Identities = 37/106 (34%), Positives = 54/106 (50%), Gaps = 19/106 (17%)

Right-hand portion:

Score = 116 bits (290), Expect = 1e−23

Identities = 199/931 (21%), Positives = 324/931 (34%), Gaps = 208/931 (22%)

Line 58
>gi|45657896|ref|YP_001982.1|

cytoplasmic membrane protein [Leptospira interrogans serovar Copenhageni str.

Fiocruz L1-130]

Length = 2554

Left-hand portion:

Score = 62.8 bits (151), Expect = 1e−07

Identities = 55/214 (25%), Positives = 93/214 (43%), Gaps = 37/214 (17%)

Right-hand portion:

Score = 84.3 bits (207), Expect = 4e−14

Identities = 185/818 (22%), Positives = 309/818 (37%), Gaps = 141/818 (17%)

Line 60
>gi|24214465|ref|NP_711946.1| Rhs family protein [Leptospira interrogans

serovar Lai str. 56601]

Length = 2321

Left-hand portion:

Score = 62.4 bits (150), Expect = 2e−07

Identities = 55/214 (25%), Positives = 93/214 (43%), Gaps = 37/214 (17%)

Right-hand portion:

Score = 77.4 bits (189), Expect = 5e−12

Identities = 177/792 (22%), Positives = 300/792 (37%), Gaps = 145/792 (18%)

Line 63
>gi|20090892|ref|NP_616967.1|

hypothetical protein MA2045 [Methanosarcina acetivorans C2A]

Length = 2217

Left-hand portion:

Score = 63.9 bits (154), Expect = 6e−08

Identities = 174/872 (19%), Positives = 302/872 (34%), Gaps = 219/872 (25%)

Right-hand portion:

Score = 63.5 bits (153), Expect = 8e−08

Identities = 127/542 (23%), Positives = 210/542 (38%), Gaps = 114/542 (21%)

Line 64
>gi|48863870|ref|ZP_00317763.1|

COG3209: Rhs family protein [Microbulbifer degradans 2-40]

Length = 2480

Left-hand portion:

Score = 62.0 bits (149), Expect = 2e−07

Identities = 140/647 (21%), Positives = 233/647 (36%), Gaps = 102/647 (15%)

Right-hand portion:

Score = 54.7 bits (130), Expect = 4e−05

Identities = 177/914 (19%), Positives = 318/914 (34%), Gaps = 188/914 (20%)

EXAMPLE 10
Cloning of a Gene Encoding a Class B/C Fusion Protein of Toxin Complex Potentiators from Tannerella forsythensis

Pfam model analysis and scanning of publicly available DNA and protein sequence databases (NCBI and TIGR Microbial) identified a gene encoding a candidate fused Class B/C Toxin Complex (TC) potentiator protein and another four potential Class C TC genes in the genome of Tannerella forsythensis. (Also known as Bacteroides forsythus. As of the subject filing date, this genome was not known to be available from Entrez, and at the TIGR Microbial Database it was listed as unfinished with no target date for completion.) These Class C TC genes are located downstream of the gene encoding the fused Class B/C TC protein. The putative gene encoding the fused Class B/C TC protein was cloned.

Genomic DNA of Tannerella forsythensis (ATCC 43037) was purchased from the American Type Tissue Culture Collection (ATCC, Manassas, Va.). Primers for amplifying various regions of the fused Class B/C TC gene and its flanking sequences were designed based on the sequence in the public database. In the initial PCR reactions, a 4541 bp product corresponding to the region from 431 bp upstream of the Class B/C fusion gene to 4110 bp downstream of the putative start codon (ATG) was obtained using primers P1 and P2 (Table 6) with PfuTurbo hotstart DNA polymerase (Stratagene, La Jolla, Calif.). This PCR product was inserted into the pCRII Blunt TOPO vector (Invitrogen, Carlsbad, Calif.) and the DNA sequence of the insert DNA was determined. The sequencing results showed that the homology between the PCR fragment and the corresponding region of the fused Class B/C gene in the public database was only 97.1%. This implied that the bacterial strain on which the sequence in the public database was based might be different from that which we obtained from the ATCC (i.e. strain 43037). Multiple attempts were made to amplify the 3′ end of the fused Class B/C gene based on the sequence in the public database. Alternative primers were designed to amplify DNA fragments beginning at the 3′ end of the confirmed sequence region and extending to various regions downstream of the Class B/C fusion gene (based on the published sequence). An approximately 6.5 kb PCR fragment was obtained using primers P3 and P4 (Table 6) with Takara EX Taq™ DNA polymerase (Fisher Scientific, Pittsburg, Pa.). This DNA fragment was cloned into the pCR2.1-TOPO vector (Invitrogen, Carlsbad, Calif.) and partially sequenced. The results from sequencing the two ends of this PCR product indicated that, although the forward primer (P3) annealed to the expected location in the Class B/C fusion gene, the reverse primer (P4) had attached to the 5′ end of the fourth Class C TC related gene downstream of the Class B/C fusion gene. Moreover, the size of PCR product (˜6.5 kb) was smaller than the size predicted from the published genomic sequence (11201 bp), which indicated that there was a deletion or re-arrangement of DNA sequence in that region. The full sequence of the 3′ end of the Class B/C fusion gene represented in the 6.5 kb PCR product was obtained by stepwise walking from the confirmed region at the 5′ end all the way to the first in-frame stop codon. The full length sequence of the Class B/C fusion gene in Tannerella forsythensis (ATCC 43037) is disclosed in SEQ ID NO:11.

In parallel, three GenomeWalking “libraries” were constructed by digesting the genomic DNA of T. forsythensis (ATCC 43037) with Afe I, BsaB I, and Stu I restriction enzymes and using the BD GenomeWalker™ Universal Kit (BD Biosciences, San Jose, Calif.). The first round PCR was performed using primer P5 (Table 6) and AP1 (provided with the kit). The second round PCR was carried out using a pair of nested primers, P6 (Table 6) and AP2 (provided with the kit). Takara LA Taq™ DNA polymerase was used in both rounds of PCR reactions. Specific amplification was obtained from the libraries generated by BsaB I and Stu I digestions. These PCR products were cloned into the pCR2.1 TOPO vector and sequenced. The sequencing results matched with the corresponding region of SEQ ID NO:11 except for a few single nucleotide mutations that were possibly introduced during the PCR process. These results confirmed that the sequence disclosed in SEQ ID NO:11 was the actual sequence of the Class B/C fusion gene in Tannerella forsythensis (ATCC 43037), with very few discrepancies.

To further confirm this result, Southern blot analyses were performed using T. forsythensis genomic DNA digested with Hind III and BsaB I in separate reactions. The blots were probed with a 1030 bp DNA fragment representing part of the coding region for a Class B TC related protein in the Class B/C fusion gene. This probe was obtained by PCR amplification of the genomic DNA from T. forsythensis (ATCC 43037) using primers P2/P3. The results from the Southern blot analysis revealed that the probe hybridized to Hind III and BsaB I fragments of T. forsythensis (ATCC 43037) genomic DNA with the same sizes as those predicted from SEQ ID NO:11 [2792 bp for Hind III digestion, and 3598 bp for BsaB I].

It is noted that the DNA sequence of the T. forsythensis (ATCC 43037) Class B/C fusion gene disclosed as SEQ ID NO:11 was obtained from PCR products amplified from genomic DNA. It is well known in the art that such PCR amplifications can introduce small numbers of base incorporation errors. Thus, it is possible that the actual sequence of the gene as present in the T. forsythensis (ATCC 43037) genome may be slightly different from that disclosed in SEQ ID NO:11. Given that the sequence disclosed in SEQ ID NO:11 was determined from multiple PCR products, it is reasonable to expect that the genomic copy of the Class B/C fusion gene should be at least 99% identical to SEQ ID NO:11. Comparison of SEQ ID NO:11 to the corresponding sequence of the Class B/C fusion gene in the public databases reveals that the two sequences share 97% homology in the region comprising approximately 5.2 kb at the 5′ end, which corresponds to the entire coding region for an amino acid sequence related to a Class B TC protein plus the core region of a Class C TC protein. Downstream of the 5.2 kb region, there is an approximately 460 bp sequence with high homology to the hyper-variable region of the third Class C TC related gene downstream of the Class B/C fusion gene, then there is an additional ˜420 bp at the 3′ end which currently exhibits relatively lower homology (less than 60%) than any other part of the published Tannerella forsythensis (ATCC 43037) partial genome sequence database. This indicated that the sequence of the putative gene encoding the fused Class B/C TC protein cloned from the genomic DNA of Tannerella forsythensis (ATCC 43037) (SEQ ID NO:11) was different from that in the public databases.

TABLE 6

Primers used for PCR

Primer ID
Sequence
SEQ ID NO:

P1
AGGATCGTACGATGGAACAAGAGG
13

P2
CGACTGTGATGCGTAACGAACAGA
14

P3
GTCCGACGGTCTGTATATGCTTAG
15

P4
CCGAAGAAATCAATGCCTGCCGAT
16

P5
TAATGTCCCCGACGGTAAATGGCTTGAA
17

P6
GCGTCTGTTCGTTACGCATCACAGTCG
18

EXAMPLE 11
Identification of Further Multi-Domain TC Proteins

In light of the subject disclosure of the activity of naturally fused “BC” toxin complex proteins, one skilled in the art now has the motivation to find other such fusion proteins with the expectation that they can potentiate the insecticidal activity of a Class A toxin complex toxin protein. It is well known in the art that standard BLAST searches of protein databases can be used to identify proteins related to one another by amino acid sequence homology. This example teaches how one may analyze a database of protein sequences and extract those having particular domain structures that predict their function as Class B or Class C Toxin Complex (TC) potentiators. The Class B and Class C TC gene families encode proteins that are relatively large and have distinctive protein domain structures. These two factors can be used in tandem to extract sequences of individual Class B and Class C TC proteins from large protein databases. Similarly, when the Class B and C TC proteins are fused into a single polypeptide, their large size and distinctive combinations of protein domains can be used to set up specific searches to extract sequences of related structure and function from protein databases. This is now possible because of the subject disclosure.

Protein domain searches are routinely performed utilizing a Pfam search algorithm (E.L.L. Sonnhamrner, S.R. Eddy, and R. Durbin), either at the Pfam web site (pfam.wustl.edu/), at a “mirror” site (e.g. website at sanger.ac.uk/Software/Pfam/), or with a local installation of the database. While these Pfam models are very helpful, they could miss existing domains, especially if those domains are reasonably divergent from the model. Therefore, to increase sensitivity of domain detection, it is desirable to establish protein domain models specific to the gene families being studied. This can be done with the same set of analysis tools as was used in generating the Pfam families (i.e. HMMER; R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison) and will often allow identification of protein domains missed by the more general models.

The workflow is conceptually simple. First, a search is performed on a protein database to extract a subset of the database sequences. Second, this subset is tested against an HMM model generated using HMMER. The hits generated against the model, and which have an appropriate significance level, will either encompass the set of proteins that can be selected for experimental characterization, or can serve as a smaller database that is screened against a second HMM model. This screening can be iterated as necessary until the desired level of resolution has been attained.

The examples below exemplify the utility of this approach for four distinct sets of TC protein families: the single Class B TC proteins, the single Class C TC proteins, and the fused Class B/C TC proteins from eukaryotic and prokaryotic archaeal sources.

Class B TC Proteins. All the Class B TC genes discovered to date come from prokaryotes; therefore, the initial search set was restricted to prokaryotic protein sequences. A protein search at the website ncbi.nlm.nih.gov/ was performed using the search terms: “1400:1600[SLEN] AND Prokaryota”. These terms restricted the search to those proteins that are between 1400 and 1600 amino acids in length and have prokaryotic sources. A total of 3522 protein sequences were identified and downloaded as a searchable database. It should be noted that, while these restrictions are useful in the present context, the interval length of the sequences and the kingdom searched can be modified to meet the parameters of the individual protein set to be examined.

All known Class B TC proteins examined to date contain two distinct sets of domains. (The domain terminology used herein is taken from the Pfam site and the domains can be searched for by name at that site.) At the amino terminal end is a highly conserved spvB domain. This domain is so well conserved that it was not necessary to construct a more specific HMM domain model, and the general model (spvB_ls.hmm) was downloaded directly from the Pfam web site. See also M. L. Lesnick, N. E. Reiner, J. Fierer, and D. G. Guiney, Mol. Microbiol. March 2001, 39(6):1464-70.

The spvB domain is followed by multiple FG-GAP domains. See, e.g., T. A. Spring, “Folding of the N-terminal, ligand-binding region of integrin alpha-subunits into a beta-propeller domain,” Proc. Natl. Acad. Sci. U.S.A. 1997, 94:65-72. The FG-GAP domain model utilized by the general Pfam model miss many domains within a single protein, and all domains in certain proteins when default Pfam gathering threshold values are used (for example GenBank Accessions 66047263, 28871479 and 48730377, which are identified by our model, see below). If a more relaxed cutoff value is used, [E-value=1.0], more domains are found, including those in the above proteins, but some domains may not be discovered. Therefore, it was necessary to create a tailored FG-GAP HMM model.

Creation of an HMM model of a protein family typically requires three steps. First, a set of domains is chosen as a “seed,” and a multiple sequence alignment is created using ClustalX. Second, the multiple sequence alignment is used as an input for hmmbuild, a program that generates the HMM model. Finally, hmmcalibrate is used to correct the statistics of the particular model (hmmcalibrate, hmmbuild, and hmmsearch are components of the HMMER package). The seed domain set used to generate the HMM model is a key component of the success of the model. It must be diverse enough to capture all the diversity of the relevant domains. However, the seed set cannot contain all the known domain members since the test of the predictive power of the model requires its ability to identify the domain containing members not included in the seed set.

A set of Class B TC proteins to be used as the source of FG-GAP domains was acquired using the BLink resource for GenBank Accession 16416891 (TcaC from Photorhabdus luminescens). This extraction yielded 15 related, nonredundant prokaryotic proteins with scores higher than 2000. [The number of related proteins accessible through BLink of any given protein can vary with time, since GenBank can be a dynamic list.] The GenBank Accession numbers of the extracted proteins were: 16416891, 37524951, 16416930, 37524959, 27479675, 51597844, 22124105, 45443595, 50956508, 14041732, 32699986, 10956817, 66047263, 28871479, and 48730377.

An intermediate HMM model was generated by extracting the FG-GAP domains found in the above proteins using the general Pfam model, supplemented with the known FG-GAP domains from GenBank Accession 16416891 (i.e. TcaC). The domains used in the final model were obtained by extracting the domains from the two protein sequences which had the best and worst scores of proteins containing six FG-GAP domains (GenBank Accessions 16416891 and 66047263, respectively). [Note that six represents the canonical number of FG-GAP domains in most FG-GAP containing proteins.] It is not surprising that GenBank Accession 16416891 is the best hit, since it is part of the model itself. The ClustalX multiple sequence alignments of the six FG_GAP domains in these two proteins is shown below. These alignments can be used with hmmbuild to generate the FG-GAP HMM model used in this example.

CLUSTALX (1.83) multiple sequence alignment

of derived FG-GAP domains.

gi|16416891|Domain_4
DARKLVAFSDMLGSGQQHLVEIKAN-RVTCWP-NLGHGRFGQP-
SEQ ID NO: 42

gi|66047263|Domain_4
-STELVAFSDLLGTGQQHLIRIRHN-EIRVWP-NLGRGRFGKG-
SEQ ID NO: 43

gi|16416891|Domain_3
--HPSIQFADLTGAGLSDLVLIGPK-SVRLYA-NQR-NGWRKGE
SEQ ID NO: 44

gi|66047263|Domain_3
--HPQGQMADLVGDGLSDLALIGPR-SVRLYA-NRRADGFAAA-
SEQ ID NO: 45

gi|16416891|Domain_6
-NTCQLQVADIQGLGIASLILTVPHIAPHHWRCDLSLTKPW---
SEQ ID NO: 46

gi|66047263|Domain_6
-RFCQFSAVDLLGLGFSSLVLTVPHMAPRHWSLYYAADRTG---
SEQ ID NO: 47

gi|16416891|Domain_2
--QDNASLMDINGDGQLDWVVTASG-IRGYHS-QQPDGKWTH--
SEQ ID NO: 48

gi|66047263|Domain_2
-APVRQTLTDLTGDGRLDWVVAQPG-MAGFFT-LNPDRSWSK-
SEQ ID NO: 49

gi|16416891|Domain_5
-NPERLFLADIDGSGTTDLIYAQSG-SLLIYL-NQSGNQFDAP-
SEQ ID NO: 50

gi|66047263|Domain_5
-DSSRVRLADLDGSGASDVLYLQAD-GFQVFM-NQGGNGLAAA-
SEQ ID NO: 51

gi|16416891|Domain_1
--QQRYQLVDLRGEGLPGMLYQDRG--AWWYK-APQRQEDGDS-
SEQ ID NO: 52

gi|66047263|Domain_1
--GQQYQLVDLYGDGLPGILYRDDK--AWLYR-EPIRDTAGTA-
SEQ ID NO: 53

This multiple sequence alignment was used with hmmbuild and hmmcalibrate to generate BModels3.hmm. The BModels3.hmm model was then tested against the 15 protein sample set above and was able to identify all expected FG-GAP domains. Conversely, when the BModels3.hmm model was tested against 20 randomly selected proteins, no FG-GAP domains were found. The BModels3.hmm model was then tested with hmmsearch against the 3522 member database containing all prokaryotic proteins between 1400 and 1600 amino acids, yielding the results below.

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/BModels3.hmm [BDomainsModel3Sequences]
Sequence database: FinalTest/Prokaryotic1400-1600.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: BDomainsModel3Sequences
Accession: [none]
Description: [none]

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|16416891|gb|AAL18451.1|
toxin complex pr
323.9
1.1e−94
6

gi|3265037|gb|AAC38625.1|
insecticidal tox
323.9
1.1e−94
6

gi|66047263|ref|YP_237104.1|

Salmonella virul
323.7
1.3e−94
6

gi|63257970|gb|AAY39066.1|

Salmonella virul
323.7
1.3e−94
6

gi|36783948|emb|CAE12810.1|
Insecticidal tox
321.0
8.3e−94
6

gi|37524524|ref|NP_927868.1|
Insecticidal tox
321.0
8.3e−94
6

gi|28854730|gb|AAO57793.1|
insecticidal tox
312.1
4e−91
6

gi|28871479|ref|NP_794098.1|
insecticidal tox
312.1
4e−91
6

gi|36784385|emb|CAE13264.1|
Insecticidal tox
248.0
8e−72
6

gi|37524959|ref|NP_928303.1|
Insecticidal tox
248.0
8e−72
6

gi|51510118|emb|CAH19031.1|
unnamed protein
244.2
1.1e−70
6

gi|49021690|emb|CAG38449.1|
unnamed protein
244.2
1.1e−70
6

gi|27479675|gb|AAO17202.1|
TcdB2 [Photorhab
244.2
1.1e−70
6

gi|36784377|emb|CAE13256.1|
Insecticidal tox
229.4
3.2e−66
6

gi|37524951|ref|NP_928295.1|
Insecticidal tox
229.4
3.2e−66
6

gi|50956508|gb|AAT90757.1|
putative insecti
229.0
4.2e−66
6

gi|16416930|gb|AAL18487.1|
TcdB1; toxin com
227.6
1.1e−65
6

gi|9963679|gb|AAG09643.1|
SepB [Serratia e
214.1
1.2e−61
6

gi|10956817|ref|NP_065277.1|
SepB [Serratia e
214.1
1.2e−61
6

gi|13444938|emb|CAC34920.1|
unnamed protein
214.1
1.2e−61
6

gi|51591126|emb|CAH22791.1|
insecticidal tox
211.8
5.9e−61
6

gi|51597844|ref|YP_072035.1|
insecticidal tox
211.8
5.9e−61
6

gi|45477499|gb|AAS66065.1|
YO185-like prote
211.8
5.9e−61
6

gi|32699986|gb|AAP57764.1|
TcYF2 [Yersinia
207.9
8.9e−60
6

gi|15981600|emb|CAC93148.1|
insecticidal tox
201.9
6e−58
6

gi|45443595|ref|NP_995134.1|
insecticidal tox
201.9
6e−58
6

gi|25511233|pir|AH0447
insecticidal tox
201.9
6e−58
6

gi|16123821|ref|NP_407134.1|
insecticidal tox
201.9
6e−58
6

gi|22124105|ref|NP_667528.1|
putative toxin s
201.9
6e−58
6

gi|45438465|gb|AAS64011.1|
insecticidal tox
201.9
6e−58
6

gi|21956856|gb|AAM83779.1|AE013618_4
putative toxin s
201.9
6e−58
6

gi|14041732|emb|CAC38403.1|
XptC1 protein [X
178.3
7.7e−51
5

gi|48730377|ref|ZP_00264125.1|

hypothetical pro

138.4

7.5e−39

6

gi|39576632|emb|CAE80796.1|
hypothetical pro
13.5
0.044
2

gi|42524423|ref|NP_969803.1|
hypothetical pro
13.5
0.044
2

gi|23129574|ref|ZP_00111400.1|
COG2931: RTX tox
5.7
0.39
2

gi|11360961|pir||H82802
fimbrial assembl
2.4
0.99
1

gi|15837080|ref|NP_297768.1|
fimbrial assembl
2.4
0.99
1

gi|28199488|ref|NP_779802.1|
fimbrial assembl
2.4
0.99
1

gi|53800360|ref|ZP_00359737.1|
COG3419: Tfp pil
2.4
0.99
1

gi|9105326|gb|AAF83288.1|AE003897_13
fimbrial assembl
2.4
0.99
1

gi|28057603|gb|AAO29451.1|
fimbrial assembl
2.4
0.99
1

gi|19704886|ref|NP_602381.1|

Fusobacterium ou
−2.0
3.4
1

gi|19712775|gb|AAL93680.1|

Fusobacterium ou
−2.0
3.4
1

gi|67760705|ref|ZP_00499423.1|
COG3321: Polyket
−2.3
3.7
1

gi|67756256|ref|ZP_00495142.1|
COG3321: Polyket
−2.3
3.7
1

gi|67738861|ref|ZP_00489483.1|
COG3321: Polyket
−2.3
3.7
1

gi|67710424|ref|ZP_00480230.1|
COG3321: Polyket
−2.3
3.7
1

gi|67683110|ref|ZP_00477256.1|
COG3321: Polyket
−2.3
3.7
1

gi|52212615|emb|CAH38641.1|
putative non-rib
−2.3
3.7
1

gi|53722201|ref|YP_111186.1|
putative non-rib
−2.3
3.7
1

gi|28199386|ref|NP_779700.1|
hemolysin-type c
−2.4
3.8
1

gi|28057492|gb|AAO29349.1|
hemolysin-type c
−2.4
3.8
1

gi|67157564|ref|ZP_00418813.1|
CobN/magnesium c
−2.7
4.1
1

gi|67085472|gb|EAM04946.1|
CobN/magnesium c
−2.7
4.1
1

gi|49532277|emb|CAG69989.1|
competence facto
−2.7
4.2
1

gi|50086301|ref|YP_047811.1|
competence facto
−2.7
4.2
1

gi|20803923|emb|CAD31501.1|
HYPOTHETICAL PRO
−2.8
4.3
1

gi|52857125|ref|ZP_00341419.1|
COG3419: Tfp pil
−3.2
4.8
1

gi|67546318|ref|ZP_00424233.1|
Amino acid adeny
−3.5
5.1
1

gi|67532466|gb|EAM29252.1|
Amino acid adeny
−3.5
5.1
1

gi|56750225|ref|YP_170926.1|
hypothetical pro
−4.0
6
1

gi|56685184|dbj|BAD78406.1|
hypothetical pro
−4.0
6
1

gi|46129819|ref|ZP_00164431.2|
COG2931: RTX tox
−4.0
6
1

gi|67935982|ref|ZP_00528997.1|
conserved hypoth
−4.7
7.2
1

gi|67775076|gb|EAM34747.1|
conserved hypoth
−4.7
7.2
1

gi|18466717|ref|NP_569524.1|
putative phage t
−4.9
7.5
1

gi|16506033|emb|CAD09919.1|
putative phage t
−4.9
7.5
1

gi|28199311|ref|NP_779625.1|
bacteriocin [Xyl
−5.4
8.7
1

gi|28057417|gb|AAO29274.1|
bacteriocin [Xyl
−5.4
8.7
1

gi|5834691|emb|CAB55188.1|
putative phage t
−5.7
9.5
1

gi|7467447|pir||T14966
phage lambda-rel
−5.7
9.5
1

gi|31795379|ref|NP_857832.1|
host specificity
−5.7
9.5
1

gi|45478595|ref|NP_995451.1|
phage lambda-rel
−5.7
9.5
1

gi|45357248|gb|AAS58642.1|
phage lambda-rel
−5.7
9.5
1

gi|16082788|ref|NP_395342.1|
host specificity
−5.7
9.5
1

gi|3883049|gb|AAC82709.1|
lambda host spec
−5.7
9.5
1

It is noted that there is a very clean break of E-values between gi|48730377| and gi|39576632| (double-underlined for clarity). The proteins with E-values below that of gi|48730377| (7.5e-39) were extracted. This dataset was then searched using the spvB_ls.hmm model and hmmsearch. The results are presented below:

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/spvB_ls.hmm [SpvB]
Sequence database: FinalTest/ProBBModel3Hits.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: SpvB
Accession: PF03534.3
Description: Salmonella virulence plasmid 65 kDa B protein

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|16416891|gb|AAL18451.1|
toxin complex pr
871.6
1.4e−261
1

gi|3265037|gb|AAC38625.1|
insecticidal tox
871.6
1.4e−261
1

gi|36783948|emb|CAE12810.1|
Insecticidal tox
843.0
5.6e−253
1

gi|37524524|ref|NP_927868.1|
Insecticidal tox
843.0
5.6e−253
1

gi|9963679|gb|AAG09643.1|
SepB [Serratia e
821.5
1.7e−246
1

gi|10956817|ref|NP_065277.1|
SepB [Serratia e
821.5
1.7e−246
1

gi|13444938|emb|CAC34920.1|
unnamed protein
821.5
1.7e−246
1

gi|32699986|gb|AAP57764.1|
TcYF2 [Yersinia
730.4
4.3e−219
1

gi|36784377|emb|CAE13256.1|
Insecticidal tox
586.6
8.5e−176
1

gi|37524951|ref|NP_928295.1|
Insecticidal tox
586.6
8.5e−176
1

gi|16416930|gb|AAL18487.1|
TcdB1; toxin com
578.4
2.5e−173
1

gi|36784385|emb|CAE13264.1|
Insecticidal tox
574.5
3.9e−172
1

gi|37524959|ref|NP_928303.1|
Insecticidal tox
574.5
3.9e−172
1

gi|51510118|emb|CAH19031.1|
unnamed protein
572.1
2e−171
1

gi|49021690|emb|CAG38449.1|
unnamed protein
572.1
2e−171
1

gi|27479675|gb|AAO17202.1|
TcdB2 [Photorhab
572.1
2e−171
1

gi|50956508|gb|AAT90757.1|
putative insecti
472.8
1.6e−141
1

gi|15981600|emb|CAC93148.1|
insecticidal tox
465.9
1.8e−139
1

gi|45443595|ref|NP_995134.1|
insecticidal tox
465.9
1.8e−139
1

gi|25511233|pir||AH0447
insecticidal tox
465.9
1.8e−139
1

gi|16123821|ref|NP_407134.1|
insecticidal tox
465.9
1.8e−139
1

gi|22124105|ref|NP_667528.1|
putative toxin s
465.9
1.8e−139
1

gi|45438465|gb|AAS64011.1|
insecticidal tox
465.9
1.8e−139
1

gi|21956856|gb|AAM83779.1|AE013618_4
putative toxin s
465.9
1.8e−139
1

gi|51591126|emb|CAH22791.1|
insecticidal tox
460.1
1e−137
1

gi|51597844|ref|YP_072035.1|
insecticidal tox
460.1
1e−137
1

gi|45477499|gb|AAS66065.1|
YO185-like prote
460.1
1e−137
1

gi|14041732|emb|CAC38403.1|
XptC1 protein [X
451.3
4.6e−135
1

gi|66047263|ref|YP_237104.1|

Salmonella virul
288.4
5.2e−86
1

gi|63257970|gb|AAY39066.1|

Salmonella virul
288.4
5.2e−86
1

gi|28854730|gb|AAO57793.1|
insecticidal tox
277.8
7.6e−83
1

gi|28871479|ref|NP_794098.1|
insecticidal tox
277.8
7.6e−83
1

gi|48730377|ref|ZP_00264125.1|
hypothetical pro
191.9
5.6e−57
1

The above set was dereplicated to remove duplicates, leaving a set of proteins that known are Class B TC proteins, as previously identified by standard BLAST searches. [The duplications are of two kinds—duplicate entries of the same gene, and identical proteins from closely related strains of the same organism.] The dereplicated list is presented below:

Sequence
Description
Score
E-value
N

gi|16416891|gb|AAL18451.1|
toxin complex pr
871.6
1.4e−261
1

gi|36783948|emb|CAE12810.1|
Insecticidal tox
843.0
5.6e−253
1

gi|37524524|ref|NP_927868.1|
Insecticidal tox
843.0
5.6e−253
1

gi|10956817|ref|NP_065277.1|
SepB [Serratia e
821.5
1.7e−246
1

gi|13444938|emb|CAC34920.1|
unnamed protein
821.5
1.7e−246
1

gi|32699986|gb|AAP57764.1|
TcYF2 [Yersinia
730.4
4.3e−219
1

gi|37524951|ref|NP_928295.1|
Insecticidal tox
586.6
8.5e−176
1

gi|16416930|gb|AAL18487.1|
TcdB1; toxin com
578.4
2.5e−173
1

gi|37524959|ref|NP_928303.1|
Insecticidal tox
574.5
3.9e−172
1

gi|27479675|gb|AAO17202.1|
TcdB2 [Photorhab
572.1
2e−171
1

gi|50956508|gb|AAT90757.1|
putative insecti
472.8
1.6e−141
1

gi|15981600|emb|CAC93148.1|
insecticidal tox
465.9
1.8e−139
1

gi|22124105|ref|NP_667528.1|
putative toxin s
465.9
1.8e−139
1

gi|45443595|ref|NP_995134.1|
insecticidal tox
465.9
1.8e−139
1

gi|51597844|ref|YP_072035.1|
insecticidal tox
460.1
1e−137
1

gi|14041732|emb|CAC38403.1|
XptC1 protein [X
451.3
4.6e−135
1

gi|66047263|ref|YP_237104.1|

Salmonella virul
288.4
5.2e−86
1

gi|28871479|ref|NP_794098.1|
insecticidal tox
277.8
7.6e−83
1

gi|48730377|ref|ZP_00264125.1|
hypothetical pro
191.9
5.6e−57
1

gi|45443595|ref|NP_995134.1|
insecticidal tox
465.9
1.8e−139
1

Thus, this example demonstrates the surprising result that, in contrast to standard BLAST searches used to identify proteins related by overall amino acid sequence homology, it is possible to additionally identify related proteins by using combinations of protein domain search strategies. These protein domain strategies can be acquired from a relatively few examples of proteins containing domains that may not be revealed by standard domain search algorithms. For this example, it is immaterial to the search end result whether the spvB_ls.hmm search is done first or second.

Class C TC Proteins. Class C TC proteins as annotated in GenBank are members of the RHS domain superfamily, which are characterized by containing multiple copies of the RHS domain. [Note: The domain terminology used herein is taken from the Pfam site and the domains can be searched for by name at that site.] See also C.W. Hill, C.H. Sandt, and D.A. Vlazny, “Rhs elements of Escherichia coli: a family of genetic composites each encoding a large mosaic protein,” Mol. Microbiol. June 1994, 12(6):865-71; A.D. Minet, B.P. Rubin, R.P. Tucker, S. Baumgartner, and R. Chiquet-Ehrismann, “Teneurin-1, a vertebrate homologue of the Drosophila pair-rule gene ten-m, is a neuronal protein with a novel type of heparin-binding domain,” J. Cell Sci. 1999, 112:2019-2032. All the Class C TC genes discovered to date come from prokaryotes; therefore, the initial search set was restricted to prokaryotic protein sequences. A protein search at the website ncbi.nlm.Nih.gov/ was performed using the search terms: “800:1100[SLEN] AND Prokaryota”. These terms restricted the search to those proteins that are between 800 and 1100 amino acids in length and have prokaryotic sources. A total of 54323 protein sequences were identified and downloaded as a searchable database. It should be noted that, while these restrictions are useful in the present context, the interval length of the sequences and the kingdom searched can be modified to meet the parameters of the individual protein set to be examined.

A set of Class C TC proteins to be used as the source of RHS domains was acquired using the BLink resource for GenBank Accession 27479677 (TccC3 from Photorhabdus luminescens). This extraction yielded 38 related, nonredundant prokaryotic proteins with scores higher than 800. [The number of related proteins accessible through BLink of any given protein can vary with time, since GenBank can be a dynamic list.] The GenBank Accession numbers of the extracted proteins were: 27479677, 27479683, 27479669, 37524966, 37528020, 37528005, 16416915, 27479639, 37524950, 37528309, 42742522, 32699988, 10956818, 51596618, 51596557, 45441893, 45441958, 14041731, 45443601, 51597848, 25511229, 45443600, 50956512, 28871477, 28871480, 66044304, 66043853, 66047265, 66045648, 66047259, 66047264, 66047260, 28868442, 48730374, 48730375, 48730376, 48732572, and 48732573.

An intermediate HMM model was generated by extracting the RHS domains found in the above proteins using the general Pfam RHS model as above. The list of domains was divided into two parts, one to be used as a seed set and one to be used for testing the model. The ClustalX multiple sequence alignments of the RHS domains of the seed set is shown below. These alignments can be used with hmmbuild to generate the RHS HMM model used in this example.

CLUSTALX (1.83) multiple sequence alignment

gi|27479639 |RHS_domain_1
-----------------ADATGALLTQT----DAKGNI------------
SEQ ID NO: 54

gi|37524966|RHS_domain_1
-----------------FDATGALLTQT----DAKSNI------------
SEQ ID NO: 55

gi|45441893|RHS_domain_1
-----------------ADATGAVLTTT----DAKGNL------------
SEQ ID NO: 56

gi|51596557|RHS_domain_1
-----------------ADATGAVLTTT----DAKGNL------------
SEQ ID NO: 57

gi|48730374|RHS_domain_2
-----------------YSPLGAVLTQT----DAGGHQ------------
SEQ ID NO: 58

gi|48730376|RHS_domain_2
-----------------FSAVGALLQTT----DAGGHL------------
SEQ ID NO: 59

gi|28871477|RHS_domain_2
-----------------FNAQGEDLAQT----DANGNV------------
SEQ ID NO: 60

gi|66047265|RHS_domain_1
-----------------FNALGDALAQT----DAMGNT------------
SEQ ID NO: 61

gi|28871480|RHS_domain_1
-----------------FNAQGEVLKQT----DASGNS------------
SEQ ID NO: 62

gi|28868442|RHS_domain_2
-----------------YTVAGLLKSSRL---QMNGQAE-----------
SEQ ID NO: 63

gi|45443601|RHS_domain_2
-----------------YNRAGQLIGSWL---TIKNSAE-----------
SEQ ID NO: 64

gi|66044304|RHS_domain_2
-----------------YDAQQRVVSET----AGNGVI------------
SEQ ID NO: 65

gi|66045648|RHS_domain_3
-----------------YDAQGHVTSET----AGNGVM------------
SEQ ID NO: 66

gi|66047260|RHS_domain_3
-----------------YDAFNQVEQET----AGNGVV------------
SEQ ID NO: 67

gi|66043853|RRS_domain_2
-----------------YDAHGRIESQT----AGNGVI------------
SEQ ID NO: 68

gi|66045648|RHS_domain_1
-----------------YDAQLRPVAII-----ENGRCV-----------
SEQ ID NO: 69

gi|66047260|RHS_domain_1
-----------------YDAQLRPLAIN-----ESGRMT-----------
SEQ ID NO: 70

gi|66047259|RHS_domain_1
-----------------YDSSLRPVSVT-----EQGLVV-----------
SEQ ID NO: 71

gi|66047264|RHS_domain_1
-----------------YDLHLRPTRII-----EQNRCA-----------
SEQ ID NO: 72

gi|27479639|RHS_domain_3
-----------------WTPRGELKQVN----NGPGN-------------
SEQ ID NO: 73

gi|27479683|RHS_domain_4
-----------------WTPRGELKQAN----NSAGN-------------
SEQ ID NO: 74

gi|27479669|RHS_domain_2
-----------------WNTRGELKQVTPVSRESAS--D-----------
SEQ ID NO: 75

gi|27479677|RHS_domain_4
-----------------WNTRGELQQVTLVKRDKGANDD-----------
SEQ ID NO: 76

gi|45441893|RHS_domain_4
-----------------WTARNELLKVTPVVRDGSTD-D-----------
SEQ ID NO: 77

gi|28871480|RHS_domain_3
-----------------WDARNQLQHITTVQREDGSNDD-----------
SEQ ID NO: 78

gi|66047265|RHS_domain_4
-----------------WDVRNQLQHITTVQREDGSSDD-----------
SEQ ID NO: 79

gi|28868442|RHS_domain_4
FDASGNLLALQAGQHLSWDRRNQLQHVRPVIRENGMDDS-----------
SEQ ID NO: 80

gi|66043853|RHS_domain_4
-----------------WDSGNRLIKVDAVTRSEQPEDG-----------
SEQ ID NO: 81

gi|27479683|RHS_domain_2
-----------------YSAAGQ-----KLREEHGNGIV-----------
SEQ ID NO: 82

gi|51597848|ref|RHS_domain_3
-----------------YSAAGQ-----KLREESGNGVI-----------
SEQ ID NO: 83

gi|27479677|RHS_domain_2
-----------------YEPETQRLIGIKTRRPSDTKVL-----------
SEQ ID NO: 84

gi|37524950|RHS_domain_2
------YDSL-------YQLISATGREMANIGQQNNQLP-SPALPS-DNN
SEQ ID NO: 85

gi|66047264|RHS_domain_3
------YDTL-------YQLIEASGREVRNGASHGPALPGLQSLPTIDPC
SEQ ID NO: 86

gi|48730374|RHS_domain_4
------YDTL-------YRLISATGYSDAPPSDR-LGLP-----QSTNPD
SEQ ID NO: 87

gi|28871477|RHS_domain_4
------YDAAGNLLQMRHEGAHNFTRNMHVDPDSNRSLP--------DND
SEQ ID NO: 88

gi|66047259|RHS_domain_3
------YDAAGNLLQMRHEGAHNFTRNMHVDPDSNRSLP--------DDE
SEQ ID NO: 89

gi|45443600|ref|NP_995139.1|
-----------------YDPVGNILAIHN--DAEATRFYR----------
SEQ ID NO: 90

gi|51597848|RHS_domain_1
-----------------YNAFGQLIASR----DPRLEVDN----------
SEQ ID NO: 91

gi|27479639|RHS_domain_1
------QRLAYDVA---GQLKGCWLTLKGQA
SEQ ID NO: 92

gi|37524966|RHS_domain_1
------QRLAYNVA---GQLKGSWLTLKNQSEQV--
SEQ ID NO: 93

gi|45441893|RHS_domain_1
------QRMAYDVA---GLLSGSW-TLKDGTE----
SEQ ID NO: 94

gi|51596557|RHS_domain_1
------QRMAYDVA---GLLSGSWLTLKDGTE----
SEQ ID NO: 95

gi|48730374|RHS_domain_2
------QQSTYDVA---GQLNRVQLQINGQT-----
SEQ ID NO: 96

gi|48730376|RHS_domain_2
------QQSTYDIA---GQLVQVQLQLDGQA-----
SEQ ID NO: 97

gi|28871477|RHS_domain_2
------QRFSHGVA---GQLHAVELTLANTAQRQT-
SEQ ID NO: 98

gi|66047265|RHS_domain_1
------QAFGMTVA---GQLKAAGLT----------
SEQ ID NO: 99

gi|28871480|RHS_domain_1
------QLSTHNLA---GQLHSTDL-----------
SEQ ID NO: 100

gi|28868442|RHS_domain_2
------QVLVSAIQY-DAQERVVSETAGNGVM----
SEQ ID NO: 101

gi|45443601|RHS_domain_2
------QVILRSLTY-SAAGQKLREESGNG------
SEQ ID NO: 102

gi|66044304|RHS_domain_2
------STALYATE--DGRLLALSARRADGLM----
SEQ ID NO: 103

gi|66045648|RHS_domain_3
------TKALHDAA--NGRLIELKGTRADGQL----
SEQ ID NO: 104

gi|66047260|RHS_domain_3
------SRYVYDLQ--DGRLIELSALSADGSV----
SEQ ID NO: 105

gi|66043853|RHS_domain_2
------SCASFDLA--DGRMSELITYRP-GVK----
SEQ ID NO: 106

gi|66045648|RHS_domain_1
------ERRQYGGAD-TQGHNQCNQCIRHDDPAGSR
SEQ ID NO: 107

gi|66047260|RHS_domain_1
------ERFTYGGPA-TAERNQCNQLIRHDDTAGSR
SEQ ID NO: 108

gi|66047259|RHS_domain_1
------ERLAYGGAD-AAEHNQCNQLIRHDDTAGSR
SEQ ID NO: 109

gi|66047264|RHS_domain_1
------ERFTYGQAG-AAAHNQCNQLVRHDDTAGSR
SEQ ID NO: 110

gi|27479639|RHS_domain_3
------EWYRYDSN---GMRQLKVSEQPTQ------
SEQ ID NO: 111

gi|27479683|RHS_domain_4
------EWYRYDSN---GIRQLKVNEQQTQ------
SEQ ID NO: 112

gi|27479669|RHS_domain_2
-----REWYRYGND---GMRRLKVSEQQ--------
SEQ ID NO: 113

gi|27479677|RHS_domain_4
-----REWYRYSGD---GRRMLKINEQQASNNAQT-
SEQ ID NO: 114

gi|45441893|RHS_domain_4
-----SESYRYDAA---SQRILKVSRQKTNT-----
SEQ ID NO: 115

gi|28871480|RHS_domain_3
-----E-RYVYDGQ---GQRCRLISTAQASGRT---
SEQ ID NO: 116

gi|66047265|RHS_domain_4
-----E-RYVYDGQ---GQRCRKISTAQASGRM---
SEQ ID NO: 117

gi|28868442|RHS_domain_4
-----E-RYSYDAS---GQRLRKVRTTQAKT-----
SEQ ID NO: 118

gi|66043853|RHS_domain_4
-----E-HYAYDAS---GQRLR--KTAKA-------
SEQ ID NO: 119

gi|27479683|RHS_domain_2
--TEY--SYEPETQ---RLIGITTRRPSDAK-----
SEQ ID NO: 120

gi|51597848|ref|RHS_domain_3
--TEY--RYEPQTQ---RLIGIKTTRP--AK-----
SEQ ID NO: 121

gi|27479E77|RHS_domain_2
--QDL--RYEYDPV---GNV-ISIRNDAEAT-----
SEQ ID NO: 122

gi|37524950|RHS_domain_2
TYTNYTRRYSYDHS---GNL-TQIRHSSSAT-----
SEQ ID NO: 123

gi|66047264|RHS_domain_3
QVSNYTQSYSYDAA---GNL-LQMRHEGA-------
SEQ ID NO: 124

gi|48730374|RHS_domain_4
DRRNYVEHYDYDHG---DNL-VKTIHVRDGTS----
SEQ ID NO: 125

gi|28871477|RHS_domain_4
RYVDF--ATSFDAN---GNL-LQLVRGQT-------
SEQ ID NO: 126

gi|66047259|RHS_domain_3
GEVDF--ATSFDAN---GNL-LQLVRGQT-------
SEQ ID NO: 127

gi|45443600|ref|NP_995139.1|
-----NQKIVPETTYRYDALYQLIEATGREADT---
SEQ ID NO: 128

gi|51597848|RHS_domain_1
------FRYQYSLS---GVPLRTDSVDSGSTL----
SEQ ID NO: 129

This multiple sequence alignment was used with hmmbuild and hmmcalibrate to generate CModel1.hmm. The CModel1.hmm model was then tested against the 38 protein sample set above and was able to identify all expected RHS domains. Conversely, when the CModel1.hmm model was tested against 20 randomly selected proteins, no RHS domains were found. The CModel1.hmm model was then tested with hmmsearch against the 54323 member database containing all prokaryotic proteins between 800 and 1100 amino acids, yielding the results below.

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/CModel1.hmm [CDomainsUniqueCreate2]
Sequence database: FinalTest/Prokaryotic800-1100.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: CDomainsUniqueCreate2
Accession: [none]
Description: [none]

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|66047260|ref|YP_237101.1|
YD repeat [Pseu
202.3
7e−57
6

gi|63257967|gb|AAY39063.1|
YD repeat [Pseu
202.3
7e−57
6

gi|28854728|gb|AAO57791.1|
insecticidal to
200.6
2.3e−56
6

gi|28871477|ref|NP_794096.1|
insecticidal to
200.6
2.3e−56
6

gi|66047265|ref|YP_237106.1|
YD repeat [Pseu
178.4
1.1e−49
7

gi|63257972|gb|AAY39068.1|
YD repeat [Pseu
178.4
1.1e−49
7

gi|28854731|gb|AAO57794.1|
insecticidal to
171.2
1.6e−47
7

gi|28871480|ref|NP_794099.1|
insecticidal to
171.2
1.6e−47
7

gi|66047264|ref|YP_237105.1|
YD repeat [Pseu
155.3
1e−42
7

gi|63257971|gb|AAY39067.1|
YD repeat [Pseu
155.3
1e−42
7

gi|66047259|ref|YP_237100.1|
YD repeat [Pseu
146.6
4e−40
6

gi|63257966|gb|AAY39062.1|
YD repeat [Pseu
146.6
4e−40
6

gi|51510120|emb|CAH19032.1|
unnamed protein
143.2
4.3e−39
4

gi|49021696|emb|CAG38450.1|
unnamed protein
143.2
4.3e−39
4

gi|27479677|gb|AAO17204.1|
TccC3 [Photorha
143.2
4.3e−39
4

gi|28851680|gb|AAO54756.1|
insecticidal to
135.0
1.3e−36
5

gi|28868442|ref|NP_791061.1|
insecticidal to
135.0
1.3e−36
5

gi|66045648|ref|YP_235489.1|
YD repeat [Pseu
134.9
1.4e−36
4

gi|63256355|gb|AAY37451.1|
YD repeat [Pseu
134.9
1.4e−36
4

gi|15981595|emb|CAC93143.1|
putative insect
134.1
2.3e−36
6

gi|45443601|ref|NP_995140.1|
putative insect
134.1
2.3e−36
6

gi|25511227|pir||AC0447
probable insect
134.1
2.3e−36
6

gi|16123816|ref|NP_407129.1|
putative insect
134.1
2.3e−36
6

gi|22124111|ref|NP_667534.1|
putative toxin
134.1
2.3e−36
6

gi|45438471|gb|AAS64017.1|
putative insect
134.1
2.3e−36
6

gi|21956863|gb|AAM83785.1|AE013619_6
putative toxin
134.1
2.3e−36
6

gi|15981596|emb|CAC93144.1|
putative insect
128.3
1.3e−34
6

gi|45443600|ref|NP_995139.1|
putative toxin
128.3
1.3e−34
6

gi|25511229|pir||AD0447
probable insect
128.3
1.3e−34
6

gi|16123817|ref|NP_407130.1|
putative insect
128.3
1.3e−34
6

gi|22124110|ref|NP_667533.1|
putative toxin
128.3
1.3e−34
6

gi|45438470|gb|AAS64016.1|
putative toxin
128.3
1.3e−34
6

gi|21956862|gb|AAM83784.1|AE013619_5
putative toxin
128.3
1.3e−34
6

gi|51591130|emb|CAH22795.1|
putative insect
127.9
1.7e−34
6

gi|51597848|ref|YP_072039.1|
putative insect
127.9
1.7e−34
6

gi|27479639|gb|AAL18492.2|
TccC2 [Photorha
127.0
3.2e−34
4

gi|36784376|emb|CAE13255.1|
Insecticidal to
124.0
2.6e−33
4

gi|37524950|ref|NP_928294.1|
Insecticidal to
124.0
2.6e−33
4

gi|36784383|emb|CAE13262.1|
Insecticidal to
122.1
9.4e−33
5

gi|37524957|ref|NP_928301.1|
Insecticidal to
122.1
9.4e−33
5

gi|49021701|emb|CAG38451.1|
unnamed protein
120.5
2.9e−32
4

gi|27479669|gb|AAO17196.1|
TccC4 [Photorha
120.5
2.9e−32
4

gi|66044304|ref|YP_234145.1|
YD repeat [Pseu
119.1
7.8e−32
5

gi|63255011|gb|AAY36107.1|
YD repeat [Pseu
119.1
7.8e−32
5

gi|51589839|emb|CAH21471.1|
putative insect
118.6
1.1e−31
5

gi|15980309|emb|CAC91117.1|
putative insect
118.6
1.1e−31
5

gi|45441893|ref|NP_993432.1|
putative insect
118.6
1.1e−31
5

gi|25510685|pir||AI0281
probable insect
118.6
1.1e−31
5

gi|16122536|ref|NP_405849.1|
putative insect
118.6
1.1e−31
5

gi|51596557|ref|YP_070748.1|
putative insect
118.6
1.1e−31
5

gi|22125912|ref|NP_669335.1|
putative compon
118.6
1.1e−31
5

gi|45436756|gb|AAS62309.1|
putative insect
118.6
1.1e−31
5

gi|21958849|gb|AAM85586.1|AE013804_10
putative compon
118.6
1.1e−31
5

gi|15980376|emb|CAC91185.1|
insecticial tox
114.6
1.8e−30
6

gi|45441958|ref|NP_993497.1|
insecticial tox
114.6
1.8e−30
6

gi|25510725|pir||AE0290
insecticial tox
114.6
1.8e−30
6

gi|16122603|ref|NP_405916.1|
insecticial tox
114.6
1.8e−30
6

gi|45436821|gb|AAS62374.1|
insecticial tox
114.6
1.8e−30
6

gi|36784392|emb|CAE13271.1|
Insecticidal to
114.1
2.5e−30
4

gi|37524966|ref|NP_928310.1|
Insecticidal to
114.1
2.5e−30
4

gi|9963680|gb|AAG09644.1|
SepC [Serratia
113.9
2.8e−30
4

gi|10956818|ref|NP_065279.1|
SepC [Serratia
113.9
2.8e−30
4

gi|13444939|emb|CAC34921.1|
unnamed protein
113.9
2.8e−30
4

gi|51589900|emb|CAH21532.1|
insecticial tox
111.8
1.2e−29
6

gi|51596618|ref|YP_070809.1|
insecticial tox
111.8
1.2e−29
6

gi|36787747|emb|CAE16860.1|
Insecticidal to
110.7
2.6e−29
4

gi|37528309|ref|NP_931654.1|
Insecticidal to
110.7
2.6e−29
4

gi|49021704|emb|CAG38452.1|
unnamed protein
107.5
2.4e−28
4

gi|27479683|gb|AAO17210.1|
TccC5 [Photorha
107.5
2.4e−28
4

gi|16416915|gb|AAL18473.1|
toxin complex p
106.5
4.6e−28
5

gi|3265044|gb|AAC38630.1|
insecticidal to
106.5
4.6e−28
5

gi|36787442|emb|CAE16539.1|
Insecticidal to
105.5
9.8e−28
5

gi|37528005|ref|NP_931350.1|
Insecticidal to
105.5
9.8e−28
5

gi|48730374|ref|ZP_00264122.1|
COG3209: Rhs fa
99.0
8.7e−26
6

gi|36787457|emb|CAE16554.1|
Insecticidal to
98.9
9e−26
5

gi|37528020|ref|NP_931365.1|
Insecticidal to
98.9
9e−26
5

gi|36784380|emb|CAE13259.1|
Insecticidal to
96.0
6.9e−25
4

gi|37524954|ref|NP_928298.1|
Insecticidal to
96.0
6.9e−25
4

gi|66043853|ref|YP_233694.1|
YD repeat [Pseu
93.1
5e−24
5

gi|63254560|gb|AAY35656.1|
YD repeat [Pseu
93.1
5e−24
5

gi|32699988|gb|AAP57765.1|
TcYF3 [Yersinia
91.4
1.7e−23
5

gi|66045559|ref|YP_235400.1|
YD repeat [Pseu
86.1
6.8e−22
9

gi|63256266|gb|AAY37362.1|
YD repeat [Pseu
86.1
6.8e−22
9

gi|56414584|ref|YP_151659.1|
Rhs-family prot
83.2
4.9e−21
11

gi|56128841|gb|AAV78347.1|
Rhs-family prot
83.2
4.9e−21
11

gi|16759284|ref|NP_454901.1|
Rhs-family prot
82.2
9.6e−21
11

gi|16501575|emb|CAD08754.1|
Rhs-family prot
82.2
9.6e−21
11

gi|25511763|pir||AB0539
Rhs-family prot
82.2
9.6e−21
11

gi|29142943|ref|NP_806285.1|
Rhs-family prot
82.2
9.6e−21
11

gi|29138575|gb|AAO70145.1|
Rhs-family prot
82.2
9.6e−21
11

gi|58581261|ref|YP_200277.1|
hypothetical pr
76.0
7e−19
9

gi|58425855|gb|AAW74892.1|
conserved hypot
76.0
7e−19
9

gi|50955422|ref|YP_062710.1|
RHS-related pro
72.4
8.4e−18
9

gi|50951904|gb|AAT89605.1|
RHS-related pro
72.4
8.4e−18
9

gi|21223079|ref|NP_628858.1|
putative Rhs pr
72.0
1.2e−17
8

gi|7321289|emb|CAB82067.1|
putative Rhs pr
72.0
1.2e−17
8

gi|50956512|gb|AAT90761.1|
putative insect
69.9
4.8e−17
4

gi|32042609|ref|ZP_00140192.1|
COG3209: Rhs fa
65.7
9.1e−16
9

gi|25496542|pir||B86084
hypothetical pr
64.8
1.7e−15
6

gi|12518845|gb|AAG59134.1|AE005624_10
orf; Unknown fu
64.8
1.7e−15
6

gi|15804530|ref|NP_290570.1|
hypothetical pr
64.8
1.7e−15
6

gi|46164196|ref|ZP_00136739.2|
COG3209: Rhs fa
64.5
2.1e−15
10

gi|14041731|emb|CAC38402.1|
XptB1 protein [
63.4
4.5e−15
5

gi|42742522|gb|AAS45281.1|
TccC1/XptB1 pro
63.4
4.5e−15
5

gi|48730376|ref|ZP_00264124.1|
COG3209: Rhs fa
61.5
1.7e−14
6

gi|48730375|ref|ZP_00264123.1|
COG3209: Rhs fa
60.2
4.1e−14
6

gi|17547189|ref|NP_520591.1|
PROBABLE RHS-RE
58.5
1.3e−13
6

gi|17429491|emb|CAD16177.1|
PROBABLE RHS-RE
58.5
1.3e−13
6

gi|58581262|ref|YP_200278.1|
hypothetical pr
57.7
2.4e−13
6

gi|58425856|gb|AAW74893.1|
conserved hypot
57.7
2.4e−13
6

gi|48732572|ref|ZP_00266315.1|
COG3209: Rhs fa
56.4
5.8e−13
4

gi|48732573|ref|ZP_00266316.1|
COG3209: Rhs fa
51.5
1.8e−11
4

gi|17430552|emb|CAD17236.1|
PUTATIVE RHS-RE
50.3
4e−11
5

gi|17548306|ref|NP_521646.1|
PUTATIVE RHS-RE
50.3
4e−11
5

gi|67660484|ref|ZP_00457830.1|
YD repeat [Burk
49.0
9.9e−11
6

gi|67091947|gb|EAM09510.1|
YD repeat [Burk
49.0
9.9e−11
6

gi|67932920|ref|ZP_00526052.1|
YD repeat [Soli
47.9
2e−10
7

gi|67859831|gb|EAM54893.1|
YD repeat [Soli
47.9
2e−10
7

gi|15981529|emb|CAC93077.1|
putative export
45.7
9.5e−10
5

gi|45443670|ref|NP_995209.1|
hypothetical pr
45.7
9.5e−10
5

gi|25511182|pir||AI0438
probable export
45.7
9.5e−10
5

gi|16123750|ref|NP_407063.1|
hypothetical pr
45.7
9.5e−10
5

gi|22124186|ref|NP_667609.1|
Rhs-like protei
45.7
9.5e−10
5

gi|45438540|gb|AAS64086.1|
putative export
45.7
9.5e−10
5

gi|21956945|gb|AAM83860.1|AE013626_7
Rhs-like protei
45.7
9.5e−10
5

gi|66048214|ref|YP_238055.1|
YD repeat [Pseu
44.1
3e−09
5

gi|63258921|gb|AAY40017.1|
YD repeat [Pseu
44.1
3e−09
5

gi|50841691|ref|YP_054918.1|
RHS-family prot
43.8
3.5e−09
3

gi|50839293|gb|AAT81960.1|
RHS-family prot
43.8
3.5e−09
3

gi|66963410|ref|ZP_00410982.1|
YD repeat [Arth
41.8
1.4e−08
9

gi|66871070|gb|EAL98434.1|
YD repeat [Arth
41.8
1.4e−08
9

gi|17432059|emb|CAD18736.1|
PUTATIVE RHS-RE
36.0
7.7e−07
4

gi|17549804|ref|NP_523144.1|
PUTATIVE RHS-RE
36.0
7.7e−07
4

gi|67757182|ref|ZP_00496060.1|
COG3209: Rhs fa
31.1
2.4e−05
4

gi|17430924|emb|CAD17606.1|
PROBABLE RHS-RE
26.8
0.00045
4

gi|17548676|ref|NP_522016.1|
PROBABLE RHS-RE
26.8
0.00045
4

gi|50841692|ref|YP_054919.1|
RHS-family prot
26.6
0.00054
4

gi|50839294|gb|AAT81961.1|
RHS-family prot
26.6
0.00054
4

gi|67760154|ref|ZP_00498879.1|
COG3209: Rhs fa
24.9
0.0018
5

gi|67756692|ref|ZP_00495573.1|
COG3209: Rhs fa
24.9
0.0018
5

gi|67711658|ref|ZP_00481464.1|
COG3209: Rhs fa
24.9
0.0018
5

gi|67685220|ref|ZP_00479096.1|
COG3209: Rhs fa
24.9
0.0018
5

gi|67671863|ref|ZP_00468647.1|
COG3209: Rhs fa
24.9
0.0018
5

gi|29826985|ref|NP_821619.1|
putative cell w
23.9
0.0036
4

gi|29604082|dbj|BAC68154.1|
putative cell w
23.9
0.0036
4

gi|67738391|ref|ZP_00489054.1|
COG3209: Rhs fa
23.6
0.0042
5

gi|66963411|ref|ZP_00410983.1|
YD repeat [Arth
21.8
0.015
3

gi|66871071|gb|EAL98435.1|
YD repeat [Arth
21.8
0.015
3

gi|28850838|gb|AAO53917.1|
Rhs family prot
21.0
0.026
3

gi|28867603|ref|NP_790222.1|
Rhs family prot
21.0
0.026
3

gi|67941206|ref|ZP_00533425.1|
YD repeat [Chlo
19.5
0.071
2

gi|67912581|gb|EAM62210.1|
YD repeat [Chlo
19.5
0.071
2

gi|28852674|gb|AAO55747.1|
YD repeat prote
17.8
0.24
2

gi|28869433|ref|NP_792052.1|
YD repeat prote
17.8
0.24
2

gi|24112019|ref|NP_706529.1|
putative Rhs-fa
16.9
0.38
2

gi|24050837|gb|AAN42236.1|
putative Rhs-fa
16.9
0.38
2

gi|67158725|ref|ZP_00419586.1|
PAS [Azotobacte
12.9
1.6
1

gi|67084598|gb|EAM04080.1|
PAS [Azotobacte
12.9
1.6
1

gi|26988616|ref|NP_744041.1|
hypothetical pr
12.7
1.7
1

gi|24983394|gb|AAN67505.1|AE016378_4
hypothetical pr
12.7
1.7
1

gi|46311839|ref|ZP_00212441.1|
COG3501: Unchar
10.0
4.3
1

While most of the hits in this list are Class C TC proteins, in contrast to the Class B TC proteins described above, there is not a clear-cut E-value defining a boundary between the Class C TC proteins and all other proteins scored. Nevertheless, the vast majority of hits better than e-10 are Class C TC proteins, and there are no known Class C TC proteins below that mark. In fact, the first 80 hits (up to gi|32699988|) are all Class C TC proteins, and at least 75% [ 88/116] of the proteins with scores better than e-10 are known Class C TC proteins or are annotated as such. None of the 38 selected Class C TC proteins had a score below 1.8e-11. Therefore, refinement of search criteria as represented in this HMM model demonstrates that it is an effective tool to search for Class C TC proteins in a database. The identification of approximately 88 hits out of 54323 proteins represents an increase in search efficiency and stringency by a factor of more than 600-fold. Some uncertainty exists in that some of the proteins scored in this model, but not annotated as Class C TC proteins may, in fact, be Class C TC proteins.

The inability of the CModel1.hmm to generate scores that allow complete separation between Class C TC proteins and all other proteins in the 800-1100 amino acid size class is probably due to the fact that the Class C TC protein family is a member of a larger superfamily of RHS proteins. Although most other members of the superfamily are larger than the Class C TC proteins, there may be a sufficient overlap in size to prevent complete discrimination. As will be seen below, the use of an RHS model in conjunction with other domain models is an effective method of distinguishing fused Class B/C TC proteins from other entries in a protein database.

Prokaryotic and Archaeal Fused Class B/C TC Proteins. Proteins corresponding to fused Class B/C Toxin Complex proteins are found in all three kingdoms, and all have the general domain format of an spvB domain followed by multiple FG-GAP domains followed by multiple RHS domains. However, some differences in the structure of these domains (or subdomains) does not allow for a simple set of HMM models to cover all three cases, so they are analyzed separately.

In contrast to single Class B TC proteins (see above) the spvB domains of fused Class B/C TC proteins from prokaryotes are not highly conserved, so the Pfam spvB model was inappropriate for model generation. Although there are three domain classes represented in the prokaryotic and archaeal fused Class B/C TC proteins, it is presently demonstrated that the use of HMM models for only two of them is sufficient to select the fused B/C TC proteins from a data set.

A protein search at the website ncbi.nlm.nih.gov was performed using the search terms: “1700:2800[SLEN] AND Prokaryota”. These terms restricted the search to those proteins that are between 1700 and 2800 amino acids in length and have prokaryotic and archaeal sources. [It is an inherent GenBank feature that a search limited to Prokaryota will also extract Archaeal genes]. A total of 3303 protein sequences were identified and downloaded as a searchable database. It should be noted that, while these restrictions are useful in the present context, the interval length of the sequences and the kingdom searched can be modified to meet the parameters of the individual protein set to be examined.

Neither the Pfam FG-GAP model nor the nonfused (single) Class B TC protein HMM model (above) proved satisfactory for finding FG-GAP domains in fused Class B/C TC proteins from prokaryotes and archaea. A satisfactory new model was therefore created from GenBank Accession numbers 48862345 and 13475700 by taking the FG-GAP domains found by Pfam and setting the cutoff score to 1. The ClustalX multiple sequence alignments of the FG-GAP domains is shown below. These alignments can be used with hmmbuild to generate the FG-GAP HMM model (BModel7.hmm) used in this example.

CLUSTAL X (1.83) multiple sequence alignment

gi|48862345|BDomain7
PESGISVGDINADGLTDVLYHNNGQV----EVYLS--
SEQ ID NO: 130

gi|13475700|BDomain5
ISSGTTIGDFNGDGLPDF-------------------
SEQ ID NO: 131

gi|48862345|BDomain3
PTTQFQPIDINGDGELDVAWL----------------
SEQ ID NO: 132

gi|48862345|BDomain4
--------DYNADGHADVAIY----------------
SEQ ID NO: 133

gi|48862345|BDomain1
--ADYFSGDMNGDREEDYFIRGYQAG----EPAL---
SEQ ID NO: 134

gi|48862345|BDamain2
RRNTIALQDYNLDGRMDLVLLSGVGGY---VVDVI--
SEQ ID NO: 135

gi|13475700|BDomain4
SGTTGVLRDFDNDGKADVVTITG-GG-----IGS---
SEQ ID NO: 136

gi|48862345|BDomains
----------DVNGDGLTDALTESR------VY----
SEQ ID NO: 137

gi|48862345|BDomain9
EDKVIRLLDVNGDGLLDLVSESKSDSTTKFNVYHW--
SEQ ID NO: 138

gi|48862345|BDomain8
GGANYVFSDVNGDSHTDLITFYDER----LSIH----
SEQ ID NO: 139

gi|13475700|BDomain3
GSGTCVLADVNGDGATDIVRYDGLN----LSAGVWLS
SEQ ID NO: 140

gi|48862345|BDomain6
-----GFGDFNGDGRLDLLVGD---------------
SEQ ID NO: 141

gi|13475700|BDomain1
TEAPREVGDLDFDGRDEIFGDYSEATDQRSGGREGET
SEQ ID NO: 142

gi|13475700|BDomain2
--GASGIG----VG--DFLGN----------GR----
SEQ ID NO: 143

Domain1 from the fused Class B/C TC protein of GenBank Accession number 13475700 was considerably longer and was trimmed to improve the model. The BModel.hmm model was then tested with hmmsearch against the 3303 member database containing all prokaryotic (and archaeal) proteins between 1700 and 2800 amino acids in length, yielding the results below. [The original output list was truncated to remove a large number of proteins with very high scores.]

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/BModel7.hmm [BDomainsModel7Sequences]
Sequence database: FinalTest/Prokaryotic1700-2800.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: BDomainsModel7Sequences
Accession: [none]
Description: [none]

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|48862345|ref|ZP_00316242.1|
COG3209: Rhs f
236.4
2.2e−68
10

gi|13475700|ref|NP_107267.1|
hypothetical p
124.1
1.5e−34
5

gi|14026456|dbj|BAB53053.1|
mll6838 [Mesor
124.1
1.5e−34
5

gi|27367604|ref|NP_763131.1|
Rhs family pro
101.4
1e−27
7

gi|27359176|gb|AAO08121.1|AE016812_103
Rhs family pro
101.4
1e−27
7

gi|48863870|ref|ZP_00317763.1|
COG3209: Rhs f
95.0
8.6e−26
8

gi|48833214|ref|ZP_00290236.1|
COG2931: RTX t
87.0
2.1e−23
11

gi|45657896|ref|YP_001982.1|
cytoplasmic me
79.7
3.4e−21
7

gi|45601137|gb|AAS70619.1|
cytoplasmic me
79.7
3.4e−21
7

gi|24214465|ref|NP_711946.1|
Rhs family pro
78.2
9.3e−21
7

gi|24195416|gb|AAN48964.1|AE011353_4
Rhs family pro
78.2
9.3e−21
7

gi|45656716|ref|YP_000802.1|
cytoplasmic me
74.8
1e−19
6

gi|45599952|gb|AAS69439.1|
cytoplasmic me
74.8
1e−19
6

gi|67919346|ref|ZP_00512927.1|
Integrins alph
69.1
5.1e−18
6

gi|67783058|gb|EAM42456.1|
Integrins alph
69.1
5.1e−18
6

gi|39576631|emb|CAE80795.1|
hypothetical p
67.9
1.2e−17
8

gi|42524422|ref|NP_969802.1|
hypothetical p
67.9
1.2e−17
8

gi|20090892|ref|NP_616967.1|
hypothetical p
63.0
3.6e−16
6

gi|19915968|gb|AAM05447.1|
hypothetical p
63.0
3.6e−16
6

gi|24216032|ref|NP_713513.1|
hypothetical p
59.6
3.9e−15
7

gi|24197262|gb|AAN50531.1|AE011493_4
conserved hypo
59.6
3.9e−15
7

gi|33632688|emb|CAE07500.1|
conserved hypo
58.7
7.2e−15
10

gi|33865519|ref|NP_897078.1|
hypothetical p
58.7
7.2e−15
10

gi|48893475|ref|ZP_00326711.1|
COG3391: Uncha
38.5
8.7e−09
7

gi|67932250|ref|ZP_00525397.1|
Integrins alph
38.3
1e−08
2

gi|67860476|gb|EAM55523.1|
Integrins alph
38.3
1e−08
2

gi|67762147|ref|ZP_00500850.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67739010|ref|ZP_00489616.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67670205|ref|ZP_00467015.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67653219|ref|ZP_00450636.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67648765|ref|ZP_00446993.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67642373|ref|ZP_00441130.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67636722|ref|ZP_00435666.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|67629702|ref|ZP_00429560.1|
COG3209: Rhs f
32.2
6.5e−07
4

gi|53718233|ref|YP_107219.1|
putative membr
32.2
6.5e−07
4

gi|52208647|emb|CAH34583.1|
putative membr
32.2
6.5e−07
4

gi|53724907|ref|YP_101869.1|
FG-GAP/YD repe
32.2
6.5e−07
4

gi|52428330|gb|AAU48923.1|
FG-GAP/YD repe
32.2
6.5e−07
4

gi|67713631|ref|ZP_00482992.1|
COG3209: Rhs f
31.0
1.6e−06
4

gi|67683974|ref|ZP_00478003.1|
COG3209: Rhs f
31.0
1.6e−06
4

gi|20089734|ref|NP_615809.1|
cell surface p
27.6
1.7e−05
6

gi|19914667|gb|AAM04289.1|

cell surface p

27.6

1.7e−05

6

gi|20092500|ref|NP_618575.1|
cell surface p
21.7
0.00067
5

gi|19917767|gb|AAM07055.1|
cell surface p
21.7
0.00067
5

gi|48838750|ref|ZP_00295689.1|
COG3291: FOG:
16.4
0.0067
4

gi|32446373|emb|CAD76201.1|
probable fibri
14.6
0.014
2

gi|32475830|ref|NP_868824.1|
probable fibri
14.6
0.014
2

gi|20092499|ref|NP_618574.1|
cell surface p
10.0
0.1
2

gi|19917765|gb|AAM07054.1|
cell surface p
10.0
0.1
2

gi|15597071|ref|NP_250565.1|
hypothetical p
7.9
0.26
1

gi|11349338|pir||A83412
hypothetical p
7.9
0.26
1

gi|32041947|ref|ZP_00139530.1|
COG5295: Autot
7.9
0.26
1

gi|9947864|gb|AAG05263.1|AE004613_8
hypothetical p
7.9
0.26
1

gi|67936217|ref|ZP_00529228.1|
Hemolysin-type
7.6
0.29
3

gi|67774841|gb|EAM34516.1|
Hemolysin-type
7.6
0.29
3

gi|33440441|gb|AAQ19127.1|
putative adhes
7.2
0.35
1

gi|31616734|emb|CAD60101.1|
peptide synthe
7.0
0.38
1

gi|31505496|gb|AAO62586.1|
peptide sythet
7.0
0.38
1

gi|67756274|ref|ZP_00495160.1|
COG3209: Rhs f
6.6
0.45
3

gi|67754231|ref|ZP_00493144.1|
COG3209: Rhs f
6.6
0.45
3

gi|67754221|ref|ZP_00493134.1|
COG3209: Rhs f
6.6
0.45
3

gi|67752361|ref|ZP_00491350.1|
COG3209: Rhs f
6.6
0.45
3

gi|67752064|ref|ZP_00491084.1|
COG3209: Rhs f
6.6
0.45
3

gi|67710451|ref|ZP_00480257.1|
COG3209: Rhs f
6.6
0.45
3

gi|67710442|ref|ZP_00480248.1|
COG3209: Rhs f
6.6
0.45
3

gi|67682395|ref|ZP_00476633.1|
COG3209: Rhs f
6.6
0.45
3

gi|17431402|emb|CAD18081.1|
SKWP PROTEIN 3
6.4
0.5
1

gi|17549151|ref|NP_522491.1|
SKWP PROTEIN 3
6.4
0.5
1

gi|48731764|ref|ZP_00265508.1|
COG2132: Putat
6.3
0.52
1

It is noted that gi|19914667|, with an E-value of 1.7e-5 (double-underlined for clarity), is the last entry protein with 6 FG-GAP domains. Proteins with this score or better were extracted from GenBank and were used to create the searchable dataset for the next round of analysis. The general RHS Pfam model rhs_ls.hmm was used in lieu of creating a new RHS HMM model specific to fused B/C proteins. The results of the search are below:

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/rhs_ls.hmm [RHS_repeat]
Sequence database: FinalTest/ProLargeModel7Hits.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: RHS_repeat
Accession: PF05593.3
Description: RHS Repeat

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|19915968|gb|AAM05447.1|
hypothetical p
153.0
3.5e−45
9

gi|20090892|ref|NP_616967.1|
hypothetical p
153.0
3.5e−45
9

gi|45656716|ref|YP_000802.1|
cytoplasmic me
151.7
9e−45
11

gi|45599952|gb|AAS69439.1|
cytoplasmic me
151.7
9e−45
11

gi|24197262|gb|AAN50531.1|AE011493_4
conserved hypo
145.6
6.3e−43
11

gi|24216032|ref|NP_713513.1|
hypothetical p
145.6
6.3e−43
11

gi|14026456|dbj|BAB53053.1|
mll6838 [Mesor
133.2
3.4e−39
7

gi|13475700|ref|NP_107267.1|
hypothetical p
133.2
3.4e−39
7

gi|45601137|gb|AAS70619.1|
cytoplasmic me
125.5
6.8e−37
8

gi|45657896|ref|YP_001982.1|
cytoplasmic me
125.5
6.8e−37
8

gi|24195416|gb|AAN48964.1|AE011353_4
Rhs family pro
122.8
4.5e−36
8

gi|24214465|ref|NP_711946.1|
Rhs family pro
122.8
4.5e−36
8

gi|67739010|ref|ZP_00489616.1|
COG3209: Rhs f
116.4
3.8e−34
9

gi|53718233|ref|YP_107219.1|
putative membr
116.4
3.8e−34
9

gi|52208647|emb|CAH34583.1|
putative membr
116.4
3.8e−34
9

gi|67713631|ref|ZP_00482992.1|
COG3209: Rhs f
116.4
3.8e−34
9

gi|67683974|ref|ZP_00478003.1|
COG3209: Rhs f
116.4
3.8e−34
9

gi|67653219|ref|ZP_00450636.1|
COG3209: Rhs f
114.0
2e−33
9

gi|67636722|ref|ZP_00435666.1|
COG3209: Rhs f
114.0
2e−33
9

gi|67642373|ref|ZP_00441130.1|
COG3209: Rhs f
114.0
2e−33
9

gi|53724907|ref|YP_101869.1|
FG-GAP/YD repe
114.0
2e−33
9

gi|52428330|gb|AAU48923.1|
FG-GAP/YD repe
114.0
2e−33
9

gi|67762147|ref|ZP_00500850.1|
COG3209: Rhs f
113.4
2.9e−33
9

gi|67648765|ref|ZP_00446993.1|
COG3209: Rhs f
113.4
3.1e−33
9

gi|67629702|ref|ZP_00429560.1|
COG3209: Rhs f
113.4
3.1e−33
9

gi|67670205|ref|ZP_00467015.1|
COG3209: Rhs f
113.3
3.2e−33
9

gi|27359176|gb|AAO08121.1|AE016812_103
Rhs family pro
95.5
7.1e−28
8

gi|27367604|ref|NP_763131.1|
Rhs family pro
95.5
7.1e−28
8

gi|48863870|ref|ZP_00317763.1|
COG3209: Rhs f
67.6
1.8e−19
9

gi|19915968|gb|AAM05447.1|
hypothetical p
63.0
3.6e−16
6

gi|24216032|ref|NP_713513.1|
hypothetical p
59.6
3.9e−15
7

gi|48862345|ref|ZP_00316242.1|

COG3209: Rhs f

53.3

3.6e−15

11

gi|48833214|ref|ZP_00290236.1|
COG2931: RTX t
−2.4
0.41
1

gi|20089734|ref|NP_615809.1|
cell surface p
−3.5
0.57
1

gi|19914667|gb|AAM04289.1|
cell surface p
−3.5
0.57
1

gi|67860476|gb|EAM55523.1|
Integrins alph
−6.1
1.2
1

gi|67932250|ref|ZP_00525397.1|
Integrins alph
−6.1
1.2
1

gi|33632688|emb|CAE07500.1|
conserved hypo
−9.5
3.3
1

gi|33865519|ref|NP_897078.1|
hypothetical p
−9.5
3.3
1

gi|67919346|ref|ZP_00512927.1|
Integrins alph
−10.2
4
1

gi|67783058|gb|EAM42456.1|
Integrins alph
−10.2
4
1

gi|39576631|emb|CAE80795.1|
hypothetical p
−12.6
7.6
1

gi|42524422|ref|NP_969802.1|
hypothetical p
−12.6
7.6
1

gi|48893475|ref|ZP_00326711.1|
COG3391: Uncha
−12.6
7.7
1

It is noted that there is a very clean break of E-values between gi|48862345| (E-value ˜e-15) and gi|48833214| (E-value 0.41) (double-underlined for clarity). A dereplicated list of proteins follows below:

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|20090892|ref|NP_616967.1|
hypothetical p
153.0
3.5e−45
9

gi|45656716|ref|YP_000802.1|
cytoplasmic me
151.7
9e−45
11

gi|24197262|gb|AAN50531.1|AE011493_4
conserved hypo
145.6
6.3e−43
11

gi|14026456|dbj|BAB53053.1|
mll6838 [Mesor
133.2
3.4e−39
7

gi|45601137|gb|AAS70619.1|
cytoplasmic me
125.5
6.8e−37
8

gi|24195416|gb|AAN48964.1|AE011353_4
Rhs family pro
122.8
4.5e−36
8

gi|67739010|ref|ZP_00489616.1|
COG3209: Rhs f
116.4
3.8e−34
9

gi|67653219|ref|ZP_00450636.1|
COG3209: Rhs f
114.0
2e−33
9

gi|67762147|ref|ZP_00500850.1|
COG3209: Rhs f
113.4
2.9e−33
9

gi|67648765|ref|ZP_00446993.1|
COG3209: Rhs f
113.4
3.1e−33
9

gi|67670205|ref|ZP_00467015.1|
COG3209: Rhs f
113.3
3.2e−33
9

gi|27359176|gb|AAO08121.1|AE016812_103
Rhs family pro
95.5
7.1e−28
8

gi|48863870|ref|ZP_00317763.1|
COG3209: Rhs f
67.6
1.8e−19
9

gi|48862345|ref|ZP_00316242.1|
COG3209: Rhs f
53.3
3.6e−15
11

Thus, this example demonstrates that fused Class B/C TC protein sequences within large datasets can be identified from their shared protein domain structures, in the absence of other amino acid sequence information aside from length.

Eukaryotic Fused Class B/C TC Proteins. Models to identify eukaryotic proteins corresponding to fused Class B/C TC proteins were developed in a slightly different fashion from the Prokaryotic/Archaeal model. The only known example, from Gibberella zeae, has an spvB domain (rarely found in eukaryotic proteins) which closely fits the Pfam model. The G. zeae fused Class B/C TC protein also has FG-GAP domains which are discoverable using the FG-GAP BModels3.hmm model developed for the nonfused Class B TC proteins above. When used together, these two models were powerful enough to select the G. zeae protein from a database, so an RHS HMM model was not developed. One skilled in the art would realize that such an RHS model could easily be developed using these teachings. Such an additional RHS model might be useful, for example, if all of the GenBank proteins were to be searched instead of the subset tested below.

A protein search at the website ncbi.nlm.nih.gov/ was performed using the search terms: “1700:2800[SLEN] AND Eukaryota”. These terms restricted the search to those proteins that are between 1700 and 2800 amino acids in length and have a eukaryotic source. A total of 19550 protein sequences were identified and downloaded as a searchable database. It should be noted that, while these restrictions are useful in the present context, the interval length of the sequences and the kingdom searched can be modified to meet the parameters of the individual protein set to be examined.

The data set was tested against the FG-GAP model first. This model was considered to be the less discriminating one of the two used, since eukaryotic proteins containing FG-GAP domains are known to exist. However as shown below, only a single member of the 19550 proteins of the search set had a significant hit. [GenBank Accessions gi|46138103| and gi|42545609| are duplicate entries.] This result indicates that the FG-GAP model is substantiality discriminatory for known proteins within the 1700 and 2800 sequence length range.

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/BModels3.hmm [BDomainsModel3Sequences]
Sequence database: FinalTest/EukaryoticGenBank1700-2800.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: BDomainsModel3Sequences
Accession: [none]
Description: [none]

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|46138103|ref|XP_390742.1|
hypothetical protei
61.6
5.5e−15
3

gi|42545609|gb|EAA68452.1|
hypothetical protei
61.6
5.5e−15
3

gi|55620227|ref|XP_526172.1|
PREDICTED: similar
8.3
1.1
1

gi|34978363|sp|P39061|COIA1_MOUSE
Collagen alpha 1(XV
3.6
3.9
1

gi|7446033|pir||B56101
collagen alpha 1(XV
3.6
3.9
1

gi|1167906|gb|AAC52903.1|
alpha-1(XVIII) coll
3.6
3.9
1

gi|57916309|ref|XP_555970.1|
ENSANGP00000027590
1.9
6.4
1

gi|55238148|gb|EAL39797.1|
ENSANGP00000027590
1.9
6.4
1

gi|55656175|ref|XP_531500.1|
PREDICTED: hypothet
1.7
6.6
1

gi|40231526|gb|AAR83296.1|
type XVIII collagen
1.7
6.6
1

gi|57112907|ref|XP_549331.1|
PREDICTED: similar
0.3
9.8
1

We tested this protein with the spvB_ls.hmm model to show that this model can find the appropriate protein. The results are below:

hmmsearch—search a sequence database with a profile HMM
HMMER 2.3.1 (June 2003)
Freely distributed under the GNU General Public License (GPL)
HMM file: FinalTest/spvB_ls.hmm [SpvB]
Sequence database: FinalTest/EukBCModel3Hits.fasta
per-sequence score cutoff: [none]
per-domain score cutoff: [none]
per-sequence Eval cutoff: <=10
per-domain Eval cutoff: [none]
Query HMM: SpvB
Accession: PF03534.3
Description: Salmonella virulence plasmid 65 kDa B protein

[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):

Sequence
Description
Score
E-value
N

gi|46138103|ref|XP_390742.1|
hypothetical
159.7
1.7e−48
1

protein FG1

gi|42545609|gb|EAA68452.1|
hypothetical
159.7
1.7e−48
1

protein FG1

Since only a single example of a eukaryotic Class B/C TC fusion protein has been discovered to date, it is not possible to provide a stringent test of this search strategy. However, it is clear that the two-model domain search strategy will be useful to discriminate proteins within the sequence length range that have FG-GAP domains but that are not fused Class B/C TC proteins. In addition, the models provided here can be used to extract Class B/C TC fusion proteins from a wider sequence range, when necessary. As with previous demonstrations, the order in which the models are used does not alter the final results. It is also important to note that an RHS model could be added to the above search criteria if necessary to obtain further discrimination.

The above examples teach that a combination of (1) sequence length filtering and (2) domain searching constitute a powerful method for extracting Class B, Class C and fused Class B/C TC proteins from protein sequence databases. The domains are taken from the spvB, FG-GAP and RHS domain families, using either general Pfam HMM models or particular HMM domain models tailored to the particular protein class. The sequence length intervals used in these examples were chosen to encompass the known range of these proteins, and to show that these proteins can be separated, not just from all other proteins, but also from other members of these protein families. Since the same HMM models were used for both the prokaryotic Class B TC proteins and the eukaryotic fused Class BC TC proteins, both results sets would have been extracted together if no sequence length or kingdom limits had initially been placed on the search. Optionally, an RHS model could easily be developed to discriminate between these protein sets. If such discrimination is undesired or unnecessary, the whole GenBank protein data set could be used as input. However, given the large and ever-expanding size of the GenBank databases, this would make the searches significantly slower.

For further guidance, see E. L. L. Sonnhammer, S. R. Eddy, and R. Durbin. Proteins 28:405-420, 1997 (describes the Pfam database of multiple sequence alignments and HMMs, and its use in large scale genome analysis), and Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison (Cambridge University Press, 1998), Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

EXAMPLE 12
Localization of spvB, FG-GAP, RHS, and HVR Subdomains in BC Fusion Proteins of Tannerella and Gibberella

FIGS. 2 and 3 illustrate the locations of the above subdomains in the B and C domains of the BC fusion proteins of Tannerella and Gibberella, respectively. For Tannerella (FIG. 2), the spvB domain (standard spvB-ls.hmm model) is illustrated with underlining from residues 51-374 (of FIG. 2 and also SEQ ID NO:12). The FG-GAP domains (using the BModel7.hmm model; there are six in this molecule) are indicated with double underlining and occur at residues 392-421, 453-486, 502-531, 552-581, 604-625, and 650-681. The foregoing features are observable in the B domain. Following a transition into the C domain, eight RHS domains (using Pfam rhs_ls.hmm model) are determinable in the C domain at residues 1048-1085, 1168-1201, 1207-1243, 1248-1285, 1290-1326, 1331-1369, 1447-1482, and 1620-1652. These are indicated in bold in FIG. 2. A hypervariable region (HVR), common in “Class C” TC proteins, is also identified at the C terminus of the molecule (residues 1733-2027). This is indicated in italics in FIG. 2.

Similarly, in FIG. 3 (and SEQ ID NO:4) for Gibberella, an spvB subdomain (shown with underlining in FIG. 3) in the B domain is observable at residues 51-374. Three FG-GAP domains/subdomains are accordingly determined to occur at residues 570-609, 630-669, and 685-700. These are indicated with double underlining in FIG. 3. (The intron of SEQ ID NO:2, residues 1557-1583, is not shown in FIG. 3.) Two RHS domains are located at residues 1738-1774 and 1972-2002 in the C domain (indicated in bold in FIG. 3), as is a HVR at residues 2154-2439 (the C terminal region), indicated in italics in FIG. 3.

For the Gibberella model discussed above, the spvB domain was determined using the standard spvB-Is.hmm model. The three FG-GAP domains were found with the Bmodels3.hmm model. As only one eukaryotic protein is presently known, it is difficult to test for the best model. As more eukaryotic fused BC toxin proteins are discovered the model will likely improve. The RHS domains were found using the Pfam rhs_ls.hmm model. As with the FG-GAP domains, two RHS domains were found. As more eukaryotic examples are found, the model is expected to improve. Under a slightly different domain search—NCBI's CD (conserved domain) search—the section from amino acid residues 1493-2153 is labeled as RHSA [Marchler-Bauer A, Bryant SH (2004), “CD-Search: protein domain annotations on the Fly.”, Nucleic Acids Res. 32:W327-331.] As with the Tannerella model, the HVR was mapped by lack of homology to other proteins. However, HVRs are recognizable in other “Class C” TC proteins. Due to the differences in length of various natural proteins as illustrated above, the exact residue locations for each subdomain cannot be predicted for future proteins. However, the subject invention includes naturally occurring proteins wherein the Spy domain is located in the first half of the molecule followed by at least one F-Gap domain, followed by at least one RHS domain in the last two thirds of the molecule, followed by a hypervariable region at the end of the protein. Some software programs may also predict a transmembrane domain. This is the case for the program TMAP (transmembrane detection based on multiple sequence alignment: Karolinska Institut, Sweden). Thus, it is possible that the subject proteins further comprise a transmembrane domain.

EXAMPLE 13
Alignment and Further Comparison of BC Fused Toxin Proteins from Tannerella and Gibberella

A global alignment of the two BC fused toxin proteins from Tannerella and Gibberella was performed with needle, an EMBOSS program (EMBOSS: The European Molecular Biology Open Software Suite (2000), Rice, P., Longden, I., and Bleasby, A., Trends in Genetics 16(6):276-277) using the Needleman-Wunsch algorithm (same as GCG's GAP). See FIG. 4.

Additional settings used were: Align_format: srspair; Report_file: outfile; Matrix: EBLOSUM62; Gap_penalty: 10.0; Extend_penalty: 0.5.

For a length of 2894 amino acid residues, the following scores were obtained: Identity: 517/2894 (17.9%); Similarity: 796/2894 (27.5%); Gaps: 1322/2894 (45.7%); Score: 441.0.

EXAMPLE 14
Construction of a Gene Encoding the 8884 Fusion Protein (TcdB2/Tcp1_GzC)

Fusion protein 8884 is comprised of the entire Photorhabdus TcdB2 (a Class B protein) fused to a portion of the Gibberella zeae Tcp1_Gzprotein. The segment of the Tcp1_Gzprotein present in the 8884 fusion protein is herein called Tcp1_GzC, to reflect its functional similarity to other Class C proteins.

To construct the coding region for the 8884 fusion protein, the 3′ end of the TcdB2 coding region was modified using standard molecular biology techniques. Likewise, the 5′ end of the coding region of the C-like region of Tcp1_Gzwas modified in a multi step process, and the two modified coding regions were joined with a linker fragment to create a single open reading frame. The novel DNA encoding the 8884 gene fusion is disclosed in SEQ ID NO:19 and encodes polypeptide 8884 (presented in SEQ ID NO:20). Nucleotides 1-4422 of the 8884 fusion protein coding region correspond to the same numbered bases of the Photorhabdus luminescens strain W-14 tcdB2 gene (Genbank Accession AF346500.2) and encode the entire TcdB2 protein. This sequence is followed by 42 bases of linker sequence (encoding 14 amino acids), which is then followed by a DNA sequence corresponding to nucleotides 4346-7423 of a DNA sequence encoding the Tcp1_Gzprotein optimized for expression in E. coli cells (SEQ ID NO:5). The fused gene consisting of the coding regions for TcdB2 and Tcp1_GzC (disclosed as SEQ ID NO:19). was cloned into a pET expression plasmid vector (Novagen, Madison Wis.). The construction was done in such a way as to maintain appropriate bacterial transcription and translation signals. The plasmid was designated pDAB8884. The cassette in SEQ ID NO:19 is 7542 nucleotides in length and contains coding regions for TcdB2 (nts 1-4422), the TcdB2/Tcp1_GzC linker peptide (nts 4423-4464) and Tcp1_GzC (nts 4465-7539). The polypeptide encoded by the fused gene in SEQ ID NO:19 is shown in SEQ ID NO:20. The fusion protein is predicted to contain 2,513 amino acids with segments representing TcdB2 (residues 1-1474), the TcdB2/Tcp1_GzC linker peptide (residues 1475-1488), and Tcp1_GzC (residues 1489-2513).

EXAMPLE 15
Expression Conditions for pDAB8884 and Lysate Preparations

The Class A TC protein XptA2_Xwiwas utilized in a purified form prepared from cultures of Pseudomonas fluorescens heterologously expressing the gene. The expression plasmids pET280 (empty vector control), pDAB8920 (encoding a TcdB2/TccC3 fusion protein), pDAB8829 (encoding the Tcp1_Gzprotein) and pDAB8884 were transformed into the E. coli T7 expression strain BL21(DE3) (Invitrogen, Carlsbad, Calif.) using standard methods. Expression cultures were initiated with 10-200 freshly transformed colonies into 250 mL LB medium containing 50 μg/mL antibiotic and 75 μM IPTG (isopropyl-α-D-thiogalactopyranoside). The cultures were grown at 28° C. for 24 hours at 180-200 rpm. The cells were collected by centrifugation in 500 mL Nalgene bottles at 5,000×g for 20 minutes at 4° C. The pellets were suspended in 4-4.5 mL Butterfield's Phosphate solution (Hardy Diagnostics, Santa Maria, Calif.; 0.3 mM potassium phosphate pH 7.2). The suspended cells were transferred to 50 mL polypropylene screw cap centrifuge tubes with 1 mL of 0.1 mm diameter glass beads (Biospec, Bartlesville, Okla., catalog number 1107901). The cell-glass bead mixture was chilled on ice, then the cells were lysed by sonication with two 45 second bursts using a 2 mm probe with a Branson Sonifier 250 (Danbury Conn.) at an output of ˜30, chilling completely between bursts. The lysates were transferred to 2 mL Eppendorf tubes and centrifuged 5 minutes at 16,000×g.

EXAMPLE 16
Bioassay Conditions for the 8884 Lysates

Insect bioassays were conducted with neonate corn earworm larvae (Helicoverpa zea (Boddie)) on artificial diets in 128-well trays specifically designed for insect bioassays (C-D International, Pitman, N.J.). Bioassays were incubated under controlled environmental conditions (28° C., ˜40% relative humidity, 16 hr:8 hr [Light:Dark]) for 5 days, at which point the total number of insects in the treatment, the number of dead insects, and the weights of surviving insects were recorded.

The biological activity of the crude lysates alone or with added XptA2_Xwitoxin protein was assayed as follows. Crude E. coli lysates (40 μL) of either control cultures or those expressing Toxin Complex proteins were applied to the surface of artificial diet in 8 wells of a bioassay tray. The average surface area of treated diet in each well was ˜1.5 cm². The lysates from bacterial cultures harboring the empty vector control, or lysates from cultures producing the 8920 TcdB2/TccC3 fusion protein, the 8829 Tcp1_Gzprotein, and the 8884 TcdB2/Tcp1_GzC fusion protein were applied with and without XptA2_Xwi. The XptA2_Xwiprotein added was a highly purified preparation from bacterial cultures heterologously expressing the protein. Additionally, purified XptA2_Xwi; without any crude lysate was mixed with Butterfield's Phosphate solution as a control. The final concentration of XptA2_Xwion the diet was 250 ng/cm².

EXAMPLE 17
Construction of a Gene Encoding the 8883 Fusion Protein (Tcp1_GzB/TccC3)

Fusion protein 8883 is comprised of a portion of the Gibberella zeae Tcp1_Gzprotein fused to the entire Photorhabdus TccC3 protein (a Class C protein). The segment of the Tcp1_Gzprotein present in the 8883 fusion protein is herein called Tcp1_GzB, to reflect its functional similarity to other Class B proteins.

To construct the 8883 fusion protein coding region, the 3′ end of the coding region for the B-like region of Tcp1_Gzwas modified in a multi step process using standard molecular biology techniques. Likewise, the 5′ end of the TccC3 coding region was modified in a multi-step process, and the two modified coding regions were joined with a linker fragment to create a single open reading frame. The novel DNA encoding the 8883 gene fusion is disclosed in SEQ ID NO:21 and encodes polypeptide 8883 (presented in SEQ ID NO:22). The portion of the Gibberella zeae protein Tcp1_Gzcorresponding to the Tcp1_GzB protein is encoded by bases 23-4558 of a DNA sequence encoding the Tcp1_Gzprotein optimized for expression in E. coli cells (disclosed in SEQ ID NO:5). This sequence comprises bases 1-4536 of SEQ ID NO:21. These bases are followed by a linker fragment of 39 bases (encoding 13 amino acids), then by the entire coding region for the Photorhabdus luminescens strain W-14 TccC3 protein (a Class C protein; Genbank Accession AF346500.2). [In SEQ ID NO:21, base number 12 (T of the native sequence) was changed to C to accommodate a ClaI restriction enzyme recognition site. This silent base change does not alter the encoded amino acid sequence of the TccC3 protein.] This novel fusion gene is called 8883 (SEQ ID NO:21) and encodes polypeptide 8883 (SEQ ID NO:22).

The fused gene consisting of the coding regions for Tcp1_GzB and TccC3 was engineered as a single open reading frame in a pET expression plasmid vector (Novagen, Madison Wis.). The construction was done in such a way as to maintain appropriate bacterial transcription and translation signals. The plasmid was designated pDAB8883. The DNA sequence of the fused coding region cassette is shown in SEQ ID NO:21. The cassette is 7458 nucleotides in length and contains coding regions of Tcp1_GzB (nts 1-4536), the Tcp1_GzB/TccC3 linker peptide (nts 4537-4575) and TccC3 (nts 4576-7455). The polypeptide encoded by the fused gene in SEQ ID NO:21 is shown in SEQ ID NO:22. The fusion protein is predicted to contain 2,485 amino acids with segments representing Tcp1_GzB (residues 1-1512), the Tcp1_GzB/TccC3 linker peptide (residues 1513-1525), and TccC3 (residues 1526-2485).

Lysates containing the 8883 fusion protein demonstrated excellent functional activity, as demonstrated in following examples. Thus, this invention demonstrates the retained synergistic activity of a fusion between the Tcp1_GzB peptide, the product of a eukaryotic gene, with TccC3, the product of a prokaryotic gene, when used in combination with the Toxin Complex Protein XptA2_Xwi.

EXAMPLE 18
Expression Conditions for pDAB8883 and Lysate Preparations

The Class A TC protein XptA2_Xwi; was utilized in a purified form prepared from cultures of Pseudomonas fluorescens heterologously expressing the gene. The expression plasmids pET280 (empty vector control), pDAB8920 (encoding a TcdB2/TccC3 fusion protein), pDAB8829 (encoding the Tcp1_Gzprotein), and pDAB8883 were transformed into the E. coli T7 expression strain BL21(DE3) (Invitrogen, Carlsbad, Calif.) using standard methods. Expression cultures were initiated with 10-200 freshly transformed colonies into 250 mL LB medium containing 50 μg/mL antibiotic and 75 μM IPTG (isopropyl-α-D-thiogalactopyranoside). The cultures were grown at 28° C. for 24 hours at 180-200 rpm. The cells were collected by centrifugation in 500 mL Nalgene bottles at 5,000×g for 20 minutes at 4° C. The pellets were suspended in 4-4.5 mL Butterfield's Phosphate solution (Hardy Diagnostics, Santa Maria, Calif.; 0.3 mM potassium phosphate pH 7.2). The suspended cells were transferred to 50 mL polypropylene screw cap centrifuge tubes with 1 mL of 0.1 mm diameter glass beads (Biospec, Bartlesville, Okla., catalog number 1107901). The cell-glass bead mixture was chilled on ice, then the cells were lysed by sonication with two 45 second bursts using a 2 mm probe with a Branson Sonifier 250 (Danbury Conn.) at an output of ˜30, chilling completely between bursts. The lysates were transferred to 2 mL Eppendorf tubes and centrifuged 5 minutes at 16,000×g.

EXAMPLE 19
Bioassay Conditions for the 8883 Lysates

Insect bioassays were conducted with neonate corn earworm larvae, Helicoverpa zea (Boddie), on artificial diets in 128-well trays specifically designed for insect bioassays (C-D International, Pitman, N.J.). Bioassays were incubated under controlled environmental conditions (28° C., ˜40% relative humidity, 16 hr:8 hr [Light:Dark]) for 5 days, at which point the total number of insects in the treatment, the number of dead insects, and the weights of surviving insects were recorded.

The biological activity of the crude lysates alone or with added XptA2_Xwitoxin protein was assayed as follows. Crude E. coli lysates (40 μL) of either control cultures or those expressing Toxin Complex proteins were applied to the surface of artificial diet in 8 wells of a bioassay tray. The average surface area of treated diet in each well was ˜1.5 cm². The lysates from bacterial cultures harboring the empty vector control, or lysates from cultures producing the 8920 TcdB2/TccC3 fusion protein, the 8829 Tcp1_Gzprotein, and the 8883 Tcp1_GzB/TccC3 fusion protein were applied with and without XptA2_Xwi. The XptA2_Xwiprotein added was a highly purified preparation from bacterial cultures heterologously expressing the protein. Additionally, purified XptA2_Xwiwithout any crude lysate was mixed with Butterfield's Phosphate solution as a control. The final concentration of XptA2_Xwion the diet was 250 ng/cm².

EXAMPLE 20
Bioassay Results for 8883 Tcp1_GzB/TccC3 Fusion Lysates

Table 7 shows the bioassay results for control lysates, lysates of cells programmed to express the 8920 TcdB2/TccC3 fusion protein, lysates of cells programmed to express the 8829 Tcp1_Gzprotein, and lysates of cells programmed to express the 8883 Tcp1_GzB/TccC3 fusion protein. All of the lysates were bioassayed plus and minus purified XptA2_Xwi. The data show that control lysates, with and without XptA2_xwi, had little effect on the insects. Lysates containing only the 8920 TcdB2/TccC3 fusion protein had no effect without added XptA2_Xwi. However, with added XptA2_Xwi, the 8920 lysate was a potent inhibitor of insect growth. Lysates containing only the 8829 Tcp1_Gzprotein had no effect without added XptA2_Xwi. However, with added XptA2_Xwi, the 8829 lysate was a potent inhibitor of insect growth. Lysates programmed to express the 8883 Tcp1_GzB/TccC3 fusion protein had no effect without added XptA2_Xwi. However, with added XptA2_Xwi, the 8883 lysate was a potent inhibitor of insect growth. These data demonstrate that when the Tcp1_GzB and TccC3 peptides are fused together they retain a synergistic effect when combined with XptA2_Xwi.

TABLE 7

Response of corn earworm (Helicoverpa zea (Boddie) neonate larvae

to E. coli lysates expressing Toxin Complex proteins.

Growth Inhibition

Sample
Lysate Tested
Corn Earworm

pET280
Empty vector control
0

pET280 + XptA2_Xwi
Empty vector control
0

Purified XptA2_xXwi
XptA2_xwi
0

pDAB8920
8920 (TcdB2/TccC3)
0

pDAB8920 + XptA2_Xwi
8920 (TcdB2/TccC3)
++++

pDAB8829
8829 (Tcp1_Gz)
0

pDAB8829 + XptA2_Xwi
8829 (Tcp1_Gz)
++++

pDAB8883
8883 (Tcp1_GzB/TccC3)
0

pDAB8883 + XptA2_Xwi
8883 (Tcp1_GzB/TccC3)
++++

24 insects used per test.

Growth Inhibition Scale:

0 = 0-20%;

+ = 21-40%;

++ = 41-60%;

+++ = 61-80%;

++++ = 81-100%.

EXAMPLE 21
Design and Synthesis of a Plant-Optimized Gene Encoding Tcp1_Gzfor Expression in Plants

To obtain higher levels of expression of a fungal gene in plants, it may be preferred to re-engineer the protein-encoding sequence of the gene so that it is more efficiently expressed in plant cells. This example teaches the design of a new DNA sequence that encodes the Tcp1_Gzprotein of SEQ ID NO: 2, but is optimized for expression in plant cells.

One motive for the re-engineering of a gene encoding a fungal protein for expression in plants is due to the non-optimal G+C content of the heterologous gene. For example, the low G+C content of many native fungal gene(s) (and consequent skewing towards high A+T content) results in the generation of sequences mimicking or duplicating plant gene control sequences that are known to be highly A+T rich. The presence of some A+T-rich sequences within the DNA of gene(s) introduced into plants (e.g., TATA box regions normally found in gene promoters) may result in aberrant transcription of the gene(s). On the other hand, the presence of other regulatory sequences residing in the transcribed mRNA (e.g., polyadenylation signal sequences (AAUAAA), or sequences complementary to small nuclear RNAs involved in pre-mRNA splicing) may lead to RNA instability. Therefore, one goal in the design of genes encoding a fungal protein for plant expression, more preferably referred to as plant optimized gene(s), is to generate a DNA sequence having a G+C content close to that of the average of plant gene coding regions. Another goal in the design of the plant optimized gene(s) encoding a fungal protein is to generate a DNA sequence in which the sequence modifications do not hinder translation.

Due to the plasticity afforded by the redundancy/degeneracy of the genetic code (i.e., some amino acids are specified by more than one codon), evolution of the genomes in different organisms, or classes of organisms, has resulted in differential usage of redundant codons. This “codon bias” is reflected in the mean base composition of protein coding regions. For example, organisms with relatively low G+C contents utilize codons having A or T in the third position of redundant codons, whereas those having higher G+C contents utilize codons having G or C in the third codon position. It is thought that the presence of “minor” codons within an mRNA may reduce the absolute translation rate of that mRNA, especially when the relative abundance of the charged tRNA corresponding to the minor codon is low. An extension of this concept is that the diminution of translation rate by individual minor codons would be at least additive for multiple minor codons. Therefore, mRNAs having high relative contents of minor codons would have correspondingly low translation rates. This rate would be reflected by subsequent low levels of the encoded protein.

To assist in engineering genes encoding a fungal protein for expression in plants the codon bias of plant genes can be determined. The codon bias for genes of a particular plant is represented by the statistical codon distribution found in the protein coding regions of the plant genes. In Table 8, Columns C, D, I, and J present the distributions (in % of usage for all codons for that amino acid) of synonomous codons for each amino acid, as found in the coding regions of Zea mays (maize) and dicot genes. The codons most preferred by each plant type are indicated in bold font, and the second, third, or fourth choices of preferred codons can be identified when multiple choices exist. It is evident that some synonomous codons for some amino acids are found only rarely in plant genes, and further, that maize and dicot plants differ in codon usage (e.g. Alanine codon GCG occurs more frequently in maize genes, while Arginine codon AGA is more often used in dicot genes). A new DNA sequence which encodes the amino acid sequence of the fungal Tcp1_gzprotein was designed for optimal expression in both maize and dicot plants. The new DNA sequence differs from the native fungal DNA sequence encoding the Tcp1_gzprotein by the substitution of plant (first preferred, second preferred, third preferred, or fourth preferred) codons to specify the appropriate amino acid at each position within the protein amino acid sequence. In the design process of creating a fungal protein-encoding DNA sequence that approximates an average codon distribution of both maize and dicot genes, any codon that is used infrequently relative to the other synonymous codons for that amino acid in either type of plant was not included (indicated by DNU in Columns F and L of Table 8). Usually, a codon was considered to be rarely used if it was represented at about 10% or less of the time to encode the relevant amino acid in genes of either plant type (indicated by NA in Columns E and K of Table 9).

where C1 is the codon in question and C2, C3, etc. represent the averages of the % values for maize and dicots of remaining synonymous codons (average % values for the relevant codons are taken from Columns E and K) of Table 8. The Weighted % value for each codon is given in Columns F and L of Table 8.

Design of the plant-optimized DNA sequence was initiated by reverse-translation of the protein sequence of SEQ ID NO: 2 using a balanced maize-dicot codon bias table constructed from Table 8 Columns F and L. The initial sequence was then modified by compensating codon changes (while retaining overall weighted average codon representation) to remove or add restriction enzyme recognition sites, remove highly stable intrastrand secondary structures, and other sequences that might be detrimental to cloning manipulations or expression of the engineered gene in plants.

The new sequence was then re-analyzed for restriction enzyme recognition sites that might have been created by the modifications. The identified sites are further modified by replacing the relevant codons with first, second, third, or fourth choice preferred codons. Other sites in the sequence which could affect transcription or translation of the gene of interest include the exon:intron junctions (5′ or 3′), poly A addition signals, or RNA polymerase termination signals. The modified sequence is further analyzed and further modified to reduce the frequency of TA or CG doublets, and to increase the frequency of TG or CT doublets. In addition to these doublets, sequence blocks that have more than about five consecutive residues of [G+C] or [A+T] can affect transcription or translation of the sequence. Therefore, these sequence blocks are also modified by replacing the codons of first or second choice, etc. with other preferred codons of choice. Rarely used codons are not included to a substantial extent in the gene design, being used only when necessary to accommodate a different design criterion than codon composition per se (e.g. addition or deletion of restriction enzyme recognition sites).

The method described above enables one skilled in the art to design modified gene(s) that are foreign to a particular plant so that the genes are optimally expressed in plants. The method is further described and illustrated in U.S. Pat. No. 5,380,831 and patent application WO 97/13402.

Thus, in order to design plant optimized genes encoding a fungal protein, a DNA sequence is designed to encode the amino acid sequence of said protein utilizing a redundant genetic code established from a codon bias table compiled from the gene sequences for the particular plant or plants. The resulting DNA sequence has a higher degree of codon diversity, a desirable base composition, can contain strategically placed restriction enzyme recognition sites, and lacks sequences that might interfere with transcription of the gene, or translation of the product mRNA. Thus, synthetic genes that are functionally equivalent to the proteins/genes of the subject invention can be used to transform hosts, including plants. Additional guidance regarding the production of synthetic genes can be found in, for example, U.S. Pat. No. 5,380,831.

Once said DNA sequence has been designed on paper or in silico, actual DNA molecules can be synthesized in the laboratory to correspond in sequence precisely to the designed sequence. Such synthetic DNA molecules can be cloned and otherwise manipulated exactly as if they were derived from natural or native sources.

The plant optimized, codon biased DNA sequence that encodes a variant of the Tcp1_gzfusion protein of SEQ ID NO: 2 is given as bases 3-7403 of SEQ ID NO:23 (referred to herein as the 8842 gene). To facilitate cloning and to ensure efficient translation initiation, a 5′ terminal NcoI restriction enzyme recognition sequence (CCTAGG) was engineered to include the ATG translation start codon (bases 1-6 of SEQ ID NO:23). This design feature introduced a GCT codon specifying Alanine as the second amino acid of the encoded protein. Thus, the protein encoded by SEQ ID NO:23, as disclosed in SEQ ID NO:24 (referred to herein as the 8842 protein), differs from the native Tcp1_gzprotein of SEQ ID NO:2 by the addition of an Alanine at the second residue. Also, to ensure proper translation termination and to facilitate cloning, bases encoding translation stop codons in the six reading frames of double-stranded DNA, plus a SacI restriction enzyme recognition site (GAGCTC), were included at the 3′ end of the coding region (bases 7404-7432 of SEQ ID NO:23). Synthesis of a DNA fragment comprising SEQ ID NO: 10 was performed by a commercial supplier (PicoScript, Houston. Tex USA).

It is to be noted that the Gibberella zeae genomic DNA sequence tcp1_Gz, disclosed in SEQ ID NO: 1, as annotated in Genbank Accession AACM01000442, comprises a putative intron sequence (bases 4669-4749 of SEQ ID NO:1). The plant-optimized, codon biased DNA sequence that encodes a variant of the Tcp1_gzfusion protein and disclosed in SEQ ID NO:23 has been designed in such a manner that plant intron splice site recognition sequences were removed. Thus, the protein encoded by SEQ ID NO:23 and disclosed in SEQ ID NO:24, and expected to be produced by plant cells, includes amino acids encoded by the putative fungal intron sequences.

TABLE 8

Synonomous codon representation in coding regions of 706 Zea mays genes

(Columns C and I), and 154 dicot plant genes (Columns D and J). Values

for a balanced-biased codon representation set for a plant-optimized

synthetic gene design are in Columns F and L.

A

C
D
E
F
G

I
J
K
L

Amino
B
Maize
Dicot*
Maize-Dicot
Weighted
Amino
H
Maize
Dicot*
Maize-Dicot
Weighted

Acid
Codon
%
%
Average
Average
Acid
Codon
%
%
Average
Average

ALA (A)
GCA
18
25
21.7
25.5
LEU (L)
CTA
8
8
NA
DNU

GCC
34
27
30.3
35.6

CTC
26
19
22.5
34.3

CCC
24
6
NA*
DNU***

CTG
29
9
NA
DNU

GCF
24
42
33.2
39.0

CTT
17
28
22.5
34.3

ARG (R)
AGA
15
30
22.4
27.4

TTA
5
10
NA
DNU

AGC
26
25
25.7
31.5

TTG
15
26
20.6
31.4

CGA
9
8
NA
DNU
LYS (K)
AAA
22
39
30.6
30.6

CCC
24
11
17.7
21.7

AAG
78
61
69.4
69.4

CGG
15
4
NA
DNU
MET(M)
ATG
100
100
100
100

CGT
11
21
15.8
19.4
PHE (F)
TTC
71
55
63.2
63.2

ASN (N)
AAC
68
55
61.4
61.4

TTT
29
45
36.8
36.8

AAT
32
45
38.6
38.6
PRO (P)
CCA
26
42
33.8
41.4

ASP (D)
GAC
63
42
52.6
52.6

CCC
24
17
20.7
25.3

GAT
37
58
47.4
47.4
CCG
28
9
NA
DNU

CYS (C)
TGC
68
56
61.8
61.8

CCT
22
32
27.2
33.3

TGT
32
44
38.2 38.2
SER (S)
AGC
23
18
20.4
26.0

END
TAA
20
48
33.8

AGT
9
14
NA
DNU

TAG
21
19
20.1
DNU
TCA
16
19
17.5
22.4

TGA
59
33
46.1
TCC
23
18
20.6
26.3

GLN (Q)
CAA
38
59
48.4
48.4

TCG
14
6
NA
DNU

CAG
62
41
51.6
51.6

TCT
15
25
19.9
25.4

GLU (E)
GAA
29
49
38.8
38.8
THR (T)
ACA
21
27
23.8
28.0

GAG
71
51
61.2
61.2

ACC
37
30
33.6
39.5

GLY (G)
GGA
19
38
28.5
28.5

ACG
22
8
NA
DNU

GGC
42
16
29.1
29.0

ACT
20
35
27.7
32.5

GGG
20
12
16.1
16.0
TRP (W)
TGG
100
100
100
100

GGT
20
33
26.7
26.6
TYR (Y)
TAC
73
57
65.0
65.0

HIS (H)
CAC
62
46
54.1
54.1

TAT
27
43
35.0
35.0

CAT
38
54
45.9
45.9
VAL (V)
GTA
8
12
NA
DNU

ILE (I)
ATA
14
18
15.9
15.9

GTC
32
20
25.8
28.7

ATC
58
37
47.6
47.9

GTG
39
29
34.1
38.0

ATT
28
45
36.4
36.4

GTT
21
39
29.9
33.3

*Murray E. E., Lotzer, J., & Eberle, M. (1989) Codon usage in plant genes. Nuci. Acids Res. 17:477-498.

**NA = Not Applicable

***DNU = Do Not Use

EXAMPLE 22
Construction of a Binary Plant Expression Vector Containing a First Version of a Gene Expressing the 8842 Protein (Variant Tcp1_Gz)

Protein 8842 is comprised of a variant of the entire Gibberella zeae Tcp1_Gzprotein (a Class B protein fused to a Class C protein). The DNA encoding the 8842 gene fusion as disclosed in SEQ ID NO:23 has been optimized for expression in plants. Nucleotides 3-7403 of SEQ ID NO:23 encode the entire Tcp1_Gzprotein variant as disclosed in SEQ ID NO:24.

The 8842 gene was cloned on an NcoI/SacI DNA fragment into an intermediate plasmid by standard molecular biology techniques. The 8842 gene expression cassette in the intermediate vector consisted of (in the 5′ to 3′ direction) a Cassava Vein Mosaic Virus (CsVMV) promoter (essentially bases 7160 to 7678 of Genbank Accession CVU58751), the Nicotiana tabacum osmotin 5′ stabilizing sequence (See US Patent Application Publication US20050102713 A1), the 8842 variant gene coding region, a Nicotiana tabacum osmotin 3′ stabilizing sequence (See U.S. Patent Application Publication US20050102713 A1), and the ORF24 3′ untranslated region from Agrobacterium tumefaciens pTi-15955 (essentially the complement of bases 18621 to 19148 of Genbank Accession ATACH5). The 8842 gene plant expression cassette was then moved via Gateway LR clonase (Invitrogen, Carlsbad, Calif.) into an Agrobacterium tumefaciens plant transformation binary vector, and the resulting plasmid was named pDAB8842. The 8842 gene plant expression cassette in pDAB8842 is directly preceded by the RB7 matrix attachment region (MAR) from Nicotiana tabacum (Hall, Gerald, Jr.; Allen, George C.; Loer, Deborah S.; Thompson, William F.; Spiker, Steven. Nuclear scaffolds and scaffold-attachment regions in higher plants. Proc. Natl. Acad. Sci. USA (1991) 88:9320-9324.) To provide in planta selection for transformed cells, the binary vector included, immediately following the 8842 gene plant expression cassette, a selectable marker gene in the form of an Arabidopsis thaliana ubiquitin 10 promoter (Genbank Accession L05399), a coding region for phosphinothricin acetyl transferase (PAT; Genbank Accession 143995), and a 3′ untranslated region (3′ UTR) from ORF1 of Agrobacterium tumefaciens pTi-15955 (essentially bases 2180 to 2887 of Genbank Accession ATACH5).

The final order of the elements and expression cassettes in the binary plasmid pDAB8842 is as follows: pTi 15955 T-DNA Border B, Nicotiana tabacum RB7 MAR, the gene 8842 expression cassette, the PAT gene expression cassette, and three tandem copies of pTi-15955 T-DNA Border A. The constructions were done in such a way as to maintain appropriate plant transcription and translation signals. For plant transformation, the pDAB8842 plasmid was introduced into cells of Agrobacterium tumefaciens strain LBA4404 by electroporation.

EXAMPLE 23
Construction of a Binary Plant Expression Vector Containing a Second Version of a Gene Expressing the 8842 Protein (Variant Tcp1_Gz)

The 8842 protein coding region of SEQ ID NO:23 was cloned on an NcoI/SacI DNA fragment into an intermediate plasmid by standard molecular biology techniques. The 8842 gene expression cassette in the intermediate vector consisted of (in the 5′ to 3′ direction) an Arabidopsis thaliana Actin 2 promoter (Act2; Genbank Accession U41998), the 8842 variant gene coding region, and the ORF24 3′ untranslated region from Agrobacterium tumefaciens pTi-15955 (essentially the complement of bases 18621 to 19148 of Genbank Accession ATACH5). The 8842 gene plant expression cassette was then moved via Gateway LR clonase (Invitrogen, Carlsbad, Calif.) into an Agrobacterium tumefaciens plant transformation binary vector, and the resulting plasmid was named pDAB8844. In plasmid DAB8844, all the elements and expression cassettes are present in the same order as described in Example 22 for plasmid pDAB8842, except that the 8842 gene cassette under the control of the CsVMV promoter present in pDAB8842 was replaced by this version of the 8842 gene under the control of the Act2 promoter. For plant transformation, the pDAB8844 plasmid was introduced into cells of Agrobacterium tumefaciens strain LBA4404 by electroporation.

EXAMPLE 24
Transformation of Cotton Cells

Seeds of cotton variety Coker 310 were surface-sterilized in 95% alcohol for 1 minute, rinsed with sterile distilled water, sterilized with 50% commercial bleach for 20 minutes, then rinsed again 3 times with sterile distilled water. Treated seeds were germinated at 28° C. on G-medium [Murashige and Skoog, 1962 (MS) basal salts with B5 vitamins (Gamborg et al., 1965) and 3% sucrose] in Magenta GA-7 vessels maintained under a high light intensity of 40-60μE/m², with a photoperiod of 16 hrs light and 8 hrs dark.

Cotyledon segments (˜5 mm square) were isolated from 7-10 day old seedlings into liquid M-medium (MS-based medium with 1-5 μM 2,4-Dichlorophenoxyacetic acid and 1-5 μM Kinetin) in Petri plates. For each construct (i.e. pDAB8842 and pDAB8844), 200 cut segments were treated with a recombinant Agrobacterium tumefaciens strain LBA4404 suspension (approximately 10⁶cells/mL), then transferred to semi-solid M-medium and co-cultivated for 2-3 days (in this and subsequent steps, incubation was carried out under light at 28° C.). Following co-cultivation, segments were transferred to MG5 medium, which contains 5 mg/L glufosinate-ammonium (to select for plant cells that contain the transferred gene) and 500 mg/L carbenicillin (to eliminate residual Agrobacterium tumefaciens cells). After 3 weeks, callus from the cotyledon segments was isolated and transferred to fresh MG5 medium then a second transfer to MG5 medium was made 3 weeks later. After an additional 3 weeks, callus was transferred to C-medium (MS-based medium containing 10-20 μM Napthaleneacetic Acid and 5-10 μM Kinetin) containing glufosinate-ammonium and carbenicillin as above, and transferred again to fresh selection medium after 3 weeks. For the pDAB8842 construct, 26 callus lines were obtained, and 25 callus lines were obtained for the pDAB8844 construct.

EXAMPLE 25
Expression of Variant Tcp1_Gzin Cotton Callus

Callus plant tissues (200 mg) isolated following transformation with constructs pDAB8842 and pDAB8844 were frozen at −80° C. The frozen plant material was placed into a 1.2 mL polypropylene tube containing a 0.188 inch diameter tungsten bead with 450 μL of extraction buffer [Phosphate Buffered Saline with 0.1% Triton X-100, 10 mM Dithiothreitol and 5 μL/mL of protease inhibitor cocktail (Sigma Chemical Company, St. Louis, Mo.; catalog number P9599)] and homogenized for four minutes at maximum speed using a Kleco Pulverizer bead mill (Kleco, Visalia, Calif.). The resulting homogenate was centrifuged at 4,000×g for 10 min at 4° C., and the supernatant removed using a pipette. The protein concentration of the supernatant was determined by the method of Bradford [Bradford, M. M., (1976) A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal. Biochem.: 72:248-254.]. The volume of supernatant required to provide 2-5 μg of total protein was mixed 4:1 with 4×Tris-HCl, SDS, 2-mercaptoethanol sample buffer, (consisting of 0.125M Tris HCl, 10% sucrose, 0.02% bromophenol blue, 2.0% SDS, and 5% 2-mercaptoethanol). The solution was heated to 90° C. for 4 minutes, loaded into wells of a 4-20% Tris-glycine polyacrylamide gel (BioRad, Hercules, Calif.), and the proteins were separated by applying 100 volts of electricity for 60 minutes using the method of Laemmli [Laemmli, U. K:, (1970) Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nature: 227:680-685.]

Characterization of the expressed variant Tcp1_Gprotein was done by immunoblot analysis [Towbin, H., Staehelin, T., and Gordon, J., (1979) Electrophoretic transfer of proteins from polyacrylamide gels to nitrocellulose sheets: procedure and some applications. Proc. Natl. Acad. Sci. USA 76:4350-4354.] Briefly, the protein samples separated by SDS-polyacrylamide gel electrophoresis (above) were electrophoretically transferred onto nitrocellulose at 100 V for 1 hour, blocked with 1% non-fat milk, and probed with a 1:3,000 dilution of one of two different primary monoclonal antibodies prepared from different sequences of the Tcp1_Gzprotein. One antibody (1184) was derived using a 17 amino acid synthetic peptide containing the sequence of residues 1184-1200 (from SEQ ID NO:2) from Tcp1_Gz(SKTASAAEELKEARKSF as disclosed in SEQ ID NO:144 in the region corresponding to the “B” protein. The other antibody (1929) was derived from a synthetic 22 amino acid peptide containing the sequence from residues 1929-1950 (from SEQ ID NO:2) (YHYDEKSLLSDDPRVKSNRLSR as disclosed in SEQ ID NO:145 that resides in the area of the protein corresponding to the “C” protein. The nitrocellulose membrane containing the transferred proteins was allowed to incubate with either the 1184 or 1929 antibody overnight at 4° C. while gently rocking. After extensive washing of the nitrocellulose membrane, proteins related to Tcp1_Gzproduced by the callus tissues were detected using anti mouse ECL-conjugated secondary antibody (BioRad) and developed using ECL reagents (Amersham Biosciences, Arlington Heights, IL) according to the supplier's instructions. Relative molecular weights of the protein bands were determined by including a sample of SeeBlue™ prestained protein molecular weight markers (Invitrogen) in one well of the gel. Negative controls consisted of plant tissues from non-transformed callus processed in the same manner as described above. Positive controls consisted of insoluble proteins obtained from an extract of E. coli cells transformed with the E. coli-optimized tcp1_Gzgene (construct pDAB8829).

Fifteen cotton calli from construct pDAB8842 and 13 calli from construct pDAB8844 were analyzed. The protein extracts were blotted in duplicate and probed separately with antibodies 1184 (B-region peptide) and 1929 (C-region peptide). Both antibodies revealed similar, but not identical, banding patterns. The calculated size of the intact variant Tcp1_Gzprotein (2467 amino acids; SEQ ID NO:24) is about 278 kDa. In all analyses, the positive control samples showed a smearing of proteins reactive to the antibodies, starting just below the position of the 250 kDa molecular weight standard, while there was no signal observed from the negative control samples of proteins extracted from non-transformed cotton calli. Of the 15 pDAB8842 construct samples (wherein the 8842 variant Tcp1_Gzexpression is driven by the CsVMV promoter), 11 showed a positive response, exhibiting a strong protein band at an apparent molecular weight just below 250 kDa, and a second protein band of generally less intensity above the 148 kDa molecular weight marker. Samples prepared from cotton calli transformed with the pDAB8844 construct (wherein the 8842 variant Tcp1_Gzexpression is driven by the Act2 promoter) resulted in significantly fewer positive responses compared to the pDAB8842 construct (only 2 of the 13 samples showed a positive response).

Cotton calli that expressed the 8842 variant tcp1_Gzgene demonstrated different banding patterns when probed with the two peptide specific antibodies 1184 (B-region peptide) and 1929 (C-region peptide). Antibody 1184 bound to a single protein species having an apparent molecular weight greater than 148 kDa, but less than 250 kDa. Antibody 1929 bound to either one or two proteins (depending on the callus sample), both having an apparent molecular weight greater than 148 kDa, but less than 250 kDa.

Thus, these results show that proteins that are recognized by antibodies prepared against peptide fragments of the variant Tcp1_Gzprotein are produced by these plant tissues. Given the very large size of this protein, and the technical limitations of the resolution of the gel analyses, it is expected that the high molecular weight, immuno-reactive bands observed in both control samples and plant samples represent full-length Tcp1_Gzproteins.

EXAMPLE 26
Toxin Complex Class A and Fused Class B/C Genes of Fusarium verticillioides

This example teaches a method to discover new Class A genes and new fused Class B/Class C genes present in the genome sequence of Fusarium verticillioides (teleomorph Gibberella moniliformis). It is to be noted that one life stage (anamorph) of Gibberella zeae is classsified as Fusarium graminearum.

Determination of the DNA sequence of the genome of the fungus Fusarium verticillioides is in progress at the Broad Institute (Cambridge, Mass.) and the partial genome is accessible to the public from the website (broad.mit.edu/annotation/fgi/). The DNA sequences of the Gibberella zeae Class A TC gene (SEQ ID NO:9) and of the Gibberella zeae tcp1_Gzgene (SEQ ID NO:1) were used separately as query sequences in TBLASTN analyses of the partial sequence of the F. verticillioides genome (TBLASTN ver. 2.2.10; Oct. 19, 2004).

These analyses revealed the presence of two sequences corresponding to Class A TC genes, two sequences corresponding to fused ClassB/ClassC TC genes, as well as a partial Class A TC gene and a partial Class B TC gene. The contig sequences including and flanking these presumptive TC genes were extracted and further analyzed. The extracted contig sequences were named for convenience; AContig12, AContig34, BCContig12, BCContig6, and BCContig46.

The sequence of each contig was translated in silico to identify coding regions for peptides of 100 amino acids or longer (terminator to terminator), and each such putative protein was used as a query sequence in a BLAST analysis (BLASTP ver.2.2.3; Apr. 24, 2002) of the Genbank nonredundant protein database (National Center for Biotechnology Information; Database: db/nr.01; Posted date: Jan. 18, 2006 4:00 PM; Number of letters in database: 111,166,549; Number of sequences in database: 325,447).

Proteins with significant BLAST scores to a TC Class A, Class B or Class C gene were mapped back to the encoding DNA of the source contig. From each contig belonging to a single TC ClassA or TC ClassBC gene, the whole DNA sequence encompassing the region encoding the protein, plus 20 bp on either side, was extracted. In some cases, it was necessary to reverse and complement the DNA base sequence as present in a native contig in order to obtain a protein coding region in the standard 5′ to 3′ sense orientation.

The DNA sequence extracted from Acontig12 is presented as SEQ ID NO:25. This DNA sequence encodes a putative TC Class A protein in two overlapping segments, the deduced sequence of which is disclosed in SEQ ID NOs:26 and 27. The Threonine codon (ACG) that serves as the beginning of the open reading frame for the coding region of the first segment of the deduced putative TC Class A protein is residues 21-23 in SEQ ID NO:25. There is a probable sequencing error around base 3000 (not uncommon in large-scale genomic sequencing projects such as this), as the open reading frame encoding the first 1002 amino acids of the putative TC Class A protein is terminated by a TGA stop codon. However, the AAA Lysine codon (residues 3022-3024 in SEQ ID NO:25) which serves as the start of the open reading frame for the second portion of the deduced putative TC Class A protein begins 5 bases upstream of the TGA codon. By linking the two encoded peptides, comprising 1002 amino acids (SEQ ID NO:26) and 2057 amino acids (SEQ ID NO:26), and by analogy to the G. zeae genome sequence, it is likely that SEQ ID NO:25 is part of a complete open reading frame that encodes a TC Class A protein of approximately 3000 amino acids. The high relatedness of this deduced FV TC Class A protein to a G. zeae TC Class A protein is reflected by a BLAST score of e-146 for the first 1002 amino acids, and 0.0 for the second 2057 amino acids.

The DNA sequence extracted from Acontig34 is presented as SEQ ID NO:28. This DNA encodes a second putative TC Class A protein. The DNA sequence that encodes the putative TC Class A protein comprises a first portion of 3298 bases. This sequence is followed by a large gap in the DNA sequence, indicated as a string of 2098 Ns. Finally, the TC-encoding sequence comprises an additional 3773 bases. Residues 20-22 (AAT) at the beginning of the coding region in SEQ ID NO:28 correspond to the first Asparagine of the putative TC Class A encoded protein sequence in SEQ ID NO:29. This first portion of the DNA sequence preceding the Ns likely contains two sequencing errors which interrupt the deduced putative TC Class A protein reading frame into three parts. The first part of the open reading frame comprises 1452 bases and encodes 484 amino acids (SEQ ID NO:29). This part of the open reading frame is terminated with a TGA stop codon. The second part of the open reading frame starts 4 bases downstream of the TGA stop codon, comprises 690 bases, and encodes 230 amino acids (SEQ ID NO:30). This portion of the open reading frame is terminated with a TAA stop codon. The third portion of the open reading frame starts 11 bases downstream of the TAA stop codon, comprises 1122 bases, and encodes 374 amino acids (SEQ ID NO:31). The portion of the DNA sequence following the Ns comprises a fourth portion of the deduced putative TC Class A protein open reading frame, and encodes 1233 amino acids (SEQ ID NO:32). The GGA codon for the first Glycine of this portion of the deduced putative TC Class A protein corresponds to basepairs 5453-5453 of SEQ ID NO: 28. The total protein encoded by SEQ ID NO:28 thus is at least 2358 amino acids long. By analogy to the G. zeae genome sequence it is likely that SEQ ID NO:28 is part of a complete open reading frame that encodes a TC Class A protein of approximately 3000 amino acids. The high relatedness of this deduced FV TC Class A protein to a G. zeae TC Class A protein is reflected by a BLAST score of 4e-43 for the first 484 amino acids, 0.001 for the second 230 amino acids, 2e-14 for the next 374 amino acids, and 0.0 for the final 1233 amino acids.

The DNA sequence extracted from BCContig12 is presented as SEQ ID NO:33. This DNA sequence encodes a putative fused TC ClassB/Class C protein and comprises a first portion of 5482 bases. This sequence is followed by a large gap in the DNA sequence, indicated as a string of 659 Ns. Finally, the BCContig 12 sequence comprises an additional 1563 bases. Basepairs 22-24 (GCC) at the beginning of the coding region in SEQ ID NO:33 correspond to the first Alanine of the encoded putative TC fused ClassB/Class C protein in SEQ ID NO:33. This first portion of the encoded protein comprises 1820 amino acids (SEQ ID NO:34). There is a probable sequencing error just after the series of Ns, as the in-frame Histidine codon (CAT, bases 6203-6205 of SEQ ID NO:33) that starts the second portion of the putative TC fused ClassB/Class C protein is preceded by 61 out-of-frame bases. The second portion of the encoded putative TC fused ClassB/Class C protein comprises 494 amino acids (SEQ ID NO:35). The total protein encoded by SEQ ID NO:33 thus is at least 2314 amino acids long. By analogy to the G. zeae genome sequence it is likely that SEQ ID NO:33 is part of a complete open reading frame that encodes a TC fused ClassB/ClassC protein of approximately 2400 amino acids. The high relatedness of this deduced FV TC fused Class B/Class C protein to a G. zeae TC fused Class B/Class C protein is reflected by a BLAST score of 0.0 for the first 1820 amino acids and 5e-45 for the final 494 amino acids.

The DNA sequence extracted from BCContig6 is presented as SEQ ID NO:36. This DNA sequence encodes a portion of a putative fused TC Class B/Class C protein and comprises 962 bases. Residues 20-22 (CAG) at the beginning of the coding region in SEQ ID NO:36 correspond to the first Glutamine of the deduced putative TC fused Class B/Class C protein. This first portion of the deduced encoded protein comprises 194 amino acids (SEQ ID NO:37). There is a probable sequencing error just after the Leucine codon (TTG), as a Stop codon (TAG) terminates the reading frame. However, the Aspartic Acid codon (GAT, residues 619-621 of SEQ ID NO:36) that starts the second portion of the putative TC fused Class B/Class C protein is found 14 bases after the TAG codon. The second portion of the deduced putative TC fused Class B/Class C protein comprises 107 amino acids (SEQ ID NO:38). The protein encoded by SEQ ID NO:36 is thus likely to represent a portion of a TC fused Class B/Class C coding region. By analogy to the G. zeae genome sequence, it is likely that SEQ ID NO:36 is part of a complete open reading frame that encodes a TC fused Class B/Class C protein of approximately 2400 amino acids. The high relatedness of this deduced FV TC fused Class B/Class C protein to a Photorhabdus TC Class C protein is reflected by a BLAST score of 1e-11 for the first 194 amino acids. The remaining 107 amino acids has a BLAST score of 1e-10 to a G. zeae TC fused Class B/Class C protein.

The DNA sequence extracted from BCContig46 is presented as SEQ ID NO:39. This DNA sequence encodes a putative fused TC Class B/Class C protein. The DNA sequence that encodes the putative fused TC Class B/Class C protein comprises a first portion of 3423 bases. This sequence is followed by a large gap in the DNA sequence, indicated as a string of 1009 Ns. Finally, the TC-encoding sequence comprises an additional 3810 bases. Bases 21-23 (GAG) at the beginning of the coding region in SEQ ID NO:39 correspond to the first Glutamic Acid of the first portion of the deduced putative TC fused Class B/Class C protein. This first portion of the deduced encoded protein comprises 1134 amino acids (SEQ ID NO:40). The second portion of the deduced putative TC fused Class B/Class C protein comprises 1263 amino acids (SEQ ID NO:41). The TTG codon, which specifies the first Leucine of the second protion of the deduced TC fused Class B/Class C protein following the Ns, corresponds to residues 4435-4437 in SEQ ID NO:39. The protein encoded by SEQ ID NO:39 thus is likely to represent a TC fused Class B/Class C protein of at least 2309 amino acids. By analogy to the G. zeae genome sequence it is likely that SEQ ID NO:39 is part of a complete open reading frame that encodes a TC fused Class B/Class C protein of approximately 2400 amino acids. The high relatedness of this deduced FV TC fused Class B/Class C protein to a G. zeae TC fused Class B/Class C protein is reflected by a BLAST score of e-168 for the first 1134 amino acids and e-122 for the final 1263 amino acids.

EXAMPLE 27
Additional Natural B/C Fusions from Burkholderia and Nitrosospora

In light of the findings reported herein, additional BLAST searches (similar to those described in Examples above) were conducted. The results of TBLASTN of Genbank nonredundant nucleotide database with the Gibberella zeae fused Class B/Class C sequence are as follows:

LOCUS CP000125.1 3181762 bp DNA circular BCT 30-Sep.-2005
DEFINITION Burkholderia pseudomallei 1710b chromosome II, complete sequence.
BLAST score: 2e-92
LOCUS CP000103.1 3184243 bp DNA circular BCT 15-Nov.-2005
DEFINITION Nitrosospira multiformis ATCC 25196, complete genome.
BLAST score: 4e-68
LOCUS CP000086.1 3809201 bp DNA circular BCT 05-Jan.-2006
DEFINITION Burkholderia thailandensis E264 chromosome I, complete sequence.
BLAST score: 7e-47
LOCUS BX571965 4074542 bp DNA circular BCT 17-Apr.-2005
DEFINITION Burkholderia pseudomallei strain K96243, chromosome 1, complete sequence.
BLAST score: 1e-39
LOCUS CP000124.1 4126292 bp DNA circular BCT 30-Sep.-2005
DEFINITION Burkholderia pseudomallei 1710b chromosome I, complete sequence.
BLAST score: 3e-39
LOCUS CP000010.1 3510148 bp DNA circular BCT 22-Sep.-2004
DEFINITION Burkholderia mallei ATCC 23344 chromosome 1, complete sequence.
BLAST score: 3e-38

As Burkholderia and Nitrosospira are bacterial genera, these results, taken with the other results reported herein, further confirm that novel BC fusion proteins can be found in other naturally occurring organisms, particularly these novel bacterial sources.

EXAMPLE 28
Additional Class A Proteins from Burkholderia and Aspergillus

LOCUS AP007171 2505489 bp DNA linear PLN 23-Dec.-2005
DEFINITION Aspergillus oryzae RIB40 genomic DNA, SC011.
BLAST score: 8e-97
LOCUS CP000125.1 3181762 bp DNA circular BCT 30-Sep.-2005
DEFINITION Burkholderia pseudomallei 1710b chromosome II, complete sequence.
BLAST score: 1e-63
LOCUS CP000010.1 3510148 bp DNA circular BCT 22-Sep.-2004
DEFINITION Burkholderia mallei ATCC 23344 chromosome 1, complete sequence.
BLAST score: 3e-08
LOCUS BX571965.1 4074542 bp DNA circular BCT 17-Apr.-2005
DEFINITION Burkholderia pseudomallei strain K96243, chromosome 1, complete sequence.
BLAST score: 3e-08
LOCUS CP000124.1 4126292 bp DNA circular BCT 30-Sep.-2005
DEFINITION Burkholderia pseudomallei 1710b chromosome I, complete sequence.
BLAST score: 3e-08
LOCUS CP000086.1 3809201 bp DNA circular BCT 05-Jan.-2006
DEFINITION Burkholderia thailandensis E264 chromosome I, complete sequence.
BLAST score: 8e-08

As Burkholderia is a bacterial genus, these results, taken with the other results reported herein, are particularly noteworthy because they confirm that Class A proteins can be found, along with novel BC fusion proteins of the subject invention, in novel bacterial sources. As Aspergillus is a (eukaryotic) fungal genus, these results are also particularly noteworthy because they confirm that Class A proteins can be found in various eukaryotic and fungal sources.

Number	Name	Date	Kind
20020078478	Ffrench-Constant et al.	Jun 2002	A1
20030215803	Garcia et al.	Nov 2003	A1
20040103455	Ffrench-Constant et al.	May 2004	A1
20040194164	Bintrim et al.	Sep 2004	A1
20040208907	Hey et al.	Oct 2004	A1
20060168683	Hey et al.	Jul 2006	A1

Number	Date	Country
WO 9822595	May 1998	WO
WO 9924581	May 1999	WO
WO 0113731	Mar 2001	WO
WO 2004044217	May 2004	WO
WO 2004067727	Aug 2004	WO
WO 2004067750	Aug 2004	WO

	Number	Date	Country
	60657965	Mar 2005	US
	60704533	Aug 2005	US

Sources for, and types of, insecticidally active proteins, and polynucleotides that encode the proteins

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

US Referenced Citations (6)

Foreign Referenced Citations (6)

Related Publications (1)

Provisional Applications (2)