The present invention relates to methods and algorithms for identifying, synthesizing and co-assembling combinatorial libraries of polynucleotide variants.
The present invention relates generally to the area of bioinformatics and more specifically to methods and algorithms for computer-aided selection and subsequent synthesis of combinatorial libraries of polynucleotide variants.
Development of biopharmaceuticals requires substantial effort around screening, generation and characterization of variants of the proposed therapeutic as well as required research reagents to identify the best candidates demonstrating appropriate biochemical and biophysical properties. Variants are typically selected from expression libraries generated using random or site-directed PCR-based mutagenesis methods (Kunkel, Proc. Natl. Acad. Sci. USA, 82:488-92, 1985; Weiner et al., Gene, 151:119-23, 1994; Ishii et al., Methods Enzymol., 293:53-71, 1988).
An alternative to PCR-based methods to generate variants is synthetic polynucleotide assembly as described in e.g. U.S. Pat. No. 6,521,427 and U.S. Pat. No. 6,670,127). Synthetic polynucleotide assembly allows cost-effective generation of difficult-to-clone genes, functional or codon-optimized variants for activity studies and protein production, and bypasses sometimes tedious and lengthy mutagenesis and subcloning protocols (Xiong et al., FEMS Microbiol. Rev. 32:522-540, 2008).
Both PCR-based and synthetic polynucleotide assembly methods suffer from their inability to generate libraries of predefined variants without generating additional variation within the variant pool due to the annealing dynamics of the DNA. A need often exists to generate a library of polynucleotide variants having predefined variation, such as the human framework libraries designed for antibody humanization, or computer-aided designed libraries for antibody affinity maturation. Thus, there is a need for methods and algorithms to facilitate synthesis of libraries of predefined variants without generating de novo variation within the preselected variant pool.
One aspect of the invention is a method of identifying a combinatorial library of polynucleotide variants, comprising:
Another aspect of the invention is a method of synthesizing a combinatorial library of polynucleotide variants, comprising:
Another aspect of the invention is a method of selecting combinatorial libraries that can form a co-assembly set, comprising:
All publications, including but not limited to patents and patent applications, cited in this specification are herein incorporated by reference as though fully set forth.
As used herein and in the claims, the singular forms “a,” “and,” and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, reference to “a polypeptide” is a reference to one or more polypeptides and includes equivalents thereof known to those skilled in the art.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which an invention belongs. Although any compositions and methods similar or equivalent to those described herein can be used in the practice or testing of the invention, exemplary compositions and methods are described herein.
The term “combinatorial library” as used herein refers to a library of sequences of polynucleotide variants wherein for each sequence in the combinatorial library there is at least one other sequence present in the library that differs at only one corresponding parsed oligonucleotide; and (2) the number of sequences in the combinatorial library is equal to the number of unique sequences obtained by synthetic polynucleotide assembly by pooling together all parsed oligonucleotides having unique sequences. A combinatorial library can include one or more variants of a polynucleotide. A combinatorial library can include, 1×101, 1×102, 1×103, 1×104, 1×105 variants.
The term “synthetic polynucleotide assembly” as used herein refers to the method of chemical synthesis of polynucleotides as described in U.S. Pat. No. 6,521,427 and U.S. Pat. No. 6,670,127, which are herein incorporated by reference.
The term “polynucleotide” as used herein means a molecule comprising a chain of nucleotides covalently linked by a sugar-phosphate backbone or other equivalent covalent chemistry. Double and single-stranded DNAs and RNAs are typical examples of polynucleotides. The polynucleotides can be 100, 200, 300, 400, 800, 1000, 1500, 2000, 4000, 8000, 10000, 12000, 18,000, 20,000, 40,000, 80,000 or more base pairs in length, and can be non-naturally occurring or can originate from bacterial, yeast, viral, mammalian, amphibian, reptilian, or avian genomes. The polynucleotide can include coding regions or non-coding elements such as origins of replication, telomeres, promoters, enhancers, transcription and translation start and stop signals, introns, exon splice sites, chromatin scaffold components and other regulatory sequences
Short polynucleotides are referred to as “oligonucleotides”. Oligonucleotides can be of various lengths, typically more than two base pairs in length. The exact size of an oligonucleotide depends on many factors, such as the reaction temperature, salt concentration, the presence of denaturants such as formamide, and the degree of complementarity with the sequence to which the oligonucleotide is intended to hybridize. The oligonucleotides can be about 15-150 bases, between about 20-100 bases, between about 25-75 bases, or between about 30-50 bases long. Exemplary oligonucleotides are 24 or 48 bases in length.
The term “variant” as used herein refers to a polynucleotide or oligonucleotide that differs from a reference “wild type” polynucleotide and may or may not retain essential properties. Generally, differences in sequences of the wild type polynucleotide and the variant are closely similar overall and, in many regions, identical. A variant may differ from the wild type polynucleotide in its sequence by one or more modifications for example, substitutions, insertions or deletions of nucleotides. A substituted or inserted nucleotide may result in stop, no change, in conservative or non-conservative substitution in the codon the nucleotide encodes. A variant of a polynucleotide may be naturally occurring or synthetic, and may have 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identity with the wild type polynucleotide. Nucleotides present in the variant polynucleotides include modified bases capable of base pairing with adenine, cytosine, guanine, thymine and uracil. Exemplary modified bases include 8-azaguanine and hypoxanthine.
It is possible to modify the structure or function of the polypeptides encoded by variant polynucleotide sequences for such purposes as enhancing activity, specificity, stability, solubility, and the like. A replacement of a codon encoding leucine with codons encoding isoleucine or valine, a codon encoding an aspartate with a codon encoding glutamate, a codon encoding threonine with a codon encoding serine, or a similar replacement of codons encoding structurally related amino acids (i.e., conservative mutations) will, in some instances but not all, not have a major effect on the biological activity of the resulting molecule. Conservative replacements are those that take place within a family of amino acids that are related in their side chains. Genetically encoded amino acids can be divided into four families: (1) acidic (aspartate, glutamate); (2) basic (lysine, arginine, histidine); (3) nonpolar (alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan); and (4) uncharged polar (glycine, asparagine, glutamine, cysteine, serine, threonine, tyrosine). Phenylalanine, tryptophan, and tyrosine are sometimes classified jointly as aromatic amino acids. In similar fashion, the amino acid repertoire can be grouped as (1) acidic (aspartate, glutamate); (2) basic (lysine, arginine histidine), (3) aliphatic (glycine, alanine, valine, leucine, isoleucine, serine, threonine), with serine and threonine optionally be grouped separately as aliphatic-hydroxyl; (4) aromatic (phenylalanine, tyrosine, tryptophan); (5) amide (asparagine, glutamine); and (6) sulfur-containing (cysteine and methionine) (Stryer (ed.), Biochemistry, 2nd ed, WH Freeman and Co., 1981). Whether a change in the amino acid sequence of a polypeptide or fragment thereof encoded by a variant polynucleotide results in a functional homolog can be readily determined by assessing the ability of the modified polypeptide or fragment to produce a response in a fashion similar to the unmodified polypeptide or fragment using the assays described herein. Peptides, polypeptides or proteins in which more than one replacement has taken place can readily be tested in the same manner.
The term “wild type” or “WT” refers to a polynucleotide that has the characteristics of that polynucleotide when isolated from a naturally occurring source. An exemplary wild type polynucleotide is a polynucleotide encoding a gene that is most frequently observed in a population and is thus arbitrarily designated the “normal” or “reference” or “wild type” form.
A polynucleotide sequence can be designed in a computer-assisted manner and used to generate a set of parsed oligonucleotides covering the plus (+) (e.g. forward) and minus (−) (e.g. reverse) strand of the sequence. As used herein, the term “parsed” means that a sequence of a polynucleotide variant has been delineated in a computer-assisted manner such that a series of contiguous oligonucleotide sequences are identified. The oligonucleotide sequences are individually synthesized and used in the methods of the invention to design algorithms for appropriate pooling of the polynucleotides and to synthesize identified combinatorial libraries and co-assembly sets. Parsing and subsequent polynucleotide synthesis is done according to methods described in U.S. Pat. No. 6,521,427 and U.S. Pat. No. 6,670,127, which are herein incorporated by reference. Methods of synthesizing oligonucleotides are found in, for example, in: Oligonucleotide Synthesis: A Practical Approach, Gate, ed., IRL Press, Oxford (1984).
“Contiguous parsed oligonucleotides”, as used herein refers to two oligonucleotides wherein the first oligonucleotide ends at position arbitarily set at −1 and the second fragment starts at position arbitarily set at 0 along the linear polynucleotide sequence. Fragments can be of varying length, for example 16-32 nucleotides.
“Corresponding parsed oligonucleotides” as used herein refers to oligonucleotides that start and end at identical positions along the polynucleotide sequence between two or more polynucleotide sequences. Corresponding parsed oligonucleotides can have an identical sequence or they can represent polynucleotide variants as described above.
“Pool of corresponding parsed oligonucleotides” as used herein refers to more than one corresponding parsed oligonucleotide present in one combinatorial library.
“Analyzed pool” as used herein refers to the pool of polynucleotide variants that have been identified as part of a combinatorial library, and are further tested against additional polynucleotide sequences to identify additional sequences forming the combinatorial library.
The term “co-assembly set” as used herein refers to a library of polynucleotide variants that can be synthesized by synthetic polynucleotide assembly in one pool without synthesizing additional variants with unique polynucleotide sequences during the annealing reactions. The term “co-assembled” refers to library of polynucleotide variants that can form a co-assembly set.
The term “assembly in one pool” or “annealed in one pool” as used herein refers to synthesis of a library of polynucleotides using synthetic polynucleotide assembly and annealing all parsed oligonucleotides in one reaction mixture.
The term “complementary sequence” as used herein refers to a second isolated polynucleotide sequence that is antiparallel to a first isolated polynucleotide sequence and that comprises nucleotides complementary to the nucleotides in the first polynucleotide sequence. Typically, such “complementary sequences” are capable of forming a double-stranded polynucleotide molecule such as double-stranded DNA or double-stranded RNA when combined under appropriate conditions with the first isolated polynucleotide sequence.
The term “vector” means a polynucleotide capable of being duplicated within a biological system or that can be moved between such systems. Vector polynucleotides typically contain elements, such as origins of replication, polyadenylation signal or selection markers, that function to facilitate the duplication or maintenance of these polynucleotides in a biological system. Examples of such biological systems may include a cell, virus, animal, plant, and reconstituted biological systems utilizing biological components capable of duplicating a vector. The polynucleotides comprising a vector may be DNA or RNA molecules or hybrids of these.
The term “expression vector” means a vector that can be utilized in a biological system or a reconstituted biological system to direct the translation of a polypeptide encoded by a polynucleotide sequence present in the expression vector.
The term “polypeptide” means a molecule that comprises at least two amino acid residues linked by a peptide bond to form a polypeptide. Small polypeptides of less than 50 amino acids may be referred to as “peptides”. Polypeptides may also be referred to as “proteins.”
Synthetic polynucleotide assembly has many attractive features including the possibility of preparing, without any significant limitations, any desirable gene sequence. Synthetic polynucleotide assembly consists of two stages: (1) parsing a polynucleotide sequence into forward (F) and reverse (R) oligonucleotide fragments and synthesizing the oligonucleotides; and (2) assembling the synthesized oligonucleotides to generate the desired polynucleotides. During the assembly stage, all forward and reverse oligonucleotides are annealed in one pool, and the nicks are repaired with ligase (U.S. Pat. No. 6,521,427 and U.S. Pat. No. 6,670,127). Because of annealing properties of the polynucleotides, only oligonucleotides having complementary sequences will anneal with each other (
The present invention relates to methods and algorithms of identifying and synthesizing combinatorial libraries and co-assembly sets of polynucleotide variants using synthetic polynucleotide assembly without generating de novo variation within the polynucleotide variant pool due to mis-annealing of used oligonucleotides. The invention is useful in various applications that require screening, generation and characterization of libraries of polynucleotide variants. Exemplary applications are generation of libraries of antibody variable regions or libraries of other therapeutic protein variants.
The concept of polynucleotide co-assembly can be applied to any collection of polynucleotide variants whose sequences differ at corresponding parsed oligonucleotides. However, not any collection of polynucleotide variants can be co-assembled. The computational algorithms developed in the present invention are designed to identify those polynucleotide variants that can form a co-assembly set.
The requirement for co-assembly of any two polynucleotides is that each forward and reverse oligonucleotide hybridize with each other only when the region of complementarity between the sequences demonstrates 100% identity. Otherwise, mis-pairing will occur and new variants will be generated during the annealing process (
The example in
The example in
Since the complementary nature of double stranded DNA, only forward parsed oligonucleotides are required for analysis when determining which polynucleotides can form a co-assembly set. The obligatory and sufficient condition for two polynucleotides to form a co-assembly set is that the polynucleotides have contigous forward parsed oligonucleotides that are non-identical in sequence at each half-oligonucleotide length.
The example in
One aspect of the invention is a method of identifying a combinatorial library of polynucleotide variants, comprising
The parsed oligonucleotides are defined as numerical vectors in the mathematical algorithms. Each polynucleotide variant in the collection of variants is parsed into contigous oligonucleotides of M bases long. A unique number is assigned to each corresponding parsed oligonucleotide having a unique sequence within the collection of polynucleotides being analyzed. A parsed polynucleotide variant can be presented as a simple vector representation as shown below:
{F1, F2, F3, . . . Fi, . . . Fn},
Wherein F1=the first corresponding parsed oligonucleotide
Fi=the ith corresponding parsed oligonucleotide
Alternatively, a parsed polynucleotide variant can be presented as an expanded vector representation by dividing each original parsed oligonucleotide into two contiguous half-oligonucleotides of M/2 bases long. A unique number is again assigned to each corresponding parsed half oligonucleotide having a unique sequence within the collection of polynculeotides being analyzed. The expanded vector representation of a polynucleotide variant is shown below:
{F1-1, F1-2, F2-1, F2-2, F3-1, F3-2, . . . , Fi-1, Fi-2, . . . , Fn-1, Fn-2}
Wherein F1-1=the 1st half-oligonucleotides in the first corresponding parsed oligonucleotide
F1-2=the 2nd half-oligonucleotides in the first corresponding parsed oligonucleotide
Fi-1=the 1st half-oligonucleotides in the ith corresponding parsed oligonucleotide
Fi-2=the 2nd half-oligonucleotides in the ith corresponding parsed oligonucleotide
Fn-1=the 1st half-oligonucleotides in the last corresponding parsed oligonucleotide
Fn-2=the 2nd half-oligonucleotides in the last corresponding parsed oligonucleotide
The simple vector representation is typically used to identify polynucleotide variants that constitute a combinatorial library, and the expanded vector representation to identify two groups of polynucleotide variants that can be co-assembled.
Oligonucleotide sibling matrix is constructed that is utilized in subsequent analyses to identify polynucleotide variants that form a combinatorial library. Two genes are considered as siblings if they differ at only one corresponding parsed oligonucleotide. For a library consisting of N polynucleotide variants, its corresponding oligonucleotide sibling matrix is of the size N*N, and each matrix element Mij in the matrix is defined below:
wherein
An exemplary symmetrical oligonucleotide sibling matrix is shown in
A combinatorial library can be identified from the oligonucleotide sibling matrix by recursively finding new sibling polynucleotide variants starting from a seed polynucleotide variant. For example, a first seed polynucleotide variant can be set to S1. The sibling matrix is scanned along the row for S1, and sibling polynucleotide variants S6, S7 and S8 are added to the combinatorial library. Subsequently, matrix rows corresponding to S6, S7 and S8 are scanned for new sibling polynucleotide variants. Polynucleotide variant S9 is identified when the matrix is scanned for row corresponding to S6, and the variant S10 is identified during scanning for row corresponding to S7. The identified siblings are added to the combinatorial library. For the variant collection in
Key parameters in the algorithm:
For (every variants to be analyzed) {
If ( Mseed,j not equal 0 ) {
If (Variant Sj is neither in Clib nor in Vlist)
Add Sj to Vlist
}
}
Else {
Set Sseed to the first sequence in Vlist
Go to step 3
}
Sequences of the polynucleotide variants can be obtained using standard sequencing methods or can be downloaded from public databases. For example, sequences of human antibody germline genes can be downloaded from the ImMunoGeneTics DataBase (imgt cines fr). Variants of the human frameworks can be designed by altering residues at positions that may preserve or enhance binding affinity during humanization, such as positions described in U.S. Pat. No. 6,402,213. Rational design can be employed to design variants anticipated to have specific effect on structure or activity of the potential therapeutic proteins.
The obtained sequences of the polynucleotide variants are parsed according to methods described in U.S. Pat. No. 6,521,427.
Another aspect of the invention is a method of selecting combinatorial libraries that can form a co-assembly set, comprising:
For identifying libraries that can be co-assembled, the expanded vector representation of polynucleotide variants is used.
The obligatory and sufficient condition for two polynucleotides to form a co-assembly set is that the polynucleotides have contigous forward parsed oligonucleotides that are non-identical in sequence at each half-oligonucleotide length. The rule is implemented by introducing a specific XOR operation among expanded vector representations for the two variants as below:
wherein
Vi and Vj are expanded vector representations for sequence Si and Sj.
The meaning of F1-1, F1-2 . . . Fi-1, Fi-2 . . . Fn-1, Fn-2 is same as above (page 12)
F′1-1, F′1-2 . . . F′i-1, F′1-2 . . . F′n-1, F′n-2 have similar definition as F1-1, F1-2 . . . Fi-1, Fi-2 . . . Fn-1, Fn-2. Instead of F, F′ is used to denonate a different sequence.
wherein
The XOR operation is applied to identify if two combinatorial libraries form a co-assembly set. The format of expanded vector representation for combinatorial libraries is very similar to the one for a two polynucleotides except for the presence of multiple variants of half-oligonucleotides at corresponding positions.
Vi
Compared to expanded vector representation for a single gene, there might be more than one corresponding parsed oligonucleotides. Σ is used to represent 1 or more.
ΣFk-h⊕ΣGk-h=0, when any half-oligonucleotide is the same among the two set of half-oligonucleotides (where 1<=k<=n, and h=1 or h=2); otherwise, ΣFk-h⊕ΣGk-h=1, when all half-oligonculeotides have different sequence at this position. Likewise, if the resultant vector contains only one strips of “1”s, the two combinatorial libraries can be co-assembled.
Exemplary combinatorial libraries are named E—1, E—2 and E—3, identified using methods described above. The expanded vector representation for each library is:
VE
VE
VE
The results for the XOR operations are:
VE
VE
VE
E—1 is a combinatorial library consisting of 4 polynucleotides, while E—2 and E—3 are combinatorial libraries with each having 2 members. Both E—1 and E—2 can be co-assembled with E—3, but E—1 and E—2 cannot be co-assembled. In the co-assembly matrix, if two combinatorial libraries can be co-assembled, their corresponding matrix cell value is 1, otherwise, 0. Table 1 shows an exemplary co-assembly matrix for five combinatorial libraries.
ALGORITHM2 shows a method to identify combinatorial libraries that can be co-assembled:
Variables used in the algorithm are:
An integrated software package for co-assembly has been developed and implemented using Java and Java Swing. The package automatically 1) reads in a sequence library and identifies unique oligos to be synthesized; 2) generates both simple and expanded vector representations for each gene; 3) calculates the oligo sibling matrix, and identifies combinatorial library of sequences starting from a seed sequence; 4) calculates the co-assembly matrix and identifies all co-assembly sets; and 5) writes out each co-assembly set for gene synthesis/assembly directly.
Tenascin 3rd fibronectin domain (FN3) is representative of a class of Ig-like scaffolds that incorporates CDR-like loops extending from the surface of the molecule, and has been widely used for identification of binding proteins to modulate activity of therapeutic proteins (Lipovsek et al., Antibodies J. Mol. Biol. 368: 1024-41, 2007).
A 144 base pair fragment was identified in the 3rd FN3 loop of human Tenascin (nucleotides 2862-3002 in human Tenascin, Gen Bank Acc. No. NM—002160, SEQ ID NO: 13), and a library of variants were designed and assembled to validate the co-assembly algorithm. Table 3 shows the sequence of the Tenascin gene fragment used. The library of variants will be designed by introducing amino acid change at underlined codon positions shown in Table 3. A total of 12 sequences are designed. Their corresponding parsed oligonucleotdies (both forward and reverse) are shown in Table 4.
For any variant, the same residue is introduced at each codon position (underlined in Table 3). The sequences are denoted as A,E,L,M,N,P,R,S,T,V,W,Y according to the introduced amino acid, respectively.