Saturation mutagenesis screens such as alanine scans [1] enable protein structure-function studies through directed inquiry into the functional consequences of mutations in individual amino acid residues. However, scaling these approaches to cover entire proteins is laborious and expensive, typically requiring the individual synthesis of mutagenic oligonucleotide primers for each target codon and their use in separate reactions. An alternative is random mutagenesis, e.g. error-prone PCR or doped oligonucleotide synthesis, but these methods fail to generate most amino acid substitutions that require multi-base changes.
Site-directed mutagenesis is an indispensable tool in sequence-structure-function studies, protein-engineering, and directed evolution [3]. The most widely used mutagenesis approaches are derivatives of the Kunkel method [4] and use oligonucleotides synthesized with the desired mutation, which prime on a wild-type template to copy the remaining wild-type sequence of interest. The parental template is marked with deoxyuracil bases (dUTPs) or, in more recent commercial approaches, dam methylation, allowing for its selective degradation [5]. These approaches traditionally target only one site at a time, such that many separate reactions must be performed in order to systematically mutagenize a protein sequence at every position, for instance to perform alanine scanning [1]. Further scaling this strategy to generate multiple distinct mutations at each site is even more labor-intensive. For example, Kato and colleagues individually constructed 2,314 point mutants of the tumor suppressor gene TP53, and serially assayed each mutant for its ability to transactivate a fluorescent reporter gene in a yeast model [6].
Recently introduced deep mutational scanning (DMS) approaches [2] take an alternative approach, in which large mutant libraries are first built in a bulk, randomized fashion. Digital counting via massively parallel DNA sequencing is then used to quantify the enrichment or depletion of individual mutants following functional selection on the complex library of mutants. In order to build the initial library of mutants, these approaches typically use doped oligonucleotide synthesis [7-9], in which a full-length region to be mutagenized is synthesized on controlled-pore glass (CPG) columns, with each phosphoramidite spiked with a small percentage of the other three, such that point mutations are randomly introduced along the length of the synthesized strand. However, single nucleotide deletions are common in CPG oligonucleotide synthesis due to incomplete incorporation in each step, limiting the length of mutant oligonucleotides that can be constructed without an unacceptably high rate of frame-shifting deletions. Furthermore, the minimum level of doping from most commercial vendors is 1%, which may be higher than desired. Error-prone PCR represents an alternative approach, but it requires empirical tuning to reach a desired mutational load and can be prone to bias [10]. Furthermore, a key limitation shared by doped oligonucleotide synthesis and error-prone PCR, when applied to protein-coding sequence, is that only a minority of the codon mutational space can be accessed through single-base mutations (e.g., 31% for p53).
Several recent approaches provide a degree of multiplexing for programmed mutagenesis. An extension of the original Kunkel method, P Funkel [11] uses pooled primer extension on a single-stranded circular phagemid template prepared from an E. coli host permissive of dUTP incorporation during DNA replication. Another approach, EMPIRIC [12], was used to mutagenize nine codons in a single reaction by inserting a cassette excisable by type IIS restriction digestion followed by replacement of this cassette with a mutagenized nine-codon cassette. These improved methods have recently enabled systematic measurement of point mutant fitness landscapes for portions of the yeast ubiquitin protein [13], a hepatitis C virus replication factor [14], and a bacterial beta-lactamase [15]. Despite such successes, these and other saturation codon mutagenesis methods remain laborious and cost-prohibitive, as they require individual synthesis of mutagenic primers or are limited in their scope by targeting only a few residues at a time, requiring serial tiling over the target.
In one aspect, this application relates to a method for multiplexed mutagenesis of a target nucleotide sequence. The method entails the steps of: a) generating, in parallel, a set of mutagenic oligonucleotide primers designed to cover all or part of the target nucleotide sequence; and b) reacting the set of mutagenic oligonucleotide primers with the target sequence in the presence of a polymerase to generate a mutant nucleotide sequence library. In step a), each member of the set of mutagenic oligonucleotide primers: i) is at least substantially complementary to a portion of the target nucleotide sequence, wherein the portion of the target nucleotide sequence is different from each member of the set, ii) includes a 5′ flanking adaptor sequence, a 3′ flanking adaptor sequence, or both, and iii) includes a unique programmed mutation near its center. In step b), each member of the mutant nucleotide sequence library includes a full-length copy of the target nucleotide sequence having a unique programmed mutation derived from one member of the set of mutagenic oligonucleotide primers.
In some embodiments, generating the set of mutagenic oligonucleotide primers entails: synthesizing and releasing a set of mutagenic oligonucleotides from a microarray; and amplifying and retrieving the set of mutagenic oligonucleotides using the 5′ flanking adaptor sequence or the 3′ flanking adaptor sequence. In some embodiments, the set of mutagenic oligonucleotide primers are designed to tile the target nucleotide sequence. In some embodiments, the 5′ flanking adaptor sequence or the 3′ flanking adaptor sequence is used to retrieve one or more copies of the target nucleotide sequence in the presence of the polymerase. In some embodiments, the unique programmed mutation includes one or more base changes, insertions, or deletions relative to the target nucleotide sequence. In other embodiments, the unique programmed mutation is a codon swap. In some embodiments, the target nucleotide sequence is a coding sequence. In some embodiments, the target nucleotide sequence is more than 10 codons in length. In some embodiments, each member of the set of mutagenic oligonucleotide primers is about 10 to 200 nucleotides in length.
In another aspect, this application relates to a method for generating a mutant nucleotide sequence library. The method entails the steps of: a) generating, in parallel, a set of mutagenic oligonucleotide primers that are at least substantially complementary to a portion of the target nucleotide sequence; b) annealing and extending the set of mutagenic oligonucleotide primers to and along a wild type sense template corresponding to the target nucleotide sequence, creating a set of sense megaprimers; c) amplifying the set of sense megaprimers using a pair of primers; d) annealing and extending the amplified set of sense megaprimers to and along a wild type antisense template corresponding to the target nucleotide sequence, creating a mutant nucleotide sequence library; and e) amplifying the members of the mutant nucleotide sequence library. In step a), each member of the set of mutagenic oligonucleotide primers includes: a unique programmed mutation, a 5′ flanking adaptor sequence, and a 3′ flanking adaptor sequence. In step b), the wild type sense template is marked for selective degradation. In step c), one of the primers recognizes and binds to the 5′ flanking adaptor sequence. In step d), each member of the mutant nucleotide sequence library includes a full-length copy of the target nucleotide sequence having a unique programmed mutation derived from one member of the set of mutagenic oligonucleotide primers.
In some embodiments, generating the set of mutagenic oligonucleotide primers entails: synthesizing and releasing a library of mutagenic oligonucleotides from a microarray, and amplifying and retrieving a subset of the library of mutagenic oligonucleotides using the 3′ flanking adaptor sequence, wherein the subset is the set of mutagenic oligonucleotide primers. In some embodiments, the wild type sense template and the wild type antisense template are degraded using a selective degradation agent following steps b) and d), respectively. In some embodiments, the method further includes removing the 3′ flanking adaptor sequence from the set of mutagenic oligonucleotide primers after step a) and before step b). In some embodiments, the method further includes removing the 5′ flanking adaptor sequence from the amplified set of sense megaprimers after step c) and before step d). In some embodiments, the set of mutagenic oligonucleotide primers are designed to tile the target nucleotide sequence. In some embodiments, the unique programmed mutation includes one or more base changes, insertions, or deletions relative to the target nucleotide sequence. In some embodiments, the unique programmed mutation is placed near the center of each mutagenic oligonucleotide primer. In some embodiments, the target nucleotide sequence is a coding sequence such as a coding sequence that is more than 10 codons in length. In some embodiments, each member of the set of mutagenic oligonucleotide primers is about 10 to 200 nucleotides in length.
In another aspect, this application relates to a method for generating a mutant protein library. The method entails the steps of: generating a mutant nucleotide sequence library using the method disclosed above, cloning each member of the mutant nucleotide sequence library into an expression plasmid, and expressing a mutated protein from each member of the mutant nucleotide sequence library that is cloned into an expression plasmid to generate the mutant protein library. In some embodiments, the mutant protein library is used to screen for the function of a mutant protein.
Methods for multiplexed mutagenesis of a target nucleotide sequence and methods are provided herein. Such methods may be used to generate mutant nucleotide sequence libraries and mutant protein libraries for use in other applications such as sequence-structure-function studies, protein-engineering, directed evolution, or any other suitable application.
To overcome the limitations of previously used methods, a method which may be referred to herein as PALS (“programmed allelic series”) was developed, which combines low-cost, microarray-based DNA synthesis of alleles with single-tube overlap extension mutagenesis in order to introduce one and only one mutation per cDNA template in a massively parallel fashion.
According to the embodiments described herein, this application relates to a method for multiplexed mutagenesis of a target nucleotide sequence. The target nucleotide sequence may be any suitable DNA or RNA sequence, including any portion of a gene or RNA molecule (e.g., mRNA, tRNA, siRNA, shRNA, miRNA), and may include a coding sequence, a non-coding sequence, or a sequence that includes a portion of a coding sequence and a portion of a non-coding sequence.
The target nucleotide sequence may be of any length. In certain embodiments, the target nucleotide sequence is more than 10 codons in length. In other embodiments, the target nucleotide sequence is more than 20 codons in length, more than 30 codons in length, more than 40 codons in length, more than 50 codons in length, more than 60 codons in length, more than 70 codons in length, more than 80 codons in length, more than 90 codons in length, more than 100 codons in length, more than 150 codons in length, more than 200 codons in length, more than 250 codons in length, more than 300 codons in length, more than 350 codons in length, more than 400 codons in length, more than 450 codons in length, more than 500 codons in length, more than 1000 codons in length.
The method includes a step of generating, in parallel, a set of mutagenic oligonucleotide primers designed to cover all or part of the target nucleotide sequence. In certain aspects, the step of generating the set of mutagenic oligonucleotide primers is accomplished by synthesizing the entre set of primers in parallel, i.e., at the same time during a single reaction instead of one-by-one or in small groups. In some embodiments, the synthesis is by microarray. in
Each member of the set of mutagenic oligonucleotide primers is at least substantially complementary to a portion of the target nucleotide sequence. “Substantially complementary,” as used herein means that a first nucleic acid molecule is (1) entirely or traditionally complementary to a second nucleic acid molecule, i.e., when the first and second nucleic acid molecules hybridize to each other to form base pairs between traditional nucleotide bases: adenine is matched to thymine (DNA) or uracil (RNA), and guanine is matched to cytosine, (2) substantially or non-traditionally complementary to a second nucleic acid molecule, wherein one or more bases are paired with a base that is not a traditional pairing (e.g. IUPAC matched bases) or a non-traditional or synthetic base when the two molecules hybridize to one another.
Each mutagenic oligonucleotide primer is generally between about 10 to about 200 nucleotides (nt) in length, but may be any suitable length, including, but not limited to, about 10 to about 100 (nt) in length, about 50 to about 100 (nt) in length, about 60 to about 100 (nt) in length, about 70 to about 100 (nt) in length, about 80 to about 100 (nt) in length, about 90 to about 100 (nt) in length, about 50 to about 150 (nt) in length, about 100 to about 200 (nt) in length, about 100 to about 150 (nt) in length, about 150 to about 200 (nt) in length, about 10 (nt) in length, about 20 (nt) in length, about 30 (nt) in length, about 40 (nt) in length, about 50 (nt) in length, about 60 (nt) in length, about 70 (nt) in length, about 80 (nt) in length, about 90 (nt) in length, about 100 (nt) in length, about 110 (nt) in length, about 120 (nt) in length, about 130 (nt) in length, about 140 (nt) in length, about 150 (nt) in length, about 160 (nt) in length, about 170 (nt) in length, about 180 (nt) in length, about 190 (nt) in length, about 200 (nt) in length, or any other suitable length.
Although not a requirement, the primer is generally shorter than the target sequence. Thus, each mutagenic oligonucleotide primer should correspond to a different portion of the target nucleotide sequence. This allows the set of primers to cover the entire target nucleotide sequence. In certain aspects, the primers are designed to tile the target nucleotide sequence.
A mutagenic oligonucleotide primer generated in accordance with the methods described herein also includes a 5′ flanking adaptor sequence, a 3′ flanking adaptor sequence, or both a 5′ flanking adaptor sequence, a 3′ flanking adaptor sequence. These adaptor sequences are not present in the target nucleotide sequence, but are used to retrieve a set or subset of nucleotides generated using the methods described herein. For example, the adaptor sequences are used to retrieve one or more copies of the target nucleotide sequence in the presence of a polymerase (e.g., during a PCR reaction), or may be used to retrieve a set or subset of mutagenic oligonucleotides that have been synthesized in parallel.
Further, each mutagenic oligonucleotide primer generated in accordance with the methods described herein includes a unique programmed mutation such that each primer has a different mutation. In some aspects, the mutation is near the center of the primer. The mutation may include, but is not limited to, one or more base changes, insertions, or deletions relative to the target nucleotide sequence. In one aspect the unique programmed mutation is a codon swap.
The methods described herein also include a step of reacting the set of mutagenic oligonucleotide primers generated in the previous step with the target sequence in the presence of a polymerase to generate a mutant nucleotide sequence library.
The reactions between the set of mutagenic oligonucleotide primers and the target sequence may include reactions (e.g., PCR, which is a reaction in the presence of a polymerase) with an antisense strand of the target nucleotide sequence, a sense strand of the target nucleotide sequence, or both. In certain embodiments, these reactions are described in
In one embodiment, a method for generating a mutant nucleotide sequence library is provided. The method include the steps of: a) generating, in parallel, a set of mutagenic oligonucleotide primers that are at least substantially complementary to a portion of the target nucleotide sequence; b) annealing and extending the set of mutagenic oligonucleotide primers to and along a wild type sense template corresponding to the target nucleotide sequence, creating a set of sense megaprimers; c) amplifying the set of sense megaprimers using a pair of primers; d) annealing and extending the amplified set of sense megaprimers to and along a wild type antisense template corresponding to the target nucleotide sequence, creating a mutant nucleotide sequence library; and e) amplifying the members of the mutant nucleotide sequence library. In step a), each member of the set of mutagenic oligonucleotide primers includes: a unique programmed mutation, a 5′ flanking adaptor sequence, and a 3′ flanking adaptor sequence. In step b), the wild type sense template is marked for selective degradation. In step c), one of the primers recognizes and binds to the 5′ flanking adaptor sequence. In step d), each member of the mutant nucleotide sequence library includes a full-length copy of the target nucleotide sequence having a unique programmed mutation derived from one member of the set of mutagenic oligonucleotide primers. This method is illustrated in
In some embodiments, generating the set of mutagenic oligonucleotide primers includes: synthesizing and releasing a library of mutagenic oligonucleotides from a microarray, and amplifying and retrieving a subset of the library of mutagenic oligonucleotides using the 3′ flanking adaptor sequence, wherein the subset is the set of mutagenic oligonucleotide primers. In some embodiments, the wild type sense template and the wild type antisense template are degraded using a selective degradation agent following steps b) and d), respectively. In some embodiments, the method further includes removing the 3′ flanking adaptor sequence from the set of mutagenic oligonucleotide primers after step a) and before step b). In some embodiments, the method further includes removing the 5′ flanking adaptor sequence from the amplified set of sense megaprimers after step c) and before step d).
In another embodiment, the methods described herein are illustrated generally in
To assess coverage of the programmed mutations and the off-target mutation rate, the PALS library resulting from this method is sequenced. Provided sufficient depth, shotgun sequencing of the complex library of mutant clones may sensitively detect all the introduced mutations. However, existing sequencing technologies still produce reads that are too short to cover full-length ORFs or even individual domains, such that one is unable to phase multiple mutations on the same clone when they are separated by more than the read insert size. Consequently, for instance, a neutral allele could be wrongly counted as highly deleterious when coupled to a loss-of-function allele elsewhere on the same clone. Some sequencing platforms (e.g., Pacific Biosciences) are capable of longer reads but these currently come at the expense of high per-base error rates (up to 15%), such that they are not readily suited to unambiguously identifying clones that contain only a programmed mutation.
For these reasons, tag-directed hierarchical sequencing or sub-assembly [16] were adopted as a way to validate the composition and quality of PALS libraries. In this approach (
The methods described herein may be performed with reagents and/or platforms that may be assembled in a kit, or available separately. For example, the reagents and materials described in the methods below may be formulated and assembled in a single kit to allow a user to perform the method by purchasing everything that is needed in a single place.
The mutant nucleotide sequence library generated in accordance with the methods described above may be used to generated a mutant protein library that can be used to assess the function of mutant proteins as discussed in the examples below. As such a method for generating a mutant protein library is provided in accordance with the methods described above. Such a method includes steps of: generating a mutant nucleotide sequence library using the method disclosed above, cloning each member of the mutant nucleotide sequence library into an expression plasmid, and expressing a mutated protein from each member of the mutant nucleotide sequence library that is cloned into an expression plasmid to generate the mutant protein library.
The following examples are intended to illustrate various embodiments of the invention. As such, the specific embodiments discussed are not to be construed as limitations on the scope of the invention. It will be apparent to one skilled in the art that various equivalents, changes, and modifications may be made without departing from the scope of invention, and it is understood that such equivalent embodiments are to be included herein. Sequence data reported herein have been deposited in the Sequence Read Archive (SRA), www.ncbi.nlm.nih.gov/sra (accession code SRA169378). Further, all references cited in the disclosure are hereby incorporated by reference in their entirety, as if fully set forth herein.
Saturation mutagenesis screens such as alanine scans [1] enable protein structure-function studies through directed inquiry into the functional consequences of mutations in individual amino acid residues. However, scaling these approaches to cover entire proteins is laborious and expensive, typically requiring the individual synthesis of mutagenic oligonucleotide primers for each target codon and their use in separate reactions. An alternative is random mutagenesis, e.g. error-prone PCR or doped oligonucleotide synthesis, but these methods fail to generate most amino acid substitutions that require multi-base changes. To overcome these challenges, PALS (“programmed allelic series”), a highly multiplexed, site-directed mutagenesis approach that leverages massively parallel oligonucleotide synthesis on microarrays was developed. PALS is demonstrated by using single reactions to introduce every possible single-codon mutation into the DNA-binding domain (DBD) of the yeast transcription factor Gal4 (64 amino acid residues) and the human tumor suppressor p53 (393 residues). Full-length, haplotype-resolved sequencing of the resulting 1.35 million clones identified 99.9% and 93.5% of the programmed mutations as singleton mutations on an otherwise wild-type background in each respective gene. Subjecting the Gal4 PALS library to an in vivo selection for transcriptional activation demonstrated that nearly a third of the DBD is intolerant to mutation. Additionally, several mutations in the linker domain that increased function are identified in the assay, possibly by orienting the flanking domains more favorably for transcriptional activation. Fully covering codon mutation space with single amino acid changes facilitates a more finely resolved landscape of protein-coding functional constraint. This method may also be useful for massively multiplexed biochemical characterization of clinically observed missense variants of unknown significance in disease associated genes.
Methods
Mutagenic Primer Preparation.
Mutagenic primers were electrochemically synthesized on a 12,432-feature programmable DNA microarray and released into solution by CustomArrray, Inc [34]. For Gal4 (GI #6325008), codons 2-65 were each replaced with the optimal codon in S. cerevisae corresponding to one of the 19 other amino acids {Nakamura:2000 wk}, a stop codon (TAA), or an in-frame deletion, for a total of 1,344 oligos each spotted in duplicate. For p53 (GI #120407068), codons 1-393 were replaced with fully degenerate bases (“NNN”), such that each primer molecule synthesized within a single spot on the array includes a different one of 64 randomized codons, with each of the 393 oligos spotted in triplicate.
Each primer was designed as a 90mer, including flanking 15-base flanking adaptor sequences (i.e., 5′ flanking adaptor sequence and 3′ flanking adaptor sequence), except for the Gal4 in-frame codon deletion primers, which were designed as 87mers. Each primer is synthesized sense to the gene, with 33 upstream bases, followed by the codon replacement, and 24 downstream bases. To allow for specific retrieval, a different flanking adaptor pair was used for each subset of mutagenic primers on the array. Gal4 primers were flanked by adaptor sequences “truncL_GAL4DBD” and “truncR_GAL4DBD” (see
The resulting oligo pools were further amplified with adaptors modified to contain a deoxyuracil base at the 3′ terminus. This second-round amplification was carried out in 50 ul reactions, using 1 ul of the previous amplification reaction (at a 1:4 dilution in dH2O) as template, following cycling program “ADO_KR”. Each reaction included 25 ul Kapa Robust Hot Start ReadyMix (which is not inhibited by uracil-containing templates), amplification primers at 500 nM each (“L_“GAL4DBD”/“R_GAL4DBD_U” or “L_TP53”/“R_TP53_U”) (see
Wild-Type Template Preparation.
The full-length Gal4 open reading frame was amplified from genomic DNA of S. cerevisae strain BY4741 and directionally cloned into the yeast shuttle vector p416CYC, a single-copy CEN plasmid with the CYC1 promoter [35], by digestion with SmaI and ClaI (New England Biolabs), using the InFusion cloning kit (Clontech). Subsequently, an N-terminal truncation was prepared by amplifying residues 1-196 from the original clone using the primer pairs GAL4_CLONE_F and GAL4_NTERM_R (see
To prepare wild-type sense and antisense strands to serve as templates for mutagenic primer extension, the desired fragments were amplified from plasmid clones by PCR with several modifications. To select for the sense strand, the reverse primer was phosphorylated to allow for its later degradation by lambda exonuclease, and to select the antisense strand, the forward primer was instead phosphorylated. Furthermore, to minimize undesired carry-through of wild-type copies, in some cases long synthetic tails (38 or 40 nt) were placed on the phosphorylated primer to prevent the resulting 3′ ends of the selected strands from acting as primers during subsequent extension steps. Primers were either ordered with a 5′ phosphate or were enzymatically phosphorylated in 10 ul reactions containing 1 ul of 100 uM primer stock, 7 ul H2O, 1 ul 10× T4 Ligase Buffer with ATP (NEB), and 10 U T4 polynucleotide kinase (NEB) and incubated for 30 minutes at 37° C., followed by heat inactivation for 20 minutes at 65° C. and one minute at 95° C. Wild-type fragments were amplified in 50 ul PCR reactions with forward and phosphorylated reverse primers using Kapa HiFi U+ HotStart Ready Mix (“KHF U+ HS RM”) supplemented with dUTPs to a final concentration of 200 nM. Primers for wild-type template preparation are listed in
Mutagenic Primer Extension.
Next, 2 ng of each primer pool was combined with 3 ng of its respective sense-strand template, raised to 12.5 ul with dH2O, and mixed with 12.5 ul of KHF U+HS RM for extension along the dUTP-containing wild-type template by the annealed mutagenic primers. The reaction was subjected to one round of denaturation, annealing, and extension (cycling conditions “PALS_EXTEND”; see
The resulting strand extension products were enriched via PCR using the KHF U+ HS RM in 25 ul reactions using the cycling program PALS_AMPLIFY (see
The reverse primer in the preceding amplification step carried a 3′-terminal dUTP, allowing for adaptor excision by treatment with 1 U USER enzyme for 15 minutes at 37° C. This reaction was cleaned by Zymo column and eluted in 11.8 ul buffer EB. Next, the respective forward primer was added (0.75 ul at 10 uM) followed by 12.5 ul of KHF HS RM to create sense-strand mutagenized megaprimers with one round of cycling conditions “PALS_EXTEND” (see
Sense-strand megaprimers were then purified by Zymo column, annealed to the wildtype antisense strand, and extended to form full length copies. Each extension reaction contained 3 ng of the sense-stranded megaprimer pool, 1 ng of the wild-type dUTP-containing antisense strand, and was performed with KHF U+ HS RM, followed by column cleanup, USER treatment (1.5 U for 10 min at 37° C.), and a second column cleanup, as during the initial mutagenic strand extension reaction. Finally, the full-length mutagenized copies were enriched by PCR using fully external primers (“OUTER_F”/“GAL4_OUTER_R” or “OUTER_F”/“P53_ANTISENSE_R”) (see
PALS Library Cloning.
Gal4 DBD PALS libraries were cloned into p416CYC-bc, a pre-tagged library of vectors derived from p416CYC, in which each clone contains a random 16mer tag. To prepare p416CYC-bc, a pair of unique restriction sites was placed downstream of the CYC1 terminator by digesting p416CYC with KpnI-HF (NEB) and inserting a duplex of oligos (“P416CYC_AGEMFE_TOP”/“P416CYC_AGEMFE_BTM”) (see
Tagged p53 PALS libraries were created in the reverse order: the PALS-mutagenized amplicon was cloned first, and the library was expanded and tags inserted second. The p53 library was cloned into pCMV6-AC-GFP (Origene) by standard directional cloning in two separate cloning reactions using NotI-HF/BamHI-HF or NotI-HF/KpnI-HF (NEB). Libraries were transformed into 10-beta electrocompetent cells (NEB), combined, expanded overnight and purified by midiprep as for Gal4. Subsequently, the cloned p53 libraries were linearized at the AgeI site downstream of the hGH poly-A signal: 2.5 ug of plasmid DNA was digested with 10 U AgeI (NEB) in 50 ul for 1 hr at 37° C., and purified by Zymo column. A tag cassette containing a randomized 20mer was synthesized (“P53_BC_CAS”) (see
Clone Subassembly Sequencing.
To bring the tag cassette into proximity with the mutagenized Gal4 coding sequence (
To prepare Illumina sequencer-ready subassembly libraries, tag-linked amplicons from the previous step were fragmented and adaptor-ligated using the Nextera v2 library preparation kit (Illumina), with the following modifications to the manufacturer's directions: for each reaction, 1.0 ul Tn5 enzyme “TDE” was combined with 2.0 ul H2O, 5 ul Buffer 2× TD, and 2 ul of the post-recircularization PCR product. Longer insert sizes were obtained by diluting enzyme TDE up to 1:10 in 1× Buffer TD (a 1:4 dilution was used for the libraries sequenced here). Tagmentation was carried out by incubating for 10 minutes at 55° C., followed by library enrichment PCR to add Illumina flowcell sequences. Libraries were amplified by KHF RM 2× mastermix in 25 ul using a forward primer of NEXV2_AD1 and one of the indexed reverse primers “SHARED_BC_REV_###” (see
Tag-Directed Clone Subassembly.
Subassembly libraries were pooled and subjected to paired-end sequencing on Illumina MiSeq and HiSeq instruments, with a long forward read directed into the clone insert (101 bp for HiSeq runs, 325 or 375 bp for MiSeq runs) and a reverse read into the clone tag. Tag-flanking adaptor sequences were trimmed using cutadapt (obtained from https://code.google.com/p/cutadapt/), and read pairs without recognizable tag-flanking adaptors were excluded from further analysis. Insert-end reads were aligned to the Gal4 or p53 wild-type clone sequence using bwa mem (with arguments “-z 1 -M”) [36], and alignments were sorted and grouped by their corresponding clone tag. To properly align the programmed in-frame codon deletions included in the Gal4 PALS library, bwa alignments were realigned using a custom implementation of Needleman-Wunsch global alignment with a reduced gap opening penalty at codon start positions (match score=1, mismatch score=−1, gap open in coding frame=−2, gap open elsewhere=−3, gap extend=−1). A consensus haplotype sequence was determined for each tag-defined read group by incorporating variants present in the group's aligned reads at sufficient depth. Spurious mutations created by sequencing errors, or mutations present at low allele frequency arising from linking two haplotypes to the same tag were flagged and discarded by requiring the major allele at each position (either wild-type or mutant) to be present with a frequency of ≧80%, ≧75% and ≧66%, for read depths≧20, 10-19, or 3-10, respectively, considering only bases with quality score≧20. Tag groups with fewer than three reads (Gal4 DBD) or 20 reads (p53) were discarded, as were groups not meeting the major allele frequency threshold across the entire target (Gal4 DBD) or a minimum of 1 kbp (p53). Consensus haplotypes were validated by Sanger sequencing of individual colonies from each tagged plasmid library (
Gal4 Functional Selections.
Gal4 DBD PALS libraries were transformed into chemically competent S. cerevisae strain PJ69-4alpha prepared using a modified LiAc-PEG protocol, as previously described [9, 37]. After transformation, cells were allowed to recover for 80 minutes at 30° C. shaking at 250 rpm. To select for transformants, cultures were spun down at 2000×g for 3 min, resuspended and grown overnight at 30° C. in 40 ml SC media lacking uracil. Plating 0.25% of the recovery culture prior to outgrowth indicated a library titer of ˜2≦105 transformants. Following overnight outgrowth, glycerol stocks were prepared from the transformation culture and stored at −80° C.
Frozen stocks of yeast carrying the Gal4 DBD PALS library were thawed and recovered overnight in 50 ml SC media lacking uracil. An aliquot of 1 ml (˜1.8×106 cells) was pelleted and frozen as the baseline input sample, and equal aliquots were used to inoculate each of four 40 ml cultures of SC media either lacking uracil (nonselective) or lacking both uracil and histidine and optionally containing the competitive inhibitor 3-AT (selective,
Input and post-selection cultures were pelleted at 16000×g and frozen at −20° C. Gal4 plasmids were recovered by spheroplast preparation and alkaline lysis miniprep using the Yeast Plasmid Miniprep II kit as directed (Zymo Research). Two-stage PCR was then used to amplify and prepare sequencing libraries to count the plasmid-tagging tags. In the first step, 2.5 ul of miniprep product was used as template in 25 ul reactions with KHF RM HS, with primers flanking the tag cassette (“GAL4_BC_AMP_F”/“GAL4_BC_AMP_R”) (see
Gal4 Enrichment Scores.
Tag reads were demultiplexed to the corresponding sample using a 9 bp index read, allowing for up to two mismatches. Tag reads lacking the proper flanking sequences or containing ambiguous ‘N’ base calls were discarded, and per-barcode histograms were prepared by counting the number of occurrences of each of the remaining tags. Tags were required to exactly match the tag of a single subassembled haplotype, and were then normalized to account for differing coverage over each library by dividing by the sum of tag counts.
An effect score was calculated for each amino acid mutation by summing the read counts of tags corresponding to all the subassembled clones carrying that mutation as a singleton, divided by the equivalent sum for wild-type clones, and taking a log-ratio between the selection and input samples, as shown in Equation 1 below:
where rSEL,j and rINPUT,j are the read counts of tag j in the selected and input samples, respectively.
Evolutionarily conserved residues in Zn2/Cys6 domains were identified by querying HHblits with Gal4 residues 1-70 [38], and were displayed using Weblogo [39]. To compare core and outward-facing residues within the dimerization helix, residues 51-65 were each scored for distance to the overall structure's solvent-exposed surface predicted using MSMS43 (using the Gal4(1-100) crystal structure, PDB accession 3COQ). Residues with above-median distance to the surface were considered ‘core’, and those with below-median distance were considered ‘exposed’, and the log 2E values of the two subsets were compared by the Mann-Whitney U test.
Gal4 Validations.
For qualitative validation of Gal4 missense mutation effects, specific alleles (C14Y, K17E, K25W, K25P, L32P, K43P, K45I, and V57M) were individually introduced into p416CYC-Gal4Wt-1-196 using the Quickchange mutatgenesis kit (Agilent) following the manufacturer's directions. Mutant colonies were miniprepped and verified by capillary sequencing, and transformed into PJ69-4alpha by LiAc treatment. Following transformation, a single yeast colony transformed by mutant or wild-type Gal4 constructs was picked and expanded in overnight culture, and back-diluted to OD 0.2 and allowed to return to mid-log phase before spotting ten-fold dilutions starting with an equal number of cells onto nonselective plates (SC lacking uracil) or selective plates (SC lacking uracil and histidine, supplemented with 5 mM 3-AT).
Results
As a proof-of-principle, first a PALS library for the DNA-binding domain (DBD) of Gal4, an archetypal yeast transcription factor, was constructed. Each codon of the DBD (residues 2-65) for replacement was targeted either by the yeast-optimized codon for each of the 19 other amino acids, or by a premature STOP. The Gal4 PALS library was cloned into a yeast expression vector, followed by tagging and subassembly requiring a minimum coverage of three reads at each nucleotide across the entire cloned ORF. Of the resulting sequence-verified consensus haplotypes, ˜47% carried one and only one programmed mutation on an otherwise wild-type background (Table 1). Among these “clean” clones, 99.9% (n=1,342) of the programmed single-codon replacements were observed at least once and 99.8% were observed at least five times. In addition, the ability of PALS to program more complex mutations was investigated by including a tiling set of in-frame deletion variants targeting each codon and found all single codon deletions within the resulting library. To validate the accuracy of these subassemblies, 40 clones were randomly picked and performed capillary sequencing on the mutagenized gene insert and its accompanying tag, confirming the subassembly-derived haplotype without any additional mutations (
To assess the scalability of this approach to full-length human genes, PALS mutagenesis was performed on the entire coding sequence of the human tumor suppressor p53. In contrast to Gal4, for which each mutant codon was explicitly specified, p53 codons were targeted for replacement by degenerate (“NNN”) triplets, reducing the number of required microarray features to the total number of codons (393 for p53) and allowing access to synonymous variants. Given its greater length, there was greater potential for incidental secondary mutations in p53 due to PCR error or chimerism. Accordingly, a lower rate of sequence-verified single-mutant haplotypes than for Gal4 (27%, n=177,841) was observed. Despite the lower purity of this library and despite sequencing fewer clones per residue, 93.4% (n—7,345) of the desired amino acid substitutions were still observed as clean, single-mutant clones.
The uniformity and coverage of mutations introduced by PALS was examined using these full-length clone sequences. For comparison of performance, a random mutagenesis library constructed by randomized doped oligonucleotide synthesis of a 102 amino-acid fragment of the mouse E3 ubiquitin ligase gene Ube4b [8] was concurrently analyzed. This library comprised 1.12 million full-length clone sequences, of which 16.6% contained a single codon mutation. Codon substitutions requiring 2-bp and 3-bp changes, which were abundantly represented within PALS libraries, were almost entirely absent from the random mutagenesis library at a comparable depth of coverage from sequenced clones (
To demonstrate the utility of PALS mutagenesis for deep mutational scanning, the Gal4 DBD library was subjected to selection for its ability to transcriptionally activate a yeast reporter gene. The mutagenized Gal4 DBD library was cloned into a low-level expression vector along with the wild-type sequence encoding an additional 131 amino acids. The resulting 196 amino-acid N-terminal fragment retains the same DNA-binding specificity as full-length Gal4, and is sufficient for transcriptional activation [17], but it lacks the cellular toxicity due to expression of full-length Gal4 which is likely caused by sequestration of the transcriptional machinery [18]. The Gal4(1-196) PALS library was transformed into the yeast two-hybrid reporter strain PJ69-4alpha [19], which is deleted for GAL4 and has expression of the HIS3 gene under the control of the GAL1 promoter. Thus, growth of yeast on media lacking histidine was conditional upon the ability of the introduced Gal4 allele to bind to and activate HIS3. Selection stringency was modulated by addition of 3-amino-1,2,4-triazole (3-AT), a competitive inhibitor of His3. After selection for Gal4 function, deep sequencing was performed to quantify the enrichment or depletion of each Gal4 mutant. Rather than resequencing the mixed population of full-length inserts, the frequency of the 16-mer tags, which are individually associated with full-length inserts via subassembly, was sequenced and tabulated.
296.5 million tag reads were collected across the input library and six selection time points (
The resulting profile of functional constraint (
To validate these effect size measurements, eight individual alleles were re-created by conventional site-directed mutagenesis and assayed them for growth defects by a spotting assay (
Superimposed on the crystal structure of Gal4 residues 1-100 [22] (
The linker (residues 41-50) tracks alongside the DNA major groove, making extensive contacts with the negatively charged backbone. A bend at proline 48 aids in positioning the dimerization helix over the DNA minor groove [21], and notably, either of two nearby lysine residues within the linker, K43 and K45, could be mutated to proline without deleterious effects and possibly with a marginal increase in activity (
The strategy presented here enables near-comprehensive, single amino acid mutagenesis of a protein-coding sequence in a single reaction, yielding a library that is readily compatible with massively parallel functional analysis. By using primers synthesized in parallel on DNA microarrays, PALS reduces reagent costs by nearly two orders of magnitude compared with previous approaches that require individual oligonucleotide synthesis (
Other functional screens exploiting microarray-derived oligonucleotide libraries have been limited to dense mutagenesis of relatively short sequence elements, due to the length constraints of microarray synthesis (100-200 nt). PALS overcomes this constraint by combining microarray synthesis of short primers with highly multiplexed overlap extension PCR using a wild-type template.
The suitability of PALS libraries for deep mutational scanning was demonstrated by profiling the functional landscape of the Gal4 DBD. PALS provided near complete coverage of codon replacements requiring 2 or 3-bp changes as well as in-frame codon deletions, mutations that would be essentially impossible to obtain at appreciable frequency with randomized mutagenesis strategies. Given its ability to incorporate multibase mutations including indels, PALS could be adapted to other types of screens, for instance to create tiling deletions of long cis-regulatory elements or to recode multiple adjacent codons.
In addition to broadening the scope of sequence variation addressable by large-scale screens, PALS libraries had a lower overall fraction of indel-bearing clones compared to libraries constructed from doped oligonucleotides (13.2%-18.2% versus 28.6%). The resulting improvement in efficiency will be beneficial as deep mutational scanning studies move from being strictly in vitro (e.g., using phage display) into yeast [9] or mammalian tissue culture models. Such studies would ideally use site-specific chromosomal integration, but it remains technically challenging to integrate highly complex libraries, putting a premium on generating as few wasted clones as possible.
There remains room for future improvement towards the goal of ‘pure’ libraries of single-mutant clones. Secondary mutations appear to be dominated by PCR chimera and synthesis errors. These factors were estimated to account for 52% and 24% of the secondary mutations in the Gal4 DBD library, respectively, by counting clones bearing two programmed mutations, or one programmed mutation and secondary mutations within the boundaries of the corresponding mutagenic primer. Chimerism is a technical challenge commonly encountered while amplifying libraries of homologous sequences [31], when incomplete strand extension products in one cycle of amplification act as primers in the subsequent cycle. Future optimization efforts will be directed at quantifying and mitigating this phenomenon by manipulating input template concentration and minimizing amplification cycles, or alternatively using droplet PCR [32]. To reduce the impact of synthesis errors, PALS uses short oligonucleotides (90 nt), but it will nevertheless benefit from ongoing developments in high-fidelity synthesis [33]. In addition, as single-base deletions are the dominant synthesis error mode [34], stringently size-selecting primer libraries may further enrich for primers lacking undesirable secondary mutations. Another strategy would fuse libraries in-frame to a selectable marker in the bacterial cloning host, although the preliminary observations suggest that such selection is inefficient for proteins that do not fold or express well in E. coli.
The combination of PALS mutagenesis and tag-directed subassembly sequencing provides an efficient way to quantify the functional impacts of specific variant alleles within a population of cDNA clones being subjected to multiplex functional analysis. In particular, the sequencing of short tags, each of which is unambiguously associated with a single mutant haplotype, reduces the required sequencing effort, as a short, single-end read contributes a single count for its corresponding mutant haplotype. By contrast, shotgun methods must uniformly cover the entire target gene with reads in order to measure a single count, increasing the cost and introducing additional sampling variability. Sequencing tags rather than inserts also mitigates the impact of sequencing errors, which would otherwise be falsely counted as novel alleles. Even the more accurate sequencing platforms currently available still suffer from considerable error rates near the ends of reads (e.g., the Illumina MiSeq reads used in this study had per-base error rates of ˜2% after 200 bp), necessitating aggressive trimming to avoid encountering sequencing-derived mutations. Emerging long-read sequencing platforms such as Pacific Biosciences or nanopore sequencing may replace subassembly for the pairing of tags with clone inserts, but deep tag counting rather than insert sequencing is likely to remain the most straightforward and accurate method of quantifying effect sizes in deep mutational scans.
PALS mutagenesis holds promise for future deep mutational scans of protein-coding genes, both for basic structure-function studies and for classifying clinically observed alleles as pathogenic or benign, i.e. “pre-measuring” the consequences of variants of uncertain significance before they are observed in a germline or cancer genome. Comprehensively surveying all of codon mutation space, even for replacements that are unlikely to occur naturally, may go beyond identifying key residues to help illuminate potential functional mechanisms or sites of post-translational modification. For proteins that are challenging to crystalize, such as ion channels, structural inferences could be made directly from these scans, supplementing co-evolutionary contact probability models [33]. In sum, this approach—massively parallel synthesis and sequencing coupled to functional selection—provides a general framework to dissect the allelic heterogeneity of human oligogeneic disorders and a path toward functional annotation of the rapidly growing catalogs of variants of unknown significance.
The references, patents and published patent applications listed below, and all references cited in the specification above are hereby incorporated by reference in their entirety, as if fully set forth herein.
This application claims priority to U.S. Provisional Patent No. 62/025,936, filed Jul. 17, 2014, the subject matter of which is hereby incorporated by reference as if fully set forth herein.
Number | Date | Country | |
---|---|---|---|
62025936 | Jul 2014 | US |