The Sequence Listing for this application is labeled “UHK273X.xml” which was created on May 9, 2023 and is 95,761 bytes. The entire content of the sequence listing is incorporated herein by reference in its entirety.
The CRISPR-associated protein 9 (Cas9) has become an important tool for genome editing. The CRISPR system includes several components where Cas9 specificity is guided by a single guide RNA (sgRNA) matching the complementary target DNA site, while the protospacer adjacent motif (PAM) lying proximal to the target DNA site is required for sequence-specific recognition. In particular, Staphylococcus pyrogenes Cas9 (SpCas9) is a well characterized enzyme that is popularly used for genome editing due to its short PAM 5′-NGG-3′ which is advantageous for broader genome editing applications and high average editing efficiency.
However, there are concerns regarding the higher off-target effects that may dampen the editing accuracy. Previous studies have been conducted to further modify SpCas9 to optimize editing accuracy and reduce constraints for PAM recognition1-10. Nevertheless, it is very challenging to minimize the bulky nature of SpCas9, thereby limiting applications of SpCas9 for in vivo genome editing in which adeno-associated viruses having a packaging limit of about 4.5 kb are commonly used for clinical gene therapy.
Therefore, researchers have turned to the better characterized smaller Cas9 variants with activities comparable to SpCas9, such as the Staphylococcus aureus Cas9 (SaCas9)11. Although SaCas9 is desirable for the packaging of genetic therapeutics, it also has certain drawbacks such as longer PAM 5′-NNGRRT-3′ and reduced genome coverage, leaving room for improvement for higher activity and specificity.
At present, most of the optimized Cas9 variants possess 2 to 7 mutations spanning multiple protein domains1-9,12-15 and each of the unique mutation combinations has contributed to comparable performance and editing fidelity. For example, >30 and >17 different amino-acid sites are engineered among the >13 SpCas9 and >8 SaCas9 variants, respectively. Nevertheless, the results represent only a small proportion of amino-acid sites interacting with the sgRNA-DNA complex16,17 each site being a potential candidate for optimization. However, a systematic experimental screen across all the candidate amino-acid positions to identify the best-performing Cas9 variants is both labor-intensive and prohibitively expensive. For instance, antibody maturation and viral capsid diversification involves a great number of fully saturated mutagenesis, ranging from 9 to 28 amino-acid sites. The capacity to evaluate such a large number of variants far exceeds what is experimentally feasible, even by massively parallel experiments.
Machine learning (ML) is advantageous in reducing the burden of experimental screen of protein engineering and in silico screens have shown great success in identifying high-performance variants of enzymes 18, optogenetic proteins19, binders20, and viral capsids21. Previous studies have shown that the ML approach allows reliably prediction of the fitness of a full virtual library covering 105-1012 variants based on a small sub-sample of empirical fitness data of 103-104 variants or even less20,22. Aiming to minimize the screening efforts, ML-guided approach such as machine learning-assisted approach to directed evolution (MLDE)23,24 extrapolates from the experimental determined fitness of a small sample of variants from a combinatorial mutant library to predict the full variant space covered by the multi-site saturation mutagenesis library in silico. Moreover, such approach is highly compatible with the existing screening platforms, which use fluorescence-activated cell sorting and next-generation sequencing as readouts, making it possible to evaluate the functionality of protein variants in a pooled library setting.
Although there were prior studies focusing on modifications of Pam-interacting (PI) domains in modifying around the PAM duplex region14, there is a lack of investigation on modification of the WED domain of SaCas9.
There continues to be a need in the art for improved Cas9 protein, and/or improved designs and techniques for methods and systems for a machine learning guided approach to meet the challenges of the optimization of Cas9.
In certain embodiments, the subject invention pertains to a Cas9 protein, according to SEQ ID NOs: 3 or 4 with an amino acid mutation at residues 888, 889, or a combination thereof of a WED domain and/or residues 988, 989, or a combination thereof of a PI domain. In certain embodiments, the mutation at residue 888 is N to Q, according to SEQ ID NO: 40; the mutation at residue 888 is N to Q and at residue 889 is A to S, according to SEQ ID NO: 41; the mutation at residue 888 is N to H and at residue 889 is A to Q, according to SEQ ID NO: 42; the mutation at residue 888 is N to S and at residue 889 is A to Q, according to SEQ ID NO: 43; the mutation at residue 888 is N to R and at residue 889 is A to Q, according to SEQ ID NO: 44; and/or the mutation at residue 888 is N to G, according to SEQ ID NO: 50. The subject invention can further pertain to a Cas9 protein with mutations at amino acid positions N986, D987, L988, L989, or any combination thereof.
Embodiments of the subject invention pertain to machine learning assisted methods and systems for engineering activity-enhanced Staphylococcus aureus Cas9's KKH variants for genome editing.
According to an embodiment of the subject invention, a method of machine learning-based in silico screens for genome editing is provided. The method comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs, running the predictive machine learning model with predefined parameters, and evaluating performance of the predictive machine learning model. Moreover, enrichment scores of the empirical measurements are min-max normalized to scaled fitness scores ranging between 0 and 1. The input dataset includes empirical measurements of different percentages generated to test minimal number of inputs for effective selection of top variants by predictions of the machine learning model. Populating a predictive machine learning model with an input dataset further comprises generating a plurality of replicates of the input dataset based on a randomized selection scheme or a diverse selection scheme for variants. The generating a plurality of replicates based on the randomized selection scheme comprises randomly selecting a pre-defined number of enrichment scores. The generating a plurality of replicates based on the diverse selection scheme comprises keeping running randomly sampling variants with available enrichment scores until no variants sharing more than p 1-mismatch-neighbours and q 2-mismatches neighbors are present in the input dataset. Further, the predefined parameters comprise Belper and Georgiev embeddings of full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) (SEQ ID NO: 3) and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) (SEQ ID NO: 4) substituted with designated variant's amino-acid residue combination. The performance of the predictive machine learning model includes precision, specificity, and sensitivity of the embeddings of the predictive machine learning model. In addition, the evaluating performance of the predictive machine learning model comprises counting numbers of true positives, true negatives, false positives, and false negatives for each result and deriving metrics of the performance of the predictive machine learning model based on the numbers counted.
In another embodiment of the subject invention, a method combining machine learning-based in silico screens for genome editing with downstream structure-guided rational design is provided. The method comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs, running the predictive machine learning model with predefined parameters, evaluating performance of the predictive machine learning model, constructing plasmid, cell culturing and transducing, conducting fluorescent protein disruption assays, performing immunoblot analysis, performing T7 endonuclease I assay, performing GUIDE-seq, and performing molecular dynamic simulations on the variants.
In certain embodiments of the subject invention, a computer program product is provided and comprises a non-transitory computer-executable storage device having computer readable program instructions embodied thereon that when executed by a computer cause the computer to perform machine learning-based in silico screens for genome editing. The computer-executable program instruction comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs, running the predictive machine learning model with predefined parameters, and evaluating performance of the predictive machine learning model. Moreover, enrichment scores of the empirical measurements are min-max normalized to scaled fitness scores ranging between 0 and 1. The input dataset includes empirical measurements of different percentages generated to test minimal number of inputs for effective selection of top variants by predictions of the machine learning model. The populating a predictive machine learning model with an input dataset further comprises generating a plurality of replicates of the input dataset based on a randomized selection scheme or a diverse selection scheme for variants. The generating a plurality of replicates based on the randomized selection scheme comprises randomly selecting a pre-defined number of enrichment scores. The generating a plurality of replicates based on the diverse selection scheme comprises keeping running randomly sampling variants with available enrichment scores until no variants sharing more than p 1-mismatch-neighbours and q 2-mismatches neighbors are present in the input dataset. The predefined parameters comprise Belper and Georgiev embeddings of full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) (SEQ ID NO: 3) and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) (SEQ ID NO: 4) substituted with designated variant's amino-acid residue combination. The performance of the predictive machine learning model includes precision, specificity, and sensitivity of the embeddings of the predictive machine learning model. The evaluating performance of the predictive machine learning model comprises counting numbers of true positives, true negatives, false positives, and false negatives for each result and deriving metrics of the performance of the predictive machine learning model based on the numbers counted.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
SEQ ID NO: 1: dsODN oligonucleotide
SEQ ID NO: 2: dsODN oligonucleotide
SEQ ID NO: 3: SaCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) amino acid sequence
SEQ ID NO: 4: SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) amino acid sequence
SEQ ID NO: 5: GFPsg1 protospacer
SEQ ID NO: 6: GFPsg2 protospacer
SEQ ID NO: 7: GFPsg3 protospacer
SEQ ID NO: 8: GFPsg4 protospacer
SEQ ID NO: 9: GFPsg5 protospacer
SEQ ID NO: 10: GFPsg6 protospacer
SEQ ID NO: 11: GFPsg7 protospacer
SEQ ID NO: 12: GFPsg8 protospacer
SEQ ID NO: 13: EMX1_sg1 protospacer
SEQ ID NO: 14: EMX1_sg4 protospacer
SEQ ID NO: 15: EMX1_sg6 protospacer
SEQ ID NO: 16: EMX1_sg10 protospacer
SEQ ID NO: 17: EMX1_sg2 protospacer
SEQ ID NO: 18: EMX1_sg7 protospacer
SEQ ID NO: 19: VEGFA_sg8 protospacer
SEQ ID NO: 20: AAVS1_sg4 protospacer
SEQ ID NO: 21: CCR5_sg2 protospacer
SEQ ID NO: 22: EMX1_sg1 forward primer
SEQ ID NO: 23: EMX1_sg1 reverse primer
SEQ ID NO: 24: EMX1_sg4 forward primer
SEQ ID NO: 25: EMX1_sg4 reverse primer
SEQ ID NO: 26: EMX1_sg6 forward primer
SEQ ID NO: 27: EMX1_sg6 reverse primer
SEQ ID NO: 28: EMX1_sg10 forward primer
SEQ ID NO: 29: EMX1_sg10 reverse primer
SEQ ID NO: 30: EMX1_sg2 forward primer
SEQ ID NO: 31: EMX1_sg2 reverse primer
SEQ ID NO: 32: EMX1_sg7 forward primer
SEQ ID NO: 33: EMX1_sg7 reverse primer
SEQ ID NO: 34: VEGFA_sg8 forward primer
SEQ ID NO: 35: VEGFA_sg8 reverse primer
SEQ ID NO: 36: AAVS1_sg4 forward primer
SEQ ID NO: 37: AAVS1_sg4 reverse primer
SEQ ID NO: 38: CCR5_sg2 forward primer
SEQ ID NO: 39: CCR5_sg2 reverse primer
SEQ ID NO: 40: Cas9 Protein with the N888Q mutation
SEQ ID NO: 41: Cas9 Protein with the N888Q and A889S mutations
SEQ ID NO: 42: Cas9 Protein with the N888H and A889Q mutations
SEQ ID NO: 43: Cas9 Protein with the N888S and A889Q mutations
SEQ ID NO: 44: Cas9 Protein with the N888R and A889Q mutations
SEQ ID NO: 45: Nucleotide sequence encoding Cas9 Protein with the N888Q mutation
SEQ ID NO: 46: Nucleotide sequence encoding Cas9 Protein with the N888Q and A889S mutations
SEQ ID NO: 47: Nucleotide sequence encoding Cas9 Protein with the N888H and A889Q mutations
SEQ ID NO: 48: Nucleotide sequence encoding Cas9 Protein with the N888S and A889Q mutations
SEQ ID NO: 49: Nucleotide sequence encoding Cas9 Protein with the N888R and A889Q mutations
SEQ ID NO: 50: Cas9 Protein with the N888G mutations
SEQ ID NO: 51: Nucleotide sequence encoding Cas9 Protein with the N888G mutation
Embodiments of the subject invention are directed to machine learning assisted methods and systems for engineering activity-enhanced Staphylococcus aureus Cas9's KKH variants for genome editing.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well as the singular forms, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not prelude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one having ordinary skill in the art to which this invention pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
When the term “about” is used herein, in conjunction with a numerical value, it is understood that the value can be in a range of 90% of the value to 110% of the value, i.e. the value can be +/−10% of the stated value. For example, “about 1 kg” means from 0.90 kg to 1.1 kg.
The term “nucleic acid” or “polynucleotide” refers to deoxyribonucleic acids (DNA) or ribonucleic acids (RNA) and polymers thereof in either single- or double-stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, and mRNA encoded by a gene.
In this application, the terms “polypeptide”, “peptide”, and “protein” are used interchangeably herein to refer to a polymer of amino acids. The terms apply to amino acid polymers in which one or more amino acid residues are artificial chemical mimetic of a corresponding naturally occurring amino acids, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymers. As used herein, the terms encompass amino acid chains of any length, including full-length proteins, wherein the amino acid residues are linked by covalent peptide bonds.
As used in herein, the terms “identical” or percent “identity”, in the context of describing two or more polynucleotide or amino acid sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (for example, a variant protein used in the method of this invention has at least 80% sequence identity, preferably 85%, 90%, 91%, 92%, 93, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identity, to a reference sequence), when compared and aligned for maximum correspondence over a comparison window, or designated region as measured using one of the following sequence comparison algorithms or by manual alignment and visual inspection. Such sequences are then said to be “substantially identical”. With regard to polynucleotide sequences, this definition also refers to the complement of a test sequence. The comparison window, in certain embodiments, refers to the full-length sequence of a given polypeptide, for example a specific enzyme.
In describing the invention, it will be understood that a number of techniques and steps are disclosed. Each of these has individual benefits and each can also be used in conjunction with one or more, or in some cases all, of the other disclosed techniques. Accordingly, for the sake of clarity, this description will refrain from repeating every possible combination of the individual steps in an unnecessary fashion. Nevertheless, the specification and claims should be read with the understanding that such combinations are entirely within the scope of the invention and the claims.
Machine learning (ML) can be applied to a focused library derived from the structure-guided design. Such focused library generally targets multiple sites, for example, eight sites for SpCas9 optimization, that are keys to the protein functionality with deliberated mutations that are restricted to a few residues per site. It is demonstrated that the ML-based in silico screens are efficient and accurate in independent Cas9 optimization tasks, resulting in a reduction of the wet-lab labor by as much as 90%. Further, activities of SaCas9 are boosted whilst broader PAM specificities are obtained. The modifications based on the E782K/N968K/R1015H SaCas9 variant (KKH-SaCas9) lead to activities comparable with wild-type SaCas9 and recognition of an expanded PAM 5′-NNNRRT-3′13.
By combining ML-based and combinatorial mutagenesis screens with downstream structure-guided rational design and wet-lab validations, changes in the WED domain can provide stronger interactions with the PI domain, thereby increasing the DNA-binding ability of KKH-SaCas9 protein. The results reveal that the modification of the WED domain may come through more often in enhancing the protein's activity rather than the changes in the PI domain. In addition, the same set of mutations can be tested with a high-fidelity SaCas9 variant, KKH-SaCas9-SAV2, indicating that the mutations may have wide applications. The work flow and associated parameters of the ML approach can be configured to maximize its effectiveness in succeeding screens for engineering other components of the Cas9 system and for gene editing.
In one embodiment, a method of machine learning-based in silico screens for genome editing is provided. The method comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs; running the predictive machine learning model with predefined parameters; and evaluating performance of the predictive machine learning model. The enrichment scores of the empirical measurements are min-max normalized to scaled fitness scores ranging between 0 and 1. The input dataset includes empirical measurements of different percentages generated to test minimal number of inputs for effective selection of top variants by predictions of the machine learning model. The populating a predictive machine learning model with an input dataset further comprising generating a plurality of replicates of the input dataset based on a randomized selection scheme or a diverse selection scheme for variants. The generating a plurality of replicates based on the randomized selection scheme comprises randomly selecting a pre-defined number of enrichment scores. The generating a plurality of replicates based on the diverse selection scheme comprising keeping running randomly sampling variants with available enrichment scores until no variants sharing more than p 1-mismatch-neighbours and q 2-mismatches neighbors are present in the input dataset. The predefined parameters comprise Belper and Georgiev embeddings of full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) (SEQ ID NO: 3) and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) (SEQ ID NO: 4) substituted with designated variant's amino-acid residue combination. The performance of the predictive machine learning model includes precision, specificity, and sensitivity of the embeddings of the predictive machine learning model. The evaluating performance of the predictive machine learning model comprises counting numbers of true positives, true negatives, false positives, and false negatives for each result and deriving metrics of the performance of the predictive machine learning model based on the numbers counted.
In another embodiment, a method combining machine learning-based in silico screens for genome editing with downstream structure-guided rational design is provided. The method comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs; running the predictive machine learning model with predefined parameters; evaluating performance of the predictive machine learning model; constructing plasmid; cell culturing and transducing; conducting fluorescent protein disruption assays; performing immunoblot analysis; performing T7 endonuclease I assay; performing GUIDE-seq; and performing molecular dynamic simulations on the variants.
In another embodiment, a computer program product comprising a non-transitory computer-executable storage device having computer readable program instructions embodied thereon that when executed by a computer cause the computer to perform machine learning-based in silico screens for genome editing is provided. The computer-executable program instruction comprises populating a predictive machine learning model with an input dataset comprising empirical measurements of on-target activities of sgRNAs; running the predictive machine learning model with predefined parameters; and evaluating performance of the predictive machine learning model. The enrichment scores of the empirical measurements are min-max normalized to scaled fitness scores ranging between 0 and 1. The input dataset includes empirical measurements of different percentages generated to test minimal number of inputs for effective selection of top variants by predictions of the machine learning model. The populating a predictive machine learning model with an input dataset further comprising generating a plurality of replicates of the input dataset based on a randomized selection scheme or a diverse selection scheme for variants. The generating a plurality of replicates based on the randomized selection scheme comprises randomly selecting a pre-defined number of enrichment scores. The generating a plurality of replicates based on the diverse selection scheme comprising keeping running randomly sampling variants with available enrichment scores until no variants sharing more than p 1-mismatch-neighbours and q 2-mismatches neighbors are present in the input dataset. The predefined parameters comprise Belper and Georgiev embeddings of full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) (SEQ ID NO: 3) and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) (SEQ ID NO: 4) substituted with designated variant's amino-acid residue combination. The performance of the predictive machine learning model includes precision, specificity, and sensitivity of the embeddings of the predictive machine learning model. The evaluating performance of the predictive machine learning model comprises counting numbers of true positives, true negatives, false positives, and false negatives for each result and deriving metrics of the performance of the predictive machine learning model based on the numbers counted. The plasmid is obtained by polymerase chain reaction (PCR), restriction enzyme digestion, ligation, one-pot ligation, Gibson assembly, or a combination thereof.
The previously published SpCas9 data8 surveying the on-target activity of sg50N (650 empirical data points) that target a red fluorescent protein (RFP) sequence as the input data are used for the MLDE model. The enrichment scores (E-scores) are min-max normalized to the scaled fitness scores ranging between 0 and 1.
In one embodiment, input datasets including 10%, 20%, 50%, and 70% of empirical measurements are generated to test the minimal number of input for effective selection of top variants from the MLDE prediction, corresponding to datasets of 65, 130, 325, and 445 empirically measured on-target activities. Three replicates are generated for each size, subjected to either randomized or diverse selection schemes for variants. To generate the randomized dataset, the sample_n( ) function from dplyr in R to randomly select the pre-defined number of E-scores is utilized. In order to generate the diverse dataset, randomly sampling variants with available E-scores are kept running until no variants sharing more than p 1-mismatch-neighbors and q 2-mismatches neighbors are present in the input dataset. The thresholds p and q for each dataset can be found in Table 1 below.
The MLDE model is run according to the default parameters. The Belper and Georgiev embedding of the full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) substituted with the designated variant's amino-acid residue combination are applied. The MLDE GenerateEncodings.py is modified such that it processes a customized input fasta file containing the protein sequences of all the variants designed in the SpCas9 as well as the SaCas9 dataset rather than generating the full set of saturated mutagenesis variants. The MLDE ExecuteMlde.py is run with default parameters on the Belper and the Georgiev embeddings and with two different sets of parameters. Other default parameters include 5-fold cross validation and the top 3 models are used to average to get final prediction results. They are assigned as parameters 1 and 2, parameter 1 using the neural network models such as “OneHidden”, “TwoHidden”, “OneConv” and “TwoConv” available in the MLDE models, each with 20 rounds of hyperparameter optimization, while parameter 2 using fewer complex models such as “Linear-Tweedie”, “RandomForestRegressor”, “LinearSVR” and “ElasticNet”, each with 50 rounds of hyperparameter optimization.
The performance of parameters of the ML algorithm including precision, specificity, and sensitivity of the embeddings of the ML is then evaluated. In particular, variants with at least 70% of the wild-type activity are assigned as positives and the rest as negatives. Thus, true positives are variants with at least 70% activity of the wild-type, when being empirically tested with the sgRNA. Otherwise, they are determined to be true negatives. For each MLDE result, the positives and negatives are also labelled using the 70% wild-type activity threshold. Then, the numbers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are counted for each result and the performance metrics are derived according to the formulas below:
Another performance metric, enrichment, proposed by Sarfati et al.42 is also applied. The enrichment as determined by the equation below reveals the ratio of identifying true top 5% of hits when using the ML prediction for the random selection (“the null background”),
where N is the total size of the test set and is the number of all the variants in the prediction in this case.
The input data handling, statistical analyses and graph plotting are carried out by R programs using packages ggplot2, tidyverse, readxl, Cairo, and stringdist.
2. Plasmid Construction The plasmids generated from the test results as shown in Table 2 below are obtained by standard molecular cloning techniques such as polymerase chain reaction (PCR), restriction enzyme digestion, ligation, one-pot ligation, or Gibson assembly. Customized oligonucleotides are ordered through Genewiz. Vectors are transformed into E. coli strain DH5α competent cells and selected with ampicillin (for example, 100 mg/ml, USB) or carbenicillin (for example, 50 mg/ml, Teknova). DNAs are extracted and purified by Plasmid Mini (for example, from Takara and Tiangen) or Midi preparation (for example, from QIAGEN) kits and sequences of the vectors are verified by Sanger sequencing.
Next, storage vectors AWp28 (for example, Addgene #73850) and AWp112 are used to assemble the sgRNA chosen to target a specific gene and the sgRNA sequences employed are listed in Table 3 below. Oligonucleotide pairs of the sgRNA target sequences with BbsI sticky ends are then synthesized, annealed, and cloned into the BbsI-digested storage vector using T4 DNA ligase (for example, from New England Biolabs).
To prepare the lentiviral vector for SaCas9 variant expression, the AWp124 vector is modified via Gibson assembly to remove all existing Esp3J enzyme sites. Esp3J sites are then re-introduced flanking the PI and WED regions to incorporate the intended mutations, giving the DTp2 vector. To insert the sgRNA expression cassette, they are amplified from the storage vector by flanking BamHI and EcoRI (for example, from Thermo Fisher Scientific) sites to and ligated with the digested lentiviral vector DTp2. To generate the PI and WED mutations, oligonucleotides with the WED domain mutations are pooled with a 1:1 ratio as the forward primer, and the same procedure is applied to the PI domain for the reverse primer. PCR amplifications are carried out by the pooled forward and reverse primers with the original KKH-SaCas9 template to create the pooled mutations. By a one-pot ligation method, the pooled mutations are inserted into the Esp3I sites of DTp2. Moreover, the EFS promoter, together with a fluorescent protein expression from the downstream T2A-BFP, drives the SaCas9 expression. To create SaCas9-KKH-SAV2-plus (DTp47A), the Esp3I sites similarly done with DTp2 are incorporated into SaCas9-KKH-SAV2 (DTp52) via Gibson assembly, and then with one-pot ligation inserted the ‘plus’ mutations that are the N888R/A889Q. In conducting saturation mutagenesis on positions 888 and 889, amplifications were done using oligonucleotides designed with ‘NNS’ nucleotides for both positions and incorporated into the lentivectors with the appropriate gRNAs using the similar technique as described above.
HEK293T cells obtained from American Type Culture Collection (ATCC) and MHCC97L-Luc cells are maintained in Dulbecco's Modified Eagle Medium (DMEM) supplemented with 1×antibiotic-antimycotic and 10% FBS (for example, from Thermo Fisher Scientific). OVCAR8-ADR cells are maintained in RPMI 1640 medium supplemented with 10% FBS (for example, from Gibco). The HEK293T cells are used for lentiviral production for KKH-SaCas9 variant expression and for generating stable cell lines. The OVCAR8-ADR cells are transduced with a pAWp9 vector (for example, Addgene #73851) expressing RFP and GFP gene, driven by the hUbCp and CMV promoters, respectively, for the initial screening of KKH-SaCas9 pooled variants and for further validation. OVCAR8-ADR cells are also transduced with lentiviruses encoding RFP and GFP genes expressed from UBC and CMV promoters, respectively, and a tandem U6 promoter-driven expression cassette of sgRNA targeting the GFP site. For the initial screening, the KKH-SaCas9 variants are expressed with sgRNA targeting GFP using EFS and U6 promoters, respectively, followed by a T2A-BFP to determine KKH-SaCas9 expression. The cells are sorted with a Becton Dickinson BD Influx cell sorter. With the mutational screening, the KKH-SaCas9 selected variants are transduced into the stable OVCAR8-ADR cell lines harboring the GFP, RFP genes, and sgRNA. The MHCC97L-Luc cell lines are transduced to create the stable expression of the selected KKH-SaCas9 variants for the T7E1 and Guide-seq experiments. The cells are regularly tested and show negative for mycoplasma contamination. Lentivirus production and transduction are carried out as previously described8.
Fluorescent protein disruption assays are conducted to determine DNA cleavage and indel-mediated disruption at the target site of the fluorescent protein, GFP, by the KKH-SaCas9 variants with the gRNA expressions, resulting in loss of cell fluorescence. The stable cell lines integrated with the GFP and RFP reporter gene, expressing the SaCas9 variants and sgRNA are washed, then resuspended with 1×PBS supplemented with 2% heat-inactivated FBS, and analyzed with Becton Dickinson LSR Fortessa Analyzer or ACEA NovoCyte Quanteon. Cells are gated on forward and side scatter, and at least 1×104 cells are recorded per sample for each data set.
Immunoblots are carried out as previously described8. Anti-SaCas9 (for example, 1:1,000, Cell Signaling #85687) and anti-GAPDH (for example, 1:5,000, Cell Signaling #2118) primary antibodies are used, followed by HRP-linked anti-mouse IgG (for example, 1:10,000, Cell Signaling #7076) and HRP-linked anti-rabbit IgG (for example, 1:20,000, Cell Signaling #7074) secondary antibodies.
T7 endonuclease I assay is performed as previously described to quantify the Cas9-induced mutagenesis in endogenous loci8. The targeted loci are amplified from 15-30 ng of genomic DNA extracted using DNeasy Blood and Tissue Kit (for example, from QIAGEN) using the primers as listed in Table 4 below. Quantification is based on relative band intensities measured using ImageJ. The editing efficiency is estimated by the formula, 100×(1−(1−(b+c)/(a+b+c))=) as previously described 43, where a is the integrated intensity of the uncleaved PCR product, and b and c are the integrated intensities of each cleavage product, respectively.
GUIDE-seq is performed as previously described8. Approximately 1.6 million MHCC97L cells stably expressing the KKH-SaCas9 variants are transduced with sgRNAs. After 72 hours, electroporation is conducted according to the manufacturer's protocol using 1,100 pmol freshly annealed end-protected dsODN with 100 μl Neon tips (for example, from ThermoFisher Scientific). The dsODN oligonucleotides used are 5′-P-G*T*TTAATTGAGTTGTCATATGTTAATAACGGT*A*T-3′ (SEQ ID NO: 1) and 5′-P-A*T*ACCGTTATTAACATATGACAACTCAATTAA*A*C-3′ (SEQ TD NO: 2), where P represents 5′ phosphorylation and the asterisks indicate phosphorothioate linkages. Electroporation voltage, width and number of pulses are set to be 1100 V, 20 ins, and 3 pulses, respectively. Cells are harvested at day 7 post transduction of the sgRNA. Genomic DNA is extracted using DNeasy Blood and Tissue Kit (for example, from QIAGEN) according to the manufacturer's protocol. The gDNA collected for the SaCas9 variant and the sgRNA are sequenced on Illumina NextSeq System and analyzed by GUIDE-seq software44.
Molecular dynamic simulations are conducted on the variants using DynaMut37. The variants mutations are singly inputted into the webserver and the structural outputs are then aligned with the crystal structure of SaCas9 (PDB: 5CZZ) on PyMol. The predicted rotamer of the mutations as indicated by DynaMut is subsequently used to replace the amino acid positions on the SaCas9 crystal structure. The predicted interactions determined by DynaMut and Pymol are indicated on the crystal structure to provide a putative representation of the SaCas9 variants.
For protein engineering, it is challenging to investigate the vast combinatorial mutational space. Machine learning-based methods allow efficient exploration of the functional impact brought by mutations and breaking through the experimental limits of testing a great number of combinatorial mutants. The possibility for the ML-based in silico screen to be applied to the Cas9 optimization can be determined based on a small fraction of variants with experimentally determined activities from a combinatorial mutant library. In particular, using the previously published combinatorial mutagenesis data on SpCas98, the minimal sample size sufficient for accurately predicting which variants possess top enzyme activities for the library can be readily determined.
In embodiments of the subject invention, a MLDE model that predicts activities of variants from multi-sites saturated mutagenesis libraries based on a small sample of variants is employed. The MLDE model offers numerous embeddings and models parameters, and the simple Georgiev embeddings25 and the learnt embedding from Belper et al26 are selected to combine with more complex neural networks models (parameter 1) or with an ensemble of more simple models such as random forests and SVM (parameter 2) to model the activities of SpCas9.
Different input sizes including 10%, 20%, 50%, 70% of randomly down-sampled empirical data points from the library of 650 variants are utilized as the training data for testing the SpCas9 activity. Since a previous study has showed that sampling diverse samples improves the ML performance27, whether using a sample with high diversity may improve accuracy needs to be determined.
Deciding which characteristic of the data is most useful as the training data facilitates design of the library for building variants for empirical testing. To this end, more dissimilar variants are selected by reducing the numbers of variants sharing merely one and two sequence mismatches included in the input dataset.
In particular, when there are a limited number of input data points, for example, 10% and 20%, it is observed that restricting the number of one-mismatch and two-mismatches counterparts of each variant boosts the number of variants by, for example, 8% to 16%, harboring five to seven mismatches from each other in the dataset. When the sample size increases, such a selective scheme does not confer more dissimilarities among variants compared to the random selection. Overall, the diversity is preserved in the down-sampling.
The MLDE model is run on all the datasets to calculate the variables such as precision, specificity, and sensitivity for predicting variants with at least 70% of wild-type activity. Consistent with the little increase in diversity described above, it is found that diverse dataset generates slightly more variants with >70% of wild-type activity (i.e., greater sensitivity), but with a small compromise on higher false-positive discovery (i.e., lower precision and specificity), compared to the randomized selection as shown in
In the ML model runs, it is found that the prediction on SpCas9 activity achieves good precision and specificity as shown in
The Belper with parameter 1 configuration also exhibits high enrichment of functional variants among the top 5% hits in the prediction. With 10% and 20% of input, 81.6% and 85.2% of variants are functional among the top 5% hits from the predictions, which correspond to a 5.46-fold and 5.88-fold enrichment of finding a functional variant compared to the null background, respectively, as shown in
Therefore, functional variants with high on-target activities can be readily isolated in silico based on the MLDE model with Belper embedding and modelling parameter 1, when empirical measurements of, for example, 10-20%, of variants are provided as input.
Based on the parameters that yield a good prediction of the SpCas9's on-target activity, the MLDE model can be applied for optimization of the SaCas9. The test results show that the editing activity of KKH-SaCas9 is augmented, suggesting that introducing additional non-base-specific interactions between KKH-SaCas9 and the PAM duplex of the target DNA can increase the efficiency of the enzyme. Such strategy is effective in compensating the reduced DNA base-specific interactions of an engineered SpCas9 variant that broaden its PAM compatibility and restoring the enzyme's activity28. For SaCas9, Nishimasu et al. has illustrated in the crystal structure (5CZZ) its amino acid residues that show direct contact with the target DNA backbone of the PAM duplex17.
In one embodiment, eight amino acid residues (located within the WED and PI domains of KKH-SaCas9) that interact with and surround the PAM duplex for combinatorial mutagenesis are selected as shown in Table 5 below. Based on a rational design, up to three amino-acid alternatives to the wild-type residue are selected for each site, leading to a total of 1,296 variant combinations.
Moreover, 300 out of the 1,296 (23%) variants are randomly picked, generating empirical data from a screening library as the training set input, and the MLDE model is run with the Belper embedding and the modelling parameter 1 to predict functional variants that have activities comparable to wild-type, for example, at least 70%, from the full variant space. The generated in silico prediction results are then confirmed by the experimental screening data, validating that the MLDE model predicts KKH-SaCas9's activity with high accuracy.
In one embodiment, a full-coverage screening library of 1,296 variants is assembled and the library is delivered by lentiviruses into reporter cell lines that stably expressed GFP and a sgRNA targeting the GFP gene sequence as shown in
The experimental screening results reveal that variants harboring mutations at residues 888 and 889 of the WED domain and 988 and 989 of the PI domain are frequently detected among the top 5%-ranked variants with high on-target activities, while those carrying wild-type sequences at 887 of the WED domain and 985, 986, and 991 of the PI domain more likely confer the enzyme with higher activity as shown in
Comparison between the in silico prediction results and experimental screen data indicates that the MLDE model accurately predicts KHH-SaCas9's activity. It is found that the three independent sets of activity measurements on KKH-SaCas9 variants yield predictions consistent with the experimental screen data, for example those as shown in
To further verify the editing efficiencies of the identified variants with increased KKH-SaCas9's activity, individual validation assays are performed. The validation results are consistent with the screening data, revealing that the N888Q and N888Q/A889S variants exhibit increased editing activities over KKH-SaCas9, when paired with sg1 and sg3 sgRNAs as shown in
Based on the above identified activity-enhanced variants, structure-guided engineering is employed to further improve the editing activity of KKH-SaCas9. Protein structure analyses indicate that N888 and A889 at the WED domain of SaCas9 are positioned close to its PI domain and the DNA backbone of the PAM duplex17. Previous modelling also revealed that while N888Q removes its contact with the DNA backbone of the PAM duplex, it could increase its proximity to and add interactions with L989 at the PI domain as shown in
In one embodiment, tests are performed to confirm that switching N888 and A889 to other residues could strengthen the interactions between WED and PI domains and also enhance KKH-SaCas9's activity. Four more combined mutation variants are engineered on these positions, which are selected based on predicted contact gains with the PI domain via N986, D987, L988, and/or L989 as shown in
Among the variants tested, the one harboring N888R/A889Q mutations (hereafter designated as “KKH-SaCas9-plus”) exhibit the greatest editing activity, for example, 122% of the activity of KKH-SaCas9 averaged from 3 sgRNAs targeting GFP as shown in
Moreover, the modelling of KKH-SaCas9-plus shows that it contacts the PI domain via N986, D987, L988, and/or L989 residues and has three contacts with the DNA backbone as shown in
It is determined that the addition of N888R/A889Q can improve the activity of high-fidelity variants of KKH-SaCas9, such as the newly engineered SAV2. Moreover, it is found that the N888R/A889Q enhances the on-target activity of SAV2. For example, 125% of KKH-SaCas9's activity averaged from sgRNAs targeting 8 loci is observed as shown in
There have been tremendous efforts in designing Cas9 proteins to boost gene editing efficiency and purge undesired off-target editing at the same time by maintaining a delicate balance between interacting and non-interacting amino-acid side chains of the Cas9 protein with the sgRNA-DNA complex. Dozens of variants possessing different mutation combinations have been reported thus far, each representing one of the many optimal solutions for the trade-off between Cas9 activity and precision.
Considering that any of the amino-acid sites of SaCas9 in spatial proximity to the sgRNA-DNA complex are potential sites for optimization, which could reach as many as 40 sites17, the number of combinatorial variants, for example, 240=1.1×1012, to screen through for optimization is prohibitively high for wet-lab experiments, even if each site is restricted to two (wild-type or mutated) amino-acid residues.
Previous studies have shown that with a rational design, each site can be limited to 4-5 candidate residues and that a targeted mutagenesis library can be generated to reduce screening efforts. SpCas9 variants with both high activity and fidelity have been successfully identified from a combinatorial screen of 952 variants8.
In embodiments of subject invention, a rational design-based screen with machine learning is adopted for optimization of the Cas9 proteins. Particularly, the ability of ML to further downsize the experimental screen via the extrapolation of handfuls of variants with experimentally-determined fitness values is assessed. It is found that ML-based in silico screen greatly facilitates the search of more efficient Cas9 variants. In the ML runs on the SpCas9 dataset using as little as 10% of variants as input training data, a 81.6% chance of capturing functional variants among the top 5% of variants predicted is achieved. Shortlisting a few candidate residues on selected amino-acid sites via structure-guided rational design of SpCas9 significantly enhances the chances of finding better variants from the previously published combinatorial mutant library. Similarly, the results of the MLDE model suggest that focus should be placed on surveying diverse sequence spaces deemed to contain functional variant24. In an independent Cas9 optimization task, it is further demonstrated that the MLDE model exhibits surpassing performance in the prediction of KKH-SaCas9 variants' activities on three sgRNAs and shows success in identifying useful novel variants in the KKH-SaCas9 screen subsequently. When the combined approach of structure-guided design, targeted mutagenesis library screen, and ML is employed to identify activity-enhanced KKH-SaCas9 variants, the path to identify the top variants is significantly shortened.
The best-performing variant, KKH-SaCas9-plus, harbors N888R/A889Q mutations, improving its editing activity. The molecular modelling provides structural insights that these mutations may strengthen the interactions between KKH-SaCas9's WED and PI domains located near the PAM duplex to anchor the target DNA in the SaCas9-sgRNA-target DNA complex. While N888R/A889Q increases the on-target activity, the mutations only minimally affect the off-target activity of SAV2 which is a high-fidelity derivative of KKH-SaCas9. The result affirms that the abilities of KKH-SaCas9 to bind the DNA and distinguish base mismatches between sgRNA and the DNA target probably act through distinct mechanisms, and thus its activity and specificity could be engineered independently. It is possible that N888R/A889Q is also compatible with other dSaCas9-derived genome perturbation tools including gene activators31,32, base editor33,34 and prime editor35 to increase their abilities to bind the DNA and thus their activities. The N888R/A889Q mutations on the WED domain represent a useful building block for further engineering of various genome perturbation tools to achieve both high activity and high specificity.
To discover the activity-enhanced KKH-SaCas9 variants, a smaller pool of, for example, about a thousand variants are initially experimented based on the structure-guided design. It is noted that the selection of suitable sgRNAs, for example, sg50N for SpCas9 and sg1, sg2, and s3 for KKH-SaCas9, allows the MLDE model to generate more reliable predictions in subsequent screens. The MLDE-based workflow is tested and validated based on the experimental screening data and the required number and diversity of the input combinations are defined for in silico predictions. The results lead to screening of more combinatorial mutations by creation of a directed library on a manageable experimental scale. Continual efforts in advancing ML methods for protein structure modelling, including incorporating structural descriptors36 into the learnt representation, lead to improvement of the prediction on variants' activities for in-silico screens.
Nevertheless, only mutation combinations from selected amino acid residues by a rational design are investigated, without exploration of the performance of the MLDE model on a virtual fully saturated mutagenesis screen. Creating a more comprehensive screening strategy by designing a library enriched with diverse but not “dead” variants remains challenging. One could examine possible structural changes of the designed variants predicted using other in silico tools such as DynaMut37, Rosetta38,39, and Pymol to further filter for candidate mutations. For example, experimental screening of a computationally designed library of ubiquitin variants was shown to be successful in identifying variants with strong protein-binding ability40.
Moreover, increasing the number of amino-acid sites is desirable. It would be particularly useful for protein repurposing to use another substrate, where the wild-type has essentially no activity. For example, obtaining a “PAMless” SaCas9 involves engineering multiple sites beyond the PI and WED domains. The number of targeted mutagenesis sites to be incorporated is still a confounding factor in combinatorial library construction. For example, commercial oligo synthesis of a 100 bp DNA fragment at most accommodates 10 sites of NNN/NNK degenerate codons or trinucleotide pool. Thus, the MLDE model is advantageous in transcending such physical limitations by building a combined in silico screen supplied with empirical data from multiple smaller focused libraries. For example, multiple focused screens may be performed with MLDE converging sites with modest overlaps that each library has mutagenesis of 5 amino-acid residues per site, up to 6 sites with 1-2 sites in common to another library as shown in
Comparison of MLDE Performance on Predicting SpCas9 Activity with That on Predicting sg50N and sg80N sgRNAs
The performance of the MLDE model on surveying the SpCas9's activity are compared with the performance of the MLDE model on surveying data of two sgRNAs8, namely, sg50N (650 empirical datapoints) and sg8ON (729 empirical datapoints), that target on a red fluorescent protein (RFP) sequence as the input data.
Similar to the approach adopted for testing sg50N describe below, input datasets including 10%, 20%, 50%, and 70% of empirical measurements are generated to test the minimal input for effective selection of top variants from the MLDE prediction, corresponding to datasets of 73, 146, 365, and 510 empirically measured on-target activity for sg8ON.
When the datasets of the two sgRNAs are compared, it is found that the prediction on sg50N activity achieves precision and specificity that are higher than these of the sg8ON activity. While using merely 10% input is sufficient to identify the three clusters of variants with high sg50N activity as shown in
Accordingly, the performance of the MLDE model on sg8ON is much lower than that on sg50N as shown in
Thus, sg8ON is a challenging dataset for ML models. Nonetheless, the MLDE model exhibits surpassing performance in the prediction of SpCas9 variants on sg50N activities and shows success in identifying useful novel variants in the KKH-SaCas9 screen.
When facing such a phenomenon resulting from sgRNA-specific effect, the MLDE model may be limited in applications for identifying variants with improved performance. It is also observed in previous studies that some sgRNAs may be more susceptible to losing editing activity with a reducing functional dose of Cas9 (or Cas9:sgRNA molar ratio) used2,3. Since the reasons accounting for such sgRNA-specific effect are not yet known, it may be desirable to test multiple conditions (i.e., more sgRNAs) and select, for example, sg50N for SpCas9 and sg1, sg2, and s3 for KKH-SaCas9, allowing the MLDE model to generate reliable predictions in subsequent screens.
In embodiments of the subject invention, the genome-editing Cas9 protein uses multiple amino-acid residues on its sequence to bind the target DNA. Considering only the residues in proximity to the target DNA as potential sites to optimize Cas9's activity, the number of combinatorial variants to screen through is too massive for a wet-lab experiment. It is demonstrated that a machine learning-coupled combinatorial mutagenesis approach reduces the experimental screening burden by as high as 90%, while achieving 87% prediction precision and 97% specificity, for Cas9 engineering. Using this approach, mutations that enhance the editing activity of the protospacer adjacent motif-relaxed KKH variant of Cas9 nuclease from Staphylococcus aureus (KHH-SaCas9) are discovered. The mutations located at SaCas9's WED domain are modelled to strengthen contacts with the PI domain and sandwich the protospacer adjacent motif-proximal DNA duplex. Followed by structure-guided engineering, one of the variants, named KKH-SaCas9-plus, showed as high as 30% enhancement of editing activity at multiple loci without compromising high genome-wide targeting specificity, when combined with mutations that confer KKH-SaCas9 with high accuracy. In addition to generating a KKH-SaCas9 nuclease with efficiency exceeding its wild-type counterpart, a readily applicable workflow is established, leveraging on the machine learning-assisted paradigm to accelerate engineering of genome editors.
All patents, patent applications, provisional applications, and publications referred to or cited herein are incorporated by reference in their entirety, including all figures and tables, to the extent they are not inconsistent with the explicit teachings of this specification.
It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application. In addition, any elements or limitations of any invention or embodiment thereof disclosed herein can be combined with any and/or all other elements or limitations (individually or in any combination) or any other invention or embodiment thereof disclosed herein, and all such combinations are contemplated with the scope of the invention without limitation thereto.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/268,745, filed Mar. 1, 2022, which is hereby incorporated by reference in its entirety including any tables, figures, or drawings.
Number | Date | Country | |
---|---|---|---|
63268745 | Mar 2022 | US |