The Sequence Listing for this application is labeled “CUHK208.xml” which was created on May 11, 2023 and is 10,958 bytes. The entire content of the sequence listing is incorporated herein by reference in its entirety.
Computational Protein Design, which is to design proteins with specific structures or functions [1], has been a powerful tool for the exploration of sequence or topology space not yet visited by evolutionary process [2-4] and discovery of proteins with improved properties [5]. It has enabled success in membrane protein design [6] and enzyme design [7]. Inverse Protein Folding (IPF), as one of the sub-tasks of Computational Protein Design, tackles the problem of finding amino acid sequences that can fold into a given three-dimensional (3D) backbone structure [8] and is of great importance for hosting a particular function often presupposes acquiring a specific backbone structure.
The ways to model and utilize residue interactions have been the focuses of various IPF algorithm development. In traditional methods, energy functions are designed to approximate backbone-sequence compatibility. Residue-pair interaction modeling is usually derived from databases by leveraging statistical preferences for particular residue pairs in a simplified local environment to estimate inter-residue energies [5, 9, 10]. However, the increasing computational complexity limits the statistical estimation of multi-residue interactions that are conditional on a more fine-grained representation of the local environment. [10, 11].
In recent years, deep learning has been widely and successfully applied to protein structure modeling and prediction [12, 13], due to its ability to automatically learn complex non-linear many-body interactions from data. Various research projects have been carried out to investigate deep learning assisted IPF [4, 14, 15]. Early methods often model protein structures as sequences of independent residues [16, 17] or atom point clouds [4, 15] and adopt a non-autoregressive decoding scheme as demonstrated in
There continues to be a need in the art for improved designs and techniques for methods and systems for a machine-learning assisted approach to meet the challenges of the protein structure modeling and prediction.
According to an embodiment of the subject invention, a machine-learning based method for protein sequence design is provided, comprising receiving information of residues of a protein sequence; determining if any residue of the protein sequence is known; performing entire sequence design if it is determined that no residue of the protein sequence is known; and performing partial sequence design if it is determined that at least one residue of the protein sequence is known. The performing entire sequence design comprises performing an entropy-based prediction-selection method in combination with a base model to remove noise in input residue context. Moreover, the performing an entropy-based prediction-selection method comprises computing an entropy of predicted distributions at each position, retaining residues having entropies lower than or equal to a threshold value, and masking other residues having entropies greater than the threshold value. The base model can be a GVP-GNN model, a ProteinMPNN model, a ProteinMPNN-C model, or an ESM model.
In another embodiment of the subject invention, another machine-learning based method for protein sequence design is provided. The method comprises generating a portion of a sequence and removing noise in input residue context; encoding and processing the portion of the sequence and backbone structure to obtain graph features; performing memory-efficient global graph attention layers to propagate the graph features and learn global residue interactions; and generating entire sequence non-iteratively. The performing memory-efficient global graph attention layers comprises enabling each residue node to learn residue interactions and gather information from the entire sequence while maintaining memory efficiency. Moreover, the performing memory-efficient global graph attention layers comprises constructing a K-nearest neighbor graph from the backbone structure and node features contain dihedral angles, forward and backward vectors. Edge features of the memory-efficient global graph attention layers include interatomic distances and direction vectors. The edge features are encoded by GVP layers to obtain structural embeddings. In each layer of the memory-efficient global graph attention layers, every residue node globally attends to other residues. An attention score is calculated from both the node and the edge features to determine amount of information that a target node gathers from another node. For node pairs that are not directly connected by an edge, a learnable pseudo edge feature is configured for attention calculation. In addition, each layer learns a separate pseudo edge feature that is shared by all non-existing edges. The attention score is then used to weight and sum up the node and the edge features, generating updated node features. The edge features are updated by the updated node features. Further, the generating entire sequence non-iteratively comprises generating the entire sequence from the node features from the last layer non-iteratively.
In certain embodiment of the subject invention, a computer program product is provided, comprising a non-transitory computer-executable storage device having computer readable program instructions embodied thereon that when executed by a computer cause the computer to perform machine-learning based method for protein sequence design, the computer-executable program instruction comprising generating a portion of a sequence and removing noise in input residue context; encoding and processing the portion of the sequence and backbone structure to obtain graph features; performing memory-efficient global graph attention layers to propagate the graph features and learn global residue interactions; and generating entire sequence non-iteratively.
Embodiments of the subject invention are directed to a machine-learning based method and system for protein sequence design based on selection of high-quality residue interactions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well as the singular forms, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not prelude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one having ordinary skill in the art to which this invention pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
When the term “about” is used herein, in conjunction with a numerical value, it is understood that the value can be in a range of 90% of the value to 110% of the value, i.e. the value can be +/−10% of the stated value. For example, “about 1 kg” means from 0.90 kg to 1.1 kg.
As used herein, the term “nucleic acid” or “polynucleotide” refers to deoxyribonucleic acids (DNA) or ribonucleic acids (RNA) and polymers thereof in either single- or double-stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, and mRNA encoded by a gene.
In this disclosure, the term “isolated nucleic acid” molecule means a nucleic acid molecule that is separated from other nucleic acid molecules that are usually associated with the isolated nucleic acid molecule. Thus, an “isolated nucleic acid molecule” includes, without limitation, a nucleic acid molecule that is free of nucleotide sequences that naturally flank one or both ends of the nucleic acid in the genome of the organism from which the isolated nucleic acid is derived (e.g., a cDNA or genomic DNA fragment produced by PCR or restriction endonuclease digestion). Such an isolated nucleic acid molecule is generally introduced into a vector (e.g., a cloning vector or an expression vector) for convenience of manipulation or to generate a fusion nucleic acid molecule. In addition, an isolated nucleic acid molecule can include an engineered nucleic acid molecule such as a recombinant or a synthetic nucleic acid molecule. A nucleic acid molecule existing among hundreds to millions of other nucleic acid molecules within, for example, a nucleic acid library (e.g., a cDNA or genomic library) or a gel (e.g., agarose, or polyacrylamide) containing restriction-digested genomic DNA, is not an “isolated nucleic acid”.
In this application, the terms “polypeptide”, “peptide”, and “protein” are used interchangeably herein to refer to a polymer of amino acids. The terms apply to amino acid polymers in which one or more amino acid residues are artificial chemical mimetics of corresponding naturally occurring amino acids, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymers, including those comprising post-translational modifications. As used herein, the terms encompass amino acid chains of any length, including full-length proteins, wherein the amino acid residues are linked by covalent peptide bonds, as well as multi-subunit proteins wherein two or more covalently linked chains of amino acids are associated by covalent bonds or non-covalent interactions.
In describing the invention, it will be understood that a number of techniques and steps are disclosed. Each of these has individual benefits and each can also be used in conjunction with one or more, or in some cases all, of the other disclosed techniques. Accordingly, for the sake of clarity, this description will refrain from repeating every possible combination of the individual steps in an unnecessary fashion. Nevertheless, the specification and claims should be read with the understanding that such combinations are entirely within the scope of the invention and the claims.
According to the embodiments of the subject invention, the machine-learning based method and system for protein sequence design comprises two design components.
(1) It is assumed that a portion of the sequence is already known and the model takes a partial sequence as input as well as the backbone structure during both training and inference. During the training phase, explicitly introducing ground truth residue information can facilitate the model to learn and exploit residue interactions more effectively, increasing native sequence recovery by up to 8% as shown by the experiment results. During the inference phase, a prediction-selection method is configured to utilize high quality residue information. Referring to
(2) A memory-efficient global graph attention model is adopted to fully utilize the denoised context, allowing each residue node to learn residue interactions and gather information from the entire sequence while maintaining the memory efficiency.
Partial Sequence Input Method: A portion of the sequence is assumed to be known. When training the model, some residues are randomly selected and their types are made visible. During the evaluation, two design settings, which are shown in
In one embodiment, when some residues are unknown and need to be designed, the other known residues can serve as an oracle partial input sequence.
In another embodiment, when no residues are known and the entire sequence needs to be designed, an entropy-based prediction-selection method is adopted to utilize high-quality predictions from the existing models. Specifically, given a sequence generated by a conventional model, which is referred to as the base model, the entropy of the predicted distribution at each position is computed. The residues with low entropy, for example, the residues having entropics lower than or equal to a threshold value, are retained and other residues having entropies greater than the threshold value are masked, assuming that models are more confident with low-entropy predictions. Results of experiments demonstrate that precisions of around 99% can be achieved for the residues with the lowest 10% entropy. Thus, the method can effectively remove a significant amount of noise in the input residue environment.
Moreover, different from the conventional methods, the method of subject invention generates the sequence in one shot without iterative design and dependence on previous predictions, resulting in faster sequence design and ensuring higher-quality residue interactions. Accordingly, the model of subject invention generates sequences with high native recovery even with relatively poor base models and achieves state-of-the-art native sequence recovery when better partial sequences are available for input. The strategy for partial input construction is applicable to any existing protein design models, as there are no requirements for the base model architecture.
Memory-efficient Global Graph Attention Method: To learn and leverage residue interactions, the attention method, which has been proven effective in modeling global dependencies for sequential data, is adopted. However, adapting attention to the graph domain can be challenging. Existing works either abandon the original global view [18, 26, 27] or require fully connected graphs to support the global attention calculation [28, 29], which raises the memory complexity from linear to quadratic in node numbers. Further, edge features are not fully leveraged in these existing methods, as they only participate in attention calculations but cannot be updated or used to update node features [18, 26, 28, 30]. Nevertheless, it has been shown that the edge features are crucial for protein structure modeling [20].
To address these issues, according to the embodiments of the subject invention, a model with memory-efficient global graph attention layer is provided as shown in
Each layer of the model learns a separate pseudo edge feature that is shared by all non-existing edges. The attention score is then used to weight and sum up the node and the edge features, producing the updated node features. The edge features are also updated by the new node features. Finally, the model generates the sequence from the node features from the last layer in one shot. Therefore, the memory-efficient global graph attention layer allows for global residue attention while eliminating the need for fully connected graph construction by learning the pseudo edge features. The residues are able to leverage the global interactions and the whole-structure features.
The experiment results demonstrate that the method according to the embodiments of the subject invention is effective in handling both entire sequence design and partial sequence design settings. Higher native sequence recovery is achieved for several sequence design benchmarks. In particular, for partial sequence design, the model for the task of single-point mutant design of Transposon-associated transposase B is validated. This is a special case of partial sequence design where only one residue can be modified. The model can generate stable and high-quality sequences and successfully identify six variants with improved gene editing activity out of the 20 mutants recommended by the model.
The model according to the embodiments of the subject invention is trained on CATH 4.2 training set containing 18,204 structures.
A base model is employed when designing the entire sequence. The method according to the embodiments of the subject invention performed in combination with each base models is tested for inverse protein folding.
GVP-GNN is trained on the same training set mentioned above.
ProteinMPNN is trained on selected PDB structures clustered into 25,361 clusters. The same model architecture for training on the same training set is denoted as ProteinMPNN-C and is used for fair comparison.
ESM is trained on CATH 4.3 training set with 16,153 structures and 12 million additional structures predicted by Alphafold2.
Table 1 below provides median sequence recovery rate and nssr scores of the test results of the method according to the embodiments of the subject invention performed in combination with different base models for three benchmarks. The model according to the embodiments of the subject invention can obtain good performance with relatively poor base models and achieve the highest recovery and nssr with better base models.
The experiments are conducted for following three benchmarks.
CATH: CATH 4.2 dataset is a standard dataset for IPF training and evaluation. Its test split of 1,120 structures is evaluated.
TS50: TS50 is a benchmark set of 50 protein chains proposed by [17]. It has been used by a number of previous works [15, 31, 32].
Latest PDB: The latest published structures in PDB are collected as another benchmark to validate the model's ability to generalize for new structures. Protein structures released after Jan. 1, 2022 with a single chain of length less than 500 and resolution 2.5° A are selected, resulting in 1,975 protein structures.
Two metrics on all benchmarks: Sequence recovery and native sequence similarity recovery (nssr) are reported [33]. A pair of residues is considered similar and contributes to the nssr score if their BLOSUM62 score 0. Compared with recovery when only residue identity is considered, nssr takes residue similarity into account and provides a more specific comparison between two sequences.
In Table 1 above, the median recovery rate and the nssr scores when using partial sequences from different base models are reported, and the results of the models according to the embodiments of the subject invention and the results of the base models are compared. Among all the base models, the ESM model achieves the best performance, highlighting the effectiveness of data augmentation. The method the method according to the embodiments of the subject invention performed on different base models consistently achieves high recovery rates and nssr scores even with relatively poor base models, demonstrating its ability to denoise input residues and recover native sequences from them. Additionally, when partial sequences with higher quality are available, the method according to the embodiments of the subject invention performed on different base models outperforms the state-of-the-art model that uses 12 million additional structures for augmentation. To validate the effectiveness of entropy-based masking in selecting high-quality residue predictions, the masking operation is removed and the entire predicted sequence is given to the model. Resulted recovery rate on the CATH test set is plotted in
In
Whether the designed sequences can fold into the target backbones by predicting their structure with Alphafold2 is assessed. The TM-score and root-mean-square deviation (RMSD) for the backbone heavy atoms are computed.
For partial sequence design, the models are evaluated on following two benchmarks.
EnzBench: EnzBench is a standard sequence recovery benchmark comprising 51 proteins [36]. Designing methods are required to recover the native residues on protein design shells with other residues fixed. This benchmark is designed to test the method's ability to model protein binding and overall stability.
BR_EnzBench: BR_EnzBench [33] aims to test the method's ability to remodel the chosen protein structure. It randomly selects 16 proteins from EnzBench benchmark and uses the Backrub server to create an ensemble of 20 near-native conformations for each protein. To further increase the designing difficulty, all residues on the design shell are mutated to alanine, and conformations are then energy-minimized. When evaluated on EnzBench and BR_EnzBench, types of residues not on design shells are fixed and available to the models.
For partial sequence design, as base models are not required, the method is compared with the GVP-GNN model and the ProteinMPNN-C model, which are also trained on the same training set for fair comparison. Each model is provided with the same partial input sequence and report recovery rates and nssr scores for residues on design shells as shown in Table 2. The model according to the embodiments of the subject invention achieves the highest recovery and nssr on both benchmarks. The recovery rate on EnzBench for different amino acids and secondary structures are further analyzed as shown in
Transposon-associated transposase B (TnpB) is considered to be an evolutionary precursor to CRISPR-Cas system effector protein [38]. TnpB (408 amino acids) in the Deinococcus radiodurans ISDra2 element has been demonstrated to function as a hypercompact programmable RNA-guided DNA endonuclease [39], and its miniature size is suitable for adeno-associated virus-based delivery. However, TnpB exhibits moderate gene editing activity in mammalian cells, limiting its therapeutic application.
The editing activity of TnpB is improved through the design of single-point mutations. In this case, the design of a single-point mutation as a partial sequence design is provided, with only one residue being designable and all others being fixed. With the empirical intuition that a more positively charged surface may potentially improve activity, the mutation target is restricted to the most positively charged amino acid, arginine (R), and the candidate mutation sites are restricted to surface residues. The model according to the embodiments of the subject invention is configured to compute a quality score for every candidate site, as illustrated in
The model according to the embodiments of the subject invention and ProteinMPNN-C are employed for mutant design following the above procedure. 20 mutation points recommended by two models are displayed in
The experiment demonstrates that the model according to the embodiments of the subject invention is effective at modeling residue interactions within a structural environment and generating sequences that best fit a given 3D context. Thus, it can be employed in combination with other property measures for redesigning existing proteins and improving their stability or other qualities that depend on protein stability.
Several key designs in the model according to the embodiments of the subject invention are investigated. For the global attention layers, it is found that the layers can learn and exploit meaningful residue interactions. Specifically, the residues that a target residue most attends to, that is, the residues with the highest attention score are examined. It is observed that many important inter-residue interactions are well learned and represented by the attention operation.
Further validation is performed on the effectiveness of two key components of the method according to the embodiments of the subject invention: global attention method and partial sequence input method. Accordingly, two ablated models are trained: (1) a model without the global attention view, where residues only attend to their neighbors on graphs, and pseudo edge features are therefore not used; and (2) a model without partial sequence input during training, where all residue types are set to unknown. The recovery rates of these ablated models are compared with the full model on EnzBench and BR_EnzBench as shown in Table 3 above and it is found that removing either component results in a drop in the model performance. Notably, the model without partial sequence input exhibits a significantly larger performance degradation on BR_EnzBench, indicating that when input structures are not accurate, the ability to utilize sequence information becomes more important, whereas training without partial input inhibits the model from learning to encode and utilize context residues. The robustness of these models is also tested by evaluating input sequence noise in the entire sequence design. For sequences generated by the base model ESM, different percentages of residues are masked with the highest entropy, and the model's median recovery rates are plotted on the CATH benchmark in
According to the embodiments of the subject invention, better modeling and learning of inter-body interactions within protein structures are studied for inverse protein folding. The two-pronged approach that incorporates a partial sequence input method and a memory-efficient global graph attention method is adopted, which work jointly to achieve effective selection and utilization of high-quality inter-residue interactions. The experiment results demonstrate that the approach is able to capture meaningful inter-residue bonds and achieves state-of-the-art sequence recovery on several protein design benchmarks. The model can be applied to redesign TnpB, resulting in the successful discovery of six mutants with enhanced editing activity. The results demonstrate great potentials for applying the model to proteins design with improved functional properties.
A protein structure is represented as a proximity graph G=(V,E), where V={v1, v2, . . . , vN} denotes the residue nodes and E={eij} denotes the directed edges form vj to vi, where residue vj is among the k=30 nearest neighbors of vi in terms of Cα distance. Each node vi has the following features:
Each edge eij has following features:
Then, two geometric vector perceptrons layers are employed to embed the extracted features.
The input partial sequence is embedded as node features. Masked residues with unknown residue type are treated as a special type “unknown”. The sequence node features are concatenated with structural node features. Resulting node features are denoted as H0∈RN×d, where h0i∈Rd denotes the feature of vi. Resulting edge features are denoted as E0∈RN×k×d, where Ei0∈Rk×d is the features of k neighbors of vi and e0ij∈Rd denotes the feature of edge from vj to vi.
Attention is first introduced in Transformer model [25]. Let H∈RN×d denote the d-dimension features of the input sequence with length N. The self-attention module updates the input features according to the following equations:
where WK∈Rd×dK, WQ∈Rd×dK, WV∈Rd×dV are parameters to map H to keys, queries and values.
Transformer architecture may be employed for learning on graphs, with nodes denoted as sequence tokens. To utilize the global view provided by the original self-attention on graphs with edge features, previous works generally incorporate edge features into the attention matrix A:
where E∈RN×N×d is the d-dimension edge features between each pair of nodes, ϕ estimates the correlations of node pairs from edge features, which may be linear transformation [28] or more sophisticated functions [30], and f is an aggregation function, which may be element-wise addition [28, 30] or multiplication [28]. These conventional methods have two limitations though. First, to construct the edge feature matrix E, fully connected graphs are required as input and the memory complexity is determined to be O(N2). Second, the edge features are not fully leveraged. They are only involved in attention computation and cannot be used to update node features or vice versa.
The model according to the embodiments of the subject invention comprises a stack of L Memory-Efficient Global Graph Attention Layers. In each layer, nodes can globally attend to all other nodes and edge features between node pairs serve as additive attention bias terms. For non-existing edges, one solution is to convert arbitrary graphs to fully connected graphs before entering the encoder, then Equation (4) may be applied. The procedures can be done by setting k to a large enough number or using a fixed masking value/vector for non-existing edges. The operation increases memory complexity from O(N×k) to O(N2). To avoid the conversion, a learnable pseudo edge feature is adopted in each layer. Let H′ and E′ denote the input features of the lth layer and H0, E0 are inputs of the first layer. The attention is computed based on Equations (5)-(7) below:
where Bl is the attention bias, WKl∈Rd×d, WQl∈Rd×d, wBl∈Rd are parameters, βl∈Rd is the pseudo edge feature in layer l. Learning a pseudo edge feature for each layer is more adaptive and flexible than using one fixed masking value across all layers, providing better approximation for fully connected graphs.
The attention score Al is then utilized to aggregate node features as well as the edge features. Node features are weighted and summed as in the vanilla Transformer. The edge features are aggregated with normalized weights and concatenated with aggregated node features. Finally, a linear layer is employed to map the concatenated feature to dimension d:
where WVl∈Rd×d, WNl∈R2d×d are parameters, ∥ denotes concatenation operation, and γil is a normalization term to normalize the sum of the edge weights to 1.
Next, a residue connection and layer normalization are adopted to output the final updated node features:
The edge features are then updated as follows, with WEi∈R3d×d being parameters:
To leverage the edge features under the global attention method, compared with O(N2) by previous works, the Memory-Efficient Global Graph Attention only needs O(N×k+L) additional memory, allowing operations on longer sequences.
The output node features from the last layer are mapped to the distribution over 20 residue types through a linear layer with parameter WP∈Rd×20:
Negative log-likelihood loss is utilized during the training processes.
Entire Sequence Design with Base Model
During the training processes, residues are randomly sampled and their types are randomly set visible to construct an input partial sequence. During inference in entire sequence design setting, an entropy-based partial sequence construction method is employed. Suppose pib is the probability distribution of vi predicted by a base model, the entropy enib of distribution pib is computed by:
Residues with the least entropy are selected and retained while others are masked. The partial sequence is fed into the model to get the probability predictions pi with entropy eni. Finally, the predictions from the base model and the model according to the embodiments of the subject invention are weighted by their entropy and fused together:
The final predicted residue type is the argmax of {circumflex over ( )}pi.
Plasmid Vector Construction: The TnpB gene is optimized for expression in human cells through codon optimization and the optimized sequence is synthesized for vector construction (Sangon Biotech, Shanghai, China). The final optimized sequence is inserted into a pST1374 vector, which contains a CMV promoter and nuclear localization signal sequences at both the 5′ and 3′ termini. The reRNA sequences are synthesized and cloned into a pGL3-U6 vector. Spacer sequences (EMX1: 5′-ctgtttctcaggatgtttgg-3′ (SEQ ID NO: 11)) are cloned into by digesting the vectors with BsaI restriction enzyme (New England BioLabs, Ipswich, MA) for 2 hours at 37° C. The resulting vector constructs are verified by Sanger sequencing to ensure accuracy.
TnpB Engineering: The construction of TnpB mutants is carried out by site-directed mutagenesis. PCR amplifications are performed using Phanta Max Super-Fidelity DNA Polymerase (Vazyme, Nanjing, Jiangsu, China). The PCR products are then ligated using 2× MultiF Seamless Assembly Mix (ABclonal, Woburn, MA). Next, ligated products are transformed into DH5α E. coli cells. The success of the mutagenesis is confirmed by Sanger sequencing. The modified plasmid vectors are purified using a TIANpure Midi Plasmid Kit (TIANGEN, Beijing, China).
Cell Culture and Transfection: HEK293T cells are maintained in Dulbecco's modified Eagle medium (Gibco, Waltham, MA) supplemented with 10% fetal bovine serum (Gemini, West Sacramento, California) and 1% penicillin-streptomycin (Gibco) in an incubator (37° C., 5% CO2). HEK293T cells are transfected at 80% confluency with a density of approximately 1×105 cells per 24-well using ExFect Transfection Reagent (Vazyme). For indel analysis, 500 ng of TnpB plasmid plus 500 ng of reRNA plasmid is transfected into 24-well cells.
DNA Extraction and Deep Sequencing: The genomic DNA of HEK293T cells is extracted using QuickExtract DNA Extraction Solution (Lucigen, Middleton, WI). Samples are incubated at 65° C. for 60 minutes and at 98° C. for 2 minutes. The lysate is used as a PCR template. The first round PCR (PCR1) is conducted with barcoded primers to amplify the genomic region of interest using Phanta Max Super-Fidelity DNA Polymerase (Vazyme). The products of PCR1 are pooled in equal moles and purified for the second round of PCR (PCR2). The PCR2 products are amplified using index primers (Vazyme) and purified by gel extraction for sequencing on the Illumina NovaSeq platform. The specific barcoded primers used in PCR1 are listed in Table 4.
All patents, patent applications, provisional applications, and publications referred to or cited herein are incorporated by reference in their entirety, including all figures and tables, to the extent they are not inconsistent with the explicit teachings of this specification.
It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application. In addition, any elements or limitations of any invention or embodiment thereof disclosed herein can be combined with any and/or all other elements or limitations (individually or in any combination) or any other invention or embodiment thereof disclosed herein, and all such combinations are contemplated with the scope of the invention without limitation thereto.