The present application contains a Sequence Listing which has been submitted in XML format via Patent Center and is hereby incorporated by reference in its entirety. Said XML file, created on Jan. 3, 2024, is named 217863-703301_xml and is 4,096 bytes in size.
High-throughput screening (HTS) technologies have transformed biomedical sciences and many of these technologies have been effectively commercialized and impact clinical care. Most of these technologies identify cell components such as DNA, RNA, or protein species, and some assess intermolecular interactions. CRISPR/Cas9 and RNAi genome wide screens can connect a gene with a cellular or organismal process. Pathways and networks are predicted from the resulting data, but these experiments only indicate a role for a gene during a cell process, and do not conclusively assess mechanism. Previous advances in assay or screening throughput include DNA and RNA sequencing, Y2H, phage display, microarrays, and the like.
There is no high-throughput assay to broadly assess molecular functions in the context of human or mammalian cells. The molecular function are the key to understanding mechanism, disease etiology, and development of therapeutic drugs. Phage or yeast display, yeast 1- or 2-hybrid, DNA encoded libraries (DEL)s, and affinity mass spectrometry assess one general type of function, molecular interactions, although these do not assess interactions in live mammalian cells. More recently, lethality selection screens of mutant libraries in mammalian cells (called deep mutational scanning), is rapidly growing to identify loss or gain of function mutants. For example many cells with p53 mutants will not survive after two weeks of culture. The presence of variants sequenced before compared to after culturing can be used as a survival screen to identify positive mutants with some activity and infer negatives for those mutants that do not survive the screening procedure.
High-throughput assay systems have had a disproportionally large impact on uncovering how cells function, as well as how misregulation can lead to disease. As such, there is a need in the art for high-throughput assay systems that can systematically address how mutations impact molecular functions or cell processes in human cells. The current invention addresses this need.
As described herein, the current invention relates to high-throughput methods of assaying cellular molecular function. As such, in one aspect, the invention provides a method of analyzing results of a high-throughput assay system for molecular functions and cell processes, the method comprising:
In certain embodiments, the gene of interest is a transcription factor.
In certain embodiments, the reporter element encodes a promoter element that drives expression of a cDNA encoding a fluorescent protein.
In certain embodiments, the fluorescent protein is GFP.
In certain embodiments, the variant calling step further comprises: performing Burrows-Wheeler indexing and alignment to a wild type reference sequence for each group of related barcodes to generate a sequence alignment and map (SAM) file; converting the SAM file to a binary alignment and map (BAM) file; sorting and indexing the contents of the BAM file; aligning the contents of the BAM file to generate alignments of the variants to the wildtype; and generating a variant call file from the alignments.
In certain embodiments, the barcode comprises 32 or more nucleotides.
In certain embodiments, the cells are mammalian cells. In certain embodiments, the cells are an immortalized cell line.
In certain embodiments, the cell sorting is accomplished by flow cytometry.
In certain embodiments, the population of cells comprises an average of hundreds of separately randomly barcoded cDNA molecules assigned to each cell group for each mutant.
In certain embodiments, each mutant has a read depth of about 2,000× to about 90,000× sequencing coverage.
In certain embodiments, the method of the above aspects or any aspect or embodiment disclosed herein further comprises validating the method by comparing the activity of a subset of mutants analyzed by the method to previously determined results.
In certain embodiments, the method of the above aspects or any aspect or embodiment disclosed herein further comprises validating the method by comparing true negatives as determined by the method to true negatives determined by an independent method.
In certain embodiments, the method of the above aspects or any aspect or embodiment disclosed herein further comprises validating the method by comparing the activity of a subset of mutants analyzed by the method to independent testing of a set of separate clones.
In certain embodiments, the method of the above aspects or any aspect or embodiment disclosed herein further comprises validating the method by comparing method results among different samples.
In certain embodiments, the method of the above aspects or any aspect or embodiment disclosed herein further comprises comprising validating the method by comparing method results in two different cell lines.
In certain embodiments, the statistical significance of mutants analyzed in the method has a median p value oft p<9×10−20.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although any methods and materials similar or equivalent to those described herein can be used in the practice for testing of the present invention, the preferred materials and methods are described herein. In describing and claiming the present invention, the following terminology will be used.
It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. The articles “a” and “an” are used herein to refer to one or to more than one (i.e., to at least one) of the grammatical object of the article. By way of example, “an element” means one element or more than one element.
“About” as used herein when referring to a measurable value such as an amount, a temporal duration, and the like, is meant to encompass variations of ±20% or ±100, more preferably ±5%, even more preferably ±1%, and still more preferably ±0.1% from the specified value, as such variations are appropriate to perform the disclosed methods.
“Effective amount” or “therapeutically effective amount” are used interchangeably herein, and refer to an amount of a compound, formulation, material, or composition, as described herein effective to achieve a particular biological result or provides a therapeutic or prophylactic benefit. Such results may include, but are not limited to, anti-tumor activity as determined by any means suitable in the art.
“Encoding” refers to the inherent property of specific sequences of nucleotides in a polynucleotide, such as a gene, a cDNA, or an mRNA, to serve as templates for synthesis of other polymers and macromolecules in biological processes having either a defined sequence of nucleotides (i.e., rRNA, tRNA and mRNA) or a defined sequence of amino acids and the biological properties resulting therefrom. Thus, a gene encodes a protein if transcription and translation of mRNA corresponding to that gene produces the protein in a cell or other biological system. Both the coding strand, the nucleotide sequence of which is identical to the mRNA sequence and is usually provided in sequence listings, and the non-coding strand, used as the template for transcription of a gene or cDNA, can be referred to as encoding the protein or other product of that gene or cDNA.
As used herein “endogenous” refers to any material from or produced inside an organism, cell, tissue or system.
As used herein, the term “exogenous” refers to any material introduced from or produced outside an organism, cell, tissue or system.
The term “expression” as used herein is defined as the transcription and/or translation of a particular nucleotide sequence driven by its promoter.
“Expression vector” refers to a vector comprising a recombinant polynucleotide comprising expression control sequences operatively linked to a nucleotide sequence to be expressed. An expression vector comprises sufficient cis-acting elements for expression; other elements for expression can be supplied by the host cell or in an in vitro expression system. Expression vectors include all those known in the art, such as cosmids, plasmids (e.g., naked or contained in liposomes) and viruses (e.g., Sendai viruses, lentiviruses, retroviruses, adenoviruses, and adeno-associated viruses) that incorporate the recombinant polynucleotide.
“Homologous” as used herein, refers to the subunit sequence identity between two polymeric molecules, e.g, between two nucleic acid molecules, such as, two DNA molecules or two RNA molecules, or between two polypeptide molecules. When a subunit position in both of the two molecules is occupied by the same monomeric subunit; e.g, if a position in each of two DNA molecules is occupied by adenine, then they are homologous at that position. The homology between two sequences is a direct function of the number of matching or homologous positions, e.g, if half (e g, five positions in a polymer ten subunits in length) of the positions in two sequences are homologous, the two sequences are 50% homologous; if 90% of the positions (e.g, 9 of 10), are matched or homologous, the two sequences are 90% homologous.
In the context of the present invention, the following abbreviations for the commonly occurring nucleic acid bases are used. “A” refers to adenosine, “C” refers to cytosine, “G” refers to guanosine, “T” refers to thymidine, and “U” refers to uridine.
Unless otherwise specified, a “nucleotide sequence encoding an amino acid sequence” includes all nucleotide sequences that are degenerate versions of each other and that encode the same amino acid sequence. The phrase nucleotide sequence that encodes a protein or an RNA may also include introns to the extent that the nucleotide sequence encoding the protein may in some version contain an intron(s).
The term “operably linked” refers to functional linkage between a regulatory sequence and a heterologous nucleic acid sequence resulting in expression of the latter. For example, a first nucleic acid sequence is operably linked with a second nucleic acid sequence when the first nucleic acid sequence is placed in a functional relationship with the second nucleic acid sequence. For instance, a promoter is operably linked to a coding sequence if the promoter affects the transcription or expression of the coding sequence. Generally, operably linked DNA sequences are contiguous and, where necessary to join two protein coding regions, in the same reading frame. The term “polynucleotide” as used herein is defined as a chain of nucleotides. Furthermore, nucleic acids are polymers of nucleotides. Thus, nucleic acids and polynucleotides as used herein are interchangeable. One skilled in the art has the general knowledge that nucleic acids are polynucleotides, which can be hydrolyzed into the monomeric “nucleotides.” The monomeric nucleotides can be hydrolyzed into nucleosides. As used herein polynucleotides include, but are not limited to, all nucleic acid sequences which are obtained by any means available in the art, including, without limitation, recombinant means, i.e., the cloning of nucleic acid sequences from a recombinant library or a cell genome, using ordinary cloning technology and PCR™, and the like, and by synthetic means.
As used herein, the terms “peptide,” “polypeptide,” and “protein” are used interchangeably, and refer to a compound comprised of amino acid residues covalently linked by peptide bonds. A protein or peptide must contain at least two amino acids, and no limitation is placed on the maximum number of amino acids that can comprise a protein's or peptide's sequence. Polypeptides include any peptide or protein comprising two or more amino acids joined to each other by peptide bonds. As used herein, the term refers to both short chains, which also commonly are referred to in the art as peptides, oligopeptides and oligomers, for example, and to longer chains, which generally are referred to in the art as proteins, of which there are many types. “Polypeptides” include, for example, biologically active fragments, substantially homologous polypeptides, oligopeptides, homodimers, heterodimers, variants of polypeptides, modified polypeptides, derivatives, analogs, fusion proteins, among others. The polypeptides include natural peptides, recombinant peptides, synthetic peptides, or a combination thereof.
The term “promoter” as used herein is defined as a DNA sequence recognized by the synthetic machinery of the cell, or introduced synthetic machinery, required to initiate the specific transcription of a polynucleotide sequence.
As used herein, the term “promoter/regulatory sequence” means a nucleic acid sequence which is required for expression of a gene product operably linked to the promoter/regulatory sequence in some instances, this sequence may be the core promoter sequence and in other instances, this sequence may also include an enhancer sequence and other regulatory elements which are required for expression of the gene product. The promoter/regulatory sequence may, for example, be one which expresses the gene product in a tissue specific manner.
A “constitutive” promoter is a nucleotide sequence which, when operably linked with a polynucleotide which encodes or specifies a gene product, causes the gene product to be produced in a cell under most or all physiological conditions of the cell.
An “inducible” promoter is a nucleotide sequence which, when operably linked with a polynucleotide which encodes or specifies a gene product, causes the gene product to be produced in a cell substantially only when an inducer which corresponds to the promoter is present in the cell.
The term “SAM file”, as used herein, refers to a type of text file format that contains the alignment information of one or more nucleotide or protein sequences that are mapped against one or more reference sequences. These files can also contain unmapped sequences.
The term “BAM file”, as used herein, refers to a file containing alignment information of various nucleotide or protein sequences that are mapped against one or more reference sequences in a binary file format. BAM files are smaller and more efficient for software to work with than SAM files, saving time and reducing costs of computation and storage.
The term “subject” is intended to include living organisms in which an immune response can be elicited (e.g., mammals). A “subject” or “patient,” as used therein, may be a human or non-human mammal. Non-human mammals include, for example, livestock and pets, such as ovine, bovine, porcine, canine, feline and murine mammals. Preferably, the subject is human.
The phrase “under transcriptional control” or “operatively linked” as used herein means that the promoter is in the correct location and orientation in relation to a polynucleotide to control the initiation of transcription by RNA polymerase and expression of the polynucleotide.
A “vector” is a composition of matter which comprises an isolated nucleic acid and which can be used to deliver the isolated nucleic acid to the interior of a cell. Numerous vectors are known in the art including, but not limited to, linear polynucleotides, polynucleotides associated with ionic or amphiphilic compounds, plasmids, and viruses. Thus, the term “vector” includes an autonomously replicating plasmid or a virus. The term should also be construed to include non-plasmid and non-viral compounds which facilitate transfer of nucleic acid into cells, such as, for example, polylysine compounds, liposomes, and the like. Examples of viral vectors include, but are not limited to, Sendai viral vectors, adenoviral vectors, adeno-associated virus vectors, retroviral vectors, lentiviral vectors, and the like.
Ranges: throughout this disclosure, various aspects of the invention can be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 2.7, 3, 4, 5, 5.3, and 6. This applies regardless of the breadth of the range.
This application incorporates by reference Applicant's prior U.S. provisional patent application entitled METHODS FOR IDENTIFICATION AND CHARACTERIZATION OF SARS-CoV-2 VARIANTS, application No. 63/167,702, filed Mar. 30, 2021, and prior U.S. non provisional patent application entitled COMPOSITIONS AND METHODS FOR STUDYING THE TAT GENE, application Ser. No. 16/318,965, filed Jul. 14, 2017, provided that if any of these prior applications or patents are in any way inconsistent with the present application (including without limitation any limiting aspects), the present application will prevail.
As described herein, the current invention relates to high-throughput methods of assaying cellular molecular function. In certain embodiments, the invention discloses the use of said high-throughput screening method by screening a HIV Tat variant library (driver), using a LTR-GFP (reporter) and thereby producing detailed functional protein map.
The high-throughput cellular molecular function assay system and method has several advantages over existing methods. The method directly assays the function of the protein. Unlike deep mutation scanning (DMS) screens, all mutations or nearly all mutations are directly measured in the high-throughput cellular molecular function assay system and method and not inferred from differences between pre-screen and screened samples. The high-throughput cellular molecular function assay of >561,000 individually barcoded mutants in living cells is at least a four order of magnitude enhancement over routine low-throughput cell based assays. For example, six Pro residues in a PxxP motifs are used to measure the impact of these motifs on Guanine nucleotide exchange factor activity. Because of the barcoding of individual molecules and high throughput at the single cell assay levels, using high-throughput cellular molecular function assay system and method allows for signal averaging large numbers of individual measurements yielding robust reproducibility, high accuracy, and a statistic for reliability of the activity for each mutant. Furthermore, all mutants are assayed under standardized conditions in the same cells with the same genetic background producing a consistency that is not seen in the literature where mutants are often studied by multiple labs with separate assay systems and conditions.
In certain embodiments of the present disclosure, the saturating mutagenesis map of Tat not only reveals shortcomings in routine interpretation of mutagenesis data, but high-throughput cellular molecular function assay system and method results create a context to improve interpretation with a new structure/function/tolerance approach. Herein, alanine scanning identifies key positions, but lacks sensitivity and misses the importance of some positions. New visualization methods of plotting saturation mutagenesis heatmap data with other structure function data, as well as physiochemical MCC surface plots enable new opportunities for visual interpretation, which will facilitate a holistic investigation of molecular functions. For example, the missense mutation analysis compared to truncation mutants shows differences that can be explained by a LS signals. This reveals previously unrecognized limitations of truncation mutant studies. The context also produces a framework to identify ambiguities and potential misinterpretation of mutant data regarding structure, PTMs, and PPIs, especially where sites overlap with each other, as is the case for many regions of Tat.
In certain embodiments, the high-throughput cellular molecular function assay system and method of the present disclosure identifies a pattern in Tat where PPIs and PTMs are selected for robustness, while key structural elements or substitutions (e.g. proline) that impact structure are not as well-tolerated. In certain embodiments, the high-throughput cellular molecular function assay system and method of the present disclosure can be applied to other genes and assays to address other important questions in biomedical sciences.
To address this challenge, a high-throughput cellular molecular function assay system and method is disclosed. In some embodiments, the high-throughput cellular molecular function assay system and method is a single cell/one-pot assay system adapted to study how variants impact HIV Tat-driven transactivation of a green fluorescent protein (GFP) reporter. In one usage, the high-throughput cellular molecular function assay system and method is employed to assay all 1,615 Tat single amino acid substitutions with no mutant dropout. Each mutant is assayed with replicate observations in LentiX293T and Jurkat cells with an average of 100s of separately barcoded cDNA molecules and cell groups for each mutant. Each mutant had ˜2,000×-90,000× sequencing coverage to measure its transcriptional activity and had a median p value of 10-20, ranging as low as 10-124. Five independent assay performance assessments with benchmark data, individually tested clones, and replicate comparisons all indicate exceptional reproducibility, accuracy, and robustness. The shortcomings of alanine scanning mutagenesis and protein truncation studies are revealed by including exhaustive substitution tolerance to the typical structure/function analysis—now structure/function/tolerance. The high-throughput cellular molecular function assay system and method is flexible and extensible and allows for a more comprehensive holistic view of protein molecular function with a highly simplified single-pot assay.
In some embodiments, a high-throughput cellular molecular function assay system and method is a mammalian cell based one-pot single cell assay where one or more variant DNA molecules are individually barcoded, assayed in cells with a fluorescent readout, flow sorted into pools, deep sequenced, and then the impact of each variant on activity is bioinformatically deconvolved. The high-throughput cellular molecular function assay system and method is not a survival screen where the negatives are not directly measured. The high-throughput cellular molecular function assay system and method measures both positives and negatives for millions of individually barcoded DNA molecules producing a highly accurate and reproducible assay. The high-throughput cellular molecular function assay system and method has several other advantages over previously developed assays of cellular functions. It is flexible, readily adapted to many cell process and molecular function assays in the context of living mammalian cells. It is a high-throughput assay capable of measuring tens of thousands of reads for each of millions of individually barcoded variant DNAs, where different genotypes are pooled for each amino acid substitution, thus reliable statistical probabilities and metrics can be calculated to determine the reliability of each measurement. Because of, at least in part, the high throughput, high reproducibility among samples, and reproducibility in different cell lines, the results are highly accurate with reliable statistics. Each DNA variant molecule can be tracked through the plasmid, viral, cell libraries, and then through cell pool separation by flow cytometry.
In some embodiments, the HIV Tat transactivation of long terminal repeat-driven green fluorescent protein (GFP) expression in LentiX293T/LTR-GFP reporter cells serves as a model system for the high-throughput cellular molecular function assay system and method. LTR is the long terminal repeat in the HIV genome. This system has the advantages of an established robust reproducible assay and abundant benchmark data is published for performance assessment. Furthermore, Tat is a small gene that is suitable for assay development and of pathological significance for HIV infection and exit from latency.
The current high-throughput cellular molecular function assay system and method approach for Tat transactivation has multiple steps. Is some instances, induction of the reporter by the Tat transgene is compared to empty vector and an inactivating mutation as controls for basal reporter expression (e.g., see
The cellular molecular function assay system and method of the present disclosure, a high-throughput cellular molecular function assay cassette encodes constitutively expressed Tat translated from a barcoded mRNA (e.g., see
To quantitate the cells and set thresholds for sorting cells expressing mutant Tat cDNAs, the method can include evaluating the same samples by flow cytometry. Cells with empty vector or the C27S mutant have low GFP expression that is not different from a control cell lacking the reporter systems, whereas cells expressing the reporter system with wtTat show strong GFP expression (e.g., see
The Tat protein sequence and heatmap (e.g., see
Five independent validation tests demonstrate that the high-throughput cellular molecular function assay system and method has very high reproducibility and accuracy (e.g., see
A second validation can be based upon an independent source of true negatives measured in the high-throughput cellular molecular function assay system and method. For true negative, LOF variants, Tat nonsense variants encoding stop codons (n=61) where each variant had more than 2.5 rpm can be examined. Tat exon 1, encoding its first 58 amino acids is, in some instances, the minimal element required for Tat transactivation activity. Nonsense variants with stop codons located less than 58 amino acids from the start Met, that have reduced activity are considered true negatives, while those 58 amino acids or longer are considered true positives with wild type activity Considering these mutant activities, the results can produce a high-throughput cellular molecular function assay system and method performance statistics for: accuracy, 1.0; sensitivity 1.0; specificity 1.0; PPV, 1.0; and PV of 1.0 (e.g., see
A third validation can include independent testing of a set of separate clones. Prior to the experiment 18 Tat mutants can be randomly selected, made stable LentiX293T/LTR-GFP cell lines expressing these mutants, and measured transcription activation of LTR-GFP reporter by flow cytometry. These results can be blinded until the high-throughput cellular molecular function assay system and method is complete and then compared to the high-throughput cellular molecular function assay system and method results (e.g., see
A forth validation approach can assess the reproducibility of high-throughput cellular molecular function assay system and method results among different samples. The LentiX293/LTR-GFP are transduced, selected, flow sorted, sequenced, and analyzed separately in duplicate. The global standard deviation for Tat mutant activities between technical duplicates can be very low (SD=0.02) with 1.0 being wild type activity. Mutant activities for sample replicates for each the LentiX293 can have a very high correlation (R2=0.99) indicating high reproducibility (e.g., see
A fifth validation can compare high-throughput cellular molecular function assay biological replicates in two different cell lines (LentiX293/LTR-GFP and Jurkat/LTR-GFP cells). The high-throughput cellular molecular function assay experiment can be repeated in Jurkat cells/LTR-GFP, again with duplicate samples. Similar results for the performance statistics, reproducibility, and mutant activities may be observed (e.g., see
The high reproducibility is achieved from the experimental design where each individual variant cDNA has a separate random barcode which is tracked through the experiment. In some instances, during the high-throughput cellular molecular function assay system and method, each cDNA is individual barcoded and after transduction of recombinant viruses, each cell is barcoded. During selection, these cells divide forming clonal barcoded cell groups. In some instances, for the different samples and cell lines there can be −561,000 barcoded cell groups after condensing for sequencing errors in barcodes. Each mutant in each replicate sample can have an average of 94 independent barcodes with scatter plots showing a high correlation for each replicate sample in each cell line and a heatmap for the number of barcodes for each mutant shown in (e.g., see
The transcriptional activities for each barcoded cell group with the same mutation are averaged and used to calculate statistics. The global standard deviation for the barcoded cell groups is 0.25. While this is considerably larger than the replicate sample standard deviations, there may be multiple factors contributing to this variance such as, for example, random chromosomal lentiviral insertion sites that impact expression, cells in different phases of the cell cycle, and other types of errors when such larger datasets are analyzed at the level of individual molecules.
Given the breadth of data produced in this high-throughput cellular molecular function assay system and method experiment, one can reliably report metrics of confidence for the activity of each mutant. The p value for each mutant in each cell with the distributions p values frequencies are shown in
Each mutant's activity was then compared to that reported for sets of true positive mutants with established wild type activity and sets of true negative mutants with greatly reduced activity (e.g., see
The saturation mutagenesis profile allows for an improved interpretation of mutation tolerance on secondary structure, post-translational modifications (PTM)s and protein-protein interactions (PPI)s on Tat activity. Based on the saturating mutagenesis experiment and its interpretations, expansion of Structure/function to structure/function/tolerance, adding the latter term to reflect what amino acids can be tolerated to preserve structure/function.
How the secondary structure for each Tat position related to mutations can be examine. Tat is mostly random coil with some helix and turns. Mutations were well tolerated in the first turn, but not in the second and third turns (e.g., see
Tat has known binding sites for about 20 other proteins (e.g., see
The activities of 78 truncation mutants are measured from nonsense mutants across both cell lines. The mutants are introduced during oligonucleotide synthesis (e.g., see
However, almost any missense mutation is tolerated at S46-E86, which is not consistent with the truncation mutants tolerated up to R57. This region contains a nuclear localization motif, and binding sites for Importina, P53, and EGR. The most likely explanation is that truncation of the nuclear localization signal blocks localization of Tat to the nucleus and its transcriptional activity. In the presence of any single point mutant, the Tat is still localized to the nucleus. This is supported by high-throughput cellular molecular function assay system and method results showing that mutation of all positively charged residues in the nuclear localization motif to negatively charged residues, reduced, but do not block the transcriptional activity of Tat. Indeed in papers that have examined the NLS, only double mutants in this region block nuclear localization, and the peptide is sufficient to localize other proteins to the nucleus.
The impact of mutation on activity by mapping transcriptional activity levels onto the 3D structure of Tat can be examined, which can be compared to the residue spatial positions, regions, and secondary structure of Tat (e.g, see
In some instances, a different approach that better summarizes the additional information gained from saturation mutagenesis, is to codify by separate Mathews Correlation Coefficients (MCC)s for groups with similar side chain physiochemical properties (small aliphatic, large hydrophobic, polar noncharged or charged, negatively charged, positively charged). This approach better segregates substitution tolerance for each position, whereas alanine scanning does not have this level of granularity. The heatmap with accuracy for classes of substitutions (e.g, see
Surface plots of MCC heatmaps for different physiochemical properties show that there is little specificity for amino acid tolerance over most of the protein including residues S46-E86 which tolerate nearly all substitutions (e.g., see
Likewise, side chain volume gradient indicates tolerance of amino acid substitutions for a sidechain volume (e.g., see
The invention is now described with reference to the following Examples. These Examples are provided for the purpose of illustration only, and the invention is not limited to these Examples, but rather encompasses all variations that are evident as a result of the teachings provided herein.
The materials and methods employed in these experiments are now described.
The plasmid pLjml_mcs are made by introducing compatible EcoRI, Sail, and AsiSI restriction enzyme sites in the pLjml-Empty (Addgene) vector for cloning of the Tat variant library Tat or mutant Tat encoding a C27S mutation are PCR amplified from pNL4-3 as a template with Q5® High-Fidelity DNA Polymerase (New England Biolabs) and cloned into EcoRI/Sall digested pLjml_mcsl. For generating a LentiX293T/LTR-GFP reporter cell line, a plasmid harboring LTR-GFP and blasticidin S resistance is constructed. The LTR-GFP cassette and Blasticidin S resistance (bsr) gene are amplified by PCR with pNL4-3, pEGFP and LentiCRISPR-v2 Blast as templates. LTR, GFP, and bsr amplicons are fused by inverse PCR using Q5® High-Fidelity DNA Polymerase. The fused amplicons are cloned into pAAVSl-Puro-DNR (Origene) previously digested with Spel and EcoRI.
A double stranded (ds) DNA library containing HIV-1 Tat cDNAs with sequences for all the possible single amino acid mutant mutants (n=1634 Tat mutants) are synthesized by Twist Bioscience. The dsDNA from each well of 96-well plates are pooled and a single round of overlap PCR extension appended random 32mers oligonucleotides to the 3′ untranslated region. The synthesized dsDNA library has a 3′-overhang sequence after the stop codon that overlaps with the 5′ overhang sequence upstream of the 32mers random oligonucleotide sequence. The pooled ds DNA library and the random oligomer are mixed in 1:10 molar ratio, denatured, and annealed. Hybridized DNA is extended with the Q5® High-Fidelity DNA Polymerase (New England Biolabs) for one cycle of PCR. The 50 pi of PCR reaction mix is then treated with 2 mi of Exonuclease 1 (New England Biolabs), incubated at 37° C. for 15 min, and DNA is purified by PCR cleanup kit (Macherey-Nagel).
The purified DNA is digested with EcoRI-HF (New England Biolabs) and AsiSI (New England Biolabs) for 3 h at 37° C. and ligated into EcoRI-HF/AsiSI digested pLjml mcs plasmid (molar ratio vector. insert=1.3) with electroligase (New England Biolabs). Ligation reactions (12) are pooled, purified with a PCR cleanup kit, and drop dialyzed on MF-Millipore® Membrane Filter, 0.025 pm pore size (Millipore Sigma). The purified ligation reaction mixture is electroporated into E. cloni 10G ELITE electrocompetent cells (Lucigen), plated on prewarmed LB ampicillin plates, and incubated for 18 ht at 37° C. Transformants were scrapped and plasmid library from the pooled cell suspension is isolated using EndoFree Plasmid Mega kit (Qiagen).
Lentiviral libraries are produced in LentiX293T cells (Takara). Approximately 3 million LentiX293T cells are seeded in 100 mm petri dish and grown in 10 ml complete DMEM media [(DMEM+10% Fetal Calf serum), Gibco] for 24 h. Plasmids pLjml Twist Tat Library (8.5 pg), pMDLG/pRRE (7.6 pg); pRSV/pRev (4.0 pg); pMD2.G (4.0 pg) are diluted to a final volume to 613 pi in a 15 ml conical tube CaCl2 (87 pi of 2M) are added to plasmid mixture. 2×HBS (700 pi) are added dropwise to the above transfection mix with gentle stirring in a circular motion.
The transfection mix is incubated for 15 min and added dropwise to the cells in a 100 mm petri dish. The cells are incubated at 37° C. for 12 h in a C02 incubator at 37′C with a 5% C02 atmosphere. Post-transfection (12 h), the calcium phosphate-containing medium is replaced with 7 ml complete media (DMEM+10% FBS) and incubated for 48 h in C02 incubator at 37° C. with a 5% C02 atmosphere. Spent media from confluent transfected LentiX293T cells is filtered through a 0.45 pm Uniflow syringe filter (Cytiva Whatman) Aliquots of the filtered spent media with the lentivirus (100 pi to 5 ml) are stored in at −80° C.
Lentiviral vectors for specific clones are produced in LentiX293T cells. Briefly, the 0.6 million LentiX293T cells are seeded in a well of a 6-well plate. After 24 hours, cells are co transfected with pLjml-mcs, pLjml-Tat, or pLjml-TatC27S (1 pg); pMDLG/pRRE (Addgene, 1.0 pg), and pRsv-Rev and pMD2.G (Addgene, 0.5 pg) transfecting with Lipofectamine LTX (Invitrogen) at a 1:3 ratio [DNA(pg): Transfection reagent(pl)]. After 6 h of incubation, media is replaced, and cells are cultured in complete media for an additional 48 h. Cell supernatants are collected, filtered through a 0.45 pm syringe filter (Millipore), and stored in −80 freezer.
Lentiviruses are titered by seeding 10,000 cells/well in 96 well plate and culturing in 200 pi of complete DMEM media (DMEM+10% FBS). After 24 h, 100 pi of serial dilutions of lentivirus are added after removing majority of the spent media from the wells and incubated 4 h. Complete DMEM media (100 pi) are added and incubated 24 h. Spent media (100 pi) is removed, replaced with DMEM media containing puromycin (Invitrogen, 1.5 pg/ml final concentration), and incubated for 96-120 h. The cells are inspected for viability under the microscope and colonies are counted to calculate the infectious unit/ml.
LentiX293T cells (0.6 million) are seeded in the well of a 6-well plate and grown in 3 ml of complete DMEM media. A GFP reporter plasmid (1.5 pg) carrying LTR-GFP and the blasticidin S-resistance (BSR) gene is transfected in LentiX293T cells and incubated for 48 h. Transfected cells are selected for blasticidin S [(5 pg/ml), Invitrogen] resistance for 14 days, exchanging DMEM media with the poison every 3 days. Cells are trypsinized and 100,000 cells are serially diluted in 96-well plates. After 14 days of incubation, single colonies are screened after expansion.
For confirming lentiviral integration, gDNA is isolated Tat amplicons are subcloned, and sequenced. Tat transcriptional activity is measured in a subculture of each clonal cell line. Cells culture in 96-well plate are transfected with 50 ng of wtTat expression vector and cultured for 48 h. Transactivation-induced GFP expression is evaluated by Nikon TE2000E epifluorescence microscopy. The clonal reporter cell lines are propagated and stored at −80° C.
LentiX293T/LTR-GFP cells (33 million) are transduced with the Tat variant lentiviral library at a multiplicity of infection (MOI) of 0.1. After 24 h of infection, cells are cultured and maintained in complete DMEM media supplemented with puromycin (1.5 pg/ml). After 5 days, confluent cells are harvested, counted, and washed once with IX PBS before fixing and isolating gDNA for NGS of the Tat amplicon.
Jurkat/LTR-GFP cells (90 million cells) are seeded and transduced with 0.1 MOI of lentiviral library for 4 hours. One day after transduction, the cells are selected for viral survival in RPMI 1640 media [(RPMI 1640+10% FBS), Gibco] supplemented with puromycin (1 pg/ml). After 5 days, the cells are counted, washed with IX PBS, and fixed for flow sorting and subsequent isolation of gDNA.
For performance evaluation of the high-throughput cellular molecular function assay system and method, 18 random mutants of Tat, as well as empty vector and wtTat are stably expressed in LentiX293T/LTR-GFP cells. Approximately 0.15 million cells are seeded in a well of a 24 well plate and incubated for 24 h. Cells are transduced with lentivirus and selected and maintained in complete DMEM media with puromycin (1.5 pg/ml) for 96 h. Cells re harvested and analyzed by flow cytometry to assess for LTR transactivated GFP expression. The same stable cell lines are created in Jurkat/LTR-GFP cells. Selected clones for empty vector, wtTat, and Tat C27S are stored at −80° C.
One fourth of the LentiX293T/LTR-GFP and one tenth of the Jurkat/LTR-GFP cells are harvested, gDNA isolated using Qiagen DNeasy Blood & Tissue Kit, and sequenced to evaluate library representation before Flow Sorting. The remaining cells are fixed in 2% paraformaldehyde/PBS for 10 minutes, washed twice with IX PBS and resuspended in IX PBS for analysis by flow sorting (Sony 800S Cell sorter). Cells sort into three bins of GFP signal intensity (low-GFP, mid-GFP and high-GFP) gated with threshold determined for cells stably expressing wt-Tat for maximal transactivation of LTR-GFP, and cells stable expressing a Tat C27S mutant or empty vector for low background of basal transactivation of LTR-GFP.
For deep sequencing, primers are designed to flank the Tat targeted region from gDNA and incorporate the NGS sequencing adaptors, gDNA is amplified by PCR with NEBNext Q5 Hot Start HiFi PCR Master Mix. The PCR protocol denatured strands at 98° C. for 30 sec only in the first cycle followed by: denaturation at 98° C. for 10 s, annealing at 58° C. for 15 s, elongation at 72° C. for 30 s, and a final elongation for 2 min NGS libraries for each sample category used NGS library forward primers and 1 NGS library reverse primer. The forward primers are common for all the sample categories and the reverse primer being unique for each sample. The Tat amplicons are pooled and 20 pi of the sample is purified by gel extraction with Ampure-XP beads (Beckman Coulter). All the samples are pooled and sequenced with a Novaseq 6000 sequencing platform. This SP flow cell produces approximately 2×250 bp paired-end reads. 18 samples were sequenced (synthetic dsDNA Tat variant library, plasmid library, selected cell libraries in LentiX293T and Jurkat cells (in duplicate), Flow sorted low-GFP, mid-GFP, and high GFP cells for each cell line (in duplicate).
Processing NGS Data with a Bioinformatics Pipeline
Paired-end reads were processed with a multistep bioinformatic pipeline BaseSpace and resulting reads in bcl files were converted into FASTQ files with BCL2FASTQ, read quality is assessed with FASTQC (e.g, see
Statistics are calculated for each mutation. In some instances, there are n cell lines (biological replicates) and each cell line has m technical replicates. For each barcode (group) in a sample, we calculate the percentage of the number of reads in the GFP+ group vs the total number of reads in both GFP+ and GFP− groups, denoted as h ratio (he[0,1]). In some instances, a high h percentage for wild type, while a low h percentage suggests a mutant. Then for each mutant, calculate the averaged h ratio for all the barcodes assigned to the same mutant, denoted as a mutant level summary score. In some instances, use a one sample t-test to evaluate 1) whether the mutant has a significantly different number of reads in the GFP+ group compared with the GFP− group within a technical replicate, and 2) whether the mutant has a significantly different number of reads in the GFP+ group compared with the GFP− group among different cell lines based on biological replicates (null hypothesis HO: h=0.5). In some instances, classify mutants with high h percentage as wild type and a low h percentage as a LOF mutant. To estimate type I error for the classification, in some instances compile a list of true mutants with wild type transcriptional activity and true LOF mutants with low activity. Then fit their h percentages with a beta distribution as the null distribution. Specifically, for the wild type detection, in some instances, use the true mutant as the null, and vice versus, for the mutant detection, use the wild type as the null. Moment estimators are used for estimating the model parameters. The p values for different cell lines are combined using Fisher's method into a global test p value.
Performance metrics of accuracy, sensitivity, specificity, positive predictive value and negative value are based upon standard formulas.
Figures can be prepared with PowerPoint, Excel, FlowJo, and Pymol Bin, Bar, and Pie plots, as well as saturating mutagenesis heatmaps generated with Excel. Values for saturating mutagenesis heatmaps and 3D surfaces plots can be generated with custom python scripts. 3D surface plots for the amino acid tolerance at each position represented accuracy of physiochemical properties as gradients from blue to white to magenta, with magenta being the highest accuracy. Accuracy is a standard formula and is calculated for groups of amino acids with similar physiochemical properties. 14 Solvent accessible surface area (SASA) is calculate for the Tat structure (1 TIV) with the Accessible Surface Area and Accessibility Tool.42 Residues are considered buried if less than 10% of surface area is exposed to solvent.
The MCC formula is calculated with the following data definitions for large hydrophobic amino acids, at a position in Tat as an example: If either Phe, Tyr, or Trp have >50% activity they are true positives and if the other amino acids have <50% activity they are true negatives. If either Phe, Tyr, or Trp have <50% activity they are false positives and if the other amino acids have >50% activity they are false negatives. Also consider the wild type amino acid to be a true positive when it is in the physiochemical group, and as a true negative when it is not. The MCC captures the tolerance for types of amino acids at each position and when mapped the surface of the 3D structure, is a new visual mining approach to reveal the spatial relationships of amino acids tolerances and their relevance to other Tat functions.
Sequencing was done through Illumina paired-ended sequencing A forward and reverse read, each of size 250 base pairs, was utilized to capture our expected amplicon of size of around 465 base pairs. Within each of our reads is the tat encoding region (with 1 mutation introduced by the library,) as well as the 32 base pair unique molecular identifier used as a barcode. These parts of interest are extracted and used for our statistics and interpretation based on the quantification of barcode to variant mappings found in each of the flow sorted pools.
Raw data (Files 6-39) were analyzed through a custom bioinformatic pipeline to extract the tat encoding region and barcode. Each file was individually input to Fastqc to measure read quality estimates. The default parameters were used, and each file was deemed suitable to proceed forward with minor quality filtration downstream. Forward and reverse reads were to be fused to allow downstream analysis. This was accomplished using FLASH. Additional parameters used for fusion were minimum overlap length of 20, maximum overlap length of 45, and mismatch ratio of 0.15. After the forward and reverse reads were fused, Trimmomatic was used to perform quality filtration. Each file was filtered such that reads were required a minimum of 365 base pairs (BPs). Additionally, reads were removed if they contained a consecutive series of 4 BPs where average PHRED score across the four BPs drops below 16. Next, Cutadapt was used to remove adapter sequences. Reads were trimmed to just the Tat encoding region, a small 3′ extension, and the 32 BP barcode. This was done by performing a linked adapter trim, specifying the parameter “-a AATTC . . . GCGATCGC” (SEQ ID NO 1), which indicates to Cutadapt to identify a series of BPs surrounded by the specified sequences, and extract it.
After adapter trimming, the 32 BP barcode must be isolated to allow barcode grouping. This is accomplished using Cutadapt. Specifying a 5′ adapter as follows “-g GATCTG”, causes Cutadapt to trim away the rest of the reads, and yields the 32 BP barcodes. Once the barcodes are extracted, to account for any naturally occurring changes to the barcode that may arise during the cell cycle, the barcodes must be grouped. All barcodes extracted from the 17 sort files were pooled, and Levenshtein grouped with a distance of two using Starcode. Utilizing the ID retaining feature of Starcode, reads were able to be demultiplexed based on the barcode group they were assigned, as well as which of the original sort files the read came from. During demultiplexing, a strict size enforcement of 32 was enacted; barcode groups that were larger or smaller than 32 BPs were removed. The end result is many directories generated and named after each identified barcode group. Each of these directories contains up to 17 files which correspond to the reads associated with that barcode group, and its flow sort results. For example, barcode group AGACGTACCAACAAAAGACAATGACAAAAAGG (SEQ ID NO: 2) was associated with 1,447 reads across the flow sorted files 34 reads corresponded to the replicate 1, GFP high sort, on cell line 293 T; 25 reads corresponding to the replicate 1, GFP low sort, cell line 293T; etc.
Once demultiplexing has taken place, variants must be identified across all barcode groups. The variant calling pipeline starts by utilizing Burrow's Wheeler Aligner (BWA). The wildtype TAT sequence was BWA indexed and used as a reference for the rest of the variant calling. Each flow sorted file within each barcode group directory was then individually BWA aligned to the indexed reference file, using BWA mem. After alignment, the files were put through a series of processes in the Samtools package. The same files generated from BWA mem were transformed into bam files, sorted, indexed, and pile upped through utilizing the Samtools view, sort, index, and pileup functions respectively. Lastly BCFtools' call function resulted in a VCF file for variant interpretation.
For each barcode group, all 17 VCFs were compared to ensure a valid variant calling was performed. For validation, All 17 VCFs should conclude on a particular codon substitution, as designed in the experiment. Once validated, read counts supporting the codon substitution were extracted from the flow sorted files. Read counts were normalized to reads per million (RPM) and a comparison of the distribution of where these read counts land in the flow sorted samples across all barcodes for a particular variant yields the resulting activity measures as shown in data file 1 Placing these scores into a heatmap and showing known post translational modification sites of tat results in
For validation, a mutant was considered active if 50% or more of the normalized reads associated with the mutant were from the high bin of the flow sorted pools. Barcodes were only considered valid if there were at least 2.5 RPM The distribution of the mutant's activity ranges is shown in
Statistical analysis was done through two models. The first aimed to capture mutants whose activity was wildtype-like or loss of function (LOF). For each barcode in a sample, the percentage of reads in the high bin out of reads in the high and low bin was calculated, and is denoted as h ratio (h E [0,1]). A high h percentage resembles wild type, while a low h percentage suggests a mutant. For each mutant, we calculate the averaged h ratio for all the barcodes assigned to the same mutant, denoted as a mutant level summary score A one sample t-test was used to evaluate 1) whether the mutant has a significantly different number of reads in the high bin compared with the low bin within a technical replicate, and 2) whether the mutant has a significantly different number of reads in the high bin compared with the low bin among different cell lines based on biological replicates (null hypothesis Ho: h=0.5).
In addition to the t-test comparing the high bin ratio among the mutants, a test was devised to perform an association between the genotype (LOF/wildtype) and GFP expression (binary variable high bin or low bin.) A mixed effect logistic regression was used, with random intercepts for barcodes and replicates to model the nested structure in our experimental design. For the wildtype control populations, we used the cells with no mutant calls (sequences identical to the reference). Each mutant was compared against the common wildtype control population. The model Ml with genotype included as fixed effects was compared to a null model MO without genotype in a likelihood ratio test (LRT). Similar to Genome-Wide Association Studies (GWAS), a significant result indicates that the LOF/wildtype is associated with the percentage of high bin cells. For mutants where the model fit was singular, we simplified the model by dropping the random effects p-values were false discovery rate (FDR)-adjusted using Storey's q-values.
The following enumerated embodiments are provided, the numbering of which is not to be construed as designating levels of importance.
Embodiment 1 provides a method of analyzing results of a high-throughput assay system for molecular functions and cell processes, the method comprising:
Embodiment 2 provides the method of embodiment 1, wherein the gene of interest is a transcription factor.
Embodiment 3 provides the method of embodiment 1, wherein the reporter element encodes a promoter element that drives expression of a cDNA encoding a fluorescent protein.
Embodiment 4 provides the method of embodiment 3, wherein the fluorescent protein is GFP.
Embodiment 5 provides the method of embodiment 1, wherein the variant calling step further comprises performing Burrows-Wheeler indexing and alignment to a wild type reference sequence for each group of related barcodes to generate a sequence alignment and map (SAM) file: converting the SAM file to a binary alignment and map (BAM) file; sorting and indexing the contents of the BAM file; aligning the contents of the BAM file to generate alignments of the variants to the wildtype; and generating a variant call file from the alignments.
Embodiment 6 provides the method of embodiment 1, wherein the barcode comprises 32 or more nucleotides.
Embodiment 7 provides the method of embodiment 1, wherein the cells are mammalian cells. Embodiment 8 provides the method of embodiment 1, wherein the cells are an immortalized cell line.
Embodiment 9 provides the method of embodiment 1, wherein the cell sorting is accomplished by flow cytometry.
Embodiment 10 provides the method according to embodiment 1, wherein the population of cells comprises an average of hundreds of separately randomly barcoded cDNA molecules assigned to each cell group for each mutant.
Embodiment 11 provides the method according to embodiment 1, wherein each mutant has a read depth of about 2,000× to about 90,000× sequencing coverage Embodiment 12 provides the method according to embodiment 1, further comprising validating the method by comparing the activity of a subset of mutants analyzed by the method to previously determined results.
Embodiment 13 provides the method according to embodiment 1, further comprising validating the method by comparing true negatives as determined by the method to true negatives determined by an independent method.
Embodiment 14 provides the method according to embodiment 1, further comprising validating the method by comparing the activity of a subset of mutants analyzed by the method to independent testing of a set of separate clones.
Embodiment 15 provides the method according to embodiment 1, further comprising validating the method by comparing method results among different samples.
Embodiment 16 provides the method according to embodiment 1, further comprising validating the method by comparing method results in two different cell lines.
Embodiment 17 provides the method according to embodiment 1, wherein the statistical significance of mutants analyzed in the method has a median p value of p<9×10−20.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the present systems and methods and their practical applications, to thereby enable others skilled in the art to best utilize the present systems and methods and various embodiments with various modifications as may be suited to the particular use contemplated.
This application is a continuation application of PCT/US2022/073427, filed Jul. 5, 2022, which claims the benefit of priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/218,429, filed Jul. 5, 2021, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63218429 | Jul 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US22/73427 | Jul 2022 | WO |
Child | 18405052 | US |