The present disclosure relates to methods of creating biological investigative protocols, and laboratory apparatus which implement such methods.
The following documents and all others cited herein are incorporated by reference in their entireties: U.S. Pat. Nos. 4,946,778; 10,068,053; 10,706,955; U.S. Pub. No. 2020/0320355 A1; U.S. Pub. No. 2020/0152289 A1; WO 2017/075292 A1; Bracewell et al., Int'l Conf. Eng. Des., 24-27 Aug. 2009, 6-223 to 6-243; Caicedo et al., Nat. Methods, September 2017 14(9), 849-63; Guan et al., PLoS One, 2013 Dec. 20; 8(12):e83291; Yuste et al., Nat. Neurosci., December 2020, 23 1456-68.
Beginning with Anton van Leeuwenhoek's microscopy observations in the mid-17th century, natural philosophers such as Robert Hooke noted that plant tissues seemed to “consist of a great many little Boxes” of which there were “in a Cubic Inch, above twelve hundred Millions . . . a thing almost incredible.” (R. Hooke, Micrographia (1655)). In the 19th century, Matthias Jakob Schleiden, Theodor Schwann, and later Rudolf Virchow, concluded that not only plants but also animals consist of cells having varying structure and function, and they developed a fundamental cell theory of biology.
Around the late 19th century, Camillo Golgi and Santiago Ramón y Cajal contributed to the systemization of cell biology by carefully staining, drawing, and cataloguing neurological cells, and naming the various cell types distinguished by their diverse visible morphologies.
Since these early findings, investigators have described countless different cell types. Modern cell biology has arrived at granular distinctions between cells, and numerous ways to describe and classify cell types, including, e.g., (a) anatomical morphology; (b) apparent physiological function; (c) protein markers; (d) genetic markers; (e) developmental origins and taxonomy; (f) epigenetic states and markers; (g) electrophysiology; (h) cross-species homology; and (i) combinations of any of these factors.
Lately, it has become well known that some cell types having identical morphology may sometimes be distinguished by molecular markers, physiological function, or both. (See, e.g., Hattar et al., S
While labeling a protein or a nucleic acid in a sample has become routine laboratory technique, it remains challenging to label large quantities of different protein and/or nucleic acid markers in complex tissues, such as brain neocortical tissue, such that all cell types may be readily resolved, identified, and distinguished. Investigators must carefully select individual or a limited set of molecular markers and labels, e.g., fluorophores, such that each cell type is clearly and distinguishably labeled. However, use of individual or limited sets of such markers or labels for mapping purposes cannot provide the power to distinguish among a plethora (e.g., hundreds to thousands) of cell types.
Spatial mapping of the three-dimensional organization of cells in complex tissues will be transformative towards understanding of biological organization of function. Recent advances in resolution, sensitivity, and throughput have made “spatial transcriptomics” the main tool for special cell-type mapping in study of the brain. However, the existing major spatial transcriptomics approaches are limited to thin two-dimensional sections.
Accordingly, there is a great need for reliable, efficient methods of optimizing cell labeling protocols; methods of analyzing and consolidating vast amounts of cell labeling information into topographically meaningful presentations, and apparatus capable of identifying and recording each cell's type and spatial position in the sample, all which could enable three-dimensional volumetric mapping. Such methods and apparatus would help develop universal, optimal labeling protocols using the best and fewest fluorophores labels possible, which will improve the efficacy, quality, and reproducibility of investigative research, and greatly aid in research of complex tissues. These research capability improvements will lead to new biological discoveries and medical treatments.
The present disclosure relates to methods of optimizing biological investigative protocols, and laboratory apparatus. More specifically, the present disclosure comprises a computational method for determining how to optimally label cells in a biological sample such that each cell type is distinguishable from all other cell types in the biological sample. The present method would ultimately permit investigators to reliably distinguish a large variety of differing cell types in a tissue sample using a minimal array of visual labels.
The present disclosure further comprises a scanner apparatus which can determine and record the position and labels on each cell in a biological sample, and determine or estimate the cell type at that location.
In an embodiment, the present disclosure comprises a method for identifying the specific locations of a plurality of specific cell types within a population of cells in a biological specimen, the method comprising the steps of:
In one embodiment, the molecular markers are nucleic acid polymers. In a preferred embodiment, the nucleic acid polymers are RNA. In an alternative embodiment, the molecular markers are peptides, whole proteins, and/or protein fragments, which may comprise any of a peptide, nuclear protein, cytosolic protein, mitochondrial protein, secreted protein, cell-surface protein, receptor, transcription factor, antibody, or any combination thereof. In other embodiments, the molecular markers are any biological components that can be tagged in accordance with the disclosure herein. Non-limiting examples include metabolites, lipids, carbohydrates including polysaccharides, glycolipids, vitamins, fatty acids, co-factors, pigments, metals, or any other biochemicals or compounds, organic or inorganic, found within a biological system. The methods described herein for identifying exemplary molecules are equally applicable to markers that are not nucleic acids or proteins. Moreover, the methods may be used to concurrently identify more than one type of molecular marker in a specimen.
In one embodiment, steps (a), (b) and (c) are performed for a particular type of biological specimen wherein the specimen comprises a plurality of known cell types (e.g., known from the literature) and among the known cell types within the specimen from which the plurality are selected for locating, in step (a), the known molecular markers of each cell type is obtained from the literature, for step (b). Based upon the selection of molecular markers in step (b), step (c) may be carried out to identify the markers to be detected, and step (d) the design of the reagents. Thus, steps (a)-(d) are carried out for each particular type of specimen and cells therein of interest in locating. In one embodiment, the remainder of the steps are carried out on the specimen and analyzing the data collected from imaging the specimen.
In one embodiment, steps (a), (b) and (c) are carried out in silico.
In another embodiment, step (b) may further comprise organizing the specific cell types into a hierarchical taxonomy according to the plurality of known molecular markers. In some embodiments, the number of different detectable labels of step (d) may equal to the number of hierarchical levels of the hierarchical taxonomy.
In yet another embodiment, the dimensionality reduction process of step (c) is accomplished using principal component analysis. In still another embodiment, the dimensionality reduction process of step (c) is accomplished using recursive partitioning. In another embodiment, the dimensionality reduction process of step (c) is accomplished using artificial neural network to design the encoding. In an alternative embodiment, the dimensionality reduction process of step (c) is accomplished using discriminant projection non-negative matrix factorization (dPNMF). In an aspect, dPNMF comprises the steps of (i) fitting a dPNMF model to training data; (ii) fitting a classifier to one class per cell type; and (iii) creating a staining profile for each cell type according to a weighting, whereby the number of cell labels per molecular marker approximates the weighting. In an aspect, the classifier of step (ii) is a Naïve Bayesian classifier. In another aspect, artificial neural network classifiers are used. In another aspect, the K-nearest neighbors algorithm (KNN) is used.
In one embodiment, step (d) is accomplished using direction from step (c) as to the preparation of the set of bivalent binding reagents and labeled binding reagents. In one embodiment, the bivalent binding reagent is an oligonucleotide comprising a molecular marker-binding region and a labeled-binding-reagent-binding region. In one embodiment, multiple bivalent binding reagents are provided that bind to the same molecular marker. In some embodiment, the multiple bivalent reagents that bind to the same molecular marker have the same labeled-binding-reagent-binding region. In one embodiment the labeled binding reagent comprises a label and a region that binds to a bivalent binding reagent. In one embodiment, the labels of the labeled binding reagents are dyes that are individually detectable in a single location within the specimen using hyperspectral imaging. In some embodiments a labeled binding reagent has a single dye molecule bound thereto. In some embodiments a labeled binding reagent has a plurality of dye molecules bound thereto. In some embodiments, the plurality of dye molecules are the same dye or different dyes.
In some embodiment, oligonucleotide-based bivalent binding reagents comprise at least one molecular-marker binding region nucleic acid sequence, and at least one labeled-binding-reagent binding region nucleic acid sequence. In some embodiments a bivalent binding reagent may comprise two labeled-binding-reagent binding sequences, binding the same or different labeled binding reagents. In some embodiments a bivalent binding reagent may comprise three labeled-binding-reagent binding sequences. In some embodiments a bivalent binding reagent may comprise more than three labeled-binding-reagent binding sequences. Such plurality of labeled-binding-reagent binding sequences on a bivalent binding reagent may bind one or more of the same, or two different, or two of the same and one different, or three all different labeled binding reagents, or any other combination thereof. Such number of labeled-binding-reagent binding sequences will be provided in the calculations from step a, b and/or c as described herein.
In some embodiments, for purposes of amplifying the bivalent binding sequences, amplification sequences may be include in the bivalent binding reagents, which may be retained or removed before use in the staining methods disclosed herein. In some embodiments, a forward and reverse amplification nucleic acid sequence are included in the bivalent binding reagent. In some embodiment, the forward amplification sequence is provided at the 5′ end of the bivalent binding reagent, and the reverse amplification sequence at the 3′end.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a molecular-marker binding sequence and a labeled-binding-reagent binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence and a molecular-marker binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a labeled-binding-reagent-binding sequence, and a molecular-marker binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding region, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a labeled-binding-reagent-binding sequence, a molecular-marker binding sequence, and a labeled-binding-reagent binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding region, a labeled-binding-reagent binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent-binding sequence, and a labeled-binding-reagent binding sequence.
In any of the bivalent binding reagents disclosed herein, additional nucleotides (e.g., A, C, G, and/or T) may be provided as spacers between the aforementioned regions, such as one or more A.
In some embodiments, bivalent binding reagents are provided that bind to different sequences on the same molecular marker; such multiple bivalent binding reagents for the same marker provide the weights that each molecular marker target contributes towards the total basis measurement.
In some embodiments, oligonucleotide-based labeled binding reagents comprise an oligonucleotide sequence that binds to a labeled-binding-reagent binding sequence of a bivalent binding reagent, and a detectable label such as a fluorescent dye. In some embodiments the dye is covalent bound to the oligonucleotide region of the labeled binding reagent. In some embodiments, the dye is reversibly linked to the oligonucleotide region of the labeled binding reagent, for example using a disulfide bond, such that it can be cleaved (reduced) and removed for successive imaging using labeled binding reagents incorporating the same dye. In one embodiment, the dyes for the labeled binding reagents are selected from among Cy5, BODIPY 630/650-X, LC Red 640, Alexa Fluor 633, BODIPY 650/665-X, Alexa Fluor 647, Alexa Fluor 660, Cyanine5.5, Alexa Fluor 680, Alexa Fluor 700 and Alexa Fluor 750.
In one embodiment, step (f) is accomplished by hyperspectral scanning of the specimen. In other embodiments, step (f) is accomplished by standard fluorescence imaging, light sheet imaging, or flow cytometry. In some embodiments, step (f) is accomplished by non-optical sensing methods such as mass spectrometry.
In one embodiment, the imaging at positions throughout the specimen to detect the labeled binding reagents and extent of labeling is obtained sequentially or simultaneously. In some embodiments, the specimen is imaged for all dyes at each location in the specimen simultaneously. In some embodiments, the specimen is imaged for each dye sequentially at each location in the specimen. In some embodiments, the specimen is incubated with all of the bivalent binding reagents and the subsequent incubating with the labeled binding reagents may be simultaneous or sequential, with, in some embodiments, imaging after each sequential incubation with each labeled binding reagent or a subset of the labeled binding reagents. Thus, in some embodiments, the imaging is performed batchwise to detect one or more dyes each scan. In some embodiments, after such partial imaging of the sample, the one or more labeled binding reagents are washed out of the specimen before the next one or more labeled binding reagents are incubated then imaged. In some embodiments, the washing out comprises removing the labeled binding reagents. In other embodiments, the dyes of the labeled binding reagents are quenched or otherwise made to not interfere with subsequent imaging of the same or different dyes.
In one embodiment, the data obtained from step (f) is converted to locations of particular cell types within the specimen using the correlating of step (g), which is based upon the relationships established in step (c). In one embodiment, step (c) is an encoding step, and step (g) is a decoding step. In one embodiment, the method used for coding in step (c) is used for decoding in step (g).
In one embodiment, the locations of the cell types within the specimen in step (h) are used diagnostically to identify, for example, a normal state, a disease state or the potential for a diseases state to develop, based upon the locations of particular cell types within the specimen.
In some embodiments, the incubating of the specimen with the bivalent binding reagents is performed before the incubating with the labeled binding reagents. In some embodiments, the incubating with the bivalent binding reagents is of longer duration than incubating with the labeled binding reagents. In some embodiments, the incubating with the bivalent binding reagents is performed at the same time as with the labeled binding reagents. In some embodiment the specimen is washed after incubating with the bivalent binding reagents and before the incubating with the labeled binding reagents. In some embodiments the labeled binding reagents are added after the bivalent binding reagents. In some embodiments the specimen is washed after incubation with the labeled binding reagents.
In some embodiments, the specimen after the incubating steps is imaged in a single scan. In some embodiments the specimen is imaged using multiple scans. In some embodiments, all dyes used in the labeled binding reagents are imaged at each specimen location at the same time. In some embodiments, a subset of dyes are imaged at each location at the same time. In some embodiments, one dye is imaged at each location. In some embodiments, imaging and incubations steps are repeated in a sequence where each step a different set of labeled binding reagents are incubated, excess is washed, and bound reagents are imaged. In some embodiments, the sequential or simultaneous incubation of the specimen with the bivalent binding reagents and the labeled binding reagents are independent of the sequential of simultaneous imaging. In some embodiments, the specimen is incubated with all of the bivalent binding reagents, but the incubating with the labeled binding reagents may be simultaneous or sequential, with, in some embodiments, imaging after each incubation with each labeled binding reagent.
In some embodiments, after each imaging of the sample, the one or more labeled binding reagents are washed out of the specimen before the next one or more labeled binding reagents are incubated then imaged. In some embodiments, the washing out comprises removing the labeled binding reagents. In some embodiments, the washing out comprises reducing a disulfide that is binding the dye to the labeled binding reagent, and washing the specimen.
In some embodiments, the imaging is low magnification imaging. In some embodiments, the specimen is cleared before incubation or imaging. In some embodiments, the specimen is embedded in a hydrogel before incubating or imaging.
In some embodiments, the specimen is brain. In some embodiments, the set of bivalent binding reagents for scanning brain cell types are comprise one or more of SEQ ID NOs:25-48. In some embodiments, the set of labeled binding reagents comprise one or more of SEQ ID NOs:75-98. In some embodiments, the set of bivalent binding reagents for scanning brain cell types comprise SEQ ID NOs:25-48 and the set of labeled binding reagents comprise SEQ ID NOs:75-98.
In some embodiments, the specimen is a whole organ. In some embodiments, the specimen is fresh, frozen, formalin preserved, alcohol preserved, a thin section, a thick section, a biopsy specimen or a previously formalin-fixed, paraffin embedded specimen. In some embodiments, the specimen is obtained from a patient, a healthy subject, a pathology specimen, a fossilized specimen, a frozen or cryogenically preserved specimen, an exhumed specimen or a mummified specimen.
In some embodiments, the bivalent binding reagent comprises an antibody or antigen-binding fragment.
In some embodiments, a bivalent binding reagent is provided selected from among SEQ ID NOs:25-48.
In some embodiments, a labeled binding reagent is provided selected from among SEQ ID NOs:75-98.
These and other aspects of the embodiments described herein will be provided in the ensuing descriptions of the drawings and detailed description of the embodiments.
Exemplary embodiments of the disclosure will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
Unless otherwise defined herein, scientific and technical terms used in connection with the present application shall have the meanings that are commonly understood by those of ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular.
As employed above and throughout the disclosure, the following terms and abbreviations, unless otherwise indicated, shall be understood to have the following meanings:
In the present disclosure, the singular forms “a,” “an,” and “the” include the plural reference, and reference to a particular numerical value includes at least that particular value, unless the context clearly indicates otherwise. Thus, for example, a reference to “a compound” is a reference to one or more of such compounds and equivalents thereof known to those skilled in the art, and so forth. The term “plurality”, as used herein, means more than one. When a range of values is expressed, another embodiment includes from the one particular and/or to the other particular value.
Similarly, when values are expressed as approximations, by use of the antecedent “about,” it is understood that the particular value forms another embodiment. All ranges are inclusive and combinable. In the context of the present disclosure, by “about” a certain amount it is meant that the amount is within ±20% of the stated amount, or preferably within ±10% of the stated amount, or more preferably within ±5% of the stated amount.
“Cell type” refers to a classification of biological cells using a classification system wherein at least two cells have at least one difference between them, and are thereby classified as different cell types, but typically refers to a taxonomy wherein the genotypic and/or phenotypic properties of a cell define its type, such properties including but not limited to (a) anatomical morphology; (b) apparent physiological function; (c) protein markers; (d) genetic markers; (e) developmental origins and taxonomy; (f) the organ and/or tissue and/or structure where the cell is typically found in an organism; (g) epigenetic states and markers; (h) electrophysiology; (i) cross-species homology; or (j) combinations of any of these factors. As used herein, cell type refers to a cell having properties different from any other cell as differentiated by the methods described herein. It will be readily understood to persons skilled in the art that any particular cell may have many possible valid “cell type” classifications according to various different heuristics and/or levels of specificity. For example, hepatic stellate cells could be classified as having “liver cell” cell type, or could be classified as having “neutrophin-expressing cell” cell type. (See C. Schachtrup et al., Hepatic stellate cells and astrocytes, C
As used herein, the term “genome” refers to the genetic material (e.g., chromosomes) of an organism or a host cell.
As used herein, the term “proteome” refers to the entire set of proteins expressed by a genome, cell, tissue or organism. A “partial proteome” refers to a subset the entire set of proteins expressed by a genome, cell, tissue or organism. Examples of “partial proteomes” include, but are not limited to, transmembrane proteins, secreted proteins, and proteins with a membrane motif.
As used herein, the terms “protein,” “polypeptide,” and “peptide” refer to a molecule comprising amino acids joined via peptide bonds. In general, “peptide” is used to refer to a sequence of 20 or less amino acids and “polypeptide” is used to refer to a sequence of greater than 20 amino acids.
As used herein, the term, “synthetic polypeptide,” “synthetic peptide” and “synthetic protein” refer to peptides, polypeptides, and proteins that are produced by a recombinant process (i.e., expression of exogenous nucleic acid encoding the peptide, polypeptide or protein in an organism, host cell, or cell-free system) or by chemical synthesis.
As used herein, the term “protein of interest” refers to a protein encoded by a nucleic acid of interest.
As used herein, the term “native” (or wild type) when used in reference to a protein refers to proteins encoded by the genome of a cell, tissue, or organism, other than one manipulated to produce synthetic proteins.
Abbreviations used herein for nucleotides shall adhere to industry standards as defined in WIPO Standard ST.25, Annex C, Appendix 2, Tables 1 & 2. Abbreviations used herein for the canonical proteinogenic amino acids adhere to industry standards as defined in WIPO Standard ST.25, Annex C, Appendix 2, Tables 3 & 4, and should be readily understood by persons having ordinary skill in the art. Amino acids abbreviated with the prefix D- refer to the D-enantiomer, but without any prefix shall be understood as referring to the L-enantiomer. As used herein, modified, uncommon, and non-proteinogenic amino acids shall be abbreviated as follows: Aad=2-aminoadipic acid; bAad=3-aminoadipic acid; Acpc=1-aminocyclopropanecarboxylic acid; bAla=β-alanine (i.e., β-aminoproprionic acid); Abu=2-aminobutyric acid; 4Abu=4-aminobutyric acid (i.e., piperidinic acid); Acp=6-aminocaproic acid; Ahe=2-aminoheptanoic acid; Aib=2-aminoisobutyric acid; bAib=3-aminoisobutyric acid; Apm=2-aminopimelic acid; Dbu=2,4-diaminobutyric acid; Des=desmosine; Dpm=2,2′-diaminoproprionic acid; Dpr=2,3-diaminoproprionic acid; EtGly=N-ethylglycine; EtAsn=N-ethylasparagine; Hse=homoserine (i.e., isothreonine); Hyl=hydroxylysine; aHyl=allo-hydroxylysine; 3Hyp=3-hydroxyproline; 4Hyp=4-hydroxyproline; Ide=isodesmosine; alle=allo-Isoleucine; MeGly=N-methylglycine (i.e., sarcosine); MeIle=N-methylisoleucine; MeLys=6-N-methyllysine; MeVal=N-methylvaline; Nva=norvaline; Nle=norleucine; and Orn=ornithine. Additionally, some alternative abbreviations for uncommon amino acids may be known to those having ordinary skill in the art, and may be readily understood from their context. For example, sometimes norleucine may be abbreviated as “norLeu” and homoserine may be abbreviated as “homoSer.”
A “sequence read” or “read” refers to data representing a sequence of monomer units (e.g., bases) that comprise a nucleic acid molecule (e.g., DNA, cDNA, RNAs including mRNAs, rRNAs, siRNAs, miRNAs and the like). The sequence read can be measured from a given molecule via a variety of techniques.
As used herein, a “fragment” refers to a nucleic acid molecule that is in a biological sample. Fragments can be referred to as long or short, e.g., fragments longer than 10 Kb (e.g. between 50 Kb and 100 Kb) can be referred to as long, and fragments shorter than 1,000 bases can be referred to as short. A long fragment can be broken up into short fragments, upon which sequencing is performed.
A “mate pair” or “mated reads” or “paired-end” can refer to any two reads from a same molecule (also referred to as two arms of a same read—arm reads) that are not fully overlapped (i.e., cover different parts of the molecule). Each of the two reads would be from different parts of the same molecule, e.g., from the two ends of the molecule. As another example, one read could be for one end of the molecule in the other read for a middle part of the molecule. As a genetic sequence can be ordered from beginning to end, a first read of a molecule can be identified as existing earlier in a genome than the second read of the molecule when the first read starts and/or ends before the start and/or end of the second read. More than two reads can be obtained for each molecule, where each read would be for a different part of the molecule. Usually there is a gap (mate gap) from about 100-10,000 bases of unread sequence between two reads. Examples of mate gaps include 500+/−200 bases and 1000+/−300 bases.
“Mapping” or “aligning” refers to a process which relates a read (or a pair of reads, e.g., of a mate pair) to zero, one, or more locations in a reference sequence to which the read is similar, e.g., by matching the instantiated arm read to one or more keys within an index corresponding to a location within a reference.
As used herein, an “allele” corresponds to one or more nucleotides (which may occur as a substitution or an insertion) or a deletion of one or more nucleotides. A “locus” corresponds to a location in a genome. For example, a locus can be a single base or a sequential series of bases. The term “genomic position” can refer to a particular nucleotide position in a genome or a contiguous block of nucleotide positions. A “heterozygous locus” (also called a “het”) is a location in a reference genome or a specific genome of the organism being mapped, where the copies of a chromosome do not have a same allele (e.g. a single nucleotide or a collection of nucleotides). A “het” can be a single-nucleotide polymorphism (SNP) when the locus is one nucleotide that has different alleles. A “het” can also be a location where there is an insertion or a deletion (collectively referred to as an “indel”) of one or more nucleotides or one or more tandem repeats. A single nucleotide variation (SNV) corresponds to a genomic position having a nucleotide that differs from a reference genome for a particular person. An SNV can be homozygous for a person if there is only one nucleotide at the position, and heterozygous if there are two alleles at the position. A heterozygous SNV is a het. SNP and SNV are used interchangeably herein.
Sequencing refers to the determination of intensity values corresponding to positions of one or more nucleic acids. The “intensity values” can be any signal, e.g., electrical or electromagnetic radiation, such as visible light. There can be one intensity value per base, multiple intensity values per base, or fewer intensity values than there are bases. Also, an intensity value can be for a particular position, or an intensity value can be for multiple positions of a nucleic acid. Intensity values can be restricted to predetermined values (e.g., binary or integers in a decimal numeral system), or can have continuous values.
A “sequencing process” or “sequencing run” refers to the determination of intensity values corresponding to positions of one or more nucleic acids as a batch. For example, when the sequencing involves imaging biochemical reactions of nucleic acids on a substrate, the resulting intensity values are obtained during the same sequencing run. Intensity values of nucleic acids for a different substrate would appear in different sequencing runs. A nucleic acid of a first sequencing run would not be involved in a second sequencing run (e.g., not included in a same image).
An “assumed sequence” corresponds to the sequence that is believed to be accurate. The determination may be inaccurate, but the training assumes it is accurate. The assumed sequence can be determined in a variety of ways, e.g., as described herein. An assumed sequence can include no calls, and thus an assumed sequence can have open positions between called positions.
As used herein, the term “transmembrane protein” refers to proteins that span a biological membrane. There are two basic types of transmembrane proteins. Alpha-helical proteins are present in the inner membranes of bacterial cells or the plasma membrane of eukaryotes, and sometimes in the outer membranes. Beta-barrel proteins are found only in outer membranes of Gram-negative bacteria, cell wall of Gram-positive bacteria, and outer membranes of mitochondria and chloroplasts.
As used herein, the term “external loop portion” refers to the portion of transmembrane protein that is positioned between two membrane-spanning portions of the transmembrane protein and projects outside of the membrane of a cell.
As used herein, the term “tail portion” refers to refers to an n-terminal or c-terminal portion of a transmembrane protein that terminates in the inside (“internal tail portion”) or outside (“external tail portion”) of the cell membrane.
As used herein, the term “secreted protein” refers to a protein that is secreted from a cell.
As used herein, the term “membrane motif” refers to an amino acid sequence that encodes a motif not a canonical transmembrane domain but which would be expected by its function deduced in relation to other similar proteins to be located in a cell membrane, such as those listed in the publicly available psortb database.
As used herein, the term “consensus protease cleavage site” refers to an amino acid sequence that is recognized by a protease such as trypsin or pepsin.
As used herein, the term “affinity” refers to a measure of the strength of binding between two members of a binding pair, for example, an antibody and an epitope and an epitope and an MHC-I or II haplotype.
As used herein, the term “antigen binding protein” refers to proteins that bind to a specific antigen. “Antigen binding proteins” include, but are not limited to, immunoglobulins, including polyclonal, monoclonal, chimeric, single chain, and humanized antibodies, Fab fragments, F(ab′)2 fragments, and Fab expression libraries. Various procedures known in the art are used for the production of polyclonal antibodies. For the production of antibody, various host animals can be immunized by injection with the peptide corresponding to the desired epitope including but not limited to rabbits, mice, rats, sheep, goats, etc. Various adjuvants are used to increase the immunological response, depending on the host species, including but not limited to Freund's (complete and incomplete), mineral gels such as aluminum hydroxide, surface active substances such as lysolecithin, pluronic polyols, polyanions, peptides, oil emulsions, keyhole limpet hemocyanins, dinitrophenol, and potentially useful human adjuvants such as BCG (Bacille Calmette-Guerin) and Corynebacterium parvum.
For preparation of monoclonal antibodies, any technique that provides for the production of antibody molecules by continuous cell lines in culture may be used (See e.g., Harlow and Lane, Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.). These include, but are not limited to, the hybridoma technique originally developed by Köhler and Milstein (Köhler and Milstein, Nature, 256:495-497 [1975]), as well as the trioma technique, the human B-cell hybridoma technique (See e.g., Kozbor et al., I
According to the disclosure, techniques described for the production of single chain antibodies (U.S. Pat. No. 4,946,778) can be adapted to produce specific single chain antibodies as desired. An additional embodiment of the disclosure utilizes the techniques known in the art for the construction of Fab expression libraries (Huse et al., S
Antibody fragments that contain the idiotype (antigen binding region) of the antibody molecule can be generated by known techniques. For example, such fragments include but are not limited to: the F(ab′)2 fragment that can be produced by pepsin digestion of an antibody molecule; the Fab′ fragments that can be generated by reducing the disulfide bridges of an F(ab′)2 fragment, and the Fab fragments that can be generated by treating an antibody molecule with papain and a reducing agent.
Genes encoding antigen-binding proteins can be isolated by methods known in the art. In the production of antibodies, screening for the desired antibody can be accomplished by techniques known in the art (e.g., radioimmunoassay, ELISA (enzyme-linked immunosorbant assay), “sandwich” immunoassays, immunoradiometric assays, gel diffusion precipitin reactions, immunodiffusion assays, in situ immunoassays (using colloidal gold, enzyme or radioisotope labels, for example), Western Blots, precipitation reactions, agglutination assays (e.g., gel agglutination assays, hemagglutination assays, etc.), complement fixation assays, immunofluorescence assays, protein A assays, and immunoelectrophoresis assays, etc.) and the like.
As used herein, the terms “computer memory” and “computer memory device” refer to any storage media readable by a computer processor. Examples of computer memory include, but are not limited to, RAM, ROM, computer chips, digital video disc (DVDs), compact discs (CDs), hard disk drives (HDD), solid-state drives (SSD), and magnetic tape.
As used herein, the term “computer readable medium” refers to any device or system for storing and providing information (e.g., data and instructions) to a computer processor. Examples of computer readable media include, but are not limited to, DVDs, CDs, hard disk drives, magnetic tape and servers for streaming media over networks.
As used herein, the terms “processor” and “central processing unit” or “CPU” are used interchangeably and refer to a device that is able to read a program from a computer memory (e.g., ROM or other computer memory) and perform a set of steps according to the program.
A “machine-learning model” (also referred to as a model) refers to techniques that predict output base calls based on known results (training data). The known results can be an assumed sequence, which is assumed to be correct. As the model attempts to predict the results of the training data, the machine learning can be supervised learning, where the supervision comes from the training data.
A “base call” is a determination of a base at a position in a nucleic acid. A base call can be a no-call or a specified base. A base call can be made independently or as part of a combination of specified base (e.g., A/T), which can be for a same genomic position (e.g., if respective scores are close to each other) or for multiple positions. A “score” output from a machine-learning model can be used to determine a base call at a position. For example, a score can be provided for each of the bases. The determination of the base call based on the scores can be considered part of the model. Some models can provide a score, where the scores are used by a later process. Examples of a score can be a probability or a possibility. The probability scores for each of the bases would sum to a fixed number, i.e., one. The possibility scores are not required to sum to the fixed number. Each possibility score can be constrained to be between 0 and 1. The possibility scores could sum to 1, particularly if a model is trained well.
As used herein, the term “neural network” refers to various configurations of classifiers used in machine learning, including multilayered perceptrons with one or more hidden layers, support vector machines and dynamic Bayesian networks. These methods share in common the ability to be trained, the quality of their training evaluated and their ability to make either categorical classifications or of continuous numbers in a regression mode.
As used herein, the term “principal component analysis” refers to a mathematical process which reduces the dimensionality of a set of data (Wold, S., Sjorstrom, M., & Eriksson, L., Chemometrics and Intelligent Laboratory Systems 2001. 58: 109-130; Multivariate and Megavariate Data Analysis Basic Principles and Applications (Parts I&II) by L. Eriksson, E. Johansson, N. Kettaneh-Wold, & J. Trygg, 2006 2nd Ed. Umetrics Academy). Derivation of principal components is a linear transformation that locates directions of maximum variance in the original input data, and rotates the data along these axes. For n original variables, n principal components are formed as follows: The first principal component is the linear combination of the standardized original variables that has the greatest possible variance. Each subsequent principal component is the linear combination of the standardized original variables that has the greatest possible variance and is uncorrelated with all previously defined components. Further, the principal components are scale-independent in that they can be developed from different types of measurements.
The terms “dimensionality reduction” or “dimension reduction” (sometimes abbreviated “dred”) refers to the process of reducing the number of variables or features under consideration, via obtaining a set of “uncorrelated” principal variables.
As used herein, the term “vector” when used in relation to a computer algorithm or the present disclosure, refers to a numerical-array representation of an object or feature, such as, e.g., a nucleic acid or a protein, generated such that an algorithm may perform processing and statistical analysis. (For instance, a “
As used herein, the term “vector,” when used in relation to recombinant DNA technology, refers to any genetic element, such as a plasmid, phage, transposon, cosmid, chromosome, retrovirus, virion, etc., which is capable of replication when associated with the proper control elements and which can transfer gene sequences between cells. Thus, the term includes cloning and expression vehicles, as well as viral vectors.
As used herein, the term “cell culture” refers to any in vitro culture of cells. Included within this term are continuous cell lines (e.g., with an immortal phenotype), primary cell cultures, finite cell lines (e.g., non-transformed cells), and any other cell population maintained in vitro, including oocytes and embryos.
The term “isolated” when used in relation to a nucleic acid, as in “an isolated oligonucleotide” refers to a nucleic acid sequence that is identified and separated from at least one contaminant nucleic acid with which it is ordinarily associated in its natural source. Isolated nucleic acids are nucleic acids present in a form or setting that is different from that in which they are found in nature. In contrast, non-isolated nucleic acids are nucleic acids such as DNA and RNA that are found in the state in which they exist in nature.
The terms “in operable combination,” “in operable order,” and “operably linked” as used herein refer to the linkage of nucleic acid sequences in such a manner that a nucleic acid molecule capable of directing the transcription of a given gene and/or the synthesis of a desired protein molecule is produced. The term also refers to the linkage of amino acid sequences in such a manner so that a functional protein is produced.
As used herein, the term “purified” or “to purify” refers to the removal of undesired components from a sample. As used herein, the term “substantially purified” refers to molecules, either nucleic or amino acid sequences, that are removed from their natural environment, isolated or separated, and are at least 60% free, preferably 75% free, and most preferably 90% free from other components with which they are naturally associated. An “isolated polynucleotide” is therefore a substantially purified polynucleotide.
The terms “bacteria” and “bacterium” refer to prokaryotic organisms, including those within all of the phyla in the Kingdom Procaryotae. It is intended that the term encompass all microorganisms considered to be bacteria including Mycoplasma, Chlamydia, Actinomyces, Streptomyces, and Rickettsia. All forms of bacteria are included within this definition including cocci, bacilli, spirochetes, spheroplasts, protoplasts, etc. Also included within this term are prokaryotic organisms that are gram negative or gram positive. “Gram negative” and “gram positive” refer to staining patterns with the Gram-staining process that is well known in the art. (See e.g., Finegold and Martin, Diagnostic Microbiology, 6th Ed., C V Mosby St. Louis, pp. 13-15 [1982]). “Gram positive bacteria” are bacteria that retain the primary dye used in the Gram stain, causing the stained cells to appear dark blue to purple under the microscope. “Gram negative bacteria” do not retain the primary dye used in the Gram stain, but are stained by the counterstain. Thus, gram negative bacteria appear red. In some embodiments, the bacteria are those capable of causing disease (pathogens) and those that cause product degradation or spoilage.
The terms “fluorescent label”, “fluorescent tag”, and “fluorescent probe” describe a molecule or molecules that attach chemically to assist in the detection of a biomolecule such as, e.g., a protein, antibody, nucleic acid polymer, amino acid, and/or lipid. Fluorescent labels may comprise fluorescent proteins, such as, e.g., blue fluorescent proteins, cyan fluorescent proteins, green fluorescent proteins, red fluorescent proteins, and yellow fluorescent proteins. Exemplary fluorescent labels include, but are in no way limited to, Sirius, Azurite, EBFP, EBFP2, FCFP, Cerulean, CyPet, SCFP, eGFP, Emerald, Superfolder avGFP, T-Sapphire, RFP, mCherry, mOrange, mRaspberry, mRuby, FusionRed, EYFP, Topaz, Venus, Citrine, YPet, SYFP, and mAmetrine. Fluorescent labels may comprise dyes, such as the blue-fluorescent DNA stain 4′,6-diamidino-2-phenylindole (DAPI). Fluorescent labels may also comprise other fluorescent biomolecule stains such as, e.g., BODIPY lipid conjugates.
As used herein, the terms “treat”, “treatment”, or “therapy” (as well as different forms thereof) refer to therapeutic treatment, including prophylactic or preventative measures, wherein the object is to prevent or slow down (lessen) an undesired physiological change associated with a disease or condition. Beneficial or desired clinical results include, but are not limited to, alleviation of symptoms, diminishment of the extent of a disease or condition, stabilization of a disease or condition (i.e., where the disease or condition does not worsen), delay or slowing of the progression of a disease or condition, amelioration or palliation of the disease or condition, and remission (whether partial or total) of the disease or condition, whether detectable or undetectable. Those in need of treatment include those already with the disease or condition as well as those prone to having the disease or condition or those in which the disease or condition is to be prevented.
The terms “subject,” “individual,” and “patient” are used interchangeably herein, and refer to an animal, for example a human, to whom treatment with a composition or formulation in accordance with the present disclosure, is provided. The term “subject” as used herein refers to human and non-human animals. The human can be any human of any age. In an embodiment, the human is an adult. In another embodiment, the human is a child. The human can be male, female, pregnant, middle-aged, adolescent, or elderly.
Conditions and disorders in a subject for which a particular drug, compound, composition, formulation (or combination thereof) is said herein to be “indicated” are not restricted to conditions and disorders for which that drug or compound or composition or formulation has been expressly approved by a regulatory authority, but also include other conditions and disorders known or reasonably believed by a physician or other health or nutritional practitioner to be amenable to treatment with that drug or compound or composition or formulation or combination thereof.
The term “subject” includes mammals, e.g., humans, companion animals (e.g., dogs, cats, birds, and the like), farm animals (e.g., cows, sheep, pigs, horses, fowl, and the like) and laboratory animals (e.g., rats, mice, guinea pigs, birds, and the like). In some embodiments, the subject is male human or a female human.
The present subject matter may be understood more readily by reference to the following detailed description which forms a part of this disclosure. It is to be understood that this disclosure is not limited to the specific products, methods, conditions or parameters described and/or shown herein, and that the terminology used herein is for the purpose of describing particular embodiments by way of example only and is not intended to be limiting of the claimed disclosure.
The following descriptions of each step in the process are intended as guidance on means for carrying out the step in relationship to other steps in the process, and are not intending to be limiting. One of skill in the art will be able to modify the steps without deviating from the spirit of the disclosure.
Although significant advances in the development of highly multiplexed spatial measurements of tissues has occurred over recent years, including, e.g., combinatorial FISH using MERFISH and seqFISH+, highly multiplexed (≥30) immunofluorescence measurements using metal-conjugated antibodies, repeated cycles of immunofluorescence imaging and signal removal, in situ sequencing, and spatial DNA barcoding, and such highly multiplexed spatial biology is rapidly becoming a “standard” approach, analogous to the rapid adoption of single-cell sequencing technologies, Existing spatial transcriptomics approaches are ill-suited for volumetric three-dimensional measurements of RNA in whole organs and complex two-dimensional specimens. The limitations stem from inherent technical requirements that limit their applicability to very thin 2D sections. In the case of spatial DNA barcoding, capture of RNA necessitates thin sections with spatially defined capture bins. Analysis of the whole mouse brain will require approximately 103 sections, 109 spatial bins, requiring an unrealistic number of 1014 sequencing reads. Combinatorial FISH can only be done in thin sections due to the need for a high spatial resolution to allow subdiffraction single-molecule detection. Targeted approaches that focus on a smaller number of genes can speed acquisition. However, (i) the requirement for thin sections is independent of the number of measured genes. (ii) using a targeted approach is likely to never produce the “final” map, as it is very likely that current transcriptional definitions of cells might change in the future, especially when one considers more developmental stages, disease models, etc. Therefore, technologies that leverage prior cell type definitions need to be fast in order to allow recreating the atlas with any update to cell type taxonomy. Existing spatial transcriptomics approaches lack the scale to make whole organ cell type mapping a routine experiment.
The present disclosure implements a novel and unexpected approach to cell-type mapping that overcomes significant problems with existing methods and systems. The present disclosure is based on a new fluorescence in situ hybridization (FISH) variant, dubbed dimensionality-reduced FISH (“dredFISH”). DredFISH combines direct measurement(s) of low-dimensional representation(s) of single-cell transcriptomics with a supervised machine learning algorithm to spatially map cell types, bypassing any need to measure expression of single genes.
In one embodiment, dredFISH leverages existing single-cell RNA sequence (scRNAseq) data from over 370 brain cell types and a supervised basis identification algorithm known as discriminant Projective Non-Negative Matrix Factorization (dPNMF), in one embodiment, to design aggregate measurements based on the non-negative weighted sums of the expression(s) of thousands of genes optimized to preserve distinguishable cell-type information. DredFISH experimentally implements these weights through an oligonucleotide (oligo) design to allow for direct measurement of the low-dimensional approximation of cells' transcriptional state, thus leapfrogging the need for direct gene expression measurements. Using a supervised algorithm trained on labeled scRNAseq data, one may classify cells into their types based on their experimentally measured reduced-dimensionality representation(s). Such methods may be applied to other specimens, cell types, expression of other markers, and/or other reagents, in order to perform similar classifications.
Further, light-sheet imaging of cleared brain tissue labeled with dredFISH probes will be able to provide the throughput needed for whole brain mapping at single-cell resolutions. Current sequential hybridization techniques are problematic in large volumes due, among other issues, to probe penetration. In some embodiments, dredFISH may utilize a hyperspectral light-sheet microscope leveraging newer organic polymer dyes, allowing simultaneous imaging and measurement of numerous fluorophores through all positions in a specimen.
As described herein, the present disclosure provides for the identification of the locations within a specimen of specific cell types. However, such identification is not based on a “1:1” correlation between the cell type and its location as would be determined by conventional cell staining or even more advanced methods using immunocytochemistry or in-situ hybridization, where the specific position of a cell in a specimen is based on a detectable property (e.g., antibody binding, nucleic acid hybridization) at a location; such methods for identifying locations of numerous cells types in a large specimen are tedious, time consuming and often unnecessary in order to yield the desired information. In contrast, the methods described herein provide a higher level cell type classification within the specimen based on a plurality of properties of each cell type (e.g., receptor expression, nucleic acid expression), a plurality of specific reagents that bind to the certain receptors or nucleic acids expressed by each cell type (referred to herein as bivalent binding reagents or encoder probes), a plurality of detectably-labeled specific reagents that bind to the bivalent binding reagents (herein referred to as labeled binding reagents) a means (e.g., imaging) to readily identify the locations of the labeled binding reagents in all locations within the specimen, and an in silico designed selection of properties, reagents and labels that differentiates among the cell types and readily provides cell type location. As described herein, it has been found that a limited number of detectable dyes, cell detection reagents, and the intensity of detectability of each reagent, can provide the cell location information without locating every differentially expressed cell specific receptor or nucleic acid amongst a plurality of cell types can be used to generate cell type locations.
The methods disclosed herein are useful beyond any specific examples of nucleic acid or protein detection (using, e.g., complementary nucleic acids or antibodies, respectively), for any cellular component for which a tag or tags may be employed following the guidance here to detect the component(s) of interest to identify and differentiate cell types in the specimen. Furthermore, different types of tags (e.g., both antibodies and nucleic acids) may be used together in the practice of the methods disclosed herein for the purposes described herein. Thus, while nucleic acid-based tags are employed in certain examples herein, for tagging nucleic acids in cells, the disclosure is not so limiting to any particular type of tag or any particular use of a single type of tag in a particular method. Non-limiting examples of other molecular markers include metabolites, lipids, carbohydrates including polysaccharides, glycolipids, vitamins, fatty acids, co-factors, pigments, metals, or any other biochemicals or compounds, organic or inorganic, found within a biological system. Tags useful for their detection in accordance with the teaching herein include but are not limited to antibodies and antigen-binding fragments thereof, ligands, lectins, receptors, chelators, etc.
The following descriptions of the steps of the method are exemplary and non-limiting. Variations that achieve the same or similar outcomes are fully embraced herein.
Selecting Cell Types within the Specimen (Step a)
The cell types to be located within a specimen is guided by the information desired to be obtained by locating the positions, numbers, distribution, topography, contact zones, organization, purity, shape, and/or other characteristics of such cell types within the specimen and/or relationships to other cellular, tissue or organ structures. By way of example, the distribution of cancer cells in stroma from a solid tumor biopsy, or the distribution of astrocytes and neuronal cells in the hippocampus, may be diagnostic for cancer invasiveness or neurodegeneration, respectively. Moreover, mapping of cell types using the methods disclosed herein using a normal cellular sample, specimen, tissue or organ may provide information such as what comprises a normal (e.g., healthy) cell type distribution against which to compare pathological or suspected pathological specimens. Changes in cell type distributions over time may provide methods for determining chronological or biological age from a specimen. For example,
It should be noted that specimens useful for the purposes herein may be fresh, frozen, formalin preserved, alcohol preserved, thin sections, thick sections, biopsy specimens, formalin-fixed, paraffin embedded, by way of non-limiting examples. Such specimens may come from patients, healthy subjects, pathology specimens, fossilized specimens, cryogenically preserved specimens, exhumed specimens, mummified specimens, etc. Moreover, such specimens may be embedded in a hydrogel matrix (e.g., polyacrylamide), may be cleared, may be expanded, or any combination of the above.
The disclosure herein is based on identifying locations of cells relying on detectable expression of molecular markers on each of those cells types. Such markers may be unique to a particular cell type, or the same markers can be expressed in different amounts, absolutely or relative to one or more other markers, among a number of different cell types. As noted herein, the subsequent steps in which reagents are designed to optimally distinguish among cells types based on expression of such markers and may inform the selection of the markers to be used for the identification, such that steps (b) and (c) are interrelated, and the order they are carried out may be reversed or iterative.
The molecular markers of the selected cell types from step (a) may be identified from the literature. For example, the identity and levels of expression of cell surface markers among the numerous types of brain cells is known from the literature. Identities of markers expressed on or by numerous cell types in tissues and organs of numerous animal species are an expanding part of the scientific literature.
In one embodiment, the molecular markers are nucleic acid polymers. In a preferred embodiment, the nucleic acid polymers are RNA. In an alternative embodiment, the molecular markers are protein, which may be any cell-surface protein, receptor, transcription factor, antibody, or a combination thereof. In other embodiments, the molecular markers are metabolites, lipids, carbohydrates including polysaccharides, glycolipids, vitamins, fatty acids, co-factors, pigments, metals, or any other biochemicals or compounds, organic or inorganic, or any combination thereof. In some embodiments, as described further below, a single type of tag (e.g., antibody) is used to detect one or more markers. In some embodiments, two or more types of tags are used (e.g., nucleic acids and antibodies) to detect one or more types of markers. In some embodiments, two or more tags are used to detect two or more types or markers. The disclosure is not limited by the number of different types of tags or the number of types of markers identified using the methods herein.
The present disclosure comprises a computational method for determining how to optimally label cells in a biological sample such that each cell type of interest is distinguishable from each other cell types in the biological sample using a limited number of labels and using a number of probes of molecular markers, based on relative levels of marker expression among cell types of interest to be differentiated from each other. In one embodiment of the present method, up to hundreds of cell types may be reliably distinguished in a tissue sample using fewer than thirty detectable labels. Such discrimination among cell types is provided by the computational-aided selection (encoding) of binding reagents (e.g., bivalent binding reagents and labeled binding reagents) that detect those markers, the extent of detectability of each labeled binding reagent, the ability and efficiency of an imaging system to identify at each location in the specimen the labels, and the computation methods to decode the relative marker expression information back into specific cell types and locations. In some embodiments, step (c) is a machine learning based process. In some embodiments, step (c) is a dimensionality reduction process. In some embodiments, step (c) is global optimization heuristic such as simulated annealing or genetic algorithm.
In silico methods for designing the reagent set based on the above information are available in any number of formats, such as but not limited to algorithms including machine learning algorithms. Non-limiting examples of such algorithms include recursive partitioning, discernment projection non-negative matrix factorization (dPNMF), among others.
Typically, this step identifies and implements a lower-dimensional representation of gene expression followed by additional statistical learning steps that assign labels to cells in the lower dimensional space. One popular dimensionality reduction scheme is principal components analysis (PCA) that is often used to create a representation of gene expression data using the first 20-50 components. Therefore, in practice, cell type classification is not occurring in the original gene expression space rather in a dimensionality reduced space that captures enough information to accurately classify cells into types. In other words, the read-out used for cell type classification are not individual genes but, in the case of PCA, a small number of linear weighted sums of gene expression levels. Using statistical methods such as PCA or dPNMF, it is possible to avoid the need for individual gene expression measurements, circumventing the inherent challenges faced by spatial transcriptomics.
There are several dimensionality reduction methods that are typically applied to single-cell RNA expression data, the most popular of which are PCA and dPNMF. Other suitable methods may include those described below:
Recursive partitioning. In one embodiment, all cell types of interest in being located within the specimen are organized into a binary classification tree. The tree can be learned directly from observations determining among those specific cell types the known extent of expression or lack thereof of a plurality of known molecular markers of each specific cell type, using hierarchical clustering or using prior knowledge.
Discernment projection non-negative matrix factorization (dPNMF). Another suitable method is a linear algebra method for basis identification called dPNMF, that is supervised and balances supervised label encoding and overall signal reconstruction. The dPNMF methodology has not yet been used previously in biological applications and has a key advantage over other basis identification methods as it identifies sparse representations that can be reapplied on unseen data. The method described in Guan et al., PLoS
Artificial Neural Network. Another suitable method is the use of an artificial neural network with multiple layers. Each layer of the network implements two mathematical operations, linear matrix multiplication and a non-linear operation. The network is trained in a supervised manner to classify cells into their known types. To use supervised learning as a design step, we simply restrict the weight of the first operation (linear matrix multiplication) to have non-negative weights and add other restrictions related to the number of overall probes used to enforce sparsity. Converting the learned weights of the first layer to bivalent reagents that maps between markers and readout probes is done by multiplying this matrix by a constant and rounding to the closet integer. The resulting values are used to determine the number of bivalent probes that map each molecular marker to a specific readout probe.
The encoding step establishing the relationship between molecular markets and cell types, is achieved by discriminant projection non-negative matrix factorization (dPNMF) as follows:
In some embodiments, the classifier of step (ii) is a Naïve Bayesian classifier. In some embodiments, the classifier is a machine learning classifier. In some embodiments, the machine learning classifier is the K-nearest neighbors algorithm (KNN).
Thus, step (c) will provide the basis for the design of the bivalent binding reagents for use with a particular type of specimen. Based upon the weights assigned to each marker, bivalent binding reagents are prepared that may recognize multiple different sequences on a marker (a complementary nucleic acid for RNA markers; in the case of antibody-based reagents, they may recognize different epitopes on the same protein marker). Bivalent binding reagents that recognize different regions on the same marker and binding to the same labeled binding reagent thus provide the corresponding weight of that marker in the projection matrix, read onto the low-dimensional representation.
The foregoing methods for establishing the relationship, also referred to herein as encoding, may be practiced by other methods and the disclosure is not limited to any particular method. Furthermore, such establishing steps provided for a particular tissue or any other biological sample type may then be used for any other specimen of the same tissue or biological sample type, such that these steps need to be performed only once per sample type. In a non-limiting example, information for carrying out such steps may be stored and subsequently retrieved and used for processing additional specimens, including having the reagents described in steps (d) already prepared and ready for use, such that rapid processing and analysis of cell types locations in incoming tissues from a biopsy specimen or tumor resection, can be performed quickly for guiding drug therapy, further surgery, or both. Specialized reagents for detecting rare, abnormal, diseased or aberrant molecular marker expression may also be provided for diagnostic purposes.
Provided with these encoded machine-learned data, which provides the weights to guide the design of the binding reagents, a staining protocol is created wherein the number of unique fluorophores equals the depth of the taxonomic tree, i.e., a vector of values, and the number of fluorophore molecules per binding reagent molecule the coefficient learned through the logistic regression. Bivalent binding reagents are prepared that bind to each molecular marker to be identified in the method; as noted, in some cases different bivalent binding reagents bind to multiple sites on a marker. Each molecular-marker-specific bivalent binding reagent comprises two parts, one that binds to the molecular marker, and another part that is bound by a labeled binding reagent. This probe sets design is shown in
Thus, in experimentally determining the design of the reagents for a particular specimen and cell types therein, a simplified design and guidance for preparing reagents is shown in
By way of further description of the example in
The resulting composite readout in the sample is the number of probes that map to B1 and B2 readout probes. The number of readout arms of each basis type (B1 and B2) in each cell is a sum of the products of the number of weights per gene times the number of RNA per gene. Thus, in total, cell-1 will have 7 B1 (7=3*2+1*1+0*1) and 4 B2 (4=0*2+2*1+2*1) whereas cell-2 will have 4 B1 (4=3*1+1*1+0*3) and 8 B2 (8=0*1+2*1+2*3). These values are compared between experimental measurement and reference to allow classification of cells into types and reconstruction of gene expression.
The bivalent binding reagents disclosed herein comprise a molecular-marker binding region and at least one labeled-binding-reagent binding region. Such reagents may be prepared by any methods known in the art. In an example herein, oligonucleotide reagents are prepared and used to carry out dredFISH.
Oligonucleotide-based bivalent binding reagents comprise at least one molecular-marker binding region nucleic acid sequence, and at least one labeled-binding-reagent binding region nucleic acid sequence. In some embodiments a bivalent binding reagent may comprise two labeled-binding-reagent binding sequences, binding the same or different labeled binding reagents. In some embodiments a bivalent binding reagent may comprise three labeled-binding-reagent binding sequences. In some embodiments a bivalent binding reagent may comprise more than three labeled-binding-reagent binding sequences. Such plurality of labeled-binding-reagent binding sequences on a bivalent binding reagent may bind one or more of the same, or two different, or two of the same and one different, or three all different labeled binding reagents, or any other combination thereof. Such number of labeled-binding-reagent binding sequences will be provided in the calculations from steps a, b and c as described herein.
For purposes of amplifying the bivalent binding sequences, amplification sequences may be include in the bivalent binding reagents, which may be retained or removed before use in the staining methods disclosed herein. In some embodiments, a forward and reverse amplification nucleic acid sequence are included in the bivalent binding reagent. In some embodiment, the forward amplification sequence is provided at the 5′ end of the bivalent binding reagent, and the reverse amplification sequence at the 3′end.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a molecular-marker binding sequence and a labeled-binding-reagent binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence and a molecular-marker binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a labeled-binding-reagent-binding sequence, and a molecular-marker binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding region, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a labeled-binding-reagent-binding sequence, a molecular-marker binding sequence, and a labeled-binding-reagent binding sequence.
In some embodiments, the bivalent binding sequence may comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent binding region, a labeled-binding-reagent binding sequence, and a reverse amplification sequence. In some embodiments the aforementioned sequence is cleaved to a, from 5′ to 3′, a labeled-binding-reagent binding sequence, a molecular-marker binding sequence, a labeled-binding-reagent-binding sequence, and a labeled-binding-reagent binding sequence.
In any of the bivalent binding reagents disclosed herein, additional nucleotides (e.g., A, C, G, T) may be provided as spacers between the aforementioned regions, such as one or more A.
The foregoing examples of arrangements of the components of a bivalent binding reagent are merely exemplary and alternate designs of the reagents are embraced herein without deviating from the intent of the disclosure. Table 1 sets forth a subset of the bivalent binding sequences used in the brain cell type scanning example herein, said sequences comprise, from 5′ to 3′, a forward amplification sequence, a labeled-binding-reagent binding sequence, a labeled-binding-reagent binding region, a molecular-marker binding sequence, a labeled-binding-reagent binding sequence, and a reverse amplification sequence. It is noted that some bivalent binding reagents are provided that bind to different sequences on the same molecular marker; as described herein above, such multiple bivalent binding reagents binding the same marker provide the weights that each molecular marker target contributes towards the basis measurement.
The foregoing bivalent binding sequences may be prepared by any method for preparing oligonucleotide sequences, and amplified using known methods. In one embodiment, PCR was used to add a T7 promotor sequence converting the ssDNA to dsDNA then performing an in-vitro transcription to convert the dsDNA to ssRNA. A reverse transcription is then used to convert the ssRNA to ssDNA. Each foregoing steps amplifying the total number of molecules. In another embodiment, asymmetrical PCR is used that produces an excess of ssDNA directly. In another embodiment, a rolling circle approach is used where the initial template is circularized then amplified into a long ssDNA strand consisting of many repeats of the template before being cleaved back to short template size ssDNA. These are merely non-limiting examples of methods that the oligonucleotide reagents disclosed here can be amplified to high quantities for the uses herein. In other embodiments, oligonucleotide reagents may be purified by any of many methods known in the art, such as but not limited to phenol chloroform extraction to remove proteins then a dialysis column to concentrate and buffer exchange oligonucleotides into small volumes of water. In other examples, alcohol precipitation protocols for concentrating oligonucleotide reagents as well as Speedvac where the solvent is evaporated to concentrate the reagents. Cleavage of the amplification regions may be achieved by restriction digestion, though this is not required for the oligonucleotides to carry out their intended purposes.
Oligonucleotide-based labeled binding reagents comprise an oligonucleotide sequence that binds to a labeled-binding-reagent binding sequence of a bivalent binding reagent, and a detectable label such as a fluorescent dye. In some embodiments the dye is covalent bound to the oligonucleotide region of the labeled binding reagent. In some embodiments, the dye is reversibly linked to the oligonucleotide region of the labeled binding reagent, for example using a disulfide bond, such that it can be cleaved (reduced) and removed for successive imaging using labeled binding reagents incorporating the same dye.
Non-limiting examples of labeled binding reagents are provided in Table 2 below. The table shows the sequence of the labeled binding reagent, and the sequence of the labeled-binding-reagent binding sequence(s) on the bivalent binding reagent to which it binds. Such labeled binding sequences are merely exemplary of those useful for the purposes disclosed herein; a skilled artisan will easily modify the design to accommodate other means for carrying out the teaching herein, including using non-oligonucleotide based reagents.
This disclosure encompasses the bivalent binding reagents and labeled binding reagents disclosed herein in Tables 1 and 2, including the bivalent binding reagents with and without either one or both of the amplification sequences, and the labeled binding reagents with and without a bound dye. As noted herein, the Cy5 dye used in the example (Cyanine5, Cy5 acid, CAS Registry No. 1032678-07-1, e.g., from BroadPharm) is representative of any of numerous dyes useful for the purposes herein, and the sample preparation and imaging protocol will guide the use of a single dye with successive imaging, the use of multiple different dyes and simultaneous imaging to scan those multiple dyes, or a different dye for each labeled binding reagent and simultaneous scanning for all dyes. Resources for other dyes useful for the purposes herein are found in: Beliveau et al., 2014, Visualizing genomes with Oligopaint FISH probes, Curr Protoc Mol Biol 2014 Jan. 6; 105:14.23.1-14.23-20. Non-limiting examples of other dyes useful for these purposes include Alexa.Fluor.350, Alexa.Fluor.405, Alexa.Fluor.488.H2O, Alexa.Fluor.532, Alexa.Fluor.610, Alexa.Fluor.633, ATTO.430LS, ATTO.490LS, ATTO.565, BD.Horizon.V450, BUV395 . . . BD.Horizon.Brilliant.Ultraviolet.395, BUV496 . . . BD.Horizon.Brilliant.Ultraviolet.496, BUV563 . . . BD.Horizon.Brilliant.Ultraviolet.563, BUV661 . . . BD.Horizon.Brilliant.Ultraviolet.661, BV421 . . . BD.Horizon.Brilliant.Violet.421, BV480 . . . BD.Horizon.Brilliant.Violet.480, BV510 . . . BD.Horizon.Brilliant.Violet.510, BV570 . . . BD.Horizon.Brilliant.Violet.570, BV605 . . . BD.Horizon.Brilliant.Violet.605, BV650 . . . BD.Horizon.Brilliant.Violet.650, CF405L, CF405M, CF405S, CF430, Cy3, Cy3.5, Cy5, DY.350XL, DY.360XL, DY.375XL, DY.380XL, DY.395XL, DY.480XL, DY.485XL, DY.510XL, DY.521XL, NovaBlue.530, NovaBlue.610, NovaBlue.660, NovaYellow.570, NovaYellow.660, POPO.1, PromoFluor.350LSS, PromoFluor.370LSS, PromoFluor.375LSS, PromoFluor.488LSS, PromoFluor.500LSS, PromoFluor.510LSS, PromoFluor.520LSS, PromoFluor.532, Super.Bright.436, Super.Bright.600, Super.Bright.645, Super.Bright.702, and SYTO.40, Other dyes include BODIPY 630/650-X, LC Red 640, BODIPY 650/665-X, Alexa Fluor 647, Alexa Fluor 660, Cyanine5.5, Alexa Fluor 680, Alexa Fluor 700 and Alexa Fluor 750. The disclosure is not so limited by such selections and may depend on the availability of instruments to perform the requisite scanning to achieve the purposes herein, the time necessary for the scan and need for the resultant data, and other factors which do not deviate from the intended purposes disclosed herein. For hyperspectral scanning, a selection of dyes includes DY.360XL, CF405S, NovaBlue.530, NovaYellow.570, Alexa.Fluor.633, DY.375XL, ATTO.490LS, BUV395 . . . BD.Horizon.Brilliant.Ultraviolet.395, BUV496 . . . BD.Horizon.Brilliant.Ultraviolet.496, BUV563 . . . BD.Horizon.Brilliant.Ultraviolet.563, BUV661 . . . BD.Horizon.Brilliant.Ultraviolet.661, POPO.1, DY.380XL, BV570 . . . BD.Horizon.Brilliant.Violet.570, BV605 . . . BD.Horizon.Brilliant.Violet.605, Super.Bright.645, CF430, PromoFluor.350LSS, DY.395XL, PromoFluor.520LSS, DY.485XL; NovaYellow.660, PromoFluor.500LSS, PromoFluor.510LSS, and Super.Bright.702.
Other designs of the labeled reagent set will be dictated by the encoding method used. Labeled binding reagents are prepared using dye-oligonucleotide or dye-protein (antibody or antigen-binding fragment) chemistries well known in the art. In other embodiments, other types of tags are labeled using appropriate chemistries, such as but not limited to bispecific antibodies (a bivalent reagent recognizing for example a protein target and a detectably labeled antigen).
In other embodiments, such further layers of oligonucleotide reagents are provided that relate encoding to a taxonomic tree of cell types. Such modification of the procedure as described would be correspondingly applied to the other steps in the method. In other embodiments, such further one or more layers may be achieved with antibodies or antigen-binding fragments thereof, and corresponding antigens comprising such bi-functional reagents. In other embodiments, similar methods are applied to labels on different types of tags.
In one embodiment, the dyes used in the preparation of the binding reagents are imaged using hyperspectral imaging, wherein all dyes at a particular location within the specimen are imaged simultaneously and the quantitative information on each dye present at that location recorded. In an alternative embodiment, the dyes are imaged using a sequential wavelength-limited imaging, the sample washed, and reimaged using stepwise imaging methodology. In other words, in one non-limiting example, a technician may stain the sample using Cy3-labeled binding reagents and then image in the Cy3 emission wavelength; wash the sample; then re-stain using Texas Red-labeled binding reagents and then image in the Texas Red emission wavelength; wash again; and so on. The various captured wavelengths may then be layered into a composite image. Alternate embodiments and methods for imaging are described in the paragraph on step (f), below.
Incubating the Specimen with the Bivalent Binding Reagents and the Labeled Binding Reagents (Step e)
Staining with the two types of binding reagents as specified above may be performed on specimens that are prepared for the subsequent imaging step. For thin sections wherein two-dimensional imaging is sufficient, standard specimen staining protocols may be utilized. For thicker specimens where three-dimensional information is needed, tissue preparation is typically required such as involving tissue clearing. In one non-limiting embodiment, the specimen is embedded in a hydrogel matrix. In one non-limiting embodiment, the labeled binding reagents comprise a moiety that will allow coupling to the hydrogel, such as acryloyl-containing moieties that can be polymerized into an acrylamide gel (see, e.g., Moffitt et al., PNAS Dec. 13, 2016, 113 (50) 14456-14461). In some embodiments, the specimen after polymerization into the gel may be isotropically expanded to facilitate hyperspectral or any other type of imaging. In some embodiments, after the bivalent binding reagents and labeled binding reagents have localized to their binding sites, the specimen permeated with polymerizable monomers and the labeled binding reagents cross-linked into the hydrogel matrix, the specimen may be cleared of the specimen, leaving the labeled binding reagents in the positions of the original cells they were designed to locate.
Because the bivalent binding reagents are typically larger than the labeled binding reagents, in particular for nucleic acid based reagents (e.g., about 150 nucleotides vs. about 20 nucleotides), incubation with the former requires a longer period than the latter. In some embodiments, the molecular-marker binding region of the bivalent binding reagent needs to bind RNA targets that are fixed with potentially some secondary structures. Therefore, when oligonucleotide-based reagents are used, the hybridization of bivalent binding reagents to RNA takes a long time (e.g. 12 hours) but has to occur only once prior to imaging. The second hybridization type is using labeled binding reagents that hybridize to the bivalent binding reagents. In some embodiments, this step is very fast (about <15 minutes) due to the simplicity of binding to bivalent binding reagents and the short length of labeled binding reagents (about 20 base pairs+fluorophore). In some embodiments, multiple rounds of incubation of the sample with labeled binding probes is needed, e.g., 8 times assuming 3 color imaging of 24 readout rounds. Optimization of the incubation periods to improve the results and/or reduce the incubation times of the various reagents is embraced herein.
In some embodiments, the incubating of the specimen with the bivalent binding reagents is performed before the incubating with the labeled binding reagents. In some embodiments, the incubating with the bivalent binding reagents is longer than incubating with the labeled binding reagents. In some embodiments, the incubating with the bivalent binding reagents is performed at the same time as with the labeled binding reagents. In some embodiment the specimen is washed after incubating with the bivalent binding reagents and before the incubating with the labeled binding reagents. In some embodiments the labeled binding reagents are added after the bivalent binding reagents. In some embodiments the specimen is washed after incubation with the labeled binding reagents.
In some embodiments, the specimen is prepared to maximize penetration of bivalent binding reagents. In some embodiments, electrophoretic fields are used to uniformly stain the specimen. In some embodiments, stochastic electrotransport is used, wherein the directionality of the electric field is randomly changed over time to actively disperse molecules to uniformly stain thick gels.
Imaging Positions within the Specimen (Step f)
In one embodiment, for hyperspectral imaging in two dimensions, in one embodiment a hyperspectral epi-fluorescence/confocal microscope can be used. In some embodiments, the hyperspectral light-sheet microscope may use high-transmittance tunable filters that change the position of the bandwidth as a function of the angle of the filter. In some embodiments, the hyperspectral light-sheet microscope comprises a moving stage and laser strobing.
For three-dimensional samples, in one embodiment, a hyperspectral light-sheet microscope may be used. Non-limiting examples include that described by Jahr et al., N
In other embodiments, imaging may be achieved using standard fluorescence imaging, light sheet imaging, or flow cytometry. In some embodiments, step (f) is accomplished by non-optical sensing methods such as mass spectrometry. The method disclosed herein is not limited in any way to the particular method of detecting the labeled tags in the specimen, and the skilled artisan will be easily guided to the appropriate method by considering the number of labels to be measured, the time available for the assessment (e.g., acutely deciding a patient's course of therapy from a biopsy assessed using these methods), the thickness of the specimen, the available equipment where the method is carried out, and other considerations that are fully embraced herein.
In one embodiment, the imaging at positions throughout the specimen to detect the labeled binding reagents and extent of labeling is obtained sequentially or simultaneously. In some embodiments, the specimen is imaged for all dyes at each location in the specimen simultaneously. In some embodiments, the specimen is imaged for each dye sequentially at each location in the specimen. In some embodiments, the specimen is incubated with all of the bivalent binding reagents and the subsequent incubating with the labeled binding reagents may be simultaneous or sequential, with, in some embodiments, imaging after each sequential incubation with each labeled binding reagent or a subset of the labeled binding reagents. Thus, in some embodiments, the imaging is performed batchwise to detect one or more dyes each scan. In some embodiments, after such partial imaging of the sample, the one or more labeled binding reagents are washed out of the specimen before the next one or more labeled binding reagents are incubated then imaged. In some embodiments, the washing out comprises removing the labeled binding reagents. In other embodiments, the dyes of the labeled binding reagents are quenched or otherwise made to not interfere with subsequent imaging of the same or different dyes.
Correlating Staining with Cell Type (Step g)
To subsequently decode the captured image data as described above and assign cell types, the following steps are carried out:
If the recursive partitioning method in step c was used:
If the dPNMF method in step c was used:
Empirically, a test of this method, as shown below in Example 4, achieved a performance of dPNMF with 24 dimensions. In other words, the abundance of ˜9,000 markers (e.g., RNA types) was mapped into 24 aggregate measurements such that the information on the label in each of these measurements is preserved.
In other embodiments, if a different method for establish the relationship (encoding) was used to create the set of bivalent binding reagents and labeled binding reagents for the particular type of specimen, in a similar fashion the data from the imaging of the specimen may be decoded following that encoding process.
The data on specific cell types and their locations obtained in step (g) are provided as a map or other data format to identify cell type locations within the specimen.
In one embodiment, the molecular markers are nucleic acid polymers. In a preferred embodiment, the nucleic acid polymers are RNA. In an alternative embodiment, the molecular markers are protein, which may be any of a secreted protein, cell-surface protein, receptor, transcription factor, antibody, or a combination thereof. In other embodiments, the molecular markers are metabolites, lipids, carbohydrates including polysaccharides, glycolipids, vitamins, fatty acids, co-factors, pigments, metals, or any other biochemicals or compounds, organic or inorganic, found within a biological system, or any combination thereof. In some embodiments, the methods disclosed herein are applied to two or more types of markers (e.g., proteins and nucleic acids) using the appropriate reagents for each type of marker, which may be performed concurrently (i.e., incubation with bivalent binding reagents for the proteins and nucleic acids expressed by cells in the specimen; incubation with labeled binding reagents that bind to the respective protein or nucleic acid binding bivalent binding reagents).
Tags (i.e., the molecular-marker binding portion of the bivalent binding reagents) useful for their detection in accordance with the teaching herein include but are not limited to antibodies and antigen-binding fragments thereof, ligands, lectins, receptors, chelators, etc.
In one embodiment, steps (a), (b) and (c) are performed for a particular type of biological specimen wherein the specimen comprises a plurality of known cell types (e.g., known from the literature) and among the known cell types within the specimen from which the plurality are selected for locating, in step (a), the known molecular markers of each cell type is obtained from the literature, for step (b). Based upon the selection of molecular markers in step (b), step (c) may be carried out to identify the markers to be detected, and step (d) the design of the reagents. Thus, steps (a)-(d) are carried out for each particular type of specimen and cells therein of interest in locating. In one embodiment, the remainder of the steps are carried out on the specimen.
In one embodiment, steps (a), (b) and (c) are carried out in silico.
In another embodiment, step (b) may further comprise organizing the specific cell types into a hierarchical taxonomy according to the plurality of known molecular markers. In some embodiments, the number of different detectable labels of step (d) may equal to the number of hierarchical levels of the hierarchical taxonomy.
In yet another embodiment, step (c) is accomplished using a dimensionality reduction process. In still another embodiment, step (c) is accomplished using recursive partitioning. In another embodiment, step (c) is accomplished using machine learning to design the encoding. In an alternative embodiment, step (c) is accomplished using discriminant projection non-negative matrix factorization (dPNMF). In an aspect, dPNMF comprises the steps of (i) fitting a dPNMF model to training data; (ii) fitting a classifier to one class per cell type; and (iii) creating a staining profile for each cell type according to a weighting, whereby the number of cell labels per molecular marker approximates weighting. In an aspect, the classifier of step (ii) is a Naïve Bayesian classifier. In another aspect, the classifier of step (ii) is KNN.
In one embodiment, step (d) is accomplished using direction from step (c) as to the preparation of the set of labeled binding reagents. In one embodiment, the labels are dyes that are individually detectable in a single location within the specimen using hyperspectral imaging.
In one embodiment, step (f) is accomplished by hyperspectral scanning of the specimen. In one embodiment, hyperspectral epifluorescence/confocal microscopy is used. In one embodiment, hyperspectral light-sheet microscopy is used. In other embodiments, step (f) is accomplished by standard fluorescence imaging, light sheet imaging, or flow cytometry. In some embodiments, step (f) is accomplished by non-optical methods such as mass spectrometry.
In some embodiments, the specimen is incubated with all of the bivalent binding reagents, but the incubating with the labeled binding reagents may be simultaneous or sequential, with, in some embodiments, imaging after each sequential incubation with each labeled binding reagent.
In one embodiment, the data obtained from step (f) is converted to locations of particular cell types within the specimen using the correlating of step (g), which is based upon the relationships established in step (c). In one embodiment, step (c) is an encoding step, and step (g) is a decoding step.
In one embodiment, the locations of the cell types within the specimen in step (h) are used diagnostically to identify, for example, a disease state or the potential for a diseases state to develop based upon the locations of particular cell types within the specimen.
These and other aspects of the embodiments described herein will be provided in the ensuing descriptions of the drawings and detailed description of the embodiments.
The following numbered embodiments are embraced by the disclosure herein.
The following examples are put forth so as to provide persons having ordinary skill in the art with a complete disclosure and description of how to make and use the subject disclosure, and are not intended to limit the scope of disclosure. Efforts have been made to ensure accuracy with respect to the numbers used (e.g., quantities, amounts, temperature, concentrations, etc.) but some experimental errors and deviations should be allowed for. Unless otherwise indicated, parts are parts by weight, molecular weight is average molecular weight, temperature is in degrees Celsius; and pressure is at or near atmospheric pressure. Overall, encoding and decoding (e.g., creating the bivalent binding reagents and labeled binding reagents, and analyzing the images created therefrom) is a statistical learning task, and like many statistically learning tasks there are several variants one could implement. This disclosure is no limited to any particular encoding and decoding methods or algorithm. In each example a reference dataset is used with many exemplary cells having the following properties: (1) abundance of markers of interest (e.g., protein or RNA); and (2) cell type label. This as conceptualized as a matrix wherein the first n columns are abundance values and the n+1 column is a category, i.e., the cells type of the specific cell.
Preliminary data from mouse whole coronal thin sections show that dredFISH measurements contain detailed information on cells' transcriptional states. These data combined with the lower magnification optics and the very bright signal generated from summing expression of thousands of genes, make dredFISH an ideal approach for cell-type mapping of large volumes. This example shows the feasibility of directly measuring multiple weighted sums of gene expression, an informative low dimensional approximate representation of gene expression, without measuring individual gene expression levels.
To validate these premises, we first designed the projection matrix of gene expression to lower dimensional space (DPNMF basis,
We next stained and imaged a mouse coronal section (
We further validated that the observed spatial patterns provide enough information for integrating dredFISH and scRNAseq (
Thus, the identified clusters spatially match the anatomy of the mouse brain including the six layers in the cortex, different components of the hippocampus, hypothalamic nuclei, and thalamic nuclei. To further dissect the information content in dredFISH dimensions we chose a cluster that represented neurons in the hippocampus and repeated the unsupervised clustering only for these cells. Spatial mapping of the identified subclusters within hippocampus neurons matches the known spatial position of CA1, CA3, and DG neurons. The ability of a standard unsupervised clustering algorithm to identify clusters that match known neuronal types validates the methods disclosed herein for directly measuring an abstract low-dimensional representation of gene expression without measuring individual gene expression levels.
We further analyzed the information contained in dredFISH measurements for data analysis tasks common in spatial transcriptomics analysis: unsupervised classification into types, cell-cell interactions, regions identification, and gene expression reconstruction (
Collectively, the data presented here demonstrate how the dredFISH principle works in practice and shows overall high cell type classification accuracy.
All cell types are organized into a binary classification tree. The tree can be learned directly from observations determining among those specific cell types the known extent of expression or lack thereof of a plurality of known molecular markers of each specific cell type, using hierarchical clustering or using prior knowledge.
Provided with this cell-type tree (i.e., cell taxonomy), the following operations are performed:
Provided with these encoded machine-learned data, we will create a staining protocol wherein the number of unique fluorophores equals the delph of the taxonomic tree, i.e., a vector of values. To later decode that vector and assign cell types, we do the following:
Discernment projection non-negative matrix factorization (dPNMF), such as the method described in Guan et al., PLoS O
The encoding step establishing the relationship between molecular markets and cell types, is achieved by dPNMF as follows:
The number of fluorophores, i.e., the dimensionality of the learned weight matrix in the dPNMF, is user-defined.
Empirically, a test of this method achieved a performance of dPNMF with 24 dimensions. In other words, we mapped the abundance of ˜9,000 markers (e.g., RNA types) into 24 aggregate measurements such that the information on the label in each of these measurements is preserved.
As shown in the preceding examples, DPNMF was found to provide reliable encoding. DPNMF provides an excellent projection, as can be gleaned from the data shown above. However, other neural network-based optimization schemes provide alternate methods for achieving the objectives of the methods described herein. In one embodiment, the lack of uniformity across rounds using DPNMF may reduce the overall information contained in the encoding scheme. For example, there are more than 1000-fold differences between the dimmest to the brightest basis (
The DPNMF projection is replaced with a neural network-based optimization (model shown in
Among the examples described herein above, examples from among the more than 92,000 oligos targeting more than 9,000 genes are provided below, used in the methods disclosed herein for preparing the images shown in
Table 1 lists 24 representative bivalent binding reagents (and Table 2, labeled binding reagents that bind them) used in the examples herein. The following oligonucleotides were prepared using PCR to add a T7 promotor sequence converting the ssDNA to dsDNA then an in-vitro transcription was performed to convert the dsDNA to ssRNA. A reverse transcription was then used to convert the ssRNA to ssDNA, each step amplifying the total number of molecules.
The following oligonucleotides were prepared following the same methods described above, and conjugated to Cy5 dye. The 24 labeled binding reagents were used to detect the more than 92,000 bivalent binding reagents used in the examples described herein. Some examples of the bivalent binding reagents, their target molecular markers, molecular marker binding region sequence and labeled-binding-reagent binding region sequences are set forth in Table 1.
This application claims priority to U.S. provisional patent application Ser. No. 63/226,660, filed Jul. 28, 2021, the entire contents of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/74201 | 7/27/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63226660 | Jul 2021 | US |