In the public domain, there are numerous references and well-curated databases that associate various biomolecules (e.g., protein) as biomarkers to many different diseases, disorders, and biological states. Advances in bioinformatics analyses offer new insights into the complex biomolecule changes that take place during the spectrum of health and disease. Yet the sensitivity and specificity of the identified biomarkers have not been adequate for early detection of cancers, for example. This may be attributed to the following; 1) often a single biomarker is not enough to generate sufficient signal to associate with a specific disease state, and 2) even if multiple biomarkers have been identified, classification using these biomarkers is often impaired due to noise from fluctuations in levels of these biomarkers and highly abundant serum/plasms proteins such as albumin. Several attempts have been made to enhance detection of a biomarker present at a low level, such as depletion of highly abundant proteins, isobaric labeling at the peptide level for multiplexed relative quantification, post-depletion plasma fractionation strategies, biomarker harvesting techniques, mathematical approaches for analyzing high quality data set, and multiplexed workflow. Despite all these efforts for improving the sensitivity and specificity of biomarkers, the complex biomolecule sampling has not been robustly successful in early detection of a disease state, e.g., cancer.
All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.
Provided herein is a computer-implemented method for detecting one or more biomarkers in a multi-omic data set, comprising: (a) providing a multi-omic data generated from one or more complex biological samples obtained from one or more individuals, each individual having one or more specified biological states; (b) applying a trained model to the multi-omic data to generate one or more classification model weights, wi . . . wn, for one or more features, fi . . . fn, yielding (wi, fi), . . . (wn, fn) and storing (wi, fi), . . . (wn, fn); (c) querying a reference data set for the one or more features, fi . . . fn, to generate a set of scores, si . . . sn, yielding (si, fi), . . . , (sn, fn) and storing (si, fi), . . . (sn, fn); (d) combining at least (wi, fi), . . . (wn, fn) and (si, fi), . . . , (sn, fn) to generate (wi, si), . . . (wn, sn) and selecting a subset of (wi, si), . . . (wn, sn) to detect one or more biomarkers. In some embodiments, selecting the subject in (d) comprises filtering (wi, si), . . . (wn, sn) such that w at least meets a first threshold and s at least meets a second threshold such that the one or more biomarkers comprise a subset (wk, sk) . . . (wm, sm) of (wi, si), . . . (wn, sn). In some embodiments, k≥i. In some embodiments, m≤n. In some embodiments, the trained model is trained using a set of labeled multi-omic data of a plurality of complex biological samples, wherein the labeled multi-omic data set comprises the one or more features fi . . . fn corresponding to one or more specified biological states, bi . . . bn, wherein the one or more features are proteins.
In some embodiments, the computer-implemented method for detecting one or more biomarkers in a multi-omic data set provided herein further comprises obtaining the one or more complex biological samples from the one or more individuals. In some embodiments, the computer-implemented method for detecting one or more biomarkers in a multi-omic data set provided herein further comprises generating an output. In some embodiments, the output corresponds to a specified biological state of the one or more specified biological states. In some embodiments, the reference data set is a database comprising features related to specified biological states by an association score. In some embodiments, the set of scores, si . . . sn, are association scores between the one or more features and the one or more specified biological states. In some embodiments, the one or more complex biological samples are selected from the group consisting of are plasma, serum, whole blood, amniotic fluid, cerebral spinal fluid, urine, saliva, tears, and feces.
In some embodiments, the multi-omic data comprises one or more selected from the group consisting of: proteomic data, genomic data, lipidomic data, glycomic data, transcriptomic data, or metabolomics data. In some embodiments, the multi-omic data comprises proteomic data. In some embodiments, the proteome data comprises (i) protein identifiers and (ii) specified biological states for the one or more individuals. In some embodiments, the multi-omic data is generated by assaying a complex biological sample of an individual of the one or more individuals. In some embodiments, the one or more features represent different proteins. In some embodiments, the one or more complex biological samples are not subjected to protein depletion. In some embodiments, the one or more complex biological samples are subjected to prior protein depletion. In some embodiments, the one or more specified biological states are bi . . . bn.
Provided herein is a method of proteome sampling, the method comprising: generating data from a first plasma proteome from a first complex biological sample and a second plasma proteome from a second complex biological sample, wherein the first complex biological sample is from a test subject with a specified biological state and the second complex biological sample is from a reference subject without the specified biological state; and building a trained classification model by extracting a plurality of features comprising a first feature of the first plasma proteome and a second feature of the second plasma proteome, wherein the trained classification model of the first feature and the second feature identifies one or more biomarkers linked to the specified biological state.
In some embodiments, the first plasma proteome differs from the second plasma proteome. In some embodiments, the first complex biological sample and/or the second complex biological sample are not subjected to prior protein depletion. In some embodiments, the first complex biological sample and/or the second complex biological sample are subjected to prior protein depletion. In some embodiments, the method of proteome sampling as provided herein further comprises subjecting the first complex biological sample and/or second complex biological sample to protein depletion prior to generating data. In some embodiments, the first plasma proteome and the second plasma proteome are generated after albumin depletion.
Provided herein is a method of complex biomolecule sampling, the method comprising: generating data from a first biomolecule corona from a first complex biological sample and a second biomolecule corona from a second complex biological sample, wherein the first complex biological sample is from a test subject with a specified biological state and the second complex biological sample is from a reference subject without the specified biological state; and building a trained classification model by extracting a plurality of features comprising a first feature of the first biomolecule corona and a second feature of the second biomolecule corona, wherein the trained classification model of the first feature and the second feature identifies one or more biomarkers linked to the specified biological state.
In some embodiments, associations between particles and captured biomarkers can be organized in a relational database. The relational database can be used to design particles to capture specific biomarkers.
In some embodiments, the biomolecule is selected from the group consisting of proteins, polypeptides, amino acids, sugars, carbohydrates, lipids, fatty acids, steroids, hormones, antibodies, metabolites, and polynucleotides. In some embodiments, the one or more biomarkers are present in a low or previously non-recorded concentration in the first complex biological sample. In some embodiments, the low or previously non-recorded concentration is less than 0.001 μg/ml or non-reported or non-detected in public databases. In some embodiments, the one or more biomarkers are detected with a sensitivity of 70% or more. In some embodiments, the one or more biomarkers are detected with a sensitivity of 90% or more. In some embodiments, the one or more biomarkers are detected with a sensitivity of at least 95%. In some embodiments, the one or more biomarkers are detected with a specificity of 70% or more. In some embodiments, the one or more biomarkers are detected with a specificity of 90% or more. In some embodiments, the one or more biomarkers are detected with a specificity of at least 95%. In some embodiments, the one or more biomarkers is selected from the group consisting of CPN1, FCN3, SAA4, IGHG1, IGHG3, CFHR5, C4B, IGLL5, APOD, SERPINA10, CPN2, FGL1, AHSG, ITIH2, HIST1H4D, C4A, CP, CD5L, CNN2, HRNR, GPLD1, IGKC, MASP2, ITIH1, CFHR1, COLEC10, BIN2, SAA2, ANGPTL6, CFB, TPI1, IGHA2, APOC2, EMILIN1, SBSN, PRG4, PPIF, CFHR2, ORMI, AMYIC, NEXN, CALML5, SERPINA7, IGHM, TUFM, APCS, SLC2A3, TMSB4X, CPQ, and SNCA. In some embodiments, the one or more biomarkers are selected from any one of proteins in Table 1.
In some embodiments, the test subject is a plurality of test subjects, and wherein each of the plurality of test subjects has the specified biological state. In some embodiments, the specified biological state is a disease state, a poor clinical outcome, a good clinical outcome, a high risk of disease, a low risk of disease, a complete response to a treatment, a partial response to a treatment, a stable disease state, or a non-response to a treatment. In some embodiments, the test subject is asymptomatic for the disease state. In some embodiments, the disease state is cancer, cardiovascular disease, endocrine disease, inflammatory disease, or a neurological disease. In some embodiments, the disease state is cancer and the cancer is selected from the group consisting of lung cancer, pancreas cancer, blood cancer, breast cancer, bladder cancer, ovarian cancer, thyroid cancer, brain cancer, prostate cancer, gynecological cancer, adenocarcinoma, sarcoma, neuroendocrine cancer, and gastric cancer.
In some embodiments, the proteome sampling or the complex biomolecule sampling as provided herein comprises a corona on a plurality of particles, wherein at least one particle of the plurality is a nanoparticle. In some embodiments, at least one of the plurality of particles is selected from the group consisting of a polymeric particle, a metal oxide particle, a plasmonic particle, a biomolecule particle, a magnetite particle, a maghemite particle, a micelles, a liposome, an iron oxide particle, a graphene, a silica, a protein-based particle, a DNA-based particle, a DNA-aptamer based particle, a RNA-based particle, a RNA-aptamer based particle, a polystyrene particle, a silver particle, and a gold particle, a quantum dot, a palladium particle, a platinum particle, a titanium particle, a superparamagnetic particle, and any combination thereof. In some embodiments, the plurality of nanoparticles are iron oxide nanoparticles with RNA on the surface. In some embodiments, the plurality of particles are iron oxide/polystyrene nanoparticles with RNA on the surface. In some embodiments, the plurality of particles are polystyrene nanoparticles with RNA on the surface. In some embodiments, the plurality of particles are gold nanoparticles with RNA on the surface. In some embodiments, particles can be made of any combination of any material disclosed herein. In some embodiments, the at least one of the plurality of particles is a liposome, and the liposome comprises at least one of a cationic lipid, an anionic lipid, a neutral lipid, or any combination thereof.
In some embodiments, the liposome comprises the cationic lipid, and the cationic lipid is selected from the group consisting of: N,N-dioleyl-N,N-dimethylammonium chloride (DODAC); N-(2,3-dioleyloxy)propyl)-N,N,N-trimethylammonium chloride (DOTMA); N,N-distearyl-N,N-dimethylammonium bromide (DDAB); N-(2,3-dioleoyloxy)propyl)-N,N,N-trimethylammonium chloride (DOTAP); 3-(N—(N′,N′-dimethylaminoethane)-carbamoyl)cholesterol (DC-Chol); N-(1-(2,3-dioleoyloxy)propyl)-N-2-(sperminecarboxamido)ethyl)-N,N-dimethy-lammonium trifluoracetate (DOSPA); dioctadecylamidoglycyl carboxyspermine (DOGS); 1,2-dioleoyl-3-dimethylammonium propane (DODAP); N,N-dimethyl-2,3-dioleoyloxy)propylamine (DODMA); N-(1,2-dimyristyloxyprop-3-yl)-N,N-dimethyl-N-hydroxyethyl ammonium bromide (DMRIE); 1,2-dioleoyl-sn-3-phosphoethanolamine (DOPE); N-(1-(2,3-dioleyloxy)propyl)-N-(2-(sperminecarboxamido)ethyl)-N,N-dimethy-lammonium trifluoroacetate (DOSPA); dioctadecylamidoglycyl carboxyspermine (DOGS); 1,2-ditetradecanoyl-sn-glycero-3-phosphocholine (DMPC); 1,2-dilinoleyloxy-N,N-dimethylaminopropane (DLinDMA); 1,2-dilinolenyloxy-N,N-dimethylaminopropane (DLenDMA); and any combination thereof.
In some embodiments, the liposome comprises the neutral lipid, and the neutral lipid is selected from the group consisting of diaclphosphatidylcholines, diacylphosphatidylethanolamines, ceramides, sphingomyelins, dihydrosphingomyelins, cephalins, and cerebrosides. In some embodiments, the liposome comprises the neutral lipid, and the neutral lipid is selected from the group consisting of: distearoylphosphatidylcholine (DSPC); dioleoylphosphatidylcholine (DOPC); dipalmitoylphosphatidylcholine (DPPC); dioleoylphosphatidylglycerol (DOPG); dipalmitoylphosphatidylglycerol (DPPG); dioleoyl-phosphatidylethanolamine (DOPE); palmitoyloleoylphosphatidylcholine (POPC); palmitoyloleoyl-phosphatidylethanolamine (POPE); dioleoyl-phosphatidylethanolamine 4-(N-maleimidomethyl)-cyclohexane-1-carboxylate (DOPE-mal); dipalmitoyl phosphatidyl ethanolamine (DPPE); dimyristoylphosphoethanolamine (DMPE); distearoyl-phosphatidylethanolamine (DSPE); 1-stearioyl-2-oleoyl-phosphatidyethanol amine (SOPE); 1,2-dielaidoyl-sn-glycero-3-phophoethanolamine (transDOPE); and 2-distearoyl-sn-glycero-3-phosphocholine (DSPC).
In some embodiments, the liposome comprises the anionic lipid, and the anionic lipid is selected from the group consisting of phosphatidylglycerol, cardiolipin diacylphosphatidylserine, diacylphosphatidic acid, N-dodecanoylphosphatidylethanolamines, N-succinylphosphatidylethanolamines, N-glutarylphosphatidylethanolamines, lysylphosphatidyiglycerols, palmitoyloleyolphosphatidylglycerol (POPG), and other anionic modifying groups joined to neutral lipids. In some embodiments, the liposome is selected from the group consisting of DOPG (1,2-dioleosl-sn-glycero-3-phospho(1′-rac-glycerol), DOTAP (1,2-dioleiyl-3-trimethylammonium propane), DOPE (dioleoylphosphatidylethaneolamine), CHOL (DOPC-cholesterol), and any combination thereof. In some aspects, the at least one particle of the plurality of particles is a nanoparticle. In some aspects, the plurality of particles comprises one or more nanoparticles. In some aspects, the plurality of particles is a plurality of nanoparticles.
Provided herein is a computer-implemented system for complex biomolecule sampling, the computer-implemented system comprising: a first memory unit for receiving a plurality of biomolecule sampling data, wherein the plurality of biomolecule sampling data comprises first biomolecule sampling data from a first complex biological sample and second biomolecule sampling data from a second complex biological sample, wherein the first complex biological sample is from one or more subjects with a specified biological state and the second complex biological sample is from one or more subjects without the specified biological state; a second memory unit for querying a known biomolecule data aggregator, wherein the known biomolecule data aggregator comprises data pertaining to known biomolecules associated with the specified biological state; a first computer executable instruction for building a trained classification model by extracting a first feature of the first biomolecule sampling data and a second feature of the second biomolecule sampling data, wherein the trained classification model of the first feature and the second feature identifies one or more biomarkers linked to the specified biological state; a second computer executable instruction for processing the trained classification model against the known biomolecule data aggregator and assigning a classification weight to all biomolecules, wherein said processing and assigning identifies one or more biomarkers linked to the specified biological state, wherein the one or more biomarkers confirms the specified biological state, wherein the one or more biomarkers is present in a low or previously non-recorded concentration in the first complex biological sampling data; a plurality of nodes connected to each other, each node comprising a computer server, including one or more processors for executing the first computer executable instruction and the second computer executable instruction; network connections to the plurality of nodes; and a communication bus between the computer server, the first memory unit, and the second memory unit.
In some embodiments, the computer-implemented system as described herein further comprises a third computer executable instruction for generating a report of the presence or absence of the specified biological state in a subject. In some embodiments, said report comprises a recommended treatment for a disease management. In some embodiments, the computer-implemented system as described herein further comprises a user interface configured to communicate or display said report to a user. In some embodiments, the computer-implemented system as described herein further catalogs the surface-activity relationship between biomarkers and the particle that captured them to output a Corona Knowledge Map (CKM).
Provided herein is a system comprising a non-transitory computer readable storage medium encoded with a computer program, including instructions executable by a processor, to create an application applying machine learning to a plurality of sample data to rationally design a plurality of features of a particle comprising a corona. The system can comprise a software module applying a machine learning detection structure to the plurality of sample data, the detection structure employing machine learning to screen surface-activity relationships in the plurality of sample data to identify a feature and classify the feature. The feature can be a particle binding region of the biomarker. The system can comprise a software module automatically generating a report comprising the surface-activity relationships of a sample from which said sample data was derived. The report can identify a disease state. The system can comprise a software module automatically generating a report comprising the plurality of features of the particle comprising a corona. The machine learning detection structure can comprise any of the following separately, in series, or in combination: Partial Least Squares, Logistic Regression, Support Vector Classifier, Nearest Neighbor, Random Forest, Naïve Bayes, Ensemble Classifiers, a neural network, a deep network, convolutional neural network, a deep convolutional neural network, a cascaded deep convolutional neural network.
Provided herein is a method of determining an efficacy of a therapeutic treatment of a subject comprising (1) obtaining a plasma sample from a subject before administration of a therapeutic treatment to treat a disease wherein the plasma sample comprises a plurality of plasma particles, (2) obtaining a second plasma sample from the subject after the administration of the therapeutic treatment wherein the second plasma sample comprises a second plurality of plasma particles, (3) isolating the plasma particles from both plasma samples, (4) enriching the biomarkers present in both samples, (5) assaying the enriched biomarkers to generate data, and (6) processing the data. Processing can comprise a trained classifier, wherein said trained classifier assigns a first set of model weights, wi . . . wn, for one or more features, fi . . . fn, yielding (wi, fi), . . . (wn, fn) and storing (wi, fi), . . . (wn, fn) to said first biomarker data to generate weighted first biomarker data, and said trained classifier assigns a second set of model weights, wi . . . wn, for one or more features, fi . . . fn, yielding (wi, fi), . . . (wn, fn) and storing (wi, fi), . . . , (wn, fn) to said second biomarker data to generate weighted second biomarker data. The trained model can be trained using a set of labeled multi-omic data of a plurality of complex biological samples, wherein the labeled multi-omic data set comprises the one or more features fi . . . fn corresponding to one or more specified biological states, bi . . . bn, wherein the one or more features are proteins. The method can further comprise querying a reference data set for the one or more features of the weighted first biomarker data and the weighted second biomarker data, fi . . . fn, to generate a set of scores, si . . . sn, yielding (si, fi), . . . , (sn, fn) and storing (si, fi), . . . , (sn, fn). The reference data set can be a public database. The method can also comprise combining at least (wi, fi), . . . , (wn, fn) and (si, fi), . . . , (sn, fn) to generate (wi, si), . . . , (wn, sn) and selecting a subset of (wi, si), . . . , (wn, sn) to generate a first phenotype classification and a second phenotype classification. The method can also comprise determining the efficacy of said therapeutic treatment by comparing said first phenotype classification with said second phenotype classification. The phenotype classification can be a disease state prior to treatment and the second phenotype classification can be a partial response to treatment. In some aspects, at least one particle of the first plurality of plasma particles is a nanoparticle. In some aspects, at least one particle of the second plurality of plasma particles is a nanoparticle. In some aspects, at least one particle of the first isolated plurality of plasma particles is a nanoparticle. In some aspects, at least one particle of the second isolated plurality of plasma particles is a nanoparticle.
Provided herein is a method of determining a concentration of a biomarker in a plasma sample comprising obtaining a reference data set comprising plasma samples with a known biomarker concentration, dispersing particles in each of the reference samples, isolating the particles from the samples, enriching the biomarkers captured by the particles, assaying the biomarkers to generate biomarker data, and incorporating the biomarker data within a trained classifier. The trained classifier can assign a concentration to the biomarker data based on the reference data set. The method can be applied to a plasma sample obtained from a subject such that the trained classifier can then be used to query the reference data set to output a biomarker concentration present in a plasma sample from a subject. The reference data set can comprise biomarker concentrations from 1 pg/mL to 100pg/mL. The reference data set can comprise biomarker concentrations from 1 pg/mL to 100 μg/mL.
Provided herein is a method of analyzing a broad range sampling of a plurality of biomolecules comprising assigning an existing knowledge association score to the plurality of biomolecules in a test data set, generating a classification model weight for the plurality of biomolecules based on the existing knowledge association score, and classifying each biomarker into a category indicative of a likelihood of the biomarker playing a role in the specified biological state. The classification categories can be chosen from one of the following: having a significant classification model weight but with little existing knowledge association for the specified biological state, having a significant classification model weight with well-known existing knowledge association for the specified biological state, having a weak classification model weight with little existing knowledge association for the specified biological state, or having a weak classification model weight with well-known existing knowledge association for the specified biological state. Biomarkers classified as having a significant classification model weight but with little existing knowledge association for the specified biological state can be further classified as novel biomarkers associated with the specified biological state. In some aspects, at least one particle of the plurality is a nanoparticle. In some aspects, the plurality of particles comprises nanoparticles. In some aspects, the plurality of particles is a plurality of nanoparticles.
The features of the present disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:
The following description and examples illustrate embodiments of the present disclosure in detail.
It is to be understood that the present disclosure is not limited to the particular embodiments described herein and as such can vary. Those of skill in the art will recognize that there are variations and modifications of the present disclosure, which are encompassed within its scope.
All terms are intended to be understood as they would be understood by a person skilled in the art. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.
The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.
Although various features of the disclosure can be described in the context of a single embodiment, the features can also be provided separately or in any suitable combination. Conversely, although the present disclosure can be described herein in the context of separate embodiments for clarity, the present disclosure can also be implemented in a single embodiment.
The following definitions supplement those in the art and are directed to the current application and are not to be imputed to any related or unrelated case, e.g., to any commonly owned patent or application. Although any methods and materials similar or equivalent to those described herein can be used in the practice for testing of the present disclosure, the preferred materials and methods are described herein. Accordingly, the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.
In this application, the use of the singular includes the plural unless specifically stated otherwise. It must be noted that, as used in the specification, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise.
In this application, the use of “or” means “and/or” unless stated otherwise. The terms “and/or” and “any combination thereof” and their grammatical equivalents as used herein, can be used interchangeably. These terms can convey that any combination is specifically contemplated. Solely for illustrative purposes, the following phrases “A, B, and/or C” or “A, B, C, or any combination thereof” can mean “A individually; B individually; C individually; A and B; B and C; A and C; and A, B, and C.” The term “or” can be used conjunctively or disjunctively, unless the context specifically refers to a disjunctive use.
Furthermore, use of the term “including” as well as other forms, such as “include”, “includes,” and “included,” is not limiting.
Reference in the specification to “some embodiments,” “an embodiment,” “one embodiment” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the present disclosures.
As used in this specification and claim(s), the words “comprising” (and any form of comprising, such as “comprise” and “comprises”), “having” (and any form of having, such as “have” and “has”), “including” (and any form of including, such as “includes” and “include”) or “containing” (and any form of containing, such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unrecited elements or method steps. It is contemplated that any embodiment discussed in this specification can be implemented with respect to any method or composition of the disclosure, and vice versa. Furthermore, compositions of the present disclosure can be used to achieve methods of the present disclosure.
The term “about” in relation to a reference numerical value and its grammatical equivalents as used herein can include the numerical value itself and a range of values plus or minus 10% from that numerical value.
The term “about” or “approximately” means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, “about” can mean within 1 or more than 1 standard deviation, per the practice in the art. Alternatively, “about” can mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. In another example, the amount “about 10” includes 10 and any amounts from 9 to 11. In yet another example, the term “about” in relation to a reference numerical value can also include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value. Alternatively, particularly with respect to biological systems or processes, the term “about” can mean within an order of magnitude, preferably within 5-fold, and more preferably within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed.
The term “biomolecule” refers to any molecule or biological component that can be produced by, or is present in, a biological organism. Non-limiting examples of biomolecules include proteins, polypeptides, polysaccharides, a sugar, a lipid, a lipoprotein, a metabolite, an oligonucleotide, a nucleic acid (DNA, RNA, micro RNA, plasmid, single stranded nucleic acid, double stranded nucleic acid), metabolome, as well as small molecules such as primary metabolites, secondary metabolites, and other natural products, or any combination thereof. In some embodiments, the biomolecule is selected from the group of proteins, nucleic acids, lipids, and metabolomes.
As used herein, the term “sensor element” refers to elements that are able to bind to a plurality of biomolecules when in contact with a sample. The term “plurality of sensor elements” refers to more than one, for example, at least two sensor elements. In some embodiments, the plurality of sensor elements includes at least two sensor elements to at least 1000 sensor elements, preferably about two sensor elements to about 100 sensor elements. In suitable embodiments, the array comprises at least two to at least 100 sensor elements, alternatively at least two to at least 50 sensor elements, alternatively at least 2 to 30 sensor elements, alternatively at least 2 to 20 sensor elements, alternatively at least 2 to 10 sensor elements, alternatively at least 3 to at least 50 sensor elements, alternatively at least 3 to at least 30 sensor elements, alternatively at least 3 to at least 20 sensor elements, alternatively at least 3 to at least 10 sensor elements, alternatively at least 4 to at least 50 sensor elements, alternatively at least 4 to at least 30 sensor elements, alternatively at least 4 to at least 20 sensor elements, alternatively at least 4 to at least 10 sensor elements, and including any number of sensor elements contemplated in between (e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 225, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, etc.). In some embodiments, the sensor array comprises at least 6 sensor elements to at least 20 sensor elements, alternatively at least 6 sensor elements to at least 10 sensor elements.
As used herein, the term “biomolecule corona” refers to the plurality of different biomolecules that are able to bind to a sensor element. The term “biomolecule corona” encompasses “protein corona” which is a term used in the art to refer to the proteins, lipids and other plasma components that bind nanoparticles when they come into contact with biological samples or biological system. For use herein, the term “biomolecule corona” also encompasses both the soft and hard protein corona as referred to in the art, see, e.g., Milani et al. “Reversible versus Irreversible Binding of Transferring to Polystyrene Nanoparticles: Soft and Hard Corona” ACS NANO, 2012, 6(3), pp. 2532-2541; Mirshafiee et al. “Impact of protein pre-coating on the protein corona composition and nanoparticle cellular uptake” Biomaterials vol. 75, January 2016 pp. 295-304, Mahmoudi et al. “Emerging understanding of the protein corona at the nanobio interfaces” Nanotoday 11(6) December 2016, pp. 817-832, and Mahmoudi et al. “Protein-Nanoparticle Interactions: Opportunities and Challenges” Chem. Rev., 2011, 111(9), pp. 5610-5637, the contents of which are incorporated by reference in their entireties. As described in the art, adsorption curve shows the build-up of a strongly bound monolayer up to the point of monolayer saturation (at a geometrically defined proteinto-NP ratio), beyond which a secondary, weakly bound layer is formed. While the first layer is irreversibly bound (hard corona), the secondary layer (soft corona) exhibits dynamic exchange. Proteins that adsorb with high affinity form what is known as the “hard” corona, consisting of tightly bound proteins that do not readily desorb, and proteins that adsorb with low affinity form the “soft” corona, consisting of loosely bound proteins. Soft and hard corona can also be defined based on their exchange times. Hard corona may show much larger exchange times in the order of several hours. See, e.g., M. Rahman et al. Protein-Nanoparticle Interactions, Spring Series in Biophysics 15, 2013, incorporated by reference in its entirety.
The term “biomolecule corona signature” refers to the composition, signature or pattern of different biomolecules that are bound to each separate sensor element. The signature not only refers to the different biomolecules but also the differences in the amount, level or quantity of the biomolecule bound to the sensor element, or differences in the conformational state of the biomolecule that is bound to the sensor element. It is contemplated that the biomolecule corona signatures of each sensor elements may contain some of the same biomolecules, may contain distinct biomolecules with regard to the other sensor elements, and/or may differ in level or quantity, type or confirmation of the biomolecule. The biomolecule corona signature may depend on not only the physiocochemical properties of the sensor element, but also the nature of the sample and the duration of exposure. In some cases, the biomolecule corona signature is a protein corona signature. In another case, the biomolecule corona signature is a polysaccharide corona signature. In yet another case, the biomolecule corona signature is a metabolite corona signature. In some cases, the biomolecule corona signature is a lipidomic corona signature. In some embodiments, the biomolecule corona signature comprises the biomolecules found in a soft corona and a hard corona. In some embodiments, the soft corona is a soft protein corona. In some embodiments, the hard corona is a hard protein corona.
“Polypeptide(s)”, “peptide(s)” and their grammatical equivalents as used herein refer to a polymer of amino acid residues. Polypeptides can comprise D-amino acids, L-amino acids, and non-natural amino acids, or any combination thereof. A “mature protein” is a protein which is full-length and which, optionally, includes glycosylation or other modifications (e.g., post-translational modification) typical for the protein in a given cellular environment. A protein can be a monomer, a homodimer or a heteromultimer of polypeptides. Non-limiting examples of post-translational modifications include phosphorylation, acylation including acetylation and formylation, glycosylation (including N-linked and O-linked), amidation, hydroxylation, alkylation including methylation and ethylation, ubiquitylation, addition of pyrrolidone carboxylic acid, formation of disulfide bridges, sulfation, myristoylation, palmitoylation, isoprenylation, farnesylation, geranylation, glypiation, lipoylation and iodination.
The term “lipid(s)” includes a variety of insoluble biomolecules, such as neutral fats, oils, and steroids. Lipids include simple lipids, triglycerides, eicosanoids, complex lipids, phospholipids, steroids, cholesterol, and lipid-related molecules. Simple lipids contain only two types of components, i.e., fatty acids and alcohols. Non-limiting examples of simple lipids include triglycerides, triacylglycerol, diglycerides, and monoglycerides. Fatty acids are long chains of carbon and hydrogen each with a carboxyl acid functional group (—COOH) at one end. The chain length varies but most fatty acids possess even number of carbon atoms with sixteen or eighteen carbon fatty acids as the most common. Fatty acids can be saturated, unsaturated, monounsaturated, or polyunsaturated. Triacylglycerols or triglycerides form when glycerol links to three fatty acids, of which can be different chain lengths. Eicosanoids are modified fatty acids or lipid-related molecules produced by slight alterations in the fatty acid chain of arachidonic acid. Non-limiting examples of eicosanoids can include prostaglandins, thromboxanes, leukotrienes, and lipoxins. Complex lipids can comprise fatty acids, glycerol, and an alcohol besides glycerol, a carbohydrate and a phosphate group. Non-limiting examples of complex lipids include phospholipids and steroids. Phospholipids are phosphate containing lipid molecules. Steroids are a class of lipid-related molecules derived from cholesterol. Non-limiting examples of steroids include cholesterol, testosterone, estrogen and cortisol.
“Polynucleotide” or “oligonucleotide” as used herein refers to a polymeric form of nucleotides or nucleic acids of any length, either ribonucleotides or deoxyribonucleotides. This term refers only to the primary structure of the molecule. Thus, this term includes double and single stranded DNA, triplex DNA, as well as double and single stranded RNA. It also includes modified, for example, by methylation and/or by capping, and unmodified forms of the polynucleotide. When a polynucleotide such as an oligonucleotide is represented by a sequence of letters, it is understood that the nucleotides are in 5′→3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted. The letters A, C, G, and T can be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art. DNA (deoxyribonucleic acid) is a chain of nucleotides comprising 4 types of nucleotides; A (adenine), T (thymine), C (cytosine), and G (guanine), and that RNA (ribonucleic acid) comprising 4 types of nucleotides; A, U (uracil), G, and C. DNA and RNA can also comprise synthetic nucleotides or chemically modified nucleotides. Certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). That is, adenine (A) can pair with thymine (T) (in the case of RNA, however, adenine (A) can pair with uracil (U)), and cytosine (C) can pair with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand.
As used herein, the terms “marker”, “biomarker” (or fragment thereof) and their synonyms, which are used interchangeably, refer to molecules that can be evaluated in a sample and are associated with a specified biological state. For example, markers include genes, expressed genes or their products (e.g., proteins) or autoantibodies to those proteins that can be detected from human samples, such as blood, serum, solid tissue, and the like, that is associated with a specified biological state. Such biomarkers include, but are not limited to, biomolecules comprising polynucleotides, amino acids, sugars, fatty acids, steroids, metabolites, polypeptides, proteins (such as, but not limited to, antigens and antibodies), carbohydrates, lipids, hormones, antibodies, regions of interest which serve as surrogates for biological molecules, combinations thereof (e.g., glycoproteins, ribonucleoproteins, lipoproteins) and any complexes involving any such biomolecules, such as, but not limited to, a complex formed between an antigen and an autoantibody that binds to an available epitope on said antigen. The biomarker can also refer to a portion of a polypeptide (parent) sequence that comprises at least 3 consecutive amino acid residues, at least 10 consecutive amino acid residues, at least 15 consecutive amino acid residues or more, and retains a biological activity and/or some functional characteristics of the parent polypeptide, e.g. antigenicity or structural domain characteristics. The biomarkers refer to both disease biomolecules present on or in diseased cells or those that have been shed from the diseased cells into bodily fluids such as blood or serum. The biomarkers can also refer to biomolecules, including autoantibodies produced by the body to those disease biomolecules. Biomarkers can include any biological substance indicative of the presence of disease, including but not limited to, genetic, epigenetic, proteomic, glycomic or imaging biomarkers, in diseases such as cancer, immunological disorders, neurological disorders, endocrine disorders, metabolic disorders, cardiac diseases. Biomarkers can include molecules secreted by diseased cells and/or other cells, including gene, gene expression, and protein-based products (tumor markers or antigens, cell free DNA, mRNA, etc.)
By “fragment” is meant a portion of a polypeptide or nucleic acid molecule. This portion contains, preferably, at least 1%, 5%, 10%, 20%, 30%, 40%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.9% of the entire length of the reference nucleic acid molecule or polypeptide. A fragment can contain at least 3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, or greater number of nucleotides or amino acids.
The term “isolated” and its grammatical equivalents as used herein refer to the removal of a biomolecule from its natural environment. The term “purified” and its grammatical equivalents as used herein refer to a molecule or composition, whether removed from nature (including genomic DNA and mRNA) or synthesized (including cDNA) and/or amplified under laboratory conditions, that has been increased in purity, wherein “purity” is a relative term, not “absolute purity.” The term “substantially purified” and its grammatical equivalents as used herein refer to a nucleic acid sequence, polypeptide, protein or other compound which is essentially free, i.e., is more than about 50% free of, more than about 60% free of, more than about 70% free of, more than about 80% free of, more than about 90% free of, the polynucleotides, proteins, polypeptides and other molecules that the nucleic acid, polypeptide, protein or other compound is naturally associated with.
By “reference” is meant a standard of comparison.
A “specified biological state” can mean, but is not limited to, a concentration of a biomolecule, a disease state, a poor clinical outcome, a good clinical outcome, a phenotype, a high risk of disease, a low risk of disease, a complete response to a treatment, a partial response to a treatment, a stable disease state, or a non-response to a treatment.
Disease, condition, and disorder are used interchangeably herein.
The term “proliferative disease” as referred to herein refers to a unifying concept in which excessive proliferation of cells and/or turnover of cellular matrix contributes significantly to the pathogenesis of the disease, including cancer.
By “neoplasia” is meant any disease that is caused by or results in inappropriately high levels of cell division, inappropriately low levels of apoptosis, or both. In some embodiments, neoplasia is cancer or tumor.
The term “cancer” and “tumor” as used herein interchangeably and are meant to encompass any cancer, neoplastic and preneoplastic disease that is characterized by abnormal growth of cells. A tumor can be cancerous or benign. A benign tumor means the tumor can grow but does not spread. A cancerous tumor is malignant, meaning it can grow and spread (metastasize) to other parts of the body. Non-limiting examples of cancer include lung cancer, pancreas cancer, myeloma, myeloid leukemia, meningioma, glioblastoma, breast cancer, esophageal squamous cell carcinoma, gastric adenocarcinoma, prostate cancer, bladder cancer, ovarian cancer, thyroid cancer, neuroendocrine cancer, colon carcinoma, ovarian cancer, head and neck cancer, Hodgkin's Disease, non-Hodgkin's lymphomas, rectum cancer, urinary cancers, uterine cancers, oral cancers, skin cancers, stomach cancer, brain tumors, liver cancer, laryngeal cancer, esophageal cancer, mammary tumors, fibrosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, chordoma, angiosarcoma, endotheliosarcoma, Ewing's sarcoma, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, cystandeocarcinoma, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, hepatoma, bile duct carcinoma, choriocarcinoma, seminoma, embryonal carcinoma, Wilms' tumor, cervical cancer, testicular tumor, endometrial cancer, lung carcinoma, small cell lung carcinoma, bladder carcinoma, epithelial carcinoma, glioblastomas, neuronomas, craniopharingiomas, schwannomas, glioma, astrocytoma, meningioma, melanoma, neuroblastoma, retinoblastoma, leukemias and lymphomas, acute lymphocytic leukemia and acute myelocytic polycythemia vera, multiple myeloma, Waldenstrom's macroglobulinemia, and heavy chain disease, acute nonlymphocytic leukemias, chronic lymphocytic leukemia, chronic myelogenous leukemia, childhood-null acute lymphoid leukemia (ALL), thymic ALL, B-cell ALL, acute megakaryocytic leukemia, Burkitt's lymphoma, and T cell leukemia, small and large non-small cell lung carcinoma, acute granulocytic leukemia, germ cell tumors, endometrial cancer, gastric cancer, hairy cell leukemia, thyroid cancer and other cancers known in the art. In a preferred embodiment, the cancer is selected from the group consisting of lung cancer, pancreas cancer, myeloma, myeloid leukemia, meningioma, glioblastoma, breast cancer, esophageal squamous cell carcinoma, gastric adenocarcinoma, prostate cancer, bladder cancer, ovarian cancer, thyroid cancer, and neuroendocrine cancer.
As used herein, the term “sample” refers to a biological sample obtained or derived from a subject. In some embodiments, the sample is a complex biological sample without any prior depletion of a biomolecule (e.g., protein). In some embodiments, a sample comprises nucleic acids representing all or substantially of the nucleic acid sequences found in a subject. In some embodiments, a sample comprises polypeptides or a set of amino acid sequences representing all or substantially of the polypeptide sequences found in a subject. In some embodiments, a sample comprises biological tissue or fluid. Suitable biological samples include, but are not limited to, blood, blood cells, serums, ascites, tissue or fine needle biopsy samples, cell-containing body fluids, lung lavage, cell lysates, bone marrow, sputum, saliva, urine, amniotic fluid, cerebrospinal fluids, tears, semen, tissue biopsy specimens, surgical specimens, other body fluids, secretions, and/or excretions, and/or cells therefrom. Such examples are not however to be construed as limiting the sample types applicable to the present disclosure. The term “tissue” is intended to include intact cells, blood, blood preparations such as plasma and serum, bones, joints, cartilage, neuronal tissue (brain, spinal cord and neurons), muscles, smooth muscles, and organs. In some embodiments, the biological fluids are prepared by methods and kits known in the art. For example, some biological samples (e.g. menstrual blood, blood samples, semen, etc.) can first be centrifuged at low speed to remove cell debris, blood clots and other cellular components that can interfere with the array. In other embodiments, for example, tissue specimens can be processed, e.g., tissue samples can be minced or homogenized, treated with enzymes to break up the tissue and/or centrifuged to remove cellular debris allowing for the assaying and extraction of the biomolecules within the tissue samples. Suitable methods of isolating and/or properly preparing and storing samples are known in the art, and can include, but are not limited, the addition of an anti-coagulant agent.
The terms “complex sample” or “complex biological sample” refer to any biological sample comprising a plurality of any suitable organic molecules (e.g., proteins, polynucleotides, lipids, metabolites), inorganic molecules, submicroscopic agents (e.g., phage). Non-limiting exemplary components of a complex sample include nucleic acid molecules (e.g., nucleotides, oligonucleotides, polynucleotides, DNA, RNA DNA aptamers, RNA aptamers), amino acids, peptides, proteins (native or recombinant), peptide aptamers, antibodies, plasmids, phages, microorganisms, cells, antibodies, organelles, cofactors, and metal ions. The complex sample can be from, but not limited to, body fluids, whole blood, plasma, serum, cerebral spinal fluid (CSF), urine, sweat, saliva, tears, pulmonary secretions, breast aspirate, prostate fluid, seminal fluid, stool, cervical scraping, cysts, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysate, tumor tissue, hair, skin, buccal scraping, nail, bone marrow, cartilage, prion, bone powder, ear wax, tumor samples (e.g., fresh, frozen, or paraffin-embedded samples), or any combination thereof.
The term “multi-omic(s)” or “multiomic(s)” refers to an analytical approach for analyzing biomolecules at a large scale, wherein the data sets are multiple omes, such as proteome, genome, transcriptome, lipidome, and metabolome. Non-limiting exemplary multi-omic data includes proteomic data, genomic data, lipidomic data, glycomic data, transcriptomic data, or metabolomics data.
The terms “individual,” “subject,” and “patient” are used interchangeably herein irrespective of whether the subject has or is currently undergoing any form of treatment. In some embodiments, the subject is diagnosed with or suspected of having or developing a disease or disorder. In some embodiments, the disease is cancer. In some embodiments, the subject is in cancer remission. As used herein, the term “subject” generally refers to any vertebrate, including, but not limited to a mammal. Examples of mammals including primates, including simians and humans, equines (e.g., horses), canines (e.g., dogs), felines, various domesticated livestock (e.g., ungulates, such as swine, pigs, goats, sheep, and the like), as well as domesticated pets (e.g., cats, hamsters, mice, and guinea pigs). In some embodiments, the subject is a human. Exemplary human patients can be male and/or female. “Patient in need thereof” or “subject in need thereof” is referred to herein as a patient diagnosed with or suspected of having a disease or disorder, for instance, but not restricted to cancer.
“Administering” is referred to herein as providing one or more compositions described herein to a patient or a subject. By way of example and not limitation, composition administration, e.g., injection, can be performed by intravenous (i.v.) injection, sub-cutaneous (s.c.) injection, intradermal (i.d.) injection, intraperitoneal (i.p.) injection, or intramuscular (i.m.) injection. One or more such routes can be employed. Parenteral administration can be, for example, by bolus injection or by gradual perfusion over time. Alternatively, or concurrently, administration can be by the oral route. Additionally, administration can also be, but not limited to, oral administration, mucosal administration, inhalational administration, ocular administration, transdermal administration, rectal administration, intracystic administration, enteral administration, parenteral administration, surgical deposition of a bolus or pellet of cells, or positioning of a medical device.
The terms “treat,” “treated,” “treating,” “treatment,” and their grammatical equivalents are meant to refer to reducing or ameliorating a disorder and/or symptoms associated therewith (e.g., a cancer). “Treating” can refer to administration of the therapy to a subject after the onset, or suspected onset, of a cancer. “Treating” includes the concepts of “alleviating”, which refers to lessening the frequency of occurrence or recurrence, or the severity, of any symptoms or other ill effects related to e.g., a cancer and/or the side effects associated with e.g., cancer therapy. The term “treating” can also encompass the concept of “managing” which refers to reducing the severity of a particular disease or disorder in a patient or delaying its recurrence, e.g., lengthening the period of remission in a patient who had suffered from the disease. It is appreciated that, although not precluded, treating a disorder or condition does not require that the disorder, condition, or symptoms associated therewith be completely eliminated. In embodiments, the effect is therapeutic, i.e., the effect partially or completely cures a disease and/or adverse symptom attributable to the disease.
The term “therapeutic effect” refers to some extent of relief of one or more of the symptoms of a disorder (e.g., a neoplasia or tumor) or its associated pathology.
The term “therapeutically effective amount”, “therapeutic amount” or its grammatical equivalents as used herein refers to an amount of an agent which is effective, upon single or multiple dose administration to the cell or subject, in prolonging the survivability of the patient with such a disorder, reducing one or more signs or symptoms of the disorder, preventing or delaying, and the like beyond that expected in the absence of such treatment. “Therapeutically effective amount” is intended to qualify the amount required to achieve a therapeutic effect. The therapeutically effective amount can vary according to factors such as the disease state, age, sex, and weight of the individual, and can be determined by a physician with consideration of individual differences in e.g., age, weight, tumor size, extent of infection or metastasis, and condition of the patient (subject).
In training classifiers against complex biomolecule sampling, the classification model weights are dependent on the dataset and are relative within the dataset. For example, if a feature (e.g., protein) is removed from a data set, the weights of the other features are recalculated during retraining. The methods disclosed herein provide consistency of important features across many independent datasets by focusing on the subset of the important features consistent across different datasets. The methods provided herein identify the subset currently not related to known knowledge as a novel biomarker.
The present disclosure relates to a method of broad dynamic range sampling of biomolecules in a complex biological sample. In an embodiment, the method relates to a proteome sampling without prior protein depletion. In an embodiment, the method relates to a proteome sampling before prior protein depletion. In an embodiment, the method relates to a proteome sampling after prior protein depletion. In another embodiment, the method of proteome sampling is independent of direct plasma protein concentration, enabling accurate clustering algorithms to generate protein biomarkers. In an embodiment, the method relates to analyzing multiple samplings such that a relational database, a Corona Knowledge Map (CKM), is produced which can be used to rationally design particles, such as a corona, based on the surface-activity relationship of biomarkers to the particle, such as a corona, that captured them.
There are several public searchable metabolome database and proteome database known in the art. The present methods of the disclosure can be used with such public databases. Examples of public databases include but are not limited to Open Targets (opentargets.org), Gene Ontology Consortium (geneontology.org), Plasma Proteome Database (plasmaproteomedatabase.org), METLIN (metlin.scripps.edu), Human Metabolome Database (hmdb.ca), Kyoto Encyclopedia of Genes and Genomes (genome.jp/kegg/), Biological Magnetic Resonance Bank (bmrb.wisc.edu/metabolomics/), Proteomics Identifications (PRIDE) (ebi.ac.uk/pride), ProteomicsDB (proteomicsdb.org), or Biological Magnetic Resonance Bank (bmrb.wisc.edu/metabolomics/). This existing public knowledge base can be leveraged individually or by using aggregators such as Open Targets, Gene Ontology Consortium, and similar commercial offerings. In addition to public knowledge bases, the methods herein can also be used with proprietary databases. For example, the Corona Knowledge Map (CKM) is a relational database to rationally design particles based on the analysis of surface-activity relationships (e.g., referencing data from a protein structure database and/or a protein-ligand co-crystal structure database) between the biomarkers captured and the particle which captured them.
Provided herein is a new approach to broadly sample the proteome without, before and after protein depletion and/or independent of direct plasma protein concentration. The present disclosure can identify proteins that are linked to a specified biological state (e.g., a disease) but which are present in very low or non-recorded concentrations in the plasma, serum, whole blood, amniotic fluid, cerebral spinal fluid, urine, saliva, tears, feces according to the Plasma Proteome Database and other databases. Non-limiting examples of a specified biological state is a disease state, a poor clinical outcome, a good clinical outcome, a high risk of disease, a low risk of disease, a complete response to a treatment, a partial response to a treatment, a stable disease state, or a non-response to a treatment. Non-limiting exemplary proteins labeled with disease is provided in Table 1.
“Biological state” encompasses any biological characteristic of a subject which can be manifested in a biological sample as defined herein. A biological state can be detected using the methods disclosed herein where two subjects who differ in the biological state manifest those differences in the composition of a sample. For example, a biological state includes a disease state of a subject. A disease state can be detected when the disease state gives rise to changes in the molecular composition (e.g., level of one or more proteins) of a sample of a subject expressing the disease state relative to a sample of a subject not having the disease state (e.g., where the biological state is a healthy state or non-disease state).
Another example of a biological state is a level of responsiveness of a subject to a particular therapeutic treatment (e.g. administration of one or a combination of drugs or pharmaceuticals). In some embodiments, a biological state is responsiveness (e.g., with respect to a particular threshold of analysis) of a subject to a particular drug. In another embodiment, a biological state is non-responsiveness (e.g., with respect to a particular threshold of analysis) of a subject to a particular drug. In some embodiments, the level of responsiveness of a subject to a drug (i.e., biological state of responsiveness to the drug or biological state of non-responsiveness to the drug) is associated with factors such as variability in metabolism or pharmacokinetics of the drug between subjects.
Another example of a biological state is the level of immune response exhibited by a subject. In some embodiments, the biological state can be increased immune response. In other embodiments, the biological state can be decreased immune response. Immune response can differ between subjects as a result of a number of variables. For example, immune response can differ between subjects as a result of differing exposure to an exogenously introduced antigen (e.g. associated with a virus or bacteria), as a result of differences in their susceptibility to an autoimmune disease or disorder, or secondarily as a result of a response to other biological states in a subject (e.g., disease states such as cancer).
The term “disease state” for a subject as used herein refers to the ability of sensor array of the present technology to be able to differentiate between the different states of a disease within a subject. This term encompasses a predisease state or precursor state of a disease or disorder (a state in which the subject may not have any outward signs or symptoms of the disease or disorder but will develop the disease or disorder in the future) and a disease state in which the subject has a stage of the disease or disorder (e.g., an early, intermediate or late stage of the disease or disorder). In other words, the disease state is a spectrum that encompasses a continuum regarding the health of a subject with respect to a disease or disorder.
The disease state also includes a precursor state of a disease or disorder. This precursor state is a state in which the subject does not have any outward signs or symptoms of the disease or disorder (although there may be submacro changes within the biomolecules of the subject found in their blood or other biological fluids) but will develop the disease or disorder in the future.
In some embodiments, the methods of the present technology include comparing the protein fingerprint of the sample to a panel of protein fingerprints associated with a plurality of diseases and/or a plurality of disease states to determine if the sample indicates a disease and/or disease state. For example, samples can be collected from a population of subjects over time. Once the subjects develop a disease or disorder, the present invention allows for the ability to characterize and detect the changes in biomolecule fingerprints over time in the subject by comparing the biomolecule fingerprint of the sample from the same subject before they have developed a disease to the biomolecule fingerprint of the subject after they have developed the disease. In some embodiments, samples can be taken from cohorts of patients who all develop the same disease, allowing for analysis and characterization of the biomolecule fingerprints that are associated with the different stages of the disease for these patients (e.g. from pre-disease to disease states).
Methods of determining a biomolecule fingerprint associated with at least one disease or disorder and/or a disease state are contemplated. The methods comprise the steps of obtaining a sample from at least two subjects diagnosed with the at least one disease or disorder or having the same disease state; contacting each sample with a sensor array described herein to determining a biomolecule fingerprint for each sensor array, and analyzing the fingerprint of the at least two samples to determine a biomolecule fingerprint associated with the at least one disease or disorder and/or disease state.
Provided herein is a computer-implemented method for detecting one or more biomarkers in a multi-omic data set (e.g., proteomic data set). In some embodiments, the computer-implemented method comprises (1) providing a multi-omic data (e.g., proteomic data) generated from one or more complex biological samples obtained from one or more individuals, each individual having one or more specified biological states; (2) applying a trained model to the multi-omic data (e.g., proteome data) to generate one or more classification model weights, wi . . . wn, for one or more features, fi . . . fn, yielding (wi,fi), . . . , (wn,fn) and storing (wi,fi), . . . , (wn,fn); (3) querying a reference data set for the one or more features, fi . . . fn, to generate a set of scores, si . . . sn, yielding (si,fi), . . . , (sn,fn) and storing (si,fi), . . . , (sn,fn); and (4) combining at least (wi,fi), . . . , (wn,fn) and (si,fi), . . . , (sn,fn) to generate (wi,si), . . . , (wn, sn) and selecting a subset of (wi,si), . . . , (wn, sn) to detect one or more biomarkers.
In some embodiments, the method further comprises obtaining one or more complex biological samples from one or more individuals. In some embodiments, the multi-omic data is proteomic data, genomic data, lipidomic data, glycomic data, transcriptomic data, or metabolomics data. In some embodiments, the one or more complex biological samples are plasma proteome samples. In some cases, the proteome data comprises (i) protein identifiers and (ii) specified biological states for the one or more individuals. The proteome data can be generated by assaying a complex biological sample of an individual of the one or more individuals as describe herein. The trained model can be trained using a set of labeled data of a plurality of complex biological samples, wherein the labeled data set comprises one or more features fi . . . fn corresponding to one or more specified biological states bi . . . bn, wherein the one or more features are proteins. In some cases, the one or more features represent different proteins. The reference data set can be a database comprising features related to specified biological states by an association score. The set of scores, si . . . sn, can be association scores between the one or more features and the one or more specified biological states. In some cases, the selecting the subject in step (4) comprises filtering (wi,si), . . . , (wn, sn) such that w at least meets a first threshold and s at least meets a second threshold such that the one or more biomarkers comprise a subset (wk,sk) . . . (wm,sm) of (wi,si), . . . , (wn, sn). In some cases, k≥i. In some cases, m≤n. In some cases, the method can further comprise generating an output. The output can correspond to a specified biological state of the one or more specified biological states. In some cases, the one or more complex biological samples are not subjected to protein depletion. In one embodiment, the surface-activity relationships between biological sample biomarker and capture particle, such as a corona, are catalogued into a relational database known as a Corona Knowledge Map (CKM). The CKM can be used to rationally design particles to target potential biomarkers.
The present disclosure relates to generating potential biomarkers present in a low or non-recorded concentration in samples by applying machine learning clustering algorithms to a large disease labeled training set of complex biomolecule data or proteome data (e.g., plasma protein levels) and comparing the classification weights to existing knowledge bases. Techniques to produce broad range sampling of complex biomolecules (e.g., plasma proteome) without a prior depletion and independent of e.g., plasma protein concentration and to generate a large training and test data set of disease labeled biomolecule (e.g., protein) levels across many patient sample are described in International Patent Application PCT/US2017/067013, which is herein incorporated by reference in its entirety.
In some embodiments, classification model weight and importance for each protein in each disease label, the definition depends on the algorithm used. The classification model weight of Table 1 is generated using Random Forest. Random Forest allows estimation of the classification model weight and importance for each protein by a number of methods, one being removing or perturbing the values. The average of the resulting changes in classification error provides an indication of weight. The classification model weights are dependent on the dataset and are relative within the dataset. For example, if a feature (protein) is removed from a data set, the weights of the other features are recalculated when the model is retrained. For example, if the same algorithm is used on a similar but a new set of data, the weight can be different. The methods disclosed herein provide consistency of important features across many independent datasets by focusing on the subset of the important features consistent across different datasets. The methods provided herein identify the subset currently not related to known knowledge as a novel biomarker.
Machine learning classification algorithms such as Partial Least Squares, Logistic Regression, Support Vector Classifier, Nearest Neighbor, Random Forest can be applied to the disease labeled biomolecules to build a trained classification model with both a high sensitivity and specificity. The features of individual biomolecules are extracted by inspecting the trained classification model and their associated classification weights and store as a set of data. Another data set is created by querying data aggregators such as Open Targets, Gene Ontology Consortium and commercial options for all known biomolecules (e.g., proteins or expressing genes) connected with the labeled diseases and their respective association score. The classification strength of the biomolecule (e.g., protein) to its strength of associations in public and private databases to the labeled diseases is compared for the set of potential biomarkers. Exemplary analysis of the trained classification model of protein proteome is shown in
The upper right and lower left quadrants help support the validity of the classification model.
The methods of the present disclosure can be used with a complex biological sample. Biomolecules can refer to any molecule or biological component that can be produced by, present in, a biological organism. Non-limiting examples of biomolecules include proteins, polypeptides, polysaccharides, a sugar, a lipid, a lipoprotein, a metabolite, an oligonucleotide, a nucleic acid (DNA, RNA, microRNA, plasmid, single stranded nucleic acid, and double stranded nucleic acid), metabolome, as well as small molecules such as primary metabolites, secondary metabolites, and other natural products, or any combination thereof. In some embodiments, the biomolecule is selected from the group of proteins (e.g., polypeptides or peptides), nucleic acids, lipids, and metabolomes.
In some embodiments, the protein is a mature protein. In some embodiments, the mature protein can include glycosylation or other modification (e.g., post-translational modification). Non-limiting examples of post-translational modifications include phosphorylation, acylation including acetylation and formylation, glycosylation (including N-linked and O-linked), amidation, hydroxylation, alkylation including methylation and ethylation, ubiquitylation, addition of pyrrolidone carboxylic acid, formation of disulfide bridges, sulfation, myristoylation, palmitoylation, isoprenylation, famesylation, geranylation, glypiation, lipoylation and iodination. Polypeptides can comprise D-amino acids, L-amino acids, and non-natural amino acids, or any combination thereof. In some embodiments, the proteins can be a monomer, a homodimer, or a heteromultimer of polypeptides.
In some embodiments, the lipids include a variety of insoluble biomolecules, such as neutral fats, oils, and steroids. Lipids include simple lipids, triglycerides, eicosanoids, complex lipids, phospholipids, steroids, cholesterol, and lipid-related molecules. Simple lipids contain only two types of components, i.e., fatty acids and alcohols. Non-limiting examples of simple lipids include triglycerides, triacylglycerol, diglycerides, and monoglycerides. Fatty acids are long chains of carbon and hydrogen each with a carboxyl acid functional group (—COOH) at one end. The chain length varies but most fatty acids possess even number of carbon atoms with sixteen or eighteen carbon fatty acids as the most common. Fatty acids can be saturated, unsaturated, monounsaturated, or polyunsaturated. Triacylglycerols or triglycerides form when glycerol links to three fatty acids, of which can be different chain lengths. Eicosanoids are modified fatty acids or lipid-related molecules produced by slight alterations in the fatty acid chain of arachidonic acid. Non-limiting examples of eicosanoids can include prostaglandins, thromboxanes, leukotrienes, and lipoxins. Complex lipids can comprise fatty acids, glycerol, and an alcohol besides glycerol, a carbohydrate and a phosphate group. Non-limiting examples of complex lipids include phospholipids and steroids. Phospholipids are phosphate containing lipid molecules. Steroids are a class of lipid-related molecules derived from cholesterol. Non-limiting examples of steroids include cholesterol, testosterone, estrogen and cortisol.
In some embodiments, nucleic acids comprise to a polymeric form of nucleotides or nucleic acids of any length, either ribonucleotides or deoxyribonucleotides. This term refers only to the primary structure of the molecule. Thus, this term includes double and single stranded DNA, triplex DNA, as well as double and single stranded RNA. It also includes modified, for example, by methylation and/or by capping, and unmodified forms of the polynucleotide. When a polynucleotide such as an oligonucleotide is represented by a sequence of letters, it is understood that the nucleotides are in 5′→3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted. The letters A, C, G, and T can be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art. DNA (deoxyribonucleic acid) is a chain of nucleotides comprising 4 types of nucleotides; A (adenine), T (thymine), C (cytosine), and G (guanine), and that RNA (ribonucleic acid) comprising 4 types of nucleotides; A, U (uracil), G, and C. DNA and RNA can also comprise synthetic nucleotides or chemically modified nucleotides. Certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). That is, adenine (A) can pair with thymine (T) (in the case of RNA, however, adenine (A) can pair with uracil (U)), and cytosine (C) can pair with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand.
Biomarkers identified with the methods provided herein can be biomolecules that can be evaluated in a sample and are associated with a specified biological state. The specified biological state can mean, but not limited to, a disease state, a poor clinical outcome, a good clinical outcome, a high risk of disease, a low risk of disease, a complete response to a treatment, a partial response to a treatment, a stable disease state, or a non-response to a treatment. In some embodiments, a disease state refers to cancer prognosis, cancer diagnosis, cancer treatment response, cancer treatment option, post-cancer management, or any combination thereof.
In some embodiments, biomarkers are the molecules that can be evaluated in a sample and are associated with a specified biological state. For example, markers include genes, expressed genes or their products (e.g., proteins) or autoantibodies to those proteins that can be detected from human samples, such as blood, serum, solid tissue, and the like, that is associated with a specified biological state. Such biomarkers include, but are not limited to, biomolecules comprising polynucleotides, amino acids, sugars, fatty acids, steroids, metabolites, polypeptides, proteins (such as, but not limited to, antigens and antibodies), carbohydrates, lipids, hormones, antibodies, regions of interest which serve as surrogates for biological molecules, combinations thereof (e.g., glycoproteins, ribonucleoproteins, lipoproteins) and any complexes involving any such biomolecules, such as, but not limited to, a complex formed between an antigen and an autoantibody that binds to an available epitope on said antigen. The biomarker can also refer to a portion of a polypeptide (parent) sequence that comprises at least 3 consecutive amino acid residues, at least 10 consecutive amino acid residues, at least 15 consecutive amino acid residues or more, and retains a biological activity and/or some functional characteristics of the parent polypeptide, e.g. antigenicity or structural domain characteristics. The biomarkers refer to both disease biomolecules present on or in diseased cells or those that have been shed from the diseased cells into bodily fluids such as blood or serum. The biomarkers can also refer to biomolecules, including autoantibodies produced by the body to those disease biomolecules. Biomarkers can include any biological substance indicative of the presence of disease, including but not limited to, genetic, epigenetic, proteomic, glycomic or imaging biomarkers, in diseases such as cancer, immunological disorders, neurological disorders, endocrine disorders, metabolic disorders, cardiac diseases. Biomarkers can include molecules secreted by diseased cells and/or other cells, including gene, gene expression, and protein-based products (tumor markers or antigens, cell free DNA, mRNA, etc.)
Additionally, the specified biological state herein includes, but not limited to, a diagnosis of cancer at early stage using one or more biomarkers identified by the methods described herein. In other embodiments, the specified biological state herein includes refractory or recurrent malignancies whose growth can be inhibited by targeting one or more biomarkers identified by the methods described herein.
Non-limiting examples of cancers include melanoma (e.g., metastatic malignant melanoma), renal cancer (e.g., clear cell carcinoma), prostate cancer (e.g., hormone refractory prostate adenocarcinoma), pancreatic adenocarcinoma, breast cancer, colon cancer, lung cancer (e.g., non-small cell lung cancer), esophageal cancer, squamous cell carcinoma of the head and neck, liver cancer, ovarian cancer, cervical cancer, thyroid cancer, glioblastoma, glioma, leukemia, lymphoma, and other neoplastic malignancies. In other embodiments, cancer can be selected from the group consisting of carcinoma, squamous carcinoma, adenocarcinoma, sarcomata, endometrial cancer, breast cancer, ovarian cancer, cervical cancer, fallopian tube cancer, primary peritoneal cancer, colon cancer, colorectal cancer, squamous cell carcinoma of the anogenital region, melanoma, renal cell carcinoma, lung cancer, non-small cell lung cancer, squamous cell carcinoma of the lung, stomach cancer, bladder cancer, gall bladder cancer, liver cancer, thyroid cancer, laryngeal cancer, salivary gland cancer, esophageal cancer, head and neck cancer, glioblastoma, glioma, squamous cell carcinoma of the head and neck, prostate cancer, pancreatic cancer, mesothelioma, sarcoma, hematological cancer, leukemia, lymphoma, neuroma, and combinations thereof. In some embodiments, a cancer to be treated by the methods of the present disclosure include, for example, carcinoma, squamous carcinoma (for example, cervical canal, eyelid, tunica conjunctiva, vagina, lung, oral cavity, skin, urinary bladder, tongue, larynx, and gullet), and adenocarcinoma (for example, prostate, small intestine, endometrium, cervical canal, large intestine, lung, pancreas, gullet, rectum, uterus, stomach, mammary gland, and ovary). In some embodiments, a cancer can be treated by targeting one or more biomarkers identified by the methods described herein include sarcomata (for example, myogenic sarcoma), leukosis, neuroma, melanoma, and lymphoma. In some embodiments, cancer is a solid tumor. In some embodiments, a solid tumor is a melanoma, renal cell carcinoma, lung cancer, bladder cancer, breast cancer, cervical cancer, colon cancer, gall bladder cancer, laryngeal cancer, liver cancer, thyroid cancer, stomach cancer, salivary gland cancer, prostate cancer, pancreatic cancer, or Merkel cell carcinoma. In some embodiments, cancer is a hematological cancer. In some embodiments, a hematological cancer is Diffuse large B cell lymphoma (“DLBCL”), Hodgkin's lymphoma (“HL”), Non-Hodgkin's lymphoma (“NHL”), Follicular lymphoma (“FL”), acute myeloid leukemia (“AML”), or Multiple myeloma (“MM”).
Non-limiting examples of cancers that can be diagnosed early with present disclosure include, but are not limited to, the following: renal cancer, kidney cancer, glioblastoma multiforme, metastatic breast cancer; breast carcinoma; breast sarcoma; neurofibroma; neurofibromatosis; pediatric tumors; neuroblastoma; malignant melanoma; carcinomas of the epidermis; leukemias such as but not limited to, acute leukemia, acute lymphocytic leukemia, acute myelocytic leukemias such as myeloblastic, promyelocytic, myelomonocytic, monocytic, erythroleukemia leukemias and myclodysplastic syndrome, chronic leukemias such as but not limited to, chronic myelocytic (granulocytic) leukemia, chronic lymphocytic leukemia, hairy cell leukemia; polycythemia vera; lymphomas such as but not limited to Hodgkin's disease, non-Hodgkin's disease; multiple myelomas such as but not limited to smoldering multiple myeloma, nonsecretory myeloma, osteosclerotic myeloma, plasma cell leukemia, solitary plasmacytoma and extramedullary plasmacytoma; Waldenstrom's macroglobulinemia; monoclonal gammopathy of undetermined significance; benign monoclonal gammopathy; heavy chain disease; bone cancer and connective tissue sarcomas such as but not limited to bone sarcoma, myeloma bone disease, multiple myeloma, cholesteatoma-induced bone osteosarcoma, Paget's disease of bone, osteosarcoma, chondrosarcoma, Ewing's sarcoma, malignant giant cell tumor, fibrosarcoma of bone, chordoma, periosteal sarcoma, soft-tissue sarcomas, angiosarcoma (hemangiosarcoma), fibrosarcoma, Kaposi's sarcoma, leiomyosarcoma, liposarcoma, lymphangio sarcoma, neurilemmoma, rhabdomyosarcoma, and synovial sarcoma; brain tumors such as but not limited to, glioma, astrocytoma, brain stem glioma, ependymoma, oligodendroglioma, nonglial tumor, acoustic neurinoma, craniopharyngioma, medulloblastoma, meningioma, pineocytoma, pineoblastoma, and primary brain lymphoma; breast cancer including but not limited to adenocarcinoma, lobular (small cell) carcinoma, intraductal carcinoma, medullary breast cancer, mucinous breast cancer, tubular breast cancer, papillary breast cancer, Paget's disease (including juvenile Paget's disease) and inflammatory breast cancer; adrenal cancer such as but not limited to pheochromocytom and adrenocortical carcinoma; thyroid cancer such as but not limited to papillary or follicular thyroid cancer, medullary thyroid cancer and anaplastic thyroid cancer; pancreatic cancer such as but not limited to, insulinoma, gastrinoma, glucagonoma, vipoma, somatostatin-secreting tumor, and carcinoid or islet cell tumor; pituitary cancers such as but limited to Cushing's disease, prolactin-secreting tumor, acromegaly, and diabetes insipius; eye cancers such as but not limited to ocular melanoma such as iris melanoma, choroidal melanoma, and cilliary body melanoma, and retinoblastoma; vaginal cancers such as squamous cell carcinoma, adenocarcinoma, and melanoma; vulvar cancer such as squamous cell carcinoma, melanoma, adenocarcinoma, basal cell carcinoma, sarcoma, and Paget's disease; cervical cancers such as but not limited to, squamous cell carcinoma, and adenocarcinoma; uterine cancers such as but not limited to endometrial carcinoma and uterine sarcoma; ovarian cancers such as but not limited to, ovarian epithelial carcinoma, borderline tumor, germ cell tumor, and stromal tumor; cervical carcinoma; esophageal cancers such as but not limited to, squamous cancer, adenocarcinoma, adenoid cyctic carcinoma, mucoepidermoid carcinoma, adenosquamous carcinoma, sarcoma, melanoma, plasmacytoma, verrucous carcinoma, and oat cell (small cell) carcinoma; stomach cancers such as but not limited to, adenocarcinoma, fungating (polypoid), ulcerating, superficial spreading, diffusely spreading, malignant lymphoma, liposarcoma, fibrosarcoma, and carcinosarcoma; colon cancers; colorectal cancer, KRAS mutated colorectal cancer; colon carcinoma; rectal cancers; liver cancers such as but not limited to hepatocellular carcinoma and hepatoblastoma, gallbladder cancers such as adenocarcinoma; cholangiocarcinomas such as but not limited to pappillary, nodular, and diffuse; lung cancers such as KRAS-mutated non-small cell lung cancer, non-small cell lung cancer, squamous cell carcinoma (epidermoid carcinoma), adenocarcinoma, large-cell carcinoma and small-cell lung cancer; lung carcinoma; testicular cancers such as but not limited to germinal tumor, seminoma, anaplastic, classic (typical), spermatocytic, nonseminoma, embryonal carcinoma, teratoma carcinoma, choriocarcinoma (yolk-sac tumor), prostate cancers such as but not limited to, androgen-independent prostate cancer, androgen-dependent prostate cancer, adenocarcinoma, leiomyosarcoma, and rhabdomyosarcoma; penal cancers; oral cancers such as but not limited to squamous cell carcinoma; basal cancers; salivary gland cancers such as but not limited to adenocarcinoma, mucoepidermoid carcinoma, and adenoidcystic carcinoma; pharynx cancers such as but not limited to squamous cell cancer, and verrucous; skin cancers such as but not limited to, basal cell carcinoma, squamous cell carcinoma and melanoma, superficial spreading melanoma, nodular melanoma, lentigo malignant melanoma, acrallentiginous melanoma; kidney cancers such as but not limited to renal cell cancer, adenocarcinoma, hypernephroma, fibrosarcoma, transitional cell cancer (renal pelvis and/or uterer); renal carcinoma; Wilms' tumor; bladder cancers such as but not limited to transitional cell carcinoma, squamous cell cancer, adenocarcinoma, carcinosarcoma. In addition, cancers include myxosarcoma, osteogenic sarcoma, endotheliosarcoma, lymphangioendotheliosarcoma, mesothelioma, synovioma, hemangioblastoma, epithelial carcinoma, cystadenocarcinoma, bronchogenic carcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma and papillary adenocarcinomas. As used herein, the terms “cardiovascular disease” (CVD) or “cardiovascular disorder” are used to classify numerous conditions affecting the heart, heart valves, and vasculature (e.g., veins and arteries) of the body and encompasses diseases and conditions including, but not limited to atherosclerosis, myocardial infarction, acute coronary syndrome, angina, congestive heart failure, aortic aneurysm, aortic dissection, iliac or femoral aneurysm, pulmonary embolism, atrial fibrillation, stroke, transient ischemic attack, systolic dysfunction, diastolic dysfunction, myocarditis, atrial tachycardia, ventricular fibrillation, endocarditis, peripheral vascular disease, and coronary artery disease (CAD). Further, the term cardiovascular disease refers to subjects that ultimately have a cardiovascular event or cardiovascular complication, referring to the manifestation of an adverse condition in a subject brought on by cardiovascular disease, such as sudden cardiac death or acute coronary syndrome, including, but not limited to, myocardial infarction, unstable angina, aneurysm, stroke, heart failure, non-fatal myocardial infarction, stroke, angina pectoris, transient ischemic attacks, aortic aneurysm, aortic dissection, cardiomyopathy, abnormal cardiac catheterization, abnormal cardiac imaging, stent or graft revascularization, risk of experiencing an abnormal stress test, risk of experiencing abnormal myocardial perfusion, and death.
As used herein, the ability to detect, diagnose or prognose cardiovascular disease, for example, atherosclerosis, can include determining if the patient is in a pre-stage of cardiovascular disease, has developed early, moderate or severe forms of cardiovascular disease, or has suffered one or more cardiovascular event or complication associated with cardiovascular disease.
Atherosclerosis (also known as arteriosclerotic vascular disease or ASVD) is a cardiovascular disease in which an artery-wall thickens as a result of invasion and accumulation and deposition of arterial plaques containing white blood cells on the innermost layer of the walls of arteries resulting in the narrowing and hardening of the arteries. The arterial plaque is an accumulation of macrophage cells or debris, and contains lipids (cholesterol and fatty acids), calcium and a variable amount of fibrous connective tissue. Diseases associated with atherosclerosis include, but are not limited to, atherothrombosis, coronary heart disease, deep venous thrombosis, carotid artery disease, angina pectoris, peripheral arterial disease, chronic kidney disease, acute coronary syndrome, vascular stenosis, myocardial infarction, aneurysm or stroke.
The term “endocrine disease” is used to refer to a disorder associated with dysregulation of endocrine system of a subject. Endocrine diseases may result from a gland producing too much or too little of an endocrine hormone causing a hormonal imbalance, or due to the development of lesions (such as nodules or tumors) in the endocrine system, which may or may not affect hormone levels. Suitable endocrine diseases able to be treated include, but are not limited to, e.g., Acromegaly, Addison's Disease, Adrenal Cancer, Adrenal Disorders, Anaplastic Thyroid Cancer, Cushing's Syndrome, De Quervain's Thyroiditis, Diabetes, Follicular Thyroid Cancer, Gestational Diabetes, Goiters, Graves' Disease, Growth Disorders, Growth Hormone Deficiency, Hashimoto's Thyroiditis, Hurthle Cell Thyroid Cancer, Hyperglycemia, Hyperparathyroidism, Hyperthyroidism, Hypoglycemia, Hypoparathyroidism, Hypothyroidism, Low Testosterone, Medullary Thyroid Cancer, MEN 1, MEN 2A, MEN 2B, Menopause, Metabolic Syndrome, Obesity, Osteoporosis, Papillary Thyroid Cancer, Parathyroid Diseases, Pheochromocytoma, Pituitary Disorders, Pituitary Tumors, Polycystic Ovary Syndrome, Prediabetes, Silent, Thyroiditis, Thyroid Cancer, Thyroid Diseases, Thyroid Nodules, Thyroiditis, Turner Syndrome, Type 1 Diabetes, Type 2 Diabetes, and the like.
As referred to herein, inflammatory disease refers to a disease caused by uncontrolled inflammation in the body of a subject. Inflammation is a biological response of the subject to a harmful stimulus which may be external or internal such as pathogens, necrosed cells and tissues, irritants etc. However, when the inflammatory response becomes abnormal, it results in self-tissue injury and may lead to various diseases and disorders. Inflammatory diseases can include, but are not limited to, asthma, glomerulonephritis, inflammatory bowel disease, rheumatoid arthritis, hypersensitivities, pelvic inflammatory disease, autoimmune diseases, arthritis; necrotizing enterocolitis (NEC), gastroenteritis, pelvic inflammatory disease (PID), emphysema, pleurisy, pyelitis, pharyngitis, angina, acne vulgaris, urinary tract infection, appendicitis, bursitis, colitis, cystitis, dermatitis, phlebitis, rhinitis, tendonitis, tonsillitis, vasculitis, autoimmune diseases; celiac disease; chronic prostatitis, hypersensitivities, reperfusion injury; sarcoidosis, transplant rejection, vasculitis, interstitial cystitis, hay fever, periodontitis, atherosclerosis, psoriasis, ankylosing spondylitis, juvenile idiopathic arthritis, Behcet's disease, spondyloarthritis, uveitis, systemic lupus erythematosus, and cancer. For example, the arthritis includes rheumatoid arthritis, psoriatic arthritis, osteoarthritis orjuvenile idiopathic arthritis, and the like.
Neurological disorders or neurological diseases are used interchangeably and refer to diseases of the brain, spine and the nerves that connect them. Neurological diseases include, but are not limited to, brain tumors, epilepsy, Parkinson's disease, Alzheimer's disease, ALS, arteriovenous malformation, cerebrovascular disease, brain aneurysms, epilepsy, multiple sclerosis, Peripheral Neuropathy, Post-Herpetic Neuralgia, stroke, frontotemporal dementia, demyelinating disease (including but are not limited to, multiple sclerosis, Devic's disease (i.e. neuromyelitis optica), central pontine myelinolysis, progressive multifocal leukoencephalopathy, leukodystrophies, Guillain-Barre syndrome, progressing inflammatory neuropathy, Charcot-Marie-Tooth disease, chronic inflammatory demyelinating polyneuropathy, and anti-MAG peripheral neuropathy) and the like. Neurological disorders also include immune-mediated neurological disorders (IMNDs), which include diseases with at least one component of the immune system reacts against host proteins present in the central or peripheral nervous system and contributes to disease pathology. IMNDs may include, but are not limited to, demyelinating disease, paraneoplastic neurological syndromes, immunemediated encephalomyelitis, immune-mediated autonomic neuropathy, myasthenia gravis, autoantibody-associated encephalopathy, and acute disseminated encephalomyelitis.
In some embodiments, biomarkers can include expressed genes or their products (e.g., proteins) or autoantibodies to those proteins that can be detected from human samples, such as body fluids, whole blood, plasma, serum, cerebral spinal fluid (CSF), urine, sweat, saliva, tears, pulmonary secretions, breast aspirate, prostate fluid, seminal fluid, stool, cervical scraping, cysts, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysate, tumor tissue, hair, skin, buccal scraping, nail, bone marrow, cartilage, prion, bone powder, ear wax, tumor samples (e.g., fresh, frozen, or paraffin-embedded samples), or any combination thereof, that is associated with a specified biological state. Such biomarkers include, but are not limited to, biomolecules comprising polynucleotides, amino acids, sugars, fatty acids, steroids, metabolites, polypeptides, proteins (such as, but not limited to, antigens and antibodies), carbohydrates, lipids, hormones, antibodies, regions of interest which serve as surrogates for biological molecules, combinations thereof (e.g., glycoproteins, ribonucleoproteins, lipoproteins) and any complexes involving any such biomolecules, such as, but not limited to, a complex formed between an antigen and an autoantibody that binds to an available epitope on said antigen. The biomarker can also refer to a portion of a polypeptide (parent) sequence that comprises at least 3 consecutive amino acid residues, at least 10 consecutive amino acid residues, at least 15 consecutive amino acid residues or more, and retains a biological activity and/or some functional characteristics of the parent polypeptide, e.g. antigenicity or structural domain characteristics. The biomarkers refer to both disease biomolecule present on or in diseased cells or those that have been shed from the diseased cells into bodily fluids such as blood or serum. The biomarkers can also refer to biomolecules, including autoantibodies produced by the body to those disease biomolecules. Biomarkers can include any biological substance indicative of the presence of disease, including but not limited to, genetic, epigenetic, proteomic, glycomic or imaging biomarkers, in diseases such as cancer, immunological disorders, neurological disorders, endocrine disorders, metabolic disorders, or cardiac diseases. Biomarkers can include molecules secreted by diseased cells, including gene, gene expression, and protein-based products (tumor markers or antigens, cell free DNA, mRNA, etc.).
In some embodiments, the biomarker can comprise one or more protein or a fragment of the protein encoded by genes selected from CPN1, FCN3, SAA4, IGHG1, IGHG3, CFHR5, C4B, IGLL5, APOD, SERPINA10, CPN2, FGL1, AHSG, ITIH2, HIST1H4D, C4A, CP, CD5L, CNN2, HRNR, GPLD1, IGKC, MASP2, ITIH1, CFHR1, COLEC10, BIN2, SAA2, ANGPTL6, CFB, TPI1, IGHA2, APOC2, EMILIN1, SBSN, PRG4, PPIF, CFHR2, ORMI, AMYIC, NEXN, CALML5, SERPINA7, IGHM, TUFM, APCS, SLC2A3, TMSB4X, CPQ, and SNCA.
In some embodiments, the biomarker is present in a complex biological sample at about 1 pg/ml, about 2 pg/ml, about 3 pg/ml, about 4 pg/ml, about 5 pg/ml, about 6 pg/ml, about 7 pg/ml, about 8 pg/ml, about 9 pg/ml, about 10 pg/ml, about 20 pg/ml, about 30 pg/ml, about 40 pg/ml, about 50 pg/ml, about 60 pg/ml, about 70 pg/ml, about 80 pg/ml, about 90 pg/ml, or about 100 pg/ml.
In some embodiments, the biomarker is present in a complex biological sample at about 1 pg/ml or more, about 2 pg/ml or more, about 3 pg/ml or more, about 4 pg/ml or more, about 5 pg/ml or more, about 6 pg/ml or more, about 7 pg/ml or more, about 8 pg/ml or more, about 9 pg/ml or more, about 10 pg/ml or more, about 20 pg/ml or more, about 30 pg/ml or more, about 40 pg/ml or more, about 50 pg/ml or more, about 60 pg/ml or more, about 70 pg/ml or more, about 80 pg/ml or more, about 90 pg/ml or more, or about 100 pg/ml or more.
In some embodiments, the biomarker is present in a complex biological sample at about 1 pg/ml or less, about 2 pg/ml or less, about 3 pg/ml or less, about 4 pg/ml or less, about 5 pg/ml or less, about 6 pg/ml or less, about 7 pg/ml or less, about 8 pg/ml or less, about 9 pg/ml or less, about 10 pg/ml or less, about 20 pg/ml or less, about 30 pg/ml or less, about 40 pg/ml or less, about 50 pg/ml or less, about 60 pg/ml or less, about 70 pg/ml or less, about 80 pg/ml or less, about 90 pg/ml or less, or about 100 pg/ml or less.
In some embodiments, the biomarker is present in a complex biological sample at about 0.1 ng/ml, about 0.2 ng/ml, about 0.3 ng/ml, about 0.4 ng/ml, about 0.5 ng/ml, about 0.6 ng/ml, about 0.7 ng/ml, about 0.8 ng/ml, about 0.9 ng/ml, about 1 ng/ml, about 2 ng/ml, about 3 ng/ml, about 4 ng/ml, about 5 ng/ml, about 6 ng/ml, about 7 ng/ml, about 8 ng/ml, about 9 ng/ml, about 10 ng/ml, about 20 ng/ml, about 30 ng/ml, about 40 ng/ml, about 50 ng/ml, about 60 ng/ml, about 70 ng/ml, about 80 ng/ml, about 90 ng/ml, about 100 ng/ml, about 200 ng/ml, about 300 ng/ml, about 400 ng/ml, about 500 ng/ml, about 600 ng/ml, about 700 ng/ml, about 800 ng/ml, about 900 ng/ml, or about 1000 ng/ml.
In some embodiments, the biomarker is present in a complex biological sample at about 0.1 ng/ml or more, about 0.2 ng/ml or more, about 0.3 ng/ml or more, about 0.4 ng/ml or more, about 0.5 ng/ml or more, about 0.6 ng/ml or more, about 0.7 ng/ml or more, about 0.8 ng/ml or more, about 0.9 ng/ml or more, about 1 ng/ml or more, about 2 ng/ml or more, about 3 ng/ml or more, about 4 ng/ml or more, about 5 ng/ml or more, about 6 ng/ml or more, about 7 ng/ml or more, about 8 ng/ml or more, about 9 ng/ml or more, about 10 ng/ml or more, about 20 ng/ml or more, about 30 ng/ml or more, about 40 ng/ml or more, about 50 ng/ml or more, about 60 ng/ml or more, about 70 ng/ml or more, about 80 ng/ml or more, about 90 ng/ml or more, about 100 ng/ml or more, about 200 ng/ml or more, about 300 ng/ml or more, about 400 ng/ml or more, about 500 ng/ml or more, about 600 ng/ml or more, about 700 ng/ml or more, about 800 ng/ml or more, about 900 ng/ml or more, or about 1000 ng/ml or more.
In some embodiments, the biomarker is present in a complex biological sample at about 0.1 ng/ml or less, about 0.2 ng/ml or less, about 0.3 ng/ml or less, about 0.4 ng/ml or less, about 0.5 ng/ml or less, about 0.6 ng/ml or less, about 0.7 ng/ml or less, about 0.8 ng/ml or less, about 0.9 ng/ml or less, about 1 ng/ml or less, about 2 ng/ml or less, about 3 ng/ml or less, about 4 ng/ml or less, about 5 ng/ml or less, about 6 ng/ml or less, about 7 ng/ml or less, about 8 ng/ml or less, about 9 ng/ml or less, about 10 ng/ml or less, about 20 ng/ml or less, about 30 ng/ml or less, about 40 ng/ml or less, about 50 ng/ml or less, about 60 ng/ml or less, about 70 ng/ml or less, about 80 ng/ml or less, about 90 ng/ml or less, about 100 ng/ml or less, about 200 ng/ml or less, about 300 ng/ml or less, about 400 ng/ml or less, about 500 ng/ml or less, about 600 ng/ml or less, about 700 ng/ml or less, about 800 ng/ml or less, about 900 ng/ml or less, or about 1000 ng/ml or less.
In some embodiments, the biomarker is present in a complex biological sample at about 1 μg/ml, about 2 μg/ml, about 3 μg/ml, about 4 μg/ml, about 5 μg/ml, about 6 μg/ml, about 7 μg/ml, about 8 μg/ml, about 9 μg/ml, about 10 μg/ml, about 20 μg/ml, about 30 μg/ml, about 40 μg/ml, about 50 μg/ml, about 60 μg/ml, about 70 μg/ml, about 80 μg/ml, about 90 μg/ml, about 100 μg/ml, about 200 μg/ml, about 300 μg/ml, about 400 μg/ml, about 500 μg/ml, about 600 μg/ml, about 700 μg/ml, about 800 μg/ml, about 900 μg/ml, or about 1000 μg/ml.
In some embodiments, the biomarker is present in a complex biological sample at about 1 μg/ml or more, about 2 μg/ml or more, about 3 μg/ml or more, about 4 μg/ml or more, about 5 μg/ml or more, about 6 μg/ml or more, about 7 μg/ml or more, about 8 μg/ml or more, about 9 μg/ml or more, about 10 μg/ml or more, about 20 μg/ml or more, about 30 μg/ml or more, about 40 μg/ml or more, about 50 μg/ml or more, about 60 μg/ml or more, about 70 μg/ml or more, about 80 μg/ml or more, about 90 μg/ml or more, about 100 μg/ml or more, about 200 μg/ml or more, about 300 μg/ml or more, about 400 μg/ml or more, about 500 μg/ml or more, about 600 μg/ml or more, about 700 μg/ml or more, about 800 μg/ml or more, about 900 μg/ml or more, or about 1000 μg/ml or more.
In some embodiments, the biomarker is present in a complex biological sample at about 1 μg/ml or less, about 2 μg/ml or less, about 3 μg/ml or less, about 4 μg/ml or less, about 5 μg/ml or less, about 6 μg/ml or less, about 7 μg/ml or less, about 8 μg/ml or less, about 9 μg/ml or less, about 10 μg/ml or less, about 20 μg/ml or less, about 30 μg/ml or less, about 40 μg/ml or less, about 50 μg/ml or less, about 60 μg/ml or less, about 70 μg/ml or less, about 80 μg/ml or less, about 90 μg/ml or less, about 100 μg/ml or less, about 200 μg/ml or less, about 300 μg/ml or less, about 400 μg/ml or less, about 500 μg/ml or less, about 600 μg/ml or less, about 700 μg/ml or less, about 800 μg/ml or less, about 900 μg/ml or less, or about 1000 μg/ml or less.
In some embodiments, the biomarker present in a complex biological sample is at about 1 mg/ml, about 2 mg/ml, about 3 mg/ml, about 4 mg/ml, about 5 mg/ml, about 6 mg/ml, about 7 mg/ml, about 8 mg/ml, about 9 mg/ml, about 10 mg/ml, about 15 mg/ml, about 20 mg/ml, about 25 mg/ml, about 30 mg/ml, about 35 mg/ml, about 40 mg/ml, about 45 mg/ml, about 50 mg/ml, about 60 mg/ml, about 70 mg/ml, about 80 mg/ml, about 90 mg/ml, about 100 mg/ml, about 200 mg/ml, about 300 mg/ml, about 400 mg/ml, about 500 mg/ml, about 600 mg/ml, about 700 mg/ml, about 800 mg/ml, about 900 mg/ml, or about 1000 mg/ml.
In some embodiments, the biomarker present in a complex biological sample is at about 1 mg/ml or more, about 2 mg/ml or more, about 3 mg/ml or more, about 4 mg/ml or more, about 5 mg/ml or more, about 6 mg/ml or more, about 7 mg/ml or more, about 8 mg/ml or more, about 9 mg/ml or more, about 10 mg/ml or more, about 15 mg/ml or more, about 20 mg/ml or more, about 25 mg/ml or more, about 30 mg/ml or more, about 35 mg/ml or more, about 40 mg/ml or more, about 45 mg/ml or more, about 50 mg/ml or more, about 60 mg/ml or more, about 70 mg/ml or more, about 80 mg/ml or more, about 90 mg/ml or more, about 100 mg/ml or more, about 200 mg/ml or more, about 300 mg/ml or more, about 400 mg/ml or more, about 500 mg/ml or more, about 600 mg/ml or more, about 700 mg/ml or more, about 800 mg/ml or more, about 900 mg/ml or more, or about 1000 mg/ml or more.
In some embodiments, the biomarker present in a complex biological sample is at about 1 mg/ml or less, about 2 mg/ml or less, about 3 mg/ml or less, about 4 mg/ml or less, about 5 mg/ml or less, about 6 mg/ml or less, about 7 mg/ml or less, about 8 mg/ml or less, about 9 mg/ml or less, about 10 mg/ml or less, about 15 mg/ml or less, about 20 mg/ml or less, about 25 mg/ml or less, about 30 mg/ml or less, about 35 mg/ml or less, about 40 mg/ml or less, about 45 mg/ml or less, about 50 mg/ml or less, about 60 mg/ml or less, about 70 mg/ml or less, about 80 mg/ml or less, about 90 mg/ml or less, about 100 mg/ml or less, about 200 mg/ml or less, about 300 mg/ml or less, about 400 mg/ml or less, about 500 mg/ml or less, about 600 mg/ml or less, about 700 mg/ml or less, about 800 mg/ml or less, about 900 mg/ml or less, or about 1000 mg/ml or less.
Proteins are essential cellular machinery, performing and enabling tasks within biological systems. The variety of proteins is extensive, and the role they occupy in biology is deep and complex; life depends on proteins. Each step of cellular generation, from replication of genetic material to cell senescence and death, relies on the correct function of several distinct proteins. The precision of cellular machinery can be disrupted, however, resulting in disease. Because much of the machinery essential to cell health and survival remains unknown, studying proteins is of great interest and importance. The field of proteomics is the large-scale study of proteins and the proteome and encompasses many techniques, such as, but not limited to, immunoassays and two-dimensional differential gel electrophoresis (2-D DIGE), and mass spectrometry. In some embodiments, mass spectrometry-based proteomics comprises top-down analysis. In some embodiments, mass spectrometry-based proteomics comprises bottom-up analysis. Top-down methods analyze whole proteins; bottom-up approaches investigate the peptides from digested proteins. There are unique methods of analysis that each group has developed, but they share in common their mode of analysis. In mass spectrometric analysis, the mass-to-charge ratios (m/z) of molecular species are determined. By collecting this data, compounds in the sample can be identified by comparing against standard databases of compounds and molecules with known masses. From whole protein analysis in top-down proteomics, to peptide analysis in bottom-up proteomics, each particle measured has an m/z signature detectable by the mass spectrometer. By pairing mass analyzers and detectors, adding equipment in different configurations, and coupling separations and mass spectrometers together, there are virtually limitless possibilities, functionalities, and speeds of data acquisition for mass spectrometry-based proteomic analysis. Mass spectrometry-based proteomics has advanced rapidly since the advent of “soft” ionization techniques, such as electrospray ionization (ESI) and matrix-assisted laser desorption ionization (MALDI). Methods of detection previously used for organic chemicals and other sample identifications were adapted for proteomics. Several combinations of ionization sources and mass analyzers are available commercially, each with merits for specific applications. Time-of-flight (TOF) instruments are often coupled with MALDI instruments. ESI-TOF instruments can provide high-speed, continuous measurements without compromising resolution. High-resolution Fourier Transform Ion Cyclotron Resonance (FT-ICR) mass spectrometers are costly, but provide unsurpassed data collection power using both electrospray and MALDI ion sources. In some cases, The Orbitrap mass analyzer provides a high mass resolution instrument at a lower cost than Fourier Transform Ion Cyclotron Resonance (FT-ICR). With mass spectrometric analysis, such as the Orbitrap, combined with the selectivity of ion traps and quadrupoles with analysis of complex samples with high mass accuracy and specificity, even in small quantities, can be completed with increasing ease and confidence.
A mass spectrometer is an instrument that measures the masses of individual molecules that have been converted into ions, i.e., molecules that have been electrically charged. A mass spectrometer measures the mass-to-charge (“m/z”) ratio of the ions formed from the molecules. The sample, which can be a solid, liquid, or vapor, enters a vacuum chamber through an inlet. Depending on the type of inlet and ionization techniques used, the sample can already exist as ions in solution, or it can be ionized in conjunction with its volatilization or by other methods in the ion source. The gas phase ions are sorted in the mass analyzer according to their m/z ratios and then collected by a detector. In the detector, the ion flux is converted to a proportional electrical current. A data system records the magnitude of these electrical signals as a function of m/z and converts this information into a mass spectrum. Further information on mass spectrometers and their use is described in U.S. Pat. Nos. 6,504,150 and 6,960,760, each of which is incorporated herein by its entirety.
The complex sample or complex biological sample as described herein can contain at least one type of a biomolecule within the complex sample. In some embodiments, the at least one type of a biomolecule comprises a biomarker of interest. In some embodiment, the biomarker of interest is not present in the sample. In some embodiments, the biomarker of interest is highly prevalent and has saturated the dynamic range of the test. In some embodiments, the biomarker of interest is present at an intermediate concentration.
In some embodiments, the fraction of a sample comprising a particular biomolecule at which binding occurs is mapped to the concentration of the binding biomarker in the sample. Typically, a correlation is made between the number of biomarkers of a given biomolecule type. By determining the number of bound biomolecules, a determination and/or estimate of the number and/or quantity of the specific biomarker (e.g., protein) present in the original sample can be made. However, this is not strictly necessary for a chemistry signature of the sample to be useful. Rather, the process can be repeatable and dispersive. That is, a range of species concentrations in the sample can produce a range of fractions of bound biomolecules, rather than collapsing onto a single fraction of bound biomolecules.
The ability of the biomarkers of a particular biomolecule type to saturate is an important aspect of the present disclosure, as it limits the level of influence that any one species, particularly a high abundance species, can have in the determined signature. The mass spectroscopy signal can be “clipped”, greatly improving the signal to noise ratio problem described earlier. By limiting the influence (i.e., clipping its signal) of one or more species that would otherwise be abundant in the sample, the mass spectrometer can more effectively detect variations of low abundance species as their signals are not overshadowed by high abundance species. Examples of high abundance proteins include, but are not limited to albumin, immunoglobulin, transferrin, factor H, C9 complement, C8 complement, C5 complement, C6 complement, C7 complement, fibrinogen; examples of low abundance protein include but are not limited to cytokines, signaling molecules, intracellular proteins, interleukin-4, TNF alpha, interferon gamma, interleukin-1 Beta, interleukin-12, interleukin-5, interleukin-10, and interleukin-6 (N. Leigh Anderson and Norman G. Anderson, The Human Plasma Proteome: History, Character, and Diagnostic Prospects, MOLECULAR & CELLULAR PROTEOMICS 1:845-867, 2002.). It should be understood that this is not an exhaustive list of high abundance proteins and low abundance proteins. One of skill in the art can recognize additional high abundance proteins and additional low abundance proteins that are in the biological sample. This is especially the case for low abundance proteins as there are a vast variety of low abundance proteins. The proteins (high abundance and low abundance) can be selected out of the sample and the signal of the high abundance proteins and low abundance proteins can be clipped or enhanced, respectively.
The proteomic sampling methods provided herein contemplate uses of various mass spectrometry methods to analyze proteins. Some methods are discovery-based, where samples are analyzed to determine what proteins are present in the sample. Often a high-resolution mass spectrometer is used for this purpose, as the false-discovery rate of protein identifications from peptides rely on highly accurate mass-to-charge measurements. Some methods are targeted, focusing on single proteins of interest and quantifying them in different samples or sample fractions. Highly selective methods using ion traps and quadrupoles are ideal for targeted analysis. Proteins and peptides can be fragmented in the mass spectrometer for tandem mass spectrometry in a variety of ways, and those fragments are analyzed for de novo peptide sequencing or peptide mass fingerprinting. The vast majority of peptide identifications are accomplished with fragmentation followed by protein database searches of the resulting fragments. Electron capture dissociation (ECD), electron transfer dissociation (ETD), higher energy collisional dissociation (HCD), collision-induced dissociation (CID), and a host of other fragmentation methods are available, each with recommended applications. Furthermore, mass spectrometry methods are often customized within software.
The mass spectrometer is a critical aspect in proteomic experiments; however, the results obtained from the mass spectrometer are limited by the sample. Regardless of the analysis approach used, a high quality sample is critical for a successful experiment. Proteomic analyses depend on the sample containing proteins to analyze. Sample preparation approaches that are time-consuming, or worse, incur massive sample losses, are intolerable. Techniques to produce proteomic sampling of e.g., plasma proteome without a prior depletion and independent of e.g., plasma protein concentration and to generate a large training and test data set of disease labeled biomarker protein levels across many patient sample are described in International Patent Application PCT/US2017/067013, which is herein incorporated by reference in its entirety. Also contemplated herein are various other proteomic sampling preparation methods, such as affinity purification, size exclusion, hydrophobic exclusion, charge exclusion. Also contemplated herein is the creation of a relational database in which the surface-activity relationships between biomarkers and capture particles such as coronas are analyzed in order to rationally design capture particles. In order to obtain protein from a complex biological sample, the sample must first be harvested from the organism, culture, or patient. Samples can be obtained by several methods. Traditional dissection from animal species, biopsies, blood draws, and additional methods can deliver adequate protein for analysis. The proteins in the sample must be made readily accessible via lysis and extraction from the cells in the sample.
Numerous methods for lysis and extraction are well known in the art, e.g., lysis buffers, mechanical disruption strategies, a chemical lysis and extraction agent, along with some mechanical stimulus that physically breaks apart the cell, allowing the chemical agent to solubilize the available protein. Non-limiting exemplary common detergents for mass spectrometry and cell lysis include Triton X-100, NP-40, Tween 20, Tween 80, octyl glucoside, octyl thioglucoside, Big CHAP, deoxycholate, sodium dodecyl sulfate, CHAPS, and CHAPSO. Lysis buffers can differ in critical micelle concentration (CMC). The CMC is the concentration at which the detergent forms micelles spontaneously, which can affect their efficacy and removal in different environments. Above this point, the detergent forms micelles, and detergent added can move directly into micelles. Higher CMC values are associated with weaker hydrophobic binding to monomers. Thus, higher CMC detergents tend to be more easily removed by buffer exchange and dialysis. Solutions with lower CMC values form micelles more easily, and generally require less detergent to effectively solubilize protein. Another factor that can affect a lysis buffer is the micelle molecular weight (MMW). Lower-weight micelles are more easily removed than higher-weight micelles. Making use of CMC and MMW to determine the best course of the experiment is well known in the art. Choosing a lysis buffer depends greatly on these detergent factors. Most of the detergents listed are incompatible with downstream mass spectrometry analysis, and must be removed. Lysis, extraction, and denaturation of protein can occur in the same step with certain procedures, such as with sodium dodecyl sulfate SDS while boiling and agitating the sample. See Bodzon-Kulakowska, A. et al. Methods for samples preparation in proteomic research. J. Chromatogr. 2007, 849, 1-31; Visser, N. F. C. et al. Sample preparation for peptides and proteins in biological matrices prior to liquid chromatography and capillary zone electrophoresis. Anal. Bioanal. Chem. 2005, 382, 535-558; Hilbrig, F. and Freitag, R. Protein purification by affinity precipitation. J. Chromatogr. 2003, 790, 79-90; Zhou, J.-Y. et al. Simple sodium dodecyl sulfate-assisted sample preparation method for LC-MS-based proteomics applications. Anal. Chem. 2012, 84, 2862-2867.
Once protein is extracted, removal of contaminants and detergents is necessary. Some detergents can interfere with enzymatic digestion. Some detergents can interfere with reverse-phase separations and mass spectrometry. Removal of unwanted cellular material, such as lipids and genomic DNA, prevents signal suppression, chromatographic interference, and presents a much cleaner, clearer spectrum from which to obtain protein identification data. Various methods of contaminant removal are well known in the art. Non-limiting exemplary methods of contaminant removal include precipitation, salting out, ultrafiltration, polyethyleneimine (PEI), isoelectric point (PI), thermal, and nonionic polymer polyethylene glycol (PEG). See Englard, S. and Seifter, S. Precipitation techniques. Methods Enzymol. 1990, 182, 285-300; Burgess, R. R. Protein precipitation techniques. Methods Enzymol. 2009, 463, 331-342; Jiang, L. et al. Comparison of protein precipitation methods for sample preparation prior to proteomic analysis. J. Chromatogr. 2004, 1023, 317-320; Wisniewski, J. R. et al. Comparison of ultrafiltration units for proteomic and N-glycoproteomic analysis by the filter-aided sample preparation method. Anal. Biochem. 2011, 410, 307-309. Int. J. Mol. Sci. 2015, 16 3557; Gupta, M. N. et al. Affinity precipitation of proteins. J. Mol. Recognit. 1996, 9, 356-359; Holler, C. et al. Polyethyleneimine precipitation versus anion exchange chromatography in fractionating recombinant 3-glucuronidase from transgenic tobacco extract. J. Chromatogr. 2007, 1142, 98-105; Hegg, P. O. Precipitation of egg white proteins below their isoelectric points by sodium dodecyl sulphate and temperature. Biochim. Biophys. Acta 1979, 579, 73-87; Jaffe, W. G. A simple method for the approximate estimation of the isoelectric point of soluble proteins. J. Biol. Chem. 1943, 148, 185-186; Fan, J. et al. Thermal precipitation fluorescence assay for protein stability screening. J. Struct. Biol. 2011, 175, 465-468; Hill, A. R. and Irvine, D. M. Effects of pH on the thermal precipitation of proteins in acid and sweet cheese wheys. Can. Inst. Food Sci. Technol. J. 1988, 21, 386-389; Ingham, K. C. Precipitation of proteins with polyethylene glycol. Methods Enzymol. 1990, 182, 301-306; Ingham, K. C. Protein precipitation with polyethylene glycol. Methods Enzymol. 1984, 104, 351-356; Sim, S.-L. et al. Protein precipitation by polyethylene glycol: A generalized model based on hydrodynamic radius. J. Biotechnol. 2012, 157, 315-319; Crowell, A. M. J. et al. Maximizing recovery of water-soluble proteins through acetone precipitation. Anal. Chim. Acta 2013, 796, 48-54; Barritault, D. et al. The use of acetone precipitation in the isolation of ribosomal proteins. Eur. J. Biochem. 1976, 63, 131-135; Puchades, M. et al. Analysis of intact proteins from cerebrospinal fluid by matrix-assisted laser desorption/ionization mass spectrometry after two-dimensional liquid-phase electrophoresis. Rapid Commun. Mass Spectrom. 1999, 13, 2450-2455; Thongboonkerd, V. et al. Proteomic analysis of normal human urinary proteins isolated by acetone precipitation or ultracentrifugation. Kidney Int. 2002, 62, 1461-1469; Srivastava, O. P. et al. Purification of gamma-crystallin from human lenses by acetone precipitation method. Curr. Eye Res. 1998, 17, 1074-1081; Von Hagen, J. Proteomics Sample Preparation; John Wiley & Sons Inc.: Hoboken, N.J., USA, 2011; Wu, X. et al. Universal sample preparation method integrating trichloroacetic acid/acetone precipitation with phenol extraction for crop proteomic analysis. Nat. Protoc. 2014, 9, 362-374; Chevallet, M. et al. Toward a better analysis of secreted proteins: The example of the myeloid cells secretome. Proteomics 2007, 7, 1757-1770; Robinson, P. J. et al. Activation of protein kinase C in vitro and in intact cells or synaptosomes determined by acetic acid extraction of MARCKS. Anal. Biochem. 1993, 210, 172-178. Int. J. Mol. Sci. 2015, 16 3558; Isaacson, T. et al. Sample extraction techniques for enhanced proteomic analysis of plant tissues. Nat. Protoc. 2006, 1, 769-774. Duan, X. et al. A straightforward and highly efficient precipitation/on-pellet digestion procedure coupled with a long gradient nano-LC separation and orbitrap mass spectrometry for label-free expression profiling of the swine heart mitochondrial proteome. J. Proteome Res. 2009, 8, 2838-2850.
Non-limiting exemplary approaches for bottom-up proteomic analysis include in-solution and in-gel digestion. In-solution digestion involves denaturing, reducing, alkylating, and digesting the protein sample in the liquid phase, as opposed to in a gel or on a filter. In-solution digestion is well known in the art. See De Godoy, L. M. F. et al. Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455, 1251-1254; Li, N. et al. Lipid raft proteomics: Analysis of in-solution digest of sodium dodecyl sulfate-solubilized lipid raft proteins by liquid chromatography-matrix-assisted laser desorption/ionization tandem mass spectrometry. Proteomics 2004, 4, 3156-3166; De Souza, G. A. et al. Identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitors. Genome Biol. 2006, 7, R72; Go, E. P. et al. In-solution digestion of glycoproteins for glycopeptide-based mass analysis. Methods Mol. Biol. (Clifton N.J.) 2013, 951, 103-111. In-solution digestion can be performed using single-tube approaches, eliminating much of the sample loss that occurs during solution transfer between different vessels. Generally in-solution digests are fractionated after digestion, but fractionation can be performed previous to digestion using different forms of chromatography, including, but not limited to, strong and weak ion exchange, reverse-phase, and size exclusion chromatography.
Gel-based mass spectrometry analysis is well known in the art and widely used as a first method of separation prior to LC-MS/MS analysis. See Lasonder, E. et al. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 2002, 419, 537-542; Pomastowski, P. and Buszewski, B. Two-dimensional gel electrophoresis in the light of new developments. TrAC Trends Anal. Chem. 2014, 53, 167-177; Shevchenko, A. et al. Mass spectrometric sequencing of proteins from silver-stained polyacrylamide gels. Anal. Chem. 1996, 68, 850-858. Different methods of gel electrophoresis are well known in the art. Before the digestion, separation of the protein is performed using a gel. In one of the most common proteomic sample preparation strategies, a denaturing gel (sodium dodecyl sulfate in a polyacrylamide gel, SDS-PAGE) is used for bottom-up proteomics, as the protein can be cleaved into peptides in later steps. The gel is used to separate whole protein in one or two dimensions. After destaining, the proteins are excised from the gel and subjected to enzymatic proteolysis. Peptides can then be analyzed via mass spectrometry. The stained gel pieces are then excised from the gel, destained, and the protein within the gel piece is subjected to digestion. Strategies for in-gel digestions are well known in the art.
Various mass spectrometry and fractionation combinations for delving deeply into the proteome of organisms and model systems are well known in the art. For example, for the yeast proteome, the multi-dimensional protein identification technology is used for the yeast proteome. Accurate mass tags (AMT) were developed in order to decrease the need for tandem mass spectrometry while providing more sensitive measurements and greater dynamic range. High-performance liquid chromatography (HPLC) is a very common separation technique with a wide variety of stationary phases. With the advent of ultra-performance liquid chromatography (UPLC), chromatographic separations have increased both in resolving power and speed of separation. HPLC and UPLC function on the same principles. UPLC columns generally offer smaller particle sizes, resulting in decreased analyte path length and higher column pressures (10,000 pounds per square inch (PSI) or greater in maximum). Liquid Chromatography is often coupled with electrospray ionization and tandem mass spectrometers for both top-down and bottom-up proteomic analysis. LC has been used for a staggering number of analyses; LC-MS is a commonly used for proteomic analysis. Liquid chromatography is robust, customizable based on the functionality of the stationary particles in the separation column. For bottom-up proteomic analysis, the most common HPLC/UPLC stationary phase is the C18 reverse-phase column. The reverse-phase column uses the hydrophobicity of peptides for separation, utilizing a gradient from low to high organic-phase solvent. Acidified methanol and acetonitrile are commonly used as organic-phase, also known as “B” or “strong” solvents because of their miscibility with aqueous solutions. Acidified water is most often the “weak” solvent, also known as “A”. Both buffers are acidified with the same acid, generally with formic acid or trifluoroacetic acid (TFA) at 0.1% or 0.01%, respectively. Formic acid is preferred over TFA, as TFA tends to form adducts and suppress signal. While reverse-phase columns are commonly used, many stationary phases are in use for proteomic work in both one- and two-dimensional separations, online and off-line. A separation strategy known as electrostatic repulsion hydrophilic interaction chromatography (ERLIC) has gained popularity for phosphoproteomic work, using adjustments in pH and volatile salts for gradient separations. As the name suggests, ERLIC uses the charge and hydrophilicity of peptides as a basis for separation. Typically ERLIC begins with a low-organic, high-pH gradient, moving to high organic and low-pH as the separation moves on. In this way, ERLIC elutes peptides in order of increasing hydrophobicity and acidity. ERLIC has proven effective at separating and identifying modified and unmodified proteins. Smaller-diameter columns with lower loading capacities and smaller stationary phase particles offer an advantage in microproteomics. By increasing the local concentration of peptide and decreasing eddy diffusion, sample loading amounts can be minimized and still provide adequate peptide signal; the chromatographic resolution necessary for complex sample separation is not compromised. See Washburn, M. P. et al. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001, 19, 242-247; Lipton, M. S. et al. Global analysis of the Deinococcus radiodurans proteome by using accurate mass tags. Proc. Natl. Acad. Sci. USA 2002, 99, 11049-11054; Xie, F. et al. Liquid chromatography-mass spectrometry-based quantitative proteomics. J. Biol. Chem. 2011, 286, 25443-25449; Peng, J. et al. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J. Proteome Res. 2003, 2, 43-50; Jungblut, P. R. Protein and Peptide Mass Spectrometry in Drug Discovery; Gross, M. L., Chen, G., Pramanik, B., Eds.; ChemMedChem: Weinheim, Germany, 2012; Volume 7, pp. 2241-2242.
The present disclosure relates to machine learning algorithms for a complex biomolecule sampling (e.g., proteome sampling) from a subject. The algorithms provided herein can aid in selection of previously unknown biomarkers and provide a report comprising a score or probability relating to, for example, disease risk, disease likelihood, presence or absence of disease, treatment response, and/or classification of disease status.
Methods of diagnosing or prognosing a disease or disorder using the biomarkers identified by the present methods are also contemplated. The methods comprise obtaining a sample from a subject; contacting the sample with e.g., a plurality of nanoparticles to produce a complex biomolecule sampling data, and comparing the complex biomolecule sampling data to the known cohort of biomolecule data identified in the disease or disorder; and diagnosing or prognosing the disease or disorder based on the presence of one of more of the identified biomarkers.
In some embodiments, methods of identifying patterns of biomarkers or specific biomarkers associated with a disease or disorder are contemplated. Suitable methods, include, for example, preforming the methods described above (e.g. obtaining a samples from at least two subjects diagnosed with the disease or disorder and at least two control subjects; contacting each sample with the sensor array to produce a biomolecule fingerprint, and comparing the biomolecule fingerprint of the subjects with the disease or disorder to the biomolecule fingerprint of the control subjects to determine at least one pattern and/or biomarker associated with the disease or disorder. Suitable, the method can comprise at least 2 disease subjects and at least two control subjects, alternatively at least 5 disease subjects and at least 5 control subjects, alternatively at least 10 disease subjects and at least 10 control subjects, alternatively at least 15 disease subjects and at least 15 control subjects, alternatively at least 20 disease subjects and at least 20 control subjects, and includes any variations in between (e.g. disease subjects from at least 2-100, and control subjects from at least 2-100).
In some embodiments, the arrays and methods allow for the determination of a pattern of biomarkers associated with the disease state or disease or disorder or, in some embodiments, specific biomarkers that are associated with the disease or disorder. Not only will biomarkers that can be associated with a disease state be able to be identified, for example, biomarkers listed herein, but new biomarkers or patterns of biomarkers that can be associated with a disease state or a disease or disorder can be determined. As discussed above, some biomarkers or patterns of biomarkers for a specific disease or disorder can be a change in a biomolecule associated with the sensor array of the present disclosure and differ from what is usually referred to as biomarkers in the art, e.g., and increase expression of a specific biomolecule associated with a disease. As discussed above, it can be the interaction of a biomolecule, e.g. biomolecule X, with other biomolecules, e.g. biomolecule Y and Z, that results in the ability to associate with a specific disease state and cannot correlate with any change in the absolute concentration of biomarker X in the sample over time or disease state. Thus, a molecule that would not in the conventional sense be considered a biomarker since it does not change in absolute concentration in a sample from the pre-disease to disease state, can in view of the present disclosure be considered a biomolecule as its relative changes that are measured by the array of the present disclosure are associated with a disease state. In other words, it can be an increase or decrease in the interaction of biomolecule X (due to the interactions of X with the sensor elements and other biomolecules in the sample) with the array that provides a signal that a biomarker is associated with a disease state.
Any of the methods, kits, and systems described herein can utilize a diagnostic assay for predicting a disease status of a subject or likelihood of a subject's response to a therapeutic. The diagnostic assay can use the presence of one or more biomarkers identified using the methods described herein to calculate a quantitative score that can be used to predict disease status or likelihood of response to a therapeutic in a subject. The diagnostic assay can use the presence of one or more biomarkers and one or more characteristics, such as, e.g., age, weight, gender, medical history, risk factors, family history to calculate a quantitative score that can be used to predict disease status or likelihood of response to a therapeutic in a subject.
In some applications, an increase in a score in the diagnostic assay indicates an increased likelihood of one or more of: a poor clinical outcome, good clinical outcome, high risk of disease, low risk of disease, complete response, partial response, stable disease, non-response, and recommended treatments for disease management. In some embodiments, a decrease in the quantitative score indicates an increased likelihood of one or more of: a poor clinical outcome, good clinical outcome, high risk of disease, low risk of disease, complete response, partial response, stable disease, non-response, and recommended treatments for disease management.
In some applications, a decrease in a score in the diagnostic assay indicates an increased likelihood of one or more of: a poor clinical outcome, good clinical outcome, high risk of disease, low risk of disease, complete response, partial response, stable disease, non-response, and recommended treatments for disease management. In some embodiments, a decrease in the quantitative score indicates an increased likelihood of one or more of: a poor clinical outcome, good clinical outcome, high risk of disease, low risk of disease, complete response, partial response, stable disease, non-response, and recommended treatments for disease management.
Also provided herein are methods for generalized-treatment recommendations for a subject based on their complex biomolecule samplings and methods for subject-speific treatment recommendation. Methods for treatments can comprise following steps: detecting a presence or absence of one or more biomarkers specific for a disease state, such as cancer; recommending to the subject at least one generalized or subject-speific treatment to ameliorate disease symptoms; and monitoring of the disease progression, treatment responses, and recurrence of the disease by detecting the one or more biomarkers.
Methods provided herein can be applied to, for example, tumor (cancer) analysis. Cancer is one of the leading causes of death in the United States, and the growth and developmental mechanisms of tumors are poorly understood. Tumors have substantial cellular heterogeneity, making tumor biology a critical area of study. Methods disclosed herein provide valuable tools to explore the complex proteomic differences within e.g., a single tumor. Tumor biology and biomarker discovery are major driving forces for proteomic analysis, based on the number of publications on proteomic cancer analysis in recent years. Proteomics is useful for tumor biology due to the breadth of protein information obtainable. Often, diagnosis and evaluation of tumors are done with histology and immunohistochemical analysis. Biopsies are sliced, stained, and analyzed with microscopy. While accurate, precise, and mature, the throughput of this technique is relatively low. Immunostaining and histochemical staining are commonly used for cancer diagnosis, but they are limited techniques. Only one or two proteins can be visualized with traditional microscopy techniques. Tumors can also be biopsied using needle core or aspiration biopsies. The typical diameter of a needle core biopsy is about 1 mm across and about the length of a grain of rice, limiting sample amount and thus the breadth of analyses that can be possible. The methods provided herein can bridge this gap. Diagnosis, proteome analysis, and network analysis can be consolidated, and the methods provided herein can help ensure maximum data from minimal material. Thousands of protein groups can be identified in a single mass spectrometry run. In some embodiments, no prior protein depletion is required. That information is then uploaded into network analysis databases, providing an in-depth look at a tumor's molecular equilibria. The methods provided herein can identify targets for cancer therapy. Analysis of pathways in cancer can not only give insight into the workings of cancer; it can also give more immediate treatment options, possibly ruling out ineffective therapies or encouraging more productive, less deleterious chemotherapies.
Suitable cancer biomarkers include, but are not limited to, for example, AHSG (a2-HS-Glycoprotein), AKR7A2 (Aflatoxin B1 aldehyde reductase), AKT3 (PKB y), ASGR (ASGPR1), BDNF, BMP1 (BMP-1), BMPER, C9, CA6 (Carbonic anhydrase VI), CAPG (CapG), CDH1 (Cadherin-1), CHRDL1 (Chordin-Like 1), CKB-CKM-(CK-MB), CLIC1 (chloride intracellular channel 1), CMA1 (Chymase), CNTN1 (Contactin-1), COL18A1 (Endostatin), CRP, CTSL2 (Cathepsin V), DDC (dopa decarboxylase), EGFR (ERBB1), FGA-FGB-FGG (D-dimer), FN1 (Fibronectin FN1.4), GHR (Growth hormone receptor), GPI (glucose phosphate isomerase), HMGB1 (HMG-1), HNRNPAB (hnRNP A/B), HP (Haptoglobin, Mixed Type), HSP90AA1 (HSP 90a), HSPA1A (HSP 70), IGFBP2 (IGFBP-2), IGFBP4 (IGFBP-4), IL12B-IL23A (IL-23), ITIH4 (Inter-a-trypsin inhibitor heavy chain H4), KIT (SCF sR), KLK3-SERPINA3 (PSA-ACT), LlCAM (NCAM-L1), LRIG3, MMP12(MMP-12), MMP7 (MMP-7), NME2 (NDP kinase B), PA2G4 (ErbB3 binding protein Ebpl), PLA2G7 (LpPLA2/PAFAH), PLAUR (suPAR), PRKACA (PRKA C-a), PRKCB (PKCp-n), PROKI (EG-VEGF), PRSS2 (Trypsin-2), PTN (Pleiotrophin), SERPINA1 (a1-Antitrypsin), STC1 (Stanniocalcin-1), STX1A (Syntaxin 1A), TACSTD2 (GA733-1 protein), TFF3 (Trefoil factor 3), TGFBI (13IGH3), TPI1 (Triosephosphate isomerase), TPT1 (Fortilin), YWHAG (14-3-3 protein y), YWHAH (14-3-3 protein eta), prostate cancer biomarkers, for example, PSA, Pro-PSA, PHI, PCA3, TMPRSS3:ERG, PCMT, MTEN, breast cancer markers, for example, epidermal growth factor receptor 2 (HER2) oncogene, melanoma biomarker BRAF, lung cancer biomarker EML4-ALK, A2ML1, BAX, C10orf47, Clorfl62, CSDA, EIFC3, ETFB, GABARAPL2, GUKI, GZMH, HIST1H3B, HLA-A, HSP90AA1, NRGN, PRDXS, PTMA, RABACI, RABAGAP1L, RPL22, SAP 18, SEPW1, SOX1, EGFR, EGFRvIII, apolipoprotein A, apolipoprotein CIII, myoglobin, tenascin C, MSH6, claudin-3, claudin-4, caveolin-1, coagulation factor III, CD9, CD36, CD37, CD53, CD63, CD81, CD136, CD147, Hsp70, Hsp90, Rabl3, Desmocollin-1, EMP-2, CK7, CK20, GCDF15, CD82, Rab-5b, Annexin V, MFG-E8, HLA-DR, a miR200 microRNA, MDC, NME-2, KGF, PIGF, Flt-3L, HGF, MCP1, SAT-1, MIP-1-b, GCLM, OPG, TNF RII, VEGF-D, ITAC, MMP-10, GPI, PPP2R4, AKR1B1, AmylA, MIP-lb, P-Cadherin, EPO and the like. For example, biomarkers for breast cancer include, but are not limited to, ER/PR, HER-2/neu, and the like. Biomarkers for colorectal cancer include, but are not limited to, for example, EGFR, KRAS, UGT1A1, and the like. Biomarkers associated with leukemia/lymophoma include, but are not limited to, e.g., CD20 antigen, CD30, FIPL1-PDGFRalpha, PDGFR, Philladelphia Chromosome (BCR/ABL), PML/RAR alpha, TPMT, UGT1A1, and the like. Biomarker associated with lung cancer include but are not limited to, e.g., ALK, EGFR, KRAS and the like. Biomarkers are known in the art, and can be found in, for example, Bigbee W, Herberman R B. Tumor markers and immunodiagnosis. In: Bast R C Jr., Kufe D W, Pollock R E, et al., editors. Cancer Medicine. 6th ed. Hamilton, Ontario, Canada: BC Decker Inc., 2003; Andriole G, Crawford E, Grubb R, et al. Mortality results from a randomized prostate-cancer screening trial. New England Journal of Medicine 2009; 360(13):1310-1319; Schroder F H, Hugosson J, Roobol M J, et al. Screening and prostatecancer mortality in a randomized European study. New England Journal of Medicine 2009; 360(13):1320-1328; Buys S S, Partridge E, Black A, et al. Effect of screening on ovarian cancer mortality: the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Randomized Controlled Trial. JAMA 2011; 305(22):2295-2303; Cramer D W, Bast R C Jr, Berg C D, et al. Ovarian cancer biomarker performance in prostate, lung, colorectal, and ovarian cancer screening trial specimens. Cancer Prevention Research 2011; 4(3):365-374; Sparano J A, Gray R J, Makower D F, et al. Prospective validation of a 21-gene expression assay in breast cancer. New England Journal of Medicine 2015; First published online Sep. 28, 2015. doi: 10.1056/NEJMoa1510764, incorporated by reference in their entireties.
Biomarkers may also be associated with the cardiovascular disease which are known in the art and include, but are not limited to, lipid profile, glucose, and hormone level and physiological biomarkers based on measurement of levels of important biomolecules such as serum ferritin, triglyceride to HDLp (high density lipoproteins) ratio, lipophorin-cholesterol ratio, lipid-lipophorin ratio, LDL cholesterol level, HDLp and apolipoprotein levels, lipophorins and LTPs ratio, sphingolipids, Omega-3 Index, and ST2 level, among others. Suitable biomarkers for cardiovascular disease can be found in the art, for example, but not limited to, in van Holten et al. “Circulating Biomarkers for Predicting Cardiovascular Disease Risk; a Systemic Review and Comprehensive Overview of Meta-Analyses” PLoS One, 2013 8(4): e62080, incorporated by reference in its entirety. Biomarkers may also be associated with a neurological disease. Suitable biomarkers are known in the art and include, but are not limited to, e.g., A131-42, t-tau and p-tau 181, a-synuclein, among others. See, e.g., Chintamaneni and Bhaskar” Biomarkers in Alzheimer's Disease: A Review” ISRN Pharmacol. 2012. 2012: 984786. Published online 2012 Jun. 28, incorporated by reference in its entirety. Biomarkers for inflammatory diseases are known in the art and include, but are not limited to, e.g., cytokines/chemokines, immune-related effectors, acute-phase proteins [C-reactive protein (CRP) and serum amyloid A (SAA)], reactive oxygen species (ROS) and reactive nitrogen species (RNS), prostaglandins and cyclooxygenase (COX)-related factors, and mediators such as transcription factors and growth factors, which can include, for example, C-reactive protein (CRP), 5100, LIF, CXCL1, CXCL2, CXCL4, CXCL5, CXCL8, CXCL9, CXCL10, CCL2, CCL23, IL-113, IL-IRa, TNF, IL-6, IL-10, IL-17A, IL-17F, IL-21, IL-22, IFNy, CXCR1, CXCR4, CXCR5, GM-CSF, GM-CSFR, G-CSF, G-CSFR, EGF, VEGFA, LEP, SAA1, VCAM1, CRP, MMP1, MMP3, TNFRSF1A, RETN, CHI3L1, antinuclear antibodies (ANA), rheumatoid factor (RF), antibodies against cyclic citrullinated peptide (anti-CCP)] and for chronic IBD (fecal calprotectin), among others. Suitable biomarkers for inflammatory bowel disease, for example, include CRP, ESR, pANCA, ASCA, and fecal calprotectin. See, e.g., Yi Fengming and Wu Jianbing, “Biomarkers of Inflammatory Bowel Disease,” Disease Markers, vol. 2014, Article ID 710915, 11 pages, 2014. doi:10.1155/2014/710915, incorporated by reference in its entirety.
Table 1 provides non-limiting exemplary genes encoding disease labeled proteins. Classification model (Random Forest) weights of each protein and their scores against Open Targets at multiple points in the EFO Ontology are provided.
In some embodiments, classification model weight and importance for each protein in each disease label, the definition depends on the algorithm used. The classification model weight of Table 1 is generated using Random Forest. Random Forest allows estimation of the classification model weight and importance for each protein by a number of methods, one being removing or perturbing the values. The average of the resulting changes in classification error provides an indication of weight. The classification model weights are dependent on the dataset and are relative within the dataset. For example, if a feature (protein) is removed from a data set, the weights of the other feature are recalculated when the model is retrained. For example, if the same algorithm is used on a similar but a new set of data, the weight can be different. The methods disclosed herein provide stability of important features across many independent datasets by focusing on the subset of the important features consistent across different datasets. The methods provided herein identify the subset currently not related to known knowledge as a novel biomarker.
Present disclosure relates to methods of identifying biomarkers (e.g., protein) that are linked to a specified biological state (e.g., a disease state) and are present in a very low or non-recorded concentration in presently known databases. The present disclosure also relates to computer implemented machine learning classification algorithms, apparatuses, systems, and computer readable media for assessing a likelihood that a patient has the specified biological state, such as cancer, relative to a patient (test) population or a control (reference) population.
The collection of data from patient samples presents a very complex network of information about the patient. This complex network of information can be effectively untangled by modem machine learning algorithms. Modem machine learning and artificial intelligence algorithms are well suited to managing large quantities of heterogeneous data.
The term “machine learning” refers to algorithms that give a computer the ability to learn without being explicitly programmed including algorithms that can learn from and make predictions about data. Non-limiting exemplary machine learning algorithms include decision tree learning, artificial neural networks, deep learning neural network, support vector machines, rule base machine learning, random forest, nearest neighbor, support vector classifier, partial least square, and logistic regression. Non-limiting examples of neural networks include convolutional neural networks, deep convolutional neural networks, cascaded deep convolutional neural networks, graph convolutional neural networks (GCNN), etc. In some embodiments, algorithms such as linear regression or logistic regression can be used as part of a machine learning process. However, it is understood that using linear regression or another algorithm as part of a machine learning process is distinct from performing a statistical analysis such as regression with a spreadsheet program such as Excel. The machine learning process has the ability to continually learn and adjust the classifier as new data becomes available, and does not rely on explicit or rules-based programming. Statistical modeling relies on finding relationships between variables (e.g., mathematical equations) to predict an outcome.
Machine learning has two main phases: a training phase and an application or testing phase. During training, models are learned from labelled diseases and their respective association scores. During the testing or application phase, the models are then applied to unseen data for classifying against the biological states used in training.
Generating a prediction algorithm by training a machine is a well-known technique. The most important in the training of the machine is the quality of the database used for the training. Typically, the machine combines one or more linear models, support vector machines, decision trees and/or a neural network.
Machine learning can be generalized as the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. Machine learning can include the following concepts and methods. Supervised learning concepts can include AODE; Artificial neural network, such as Backpropagation, Autoencoders, Hopfield networks, Boltzmann machines, Restricted Boltzmann Machines, and Spiking neural networks; Bayesian statistics, such as Bayesian network and Bayesian knowledge base; Case-based reasoning; Gaussian process regression; Gene expression programming; Group method of data handling (GMDH); Inductive logic programming; Instance-based learning; Lazy learning; Learning Automata; Learning Vector Quantization; Logistic Model Tree; Minimum message length (decision trees, decision graphs, etc.), such as Nearest Neighbor Algorithm and Analogical modeling; Probably approximately correct learning (PAC) learning; Ripple down rules, a knowledge acquisition methodology; Symbolic machine learning algorithms; Support vector machines; Random Forests; Ensembles of classifiers, such as Bootstrap aggregating (bagging) and Boosting (meta-algorithm); Ordinal classification; Information fuzzy networks (IFN); Conditional Random Field; ANOVA; Linear classifiers, such as Fisher's linear discriminant, Linear regression, Logistic regression, Multinomial logistic regression, Naive Bayes classifier, Perceptron, Support vector machines; Quadratic classifiers; k-nearest neighbor; Boosting; Decision trees, such as C4.5, Random forests, ID3, CART, SLIQ, SPRINT; Bayesian networks, such as Naive Bayes; and Hidden Markov models. Unsupervised learning concepts can include; Expectation-maximization algorithm; Vector Quantization; Generative topographic map; Information bottleneck method; Artificial neural network, such as Self-organizing map; Association rule learning, such as, Apriori algorithm, Eclat algorithm, and FP-growth algorithm; Hierarchical clustering, such as Single-linkage clustering and Conceptual clustering; Cluster analysis, such as, K-means algorithm, Fuzzy clustering, DBSCAN, and OPTICS algorithm; and Outlier Detection, such as Local Outlier Factor. Semi-supervised learning concepts can include; Generative models; Low-density separation; Graph-based methods; and Co-training. Reinforcement learning concepts can include; Temporal difference learning; Q-learning; Learning Automata; and SARSA. Deep learning concepts can include; Deep belief networks; Deep Boltzmann machines; Deep Convolutional neural networks; Deep Recurrent neural networks; and Hierarchical temporal memory.
In some embodiments, the present systems and methods relate to generating a trained classification model by assigning classification model weight and existing knowledge association score. In some embodiments, the variables used to train the machine comprise the expression level of biomolecules (e.g., protein) from a sample. In the present disclosure, the machine learning algorithms are “trained” by building a trained classification model from inputs. The inputs can be retrospective data with a known diagnosis of cancer (including matched controls) and data from measured biomarkers and clinical factors of those patients. In some embodiments, the inputs can be complex biomolecule sampling data from a test subject who has or is currently undergoing any form of treatment (e.g., cancer treatment). In some embodiments, the subject is diagnosed with or suspected of having or developing a disease or disorder. In some embodiments, the disease is cancer. In some embodiments, the subject is in cancer remission.
In some embodiments, the surface-activity relationships (i.e., intermolecular interaction modeling) of the biomarkers identified and the corona which captured them are stored in a relational database and applied to a graph convolutional neural network (GCNN) to rationally design coronas with features configured to target specific biomarkers which may be relevant to a disease state.
In some embodiments, broad range sampling of complex biomolecules without any prior depletion generates a large training data set and test data set of disease labeled biomolecules (e.g., protein) across many patient samples. The trained machine learning algorithm can assign a classification model weight to each biomolecule (e.g, protein) in the test data set and all known biomolecules for a specified biological state (e.g., cancer). The classification strength of the biomolecules can be determined by assigning an existing knowledge association score (
It is understood that the basis of use of the biomarkers identified by presently described methods for e.g., early detection of cancer and/or monitoring of a treatment response to a therapy is based on, at least in part, (1) an identification of a specified biological state (e.g., a type of cancer), (2) validated biomarkers that are associated with the specified biological state, (3) clinical parameter data, and in some cases, (4) publically available data including risk factors for having the cancer. Validation of the biomarkers identified in the present methods can be provided by analyzing retrospective e.g., cancer samples along with age matched normal (e.g., non-cancer) samples and/or other controls.
Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity measures the proportion of positives that are correctly identified as such (e.g. the percentage of subject with a disease or disorder who are correctly identified as having the disease or disorder). Specificity measures the proportion of negatives that are correctly identified as such (e.g. the percentage of subject without the disease or disorder who are correctly identified as not having the disease or disorder). Sensitivity quantifies the avoiding of false negatives, and specificity does the same for false positives.
Sensitivity and specificity are prevalence-independent test characteristics, as their values are intrinsic to the test and do not depend on the disease prevalence in the population of interest. Positive and negative predictive values, but not sensitivity or specificity, are values influenced by the prevalence of disease in the population that is being tested. Bayesian clinical diagnostic model can demonstrate the positive and negative predictive values as a function of the prevalence, the sensitivity and specificity. Bayesian inference is a method of statistical inference in which Bayes' rule is used to update the probability that a hypothesis is correct as evidence is added. In clinical medicine, Bayesian methods are used to establish the probability that a patient has a particular condition given the results of the test used and the prevalence of the condition in the population tested. The probability that the subject has the condition is largely dependent on the frequency of the condition in the population tested (prevalence). This applies even if the test has a high probability of being correct (sensitivity) and a high probability of identifying the patients that do not have the condition (specificity).
The tradeoff between specificity and sensitivity can be represented graphically using a receiver operating characteristic (ROC) curve. ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. ROC curve analysis is often applied to measure the diagnostic accuracy of a biomarker. The analysis results in two gains: diagnostic accuracy of the biomarker and the optimal cut-point value. The ROC curve is a mapping of the sensitivity versus for all possible values of the cut-point between cases and controls.
To measure the diagnostic ability of a biomarker, it is common to use summary measures such as the area under the ROC curve (AUC) and/or the partial area under the ROC curve (pAUC). A biomarker with AUC=1 discriminates individuals perfectly as diseased or healthy. Meanwhile, an AUC=0.5 means that there is no apparent distributional difference between the biomarker values of the two groups. ROC analysis provides two main outcomes: the diagnostic accuracy of the test and the optimal cut-point value for the test. Cut-points dichotomize the test values, so this provides the diagnosis (diseased or not). The identification of the cut-point value requires a simultaneous assessment of sensitivity and specificity. A cut-point is referred to as optimal when the point classifies most of the individuals correctly. AUC, sensitivity, and specificity values are useful for the evaluation of a marker; however, they do not specify “optimal” cut-points directly. The ROC curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1—specificity) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1—specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from-∞ to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability on the x-axis. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.
Area under the ROC curve (AUC) means a statistic to measure classifier performance, commonly used in machine learning applications that encapsulate both sensitivity and specificity of the classifier performance. In a ROC curve, the true positive rate (sensitivity) is plotted in function of the false positive rate (1—specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
Positive predictive value (PPV) and negative predictive value (NPV) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. Although sometimes used synonymously, a negative predictive value generally refers to what is established by control groups, while a negative post-test probability rather refers to a probability for an individual. Still, if the individual's pre-test probability of the target condition is the same as the prevalence in the control group used to establish the negative predictive value, then the two are numerically equal.
The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a specificity of at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9%. The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a specificity about 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9%.
The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a sensitivity of at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9%. The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a sensitivity about 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9%.
In some applications, the one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) can indicates an increased likelihood of one or more of: a poor clinical outcome, good clinical outcome, high risk of disease, low risk of disease, complete response, partial response, stable disease, non-response, and recommended treatments for disease management.
The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a specificity greater than 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9% ROC. The methods provided herein can identify one or more biomarkers with a significant classification model weight with little existing knowledge association for a specified biological state (e.g., a disease state) with a sensitivity greater than 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.1%, 99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, or 99.9% ROC.
The current disclosure provides computer systems for implementing any of the methods described herein. A computer system can be used to implement one or more steps including, sample collection, sample processing, detecting, quantifying one or more biomolecules, generating a profile data, comparing said data to a reference, generating a subject-specific biomolecule profile, comparing the subject-specific profile to a reference profile, receiving medical history, receiving medical records, receiving and storing data obtained by one or more methods described herein, analyzing said data, generating a report, and reporting results to a receiver.
Computer systems described herein can comprise computer-executable code for performing any of the algorithms described herein. Computer systems described herein can comprise computer-executable code for performing any of the algorithms and using the database as herein.
The storage unit 115 can store files, such as subject reports, and/or communications with the caregiver, sequencing data, data about individuals, or any aspect of data associated with the present disclosure.
The server can communicate with one or more remote computer systems through the network 130. The one or more remote computer systems can be, for example, personal computers, laptops, tablets, telephones, Smart phones, or personal digital assistants.
In some applications the computer system 100 includes a single server 101. In other situations, the system includes multiple servers in communication with one another through an intranet, extranet and/or the internet.
The server 101 can be adapted to store measurement data or a database as provided herein, patient information from the subject, such as, for example, polymorphisms, mutations, medical history, family history, demographic data and/or other clinical or personal information of potential relevance to a particular application. Such information can be stored on the storage unit 115 or the server 101 and such data can be transmitted through a network.
Methods as described herein can be implemented by way of machine (or computer processor) executable code (or software) stored on an electronic storage location of the server 101, such as, for example, on the memory 110, or electronic storage unit 115. During use, the code can be executed by the processor 105. In some cases, the code can be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some situations, the electronic storage unit 115 can be precluded, and machine-executable instructions are stored on memory 110. Alternatively, the code can be executed on a second computer system 140.
Aspects of the systems and methods provided herein, such as the server 101, can be embodied in programming. Various aspects of the technology can be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which can provide non-transitory storage at any time for the software programming. All or portions of the software can at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, can enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that can bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless likes, optical links, or the like, also can be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” can refer to any medium that participates in providing instructions to a processor for execution
Computer systems described herein can comprise computer-executable code or instruction for performing any of the algorithms or algorithms-based methods described herein. In some applications the algorithms described herein will make use of a memory unit that is comprised of at least one database.
Data relating to the present disclosure can be transmitted over a network or connections for reception and/or review by a receiver. The receiver can be but is not limited to the subject to whom the report pertains; or to a caregiver thereof, e.g., a health care provider, manager, other health care professional, or other caretaker; a person or entity that performed and/or ordered the analysis. The receiver can also be a local or remote system for storing such reports (e.g. servers or other systems of a “cloud computing” architecture). In one embodiment, a computer-readable medium includes a medium suitable for transmission of a result of an analysis of a biological sample using the methods described herein.
Computer systems disclosed herein can comprise a memory unit. The memory unit can be configured to receive data comprising extracting data from a pubic database, detecting, quantifying and profiling one or more biomolecules.
There are several public searchable metabolome database and proteome database known in the art. The present methods of the disclosure can be used with such public databases as well as proprietary databases. Examples of public databases include but are not limited to Open Targets (opentargets.org), Gene Ontology Consortium (geneontology.org), Plasma Proteome Database (plasmaproteomedatabase.org), METLIN (metlin.scripps.edu), Human Metabolome Database (hmdb.ca), Kyoto Encyclopedia of Genes and Genomes (genome.jp/kegg/), Biological Magnetic Resonance Bank (bmrb.wisc.edu/metabolomics/), Proteomics Identifications (PRIDE) (ebi.ac.uk/pride), ProteomicsDB (proteomicsdb.org), or Biological Magnetic Resonance Bank (bmrb.wisc.edu/metabolomics/).
The Open Targets database calculates the disease association score of each protein based on evidence from various databases to score the available evidence from a scale of 0 (lowest) to 1.0 (highest) level of disease association. Non-limiting exemplary Open Target database include GWAS Catalog (D. Welter et al., The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic acids research 42, D1001-D1006 (2013)); UniProt (U. Consortium, UniProt: a hub for protein information. Nucleic acids research, gku989 (2014)): Gene2Phenotype (C. F. Wright et al., Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. The Lancet 385, 1305-1314 (2015)); Cancer Gene Census (S. A. Forbes et al., COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic acids research 43, D805-D811 (2014)); IntOGen (C. Rubio-Perez et al., In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. Cancer cell 27, 382-396 (2015)); Europe PMC (E. P. Consortium, Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic acids research, gku1061 (2014)); and Reactome (D. Croft et al., The Reactome pathway knowledgebase. Nucleic acids research 42, D472-D477 (2013)).
The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education. The database is designed to contain or link three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. The database (version 3.5) contains 40,446 metabolite entries including both water-soluble and lipid soluble metabolites as well as metabolites that would be regarded as either abundant (>1 pM) or relatively rare (<1 nM). Additionally, 5,235 protein (and DNA) sequences are linked to these metabolite entries. See Wishart, D. S., Tzur, D., Knox, C., et al., HMDB: the Human Metabolome Database, Nucleic Acids Res. 2007 January; 35(Database issue):D521-6; Wishart, D. S., Knox, C., Guo, A. C., et al., HMDB: a knowledgebase for the human metabolome, Nucleic Acids Res. 2009 37(Database issue):D603-610; Wishart., D. S., Jewison, T., Guo, A. C., Wilson, M., Knox, C., et al., HMDB 3.0—The Human Metabolome Database in 2013, Nucleic Acids Res. 2013. Jan. 1; 41(D1):D801-7. The database can be located on central server containing the computer-executable code that allows access to a user. The user can connect to the central server through a physical connection or cloud-based connection depending on the application. In some applications a portion of the database and necessary executable code will be supplied to as user on appropriate storage media.
Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system can comprise computer executable instruction for providing a report communicating the detecting, measuring, or determining a sampling of complex biomolecule from a subject. Measuring, or determining a sampling of complex biomolecule can include the use of a database as provided herein.
The computer system for a complex biomolecule sampling as provided herein can comprise a first memory unit for receiving a plurality of biomolecule sampling data, wherein the plurality of biomolecule sampling data comprises a first biomolecule sampling data from a first complex biological sample and a second biomolecule sampling data from a second complex biological sample, wherein the first complex biological sample is from one or more subjects with a specified biological state and the second complex biological sample is from one or more subjects without the specified biological state; a second memory unit for querying a known biomolecule data aggregator, wherein the known biomolecule data aggregator comprises all known biomolecules associated with the specified biological state; a first computer executable instruction for building a trained classification model by extracting a first feature of the first biomolecule sampling data and a second feature of the second biomolecule sampling data, wherein the trained classification model of the first feature and the second feature identifies one or more biomarkers linked to the specified biological state; and a second computer executable instruction for processing the trained classification model against the known biomolecule data aggregator and assigning a classification weight to all biomolecules, wherein said processing and assigning identifies one or more biomarkers linked to the specified biological state, wherein the one or more biomarkers confirms the specified biological state with a high sensitivity and a high specificity, wherein the one or more biomarker is present in a low or non-recorded concentration in the first complex biological sampling data.
The computer system herein can further comprising a third computer executable instruction for generating a report of the presence or absence of the specified biological state in a subject. The report can comprise a recommended treatment for a disease management. The computer system can further comprise a user interface configured to communicate or display said report to a user.
Techniques and systems to produce broad range sampling of complex biomolecules (e.g., plasma proteome) without a prior depletion and independent of e.g., plasma protein concentration and to generate a large training and test data set of disease labeled biomolecule (e.g., protein) levels across many patient sample are described in International Patent Application PCT/US2017/067013, which is herein incorporated by reference in its entirety.
In some embodiments, the system can comprise multi-particle enabled complex biomolecule sampling. In some embodiments, the particles are nanoparticles. In some embodiments, the particles are liposomes. The liposomes can comprise any lipid capable of forming a particle. In one embodiment, the liposome comprises one or more cationic lipids or anionic lipids, and one or more stabilizing lipids. Suitable liposomes are known in the art and include, but are not limited to, for example, DOPG (1,2-dioleoyl-sn-glycero-3-phospho-(1′-rac-glycerol), DOTAP(1,2-Dioleiyl-3 trimethylammonium-propane)-DOPE (dioleoylphosphatidylethanolamine), CHOL (DOPC-Cholesterol), and combinations thereof.
The lipid-based surface of a liposome can contact a subset of biomolecules (e.g., proteins) of a complex biological sample (e.g., plasma, or any sample having a complex mix of biomolecules such as proteins and nucleic acid and at least one of a polysaccharide and lipid) at a lipid-biomolecule (e.g. protein) interface, thereby binding the subset of proteins to produce a pattern of biomolecule (e.g. protein) binding.
In one embodiment, the liposome comprises a cationic lipid. As used herein, the term “cationic lipid” refers to a lipid that is cationic or becomes cationic (protonated) as the pH is lowered below the pK of the ionizable group of the lipid, but is progressively more neutral at higher pH values. At pH values below the pK, the lipid is then able to associate with negatively charged nucleic acids. In certain embodiments, the cationic lipid comprises a zwitterionic lipid that assumes a positive charge on pH decrease. In certain embodiments, the liposomes comprise cationic lipid. In some embodiments, cationic lipid comprises any of a number of lipid species which carry a net positive charge at a selective pH, such as physiological pH. Such lipids include, but are not limited to, N,N-dioleyl-N,N-dimethylammonium chloride (DODAC); N-(2,3-dioleyloxy)propyl)-N,N,N-trimethylammonium chloride (DOTMA); N,N-distearyl-N,N-dimethylammonium bromide (DDAB); N-(2,3-dioleoyloxy)propyl)-N,N,N-trimethylammonium chloride (DOTAP); 3-(N—(N′,N′-dimethylaminoethane)-carbamoyl)cholesterol (DC-Chol), N-(1-(2,3-dioleoyloxy)propyl)-N-2-(sperminecarboxamido)ethyl)-N,N-dimethy-lammonium trifluoracetate (DOSPA), dioctadecylamidoglycyl carboxyspermine (DOGS), 1,2-dioleoyl-3-dimethylammonium propane (DODAP), N,N-dimethyl-2,3-dioleoyloxy)propylamine (DODMA), N-(1,2-dimyristyloxyprop-3-yl)-N,N-dimethyl-N-hydroxyethyl ammonium bromide (DMRIE), 1,2-dioleoyl-sn-3-phosphoethanolamine (DOPE), N-(1-(2,3-dioleyloxy)propyl)-N-(2-(sperminecarboxamido)ethyl)-N,N-dimethy-lammonium trifluoroacetate (DOSPA), dioctadecylamidoglycyl carboxyspermine (DOGS), and 1,2-ditetradecanoyl-sn-glycero-3-phosphocholine (DMPC). The following lipids are cationic and have a positive charge at below physiological pH: DODAP, DODMA, DMDMA, 1,2-dilinoleyloxy-N,N-dimethylaminopropane (DLinDMA), 1,2-dilinolenyloxy-N,N-dimethylaminopropane (DLenDMA). In some embodiment, the lipid is an amino lipid.
In certain embodiments, the liposome comprises one or more additional lipids which stabilize the formation of particles during their formation. Suitable stabilizing lipids include neutral lipids and anionic lipids. The term “neutral lipid” refers to any one of a number of lipid species that exist in either an uncharged or neutral zwitterionic form at physiological pH. Representative neutral lipids include diacylphosphatidylcholines, diacylphosphatidylethanolamines, ceramides, sphingomyelins, dihydro sphingomyelins, cephalins, and cerebrosides. Exemplary neutral lipids include, for example, distearoylphosphatidylcholine (DSPC), dioleoylphosphatidylcholine (DOPC), dipalmitoylphosphatidylcholine (DPPC), dioleoylphosphatidylglycerol (DOPG), dipalmitoylphosphatidylglycerol (DPPG), dioleoyl-phosphatidylethanolamine (DOPE), palmitoyloleoylphosphatidylcholine (POPC), palmitoyloleoyl-phosphatidylethanolamine (POPE) and dioleoyl-phosphatidylethanolamine 4-(N-maleimidomethyl)-cyclohexane-1-carboxylate (DOPE-mal), dipalmitoyl phosphatidyl ethanolamine (DPPE), dimyristoylphosphoethanolamine (DMPE), distearoyl-phosphatidylethanolamine (DSPE), 16-O-monomethyl PE, 16-O-dimethyl PE, 18-1-trans PE, 1-stearioyl-2-oleoyl-phosphatidyethanol amine (SOPE), and 1,2-dielaidoyl-sn-glycero-3-phophoethanolamine (transDOPE). In one embodiment, the neutral lipid is 1,2-distearoyl-sn-glycero-3-phosphocholine (DSPC).
The term “anionic lipid” refers to any lipid that is negatively charged at physiological pH. These lipids include phosphatidylglycerol, cardiolipin diacylphosphatidylserine, diacylphosphatidic acid, N-dodecanoylphosphatidylethanolamines, N-succinylphosphatidylethanolamines, N-glutarylphosphatidylethanolamines, lysylphosphatidyiglycerols, palmitoyloleyolphosphatidylglycerol (POPG), and other anionic modifying groups joined to neutral lipids. In certain embodiments, the liposome comprises glycolipids (e.g., monosialoganglioside GM.sub.1). In certain embodiments, the liposome comprises a sterol, such as cholesterol. In certain embodiments, the liposome comprises an additional, stabilizing-lipid which is a polyethylene glycol-lipid. Suitable polyethylene glycol-lipids include PEG-modified phosphatidylethanolamine, PEG-modified phosphatidic acid, PEG-modified ceramides (e.g., PEG-CerC14 or PEG-CerC20), PEG-modified dialkylamines, PEG-modified diacylglycerols, PEG-modified dialkylglycerols. Representative polyethylene glycol-lipids include PEG-c-DOMG, PEG-c-DMA, and PEG-s-DMG. In one embodiment, the polyethylene glycol-lipid is N-[(methoxy poly(ethylene glycol).sub.2000)carbamyl]-1,2-dimyristyloxlpropyl-3-amine (PEG-c-DMA). In one embodiment, the polyethylene glycol-lipid is PEG-c-DOMG).
Suitable liposomes can be solid lipid nanoparticles (SLN) which can be made of solid lipid, emulsifier and/or water/solvent. SLN can include, but are not limited to, a combination of the following ingredients: triglycerides (tri-stearin), partial glycerides (Imwitor), fatty acids (stearic acid, palmitic acid), and steroids (cholesterol) and waxes (cetyl palmitate). Various emulsifiers and their combination (Pluronic F 68, F 127) have been used to stabilize the lipid dispersion. Suitable ingredients for the use in preparing SNL sensor elements include, but are not limited to, e.g., phospholipids, glycerol, poloxamer 188, soy phosphatidyl choline, compritol, cetyl palmitate, PEG 2000, PEG 4500, Tween 85, ethyl oleate, Na alginate, ethanol/butanol, tristearin glyceride, PEG 400, isopropyl myristate, Pluronic F68, Tween 80, trimyristin, tristearin, trilaurin, stearic acid, glyceryl caprate as Capmul®MCM C10, theobroma oil, triglyceride coconut oil, 1-octadecanol, glycerol behenate as Compritol® 888 ATO, glycerol palmitostearate as Precirol® ATO 5, and cetyl palmitate wax and the like.
Suitable nanoparticles are known in the art and include, but are not limited to, for example, natural or synthetic polymers, copolymers, terpolymers (with the cores being composed of metals or inorganic oxides, including magnetic cores). Suitable polymeric nanoparticles include, but are not limited to, e.g., polystyrene; poly(lysine), chitosan, dextran, poly(acrylamide) and its derivatives such as N-isopropylacrylamide, N-tertbutylacrylamide, N,N-dimethylacrylamide, polyethylene glycol, poly(vinyl alcohol), gelatin, starch, degradable (bio)polymers, silica and the like.
In various embodiments, the core of the nanoparticles can include an organic particle, an inorganic particle, or a particle including both organic and inorganic materials. For example, the particles can have a core structure that is or includes a metal particle, a quantum dot particle, a metal oxide particle, or a core-shell particle. For example, the core structure can be or include a polymeric particle or a lipid-based particle, and the linkers can include a lipid, a surfactant, a polymer, a hydrocarbon chain, or an amphiphilic polymer. For example, the linkers can include polyethylene glycol or polyalkylene glycol, e.g., the first ends of the linkers can include a lipid bound to polyethelene glycol (PEG) and the second ends can include functional groups bound to the PEG. In these methods, the first or second functional groups can include an amine group, a maleimide group, a hydroxyl group, a carboxyl group, a pyridylthiol group, or an azide group.
In certain embodiments, the nanoparticles can comprise polymers that include, for example, a sodium polystyrene sulfonate (PSS), polyethylene oxide (PEO), polyoxyethylene glycol, polyethylene glycol (PEG), polyethylene imine (PEI), polylactic acid, polycaprolactone, polyglycolic acid, poly(lactide-co-glycolide polymer (PLGA), cellulose ether polymer, polyvinylpyrrolidone, vinyl acetate, polyvinylpyrrolidone-vinyl acetate copolymer, polyvinyl alcohol (PVA), acrylate, polyacrylic acid (PAA), vinyl acetate, crotonic acid copolymers, polyacrylamide, polyethylene phosphonate, polybutene phosphonate, polystyrene, polyvinylphosphonate, polyalkylene, carboxy vinyl polymer, sodium alginate, carrageenan, xanthan gum, gum acacia, Arabic gum, guar gum, pullulan, agar, chitin, chitosan, pectin, karaya gum, locust bean gum, maltodextrin, amylose, corn starch, potato starch, rice starch, tapioca starch, pea starch, sweet potato starch, barley starch, wheat starch, hydroxypropylated high amylose starch, dextrin, levan, elsinan, gluten, collagen, whey protein isolate, casein, milk protein, soy protein, keratin, or a gelatin, or a copolymer, derivative, or mixture thereof.
In other embodiments, the polymer can be or include a polyethylene, polycarbonate, polyanhydride, polyhydroxyacid, polypropylfumerate, polycaprolactone, polyamide, polyacetal, polyether, polyester, poly(orthoester), polycyanoacrylate, polyvinyl alcohol, polyurethane, polyphosphazene, polyacrylate, polymethacrylate, polycyanoacrylate, polyurea, polystyrene, or a polyamine, or a copolymer, derivative, or mixture thereof.
In some embodiments, the present disclosure provides nanoparticles comprising biodegradable polymers. The non-limiting exemplary biodegradable polymers can be poly-β-amino-esters (PBAEs), poly(amido amines), polyesters including poly lactic-co-glycolic acid (PLGA), polyanhydrides, bioreducible polymers, and other biodegradable polymers. In some embodiments, the biodegradable polymer comprises 2-(3-aminopropylamino)ethanol end-modified poly(1,4-butanediol diacrylate-co-4-amino-1-butanol), (1-(3-aminopropyl)-4-methylpiperazine end-modified poly(1,4-butanediol diacrylate-co-4-amino-1-butanol), 2-(3-aminopropylamino)ethanol end-modified poly(1,4-butanediol diacrylate-co-5-amino-1-pentanol), (1-(3-aminopropyl)-4-methylpiperazine end-modified poly(1,4-butanediol diacrylate-co-5-amino-1-pentanol), 2-(3-aminopropylamino)ethanol end-modified poly(1,5 pentanediol diacrylate-co-3-amino-1-propanol), and (1-(3-aminopropyl)-4-methylpiperazine-end-modified poly(1,5 pentanediol diacrylate-co-3-amino-1-propanol).
The plurality of nanoparticles can be in any suitable combination of two or more nanoparticles in which each nanoparticles provides a unique biomolecule corona signature. For example, the plurality of nanoparticles can include one or more liposome and one or more nanoparticle described herein. In one embodiment, the plurality of nanoparticles can be a plurality of liposomes with varying lipid content and/or varying charges (cationic/anionic/neutral). In another embodiment, the plurality of nanoparticles can contain one or more nanoparticle made of the same material but of varying sizes and physiochemical properties.
The physicochemical properties include, but are not limited to the composition, size, surface charge, hydrophobicity, hydrophilicity, surface functionality (surface functional groups), surface topography, surface curvature and shape. The term composition encompasses the use of different types of materials and differences in the chemical and/or physical properties of materials, for example, conductivity of the material chosen between the sensor elements.
Surface curvature is generally determined by the nanoparticle size. Thus, at a nanometer scale, as the size of the surface curvature affects the binding selectivity of the surface. For example, at certain curvature, the surface of the particle may have a binding affinity for a specific type of biomolecule where a different curvature will have a different binding affinity and/or a binding affinity for a different biomolecule. The curvature can be adjusted to create a plurality of sensor elements with altered affinity for different biomolecules. A sensor array can be created including a plurality of sensor elements having different curvatures (e.g. different sizes) which results in a plurality of sensor elements each with a different biomolecule corona signature.
In another embodiment, the plurality of nanoparticles can contain one or more nanoparticle made of differing materials (e.g. silica and polystyrene) with similar or varying sizes and/or physiochemical properties (e.g. modifications, for example, —NH2, —COOH functionalization). These combinations are purely provided as examples and are non-limiting to the scope of the present disclosure.
Surface morphology may also be modified by methods such as patterning the surface to provide different affinities, engineering surface curvatures on multiple length scales and the like. Patterning the surface is provided by, for example, forming the sensor elements by block polymerization in which the at least two blocks have different chemistries, forming the nanoparticles using mixtures of at least two different polymers and phase separating the polymers during polymerization, and/or cross-linking the separate polymers following phase separation. Engineered surface curvature on multiple length scales is provided, for example, by employing Pickering emulsions (Sacanna et al. 2007) stabilized by finely divided particles for the synthesis of nanoparticles. In some embodiments, finely divided particles are selected from, for example, silicates, aluminates, titanates, metal oxides such as aluminum, silicon, titanium, nickel, cobalt, iron, manganese, chromium, or vanadium oxides, carbo blacks, and nitrides or carbides, e.g., boron nitride, boron carbide, silicon nitride, or silicon carbide, among others. In some embodiments, finely divided particles can comprise an inorganic material. In some embodiments, finely divided particles can comprise an organic material. In some embodiments, finely divided particles can comprise biomolecules such as protein-based particles and oligonucleotide-based particles (RNA and/or DNA). In some embodiments, finely divided particles are selected from, for example, superparamagnetic materials such as magnetite, maghemite, etc. In some embodiments, finely divided particles are selected from a polymer, a metal oxide, a plasmonic material, a biomolecule, a superparamagnetic material, magnetite, maghemite, a micelle, a liposome, iron oxide, graphene, silica, polystyrene, silver, gold, a quantum dot, palladium, platinum, titanium, and any combination thereof.
Nanoparticle elements may be functionalized to have different physicochemical properties. Suitable methods of functionalizing the sensor elements are known in the art and depend on composition of the sensor element (e.g. gold, iron oxide, silica, silver, etc.), and include, but not limited to, for example aminopropyl functionalized, amine functionalized, boronic acid functionalized, carboxylic acid functionalized, amine functionalized, boronic acid functionalized, carboxylic acid functionalized, methyl functionalized, N-succinimidyl ester functionalized, PEG functionalized, streptavidin functionalized, methyl ether functionalized, triethoxylpropylaminosilane functionalized, thiol functionalized, PCP functionalized, citrate functionalized, lipoic acid functionalized, BPEI functionalized, carboxyl functionalized, hydroxyl functionalized, and the like. In one embodiment, the nanoparticles may be functionalized with an amine group (—NH2 or a carboxyl group (COOH). In some embodiments, the nanoscale sensor elements are functionalized with a polar functional group. Non-limiting examples of the polar functional group comprise carboxyl group, a hydroxyl group, a thiol group, a cyano group, a nitro group, an ammonium group, an imidazolium group, a sulfonium group, a pyridinium group, a pyrrolidinium group, a phosphonium group or any combination thereof. In some embodiments, the functional group is an acidic functional group (e.g., sulfonic acid group, carboxyl group, and the like), a basic functional group (e.g., amino group, cyclic secondary amino group (such as pyrrolidyl group and piperidyl group), pyridyl group, imidazole group, guanidine group, etc.), a carbamoyl group, a hydroxyl group, an aldehyde group and the like. In some embodiments, the polar functional group is an ionic functional group. Non-limiting examples of the ionic function group comprise an ammonium group, an imidazolium group, a sulfonium group, a pyridinium group, a pyrrolidinium group, a phosphonium group. In some embodiments, the sensor elements are functionalized with a polymerizable functional group. Non-limiting examples of the polymerizable functional group include a vinyl group and a (meth)acrylic group. In some embodiments, the functional group is pyrrolidyl acrylate, acrylic acid, methacrylic acid, acrylamide, 2-(dimethylamino)ethyl methacrylate, hydroxyethyl methacrylate and the like.
In other embodiments, the physicochemical properties of the nanoparticle may be modified by modification of the surface charge. For example, the surface can be modified to provide a net neutral charge, a net positive surface charge, a net negative surface charge, or a zwitterionic charge. The charge of the surface can be controlled either during synthesis of the element or by post-synthesis modification of the charge through surface functionalization. For polymeric nanoparticles, differences in charge can be obtained during synthesis by using different synthesis procedures, different charged comonomers, and in inorganic substances by having mixed oxidation states.
In some embodiments, the multi-particle enabled complex biomolecule sampling can comprise an array or a chip. In some embodiments, the multi-particle enabled complex biomolecule sampling can comprise, consist essentially of, or consist of a plurality of half particles of different geometric shapes which can be made by molding technology, 3D printing or 4D printing. Suitable half particles are known in the art, and include, but are not limited to half and partial particles in any geometric shape, for example, spheres, rods, triangles, cubes and combinations thereof. Suitably, in one embodiment, the plurality of half particles has different physicochemical properties made by 3-D printing.
In some embodiments, the particles of the multi-particle enabled complex biomolecule sampling are made by 3D or 4 D printing. Suitable methods of 3D and 4D printing, including nanoscale sensor elements are known in the art. Suitable material for 3D and 4D printing include, but is not limited to, e.g., plastics and synthetic polymers (e.g., poly-ethylene glycol-diacrylate (PEG-DA), poly (e-caprolactone) (PCL), poly(propylene oxide (PPO), poly(ethylene oxide) (PEO) etc.), metals, powders, glass, ceramics, and hydrogels. Suitable shapes made by 3D or 4D printing include, but are not limited to, for example, full or partial spheres (e.g. ¾ or half spheres), rods, cubes, triangles or other geometrical or non-geometrical shapes.
3D printing techniques include, but are not limited to, microextrusion printing, inkjet bioprinting, laser-assisted bioprinting, stereolithography, omnidirectional printing, and stamp printing.
In some embodiments, the array comprises a substrate and a plurality of nanoparticles. Regardless of the identity of the plurality of nanoparticles, this disclosure can be embodied by a matrix of plurality of nanoparticles immobilized on, connected with and/or coupled to a solid substrate. The substrate can comprise, consist essentially of or consist of polydimethylsiloxane (PDMS), silica, gold or gold coated substrate, silver or silver coated substrate, platinum or platinum coated substrate, zinc or zinc coated substrate, carbon coated substrate and the like. One skilled in the art would be able to select an appropriate substrate for the plurality of nanoparticles. In some embodiments, the plurality of nanoparticles and the substrate are made of the same element, for example, gold. In some embodiments, the substrate and nanoparticles form a chip.
In some embodiments, the array comprises a single surface, plate or chip containing two or more discrete sensor elements (regions) with topological differences that allows for discrete biomolecule corona formation at each discrete element (region). The surface plate or chip can be fabricated to include the two or more discrete elements (regions) by the methods described herein. The discrete regions can be raised surfaces of differing geometric shapes, differing sizes or differing charges or other topological differences that result in discrete sensor elements with ability to form discrete biomolecule coronas.
In some embodiments, the plurality of nanoparticles are non-covalently attached to the substrate. Suitable methods of non-covalent attachment are known in the art and include, but are not limited to, for example, metal coordination, charge interaction, hydrophobic-hydrophobic interaction, chelation and the like. In other embodiments, the plurality of nanoparticles are covalently attached to the substrate. Suitable methods of covalently linking the plurality of nanoparticles and the substrates include, but are not limited to, for example, click chemistry, irradiation, and the like.
For example, the plurality of nanoparticles can be conjugated to a substrate (e.g. silica substrate) via the amidation reaction between the amino groups on silica substrate surface and carboxylic acid groups on nanoparticle surface, via the ring-opening reaction between the epoxy groups on silica substrate surface and amino groups on nanoparticle surface, via the Michael Addition reaction between the maleimide groups on silica substrate surface and thiol or amino groups on nanoparticle surface, via the urethane reaction between the isocyanate groups on silica substrate surface and hydroxyl or amino groups on nanoparticle surface, via the oxidation reaction between the thiol groups on silica substrate surface and the ones on nanoparticle surface, via the “Click” chemistry between azide groups on silica substrate surface and alkyne groups on nanoparticle surface, via the thiol exchange reaction between 2-pyridyldithiol groups on silica substrate surface and thiol groups on nanoparticle surface, via the coordination reaction between boronic acid groups on silica substrate surface and diol groups on nanoparticle surface, via the UV light-irradiated addition reaction between C═C bonds on substrate surface and C═C bonds on nanoparticle surface and the like. Suitable methods of conjugating sensor elements to gold substrate are known in the art and include, for example, conjugation via Au-thiol bonds, via the amidation reaction between the carboxylic acid groups on gold substrate surface and the amino groups on nanoparticle surface, via “Click” chemistry between the azide groups on gold substrate surface and the alkyne groups on nanoparticle surface, via urethane reaction between the NHS groups on gold substrate surface and the amino groups on nanoparticle surface, via the ring-opening reaction between the epoxy groups on gold substrate surface and amino groups on nanoparticle surface, via the coordination reaction between boronic acid groups on silica substrate surface and diol groups on nanoparticle surface, via the UV light-irradiated addition reaction between C═C bonds on gold substrate surface and C═C bonds on nanoparticle surface, via the “Ligand-Receptor” interaction between biotin on gold substrate surface and avidin on nanoparticle surface, via the “Host-Guest” interaction between a-cyclodextrin (a-CD) on gold substrate surface and adamantine (Ad) on nanoparticle surface, and the like.
The plurality of nanoparticles can be attached to the substrate randomly or in a distinct pattern. The plurality of nanoparticles can be substantially uniformly positioned. The pattern of the arranged plurality of nanoparticles can vary according to the pattern in which the plurality of nanoparticles are attached to the substrate. Each nanoparticle t is separated by a distance. The distance between the nanoparticles arranged on the substrate can vary depending on the length of the linker used to attach or other fabrication conditions. According to various embodiments, the plurality of nanoparticles on the array can be fabricated having a desired inter-element distance and pattern. Suitable distinct patterns are known in the art, including, but not limited to, parallel lines, squares, circles, triangles and the like. Further, the nanoparticles can be arranged in rows, or columns. In some embodiments, the substrate is a flat substrate, in other embodiments, the substrate is in the form of microchannels or nanochannels. The plurality of nanoparticles can be contained within microchannels or nanochannels that restrict or control the flow of the sample through the sensor array. Suitable microchannels can range from 10 μm to about 100 μm in size.
In some embodiments, a channel is formed by lithography, etching, embossing, or molding of a polymeric surface. In general, the fabrication process can involve one or more of any of these processes, and different parts of the array can be fabricated using different methods and assembled or bonded together.
Lithography involves use of light or other form of energy such as electron beam to change a material. Typically, a polymeric material or precursor (e.g. photoresist, a light-resistant material) is coated on a substrate and is selectively exposed to light or other form of energy. Depending on the photoresist, exposed regions of the photoresist either remain or are dissolved in subsequent processing steps known generally as “developing.” This process results in a pattern of the photoresist on the substrate. In some embodiments, the photoresist is used as a master in a molding process. In some embodiments, a polymeric precursor is poured on the substrate with photoresist, polymerized (i.e. cured) and peeled off.
In some embodiments, the photoresist is used as a mask for an etching process. For example, after patterning photoresist on a silicon substrate, channels can be etched into the substrate using a deep reactive ion etch (DRIE) process or other chemical etching process known in the art (e.g. plasma etch, KOH etch, HF etch, etc.). The photoresist is removed, and the substrate is bonded to another substrate using one of any bonding procedures known in the art (e.g. anodic bonding, adhesive bonding, direct bonding, eutectic bonding, etc.). Multiple lithographic and etching steps and machining steps such as drilling can be included as required.
In some embodiments, a polymeric substrate can be heated and pressed against a master mold for an embossing process. The master mold can be formed by a variety of processes, including lithography and machining. The polymeric substrate is then bonded with another substrate to form channels and/or a mixing apparatus. Machining processes can be included if necessary.
In some embodiments, a molten polymer or metal or alloy is injected into a suitable mold and allowed to cool and solidify for an injection molding process. The mold typically consists of two parts that allow the molded component to be removed. Parts thus manufactured can be bonded to result in the substrate.
In some embodiments, sacrificial etch can be used to form channels. Lithographic techniques can be used to pattern a material on a substrate. This material is covered by another material of different chemical nature. This material can undergo lithography and etch processes, or other machining process. The substrate is then exposed to a chemical agent that selectively removes the first material. Channels are formed in the second material, leaving voids where the first material was present before the etch process.
In some embodiments, microchannels are directly machined into a substrate by laser machining or CNC machining. Several layers thus machined can be bonded together to obtain the final substrate. In some embodiments, the width or height of each channel ranges from approximately 1 μm to approximately 1000 μm. In some embodiments, the width or height of each channel ranges from approximately 5 μm to approximately 500 μm. In some embodiments, the width or height of each channel ranges from approximately 10 μm to approximately 100 μm. In some embodiments, the width or height of each channel a ranges from approximately 25 μm to approximately 100 μm. In some embodiments, the width or height of each channel ranges from approximately 50 μm to approximately 100 μm. In some embodiments, the width or height of each channel ranges from approximately 75 μm to approximately 100 μm. In some embodiments, the width or height of each channel ranges from approximately 10 μm to approximately 75 μm. In some embodiments, the width or height of each channel ranges from approximately 10 μm to approximately 50 μm. In some embodiments, the width or height of each channel ranges from approximately 10 μm to approximately 25 μm.
In some embodiments, the maximum width or height of a channel is approximately 1 μm, approximately 5 μm, approximately 10 μm, approximately 20 μm, approximately 30 μm, approximately 40 μm, approximately 50 μm, approximately 60 μm, approximately 70 μm, approximately 80 μm, approximately 90 μm, approximately 100 μm, approximately 250 μm, approximately 500 μm, or approximately 1000 μm.
In some embodiments, the width of each channel ranges from approximately 5 μm to approximately 100 μm. In some embodiments, the width of a channel is approximately 5 μm, approximately 10 μm, approximately 15 μm, approximately 20 μm, approximately 25 μm, approximately 30 μm, approximately 35 μm, approximately 40 μm, approximately 45 μm, approximately 50 μm, approximately 60 μm, approximately 70 μm, approximately 80 μm, approximately 90 μm, or approximately 100 μm.
In some embodiments, the height of each channel ranges from approximately 10 μm to approximately 1000 μm. In some embodiments, the height of a channel is approximately 10 μm, approximately 100 μm, approximately 250 μm, approximately 400 μm, approximately 500 μm, approximately 600 μm, approximately 750 μm, or approximately 1000 μm. In specific embodiments, the height of the channel(s) through which the sample flows is approximately 500 μm. In some embodiments, the height of the channel(s) through which the sample flows is approximately 500 μm.
In some embodiments, the length of each channel ranges from approximately 100 μm to approximately 10 cm. In some embodiments, the length of a channel is approximately 100 μm, approximately 1.0 mm, approximately 10 mm, approximately 100 mm, approximately 500 mm, approximately 600 mm, approximately 700 mm, approximately 800 mm, approximately 900 mm, approximately 1.0 cm, approximately 1.1 cm, approximately 1.2 cm, approximately 1.3 cm, approximately 1.4 cm, approximately 1.5 cm, approximately 5 cm, or approximately 10 cm. In some embodiments, the length of the channel(s) through which the sample flows is approximately 1.0 cm. In some embodiments, the length of the channel(s) through which the sample flows is approximately 1.0 cm.
Suitable time for incubating the array include, at least a few seconds, e.g. at least 10 seconds to about 24 hours, for example at least about 10 seconds, at least about 15 seconds, at least about 20 seconds, at least about 25 seconds, at least about 30 seconds, at least about 40 seconds, at least about 50 seconds, at least about 60 seconds, at least about 90 seconds, at least about 2 minutes, at least about 3 minutes, at least about 4 minutes, at least about 5 minutes, at least about 6 minutes, at least about 7 minutes, at least about 8 minutes, at least about 9 minutes, at least about 10 minutes, at least about 15 minutes, at least about 20 minutes, at least about 25 minutes, at least about 30 minutes, at least about 45 minutes, at least about 50 minutes, at least about 60 minutes, at least about 90 minutes, at least about 2 hours, at least about 3 hours, at least about 4 hours, at least about 5 hours, at least about 6 hours, at least about 7 hours, at least about 8 hours, at least about 9 hours, at least about 10 hours, at least about 12 hours, at least about 14 hours, at least about 15 hours, at least about 16 hours, at least about 17 hours, at least about 18 hours, at least about 19 hours, at least about 20 hours, and include any time and increment in between (e.g. 10 seconds, 11, 12, 13, 14, 15, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 33,34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 seconds, etc.; 1 minute, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, etc.; 1 hour, 2 hours, 3, hours, 4, hours, 5 hours, 6 hours, 7 hours, 8 hours etc.)
Further, the temperature at which the assay is performed can be determined by one skilled in the art, and incudes temperatures between about 4° C. to about 40° C., alternatively from about 4° C. to about 20° C., alternatively from about 10° C. to about 15° C., alternatively from about 10° C. to about 40° C., for example, at about 4° C., about 5° C., about 6° C., about 7° C., about 8° C., about 9° C., about 10° C., about 11° C., about 12° C., about 13° C., about 14° C., about 15° C., about 16° C., about 17° C., about 18° C., about 19° C., about 20° C., about 21° C., about 22° C., about 25° C., about 30° C., about 35° C., about 37° C., etc. Suitable, the assay may be performed at room temperature (e.g. around about 37° C., for example from about 35° C. to about 40° C.).
Aspects of the present disclosure that are described with respect to methods can be utilized in the context of the sensor array or kits discussed in this disclosure. Similarly, aspects of the present disclosure that are described with respect to the sensor array and methods can be utilized in the context of the kits, and aspects of the present disclosure that are described with respect to kits can be utilized in the context of the methods and sensor array.
This disclosure provides kits. The kits can be suitable for use in the methods described herein. Suitable kits include a kit for determining a biomolecule fingerprint for a sample comprising a sensor array as described herein. In one aspect, the kit provides a sensor array comprising at least two sensor elements which have differing physiocochemical properties from each other. In some aspects, the kits provides a comparative panel of biomolecule fingerprints in order to use the biomolecule fingerprint to determine a disease state for the subject. In some aspects, instructions on how to determine the biomolecule fingerprint are included. In some suitable embodiments, the sensor arrays are provided as chip arrays in the kit.
In other aspects, kits for determining a disease state of a subject or diagnosing or prognosing a disease in a subject are provided. Suitable kits include a sensor array comprising at least two sensor elements which have differing physiocochemical properties from each other to determining a biomolecule fingerprint. Further, the kit may further include a comparative panel of biomolecule fingerprint of different disease states or different diseases or disorders. Instructions on determining the biomolecule fingerprint and analysis are provided.
It should be apparent to those skilled in the art that many additional modifications beside those already described are possible without departing from the inventive concepts. In interpreting this disclosure, all terms should be interpreted in the broadest possible manner consistent with the context. Variations of the term “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, so the referenced elements, components, or steps may be combined with other elements, components, or steps that are not expressly referenced. Embodiments referenced as “comprising” certain elements are also contemplated as “consisting essentially of and “consisting of those elements. The term “consisting essentially of and “consisting of should be interpreted in line with the MPEP and relevant Federal Circuit's interpretation. The transitional phrase “consisting essentially of limits the scope of a claim to the specified materials or steps “and those that do not materially affect the basic and novel characteristic(s)” of the claimed invention. “Consisting of is a closed term that excludes any element, step or ingredient not specified in the claim.
These examples are provided for illustrative purposes only and not to limit the scope of the claims provided herein.
Multi-nanoparticle proteomic sampling was performed with three different cross-reactive liposomes with various surface charges (i.e., cationic DOPG (1,2-dioleoyl-sn-glycero-3-phospho-(1′-rac-glycerol)), anionic (DOTAP(1,2-Dioleoyl-3-trimethylammonium-propane)-DOPE(dioleoylphosphatidylethanolamine)), and neutral (CHOL (DOPC-Cholesterol), as described in International Patent Application PCT/US2017/067013, which is herein incorporated by reference in its entirety. The protein composition at the surface of three liposomes was evaluated by liquid chromatography-mass spectrometry (LC-MS/MS) in which the abundance of 850+ known proteins was defined. The bottom plot of
Machine learning classification algorithms (e.g., Partial Least Squares, Logistic Regression, Support Vector Classifier, Nearest Neighbor, or Random Forest) were applied to these proteins to build a trained classification model. The features of individual proteins were extracted by the trained classification model and associated classification weights were generated and stored as a set of data. Another data set was created by querying data aggregators such as Open Targets, Gene Ontology Consortium and commercial options for all known biomolecules (e.g., proteins or expressing genes) connected with cancer and their respective association score. The classification strength of the protein to its strength of associations in public and private databases to the labeled diseases was compared for the set of potential biomarkers. The lower right quadrant of
The top plot of
This application is a Continuation of International Application No. PCT/US2019/028809, filed Apr. 23, 2019, which claims the benefit of U.S. Provisional Application No. 62/661,388, filed Apr. 23, 2018, and U.S. Provisional Application No. 62/824,281 filed Mar. 26, 2019, each of which which are incorporated herein by reference in their entireties for all purposes.
Number | Date | Country | |
---|---|---|---|
62661388 | Apr 2018 | US | |
62824281 | Mar 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2019/028809 | Apr 2019 | US |
Child | 17068135 | US |