This application claims benefit to and priority of LU Patent Application No. LU102007 “DEVICE AND METHOD FOR DETECTING PEPTIDES AND PROTEINS IN A FLUID SAMPLE”, filed on 20 Aug. 2020.
The invention relates to a device for detecting and quantifying molecules with amino acid residues, such as peptides and proteins, in a liquid dispersion sample.
Disease biomarkers are indicators that can be used for diagnostic, prognostic, and even therapeutic purposes. Several molecules with amino acid residues, i.c., proteins and peptides, have been found to be biomarkers found abnormally expressed during the development of diseases. Some of the proteins and peptides have been in use in clinically relevant environments for a long time. For example, in Alzheimer's disease (AD), the most validated pathological biomarkers include the Amyloid-beta (Aβ) peptides, total Tau (T-tau) protein and the hyperphosphorylated form of Tau (P-tau) protein. The detection and quantification of these biomarkers in the biofluids of patients, particularly in cerebrospinal fluid (CSF), in clinical context is a frequent routine procedure to detect AD since the biomarkers allow a non-invasive diagnosis of the disease.
Regarding Parkinson's disease (PD), the most common disease biomarker is related to abnormal aggregates of alpha-synuclein (α-syn) protein, which leads to the development of Lewy bodies and contributes to the disease progression. Other neurodegenerative disorders that can be detected based on the levels of the pathological prion protein (PrpSc) are prion diseases, for example, the Creutzfeldt-Jacob disease (CJD) [1].
Over the past decades, some clinically relevant biomarkers have also been identified and used in cancer research. One of the most used biomarkers for cancer screening is the prostate-specific antigen (PSA) which is a protein produced in the prostate. The PSA test is often made before more invasive tests are carried out to determine the extent of the cancer in the patient.
Further testing of the patient is indicated only when high levels of PSA are found in the blood. Other well-known tumour biomarkers are cancer-associated antigens, such as CA 125, which is a serum-based marker for ovarian cancer, CA 15-3, a cancer antigen biomarker for breast cancer or CA 19-9, a marker for pancreatic cancer. The carcinoembryonic antigen or CEA is also tested in a variety of cancers, for instance in colorectal cancer. These biomarkers are mainly used to understand the disease progression and evaluate how a patient is responding to the treatment [2].
There are numerous biomarkers already well established within the context of cardiovascular diseases (CVDs), as well as associated with its pathophysiological processes. As an example, blood levels of natriuretic peptides (NP), in particular the B-type natriuretic peptide (BNP) and N-terminal pro-B-type natriuretic peptide (NT-proBNP), are promptly measured in patients with heart failure (HF). These peptides have been contributing to the rapid diagnosis and evaluation of the HF treatment and the disease progression, particularly in emergency scenarios [3]. On the other hand, the gold standard biomarkers used in case of suspicion of acute coronary syndrome (ACS)—resulting for instance from a myocardial infarction (MI)—are regulatory proteins, namely cardiac troponin I (cTnI) and T (cTnT). C-reactive protein (CRP) is an ideal biomarker for inflammation and can be typically associated with a variety of diseases but also works as a good indicator for CVDs development [4].
Clinicians and researchers face a huge challenge when analysing the protein levels of the biomarkers in different samples of biofluids. Currently, the measurements are made using different types of instruments, techniques and protocols which causes a variation in the results obtained by different research institutes, hospitals, and associated laboratories. Thus, there is currently no consensus regarding the physiological concentrations of the tested amino acid residues (peptides or proteins) found in the samples of biofluids found in different patients.
Despite this difference, the criteria for diagnosing a patient suffering, for example, from AD is well established. The criteria are based on the levels of the biomarkers (Amyloid-beta (Aβ) peptides, total Tau (T-tau) protein and the hyperphosphorylated form of Tau (P-tau) protein) found in samples of the CSF of patients. Increased levels of both T-tau and P-tau forms and a decrease of Aβ1-42 form and Aβ1-42/Aβ1-40 ratio is a common phenotype among AD patients. The decreased levels of Aβ1-42 in the CSF of AD patients, when compared with healthy controls, can be explained by the increase of senile plaque aggregates in the brain, which reduces the amount of peptide that diffuses to CSF. Indeed, some studies indicate that CSF Aβ1-42 levels correlate with amyloid deposition confirmed by PET imaging and that the CSF Aβ1-42/Aβ1-40 ratio is even a more suitable marker for amyloid PET correlation status [5-6].
Aβ and Tau concentrations are around 10 to 100 times lower in plasma and/or in serum than in the CSF, which means that these biomarkers must be measured in lower ranges in the samples taken from the bloodstream. Nevertheless, T-tau and p-tau protein levels in the samples from the bloodstream are also found to be elevated in AD patients and may therefore also reflect an association with AD pathology. However, further studies are required to provide more evidence that the Tau protein found in the plasma of the bloodstream is an AD specific marker and not a general indicator of neurodegeneration [7-8]. Several studies have also been reporting a decrease in Aβ1-42 as well as in the Aβ1-42/Aβ1-40 ratio in the plasma of AD patients. An association between a lower plasma Aβ1-42/Aβ1-40 ratio and PET-imaging positivity for amyloid plaques deposition has been also reported, which may justify a future use of this ratio as a predictor of AD. Even though the Tau protein and Aβ1-42/Aβ1-40 ratio are promising candidates, there is still, as far as our knowledge goes, no blood-based biomarker established for AD diagnosis [9-10].
Current methods to directly identify and quantify the presence of molecules with amino acid residues (such as peptides and proteins) include immunological approaches, such as the enzyme-linked immunosorbent assay (ELISA), and more advanced proteomic methods involving mass spectrometry (MS). A wide range of methods combining these two categories are also available [17].
Immunoassays are widely used and very sensitive approaches. They are based on antigen-antibody interactions requiring the use of quality antibodies to target the protein or the peptide of interest present in the sample. ELISA is a conventional immunoassay procedure and, although it is relatively simple, ELISA can be very time-consuming and also produce false-positive findings due to high levels of non-specific binding. Thus, the ELISA immunoassay procedure lacks specificity. Another disadvantage is the high price of the associated quality antibodies, which makes ELISA expensive. Nevertheless, ELISA has been considered one of the main methods for biomarkers' quantification in biofluids. For instance, a manual method based on standard ELISA (Innotest® ELISA) could detect both Aβ1-42 and Aβ1-40 in plasma samples with a limit of detection of 7.8 pg/mL [18].
The quantification limits for Aβ1-42, Aβ1-40 and Tau are, respectively, 5.8 pg/mL, 9pg/mL, and 1 pg/mL for an automated ELISA technique developed by Roche (Elecsys®).
On the other hand, MS based methods are powerful tools involving the identification of several proteins and peptides in a biological sample, based on the analysis of the peaks collected from components' mass spectra (the collected pattern of the mass/charge ratio of ionized molecules). MS techniques provide high specificity, sensitivity, and fast results. However, spectrometers are very expensive instruments. MS-based methods, such as selected reaction monitoring (SRM) have been applied for quantification of Aβ1-42, Aβ1-40, and Tau biomarkers in CSF samples of AD patients. Analysis by SRM showed a lower limit of quantification (LOQ) for Aβ1-38, Aβ1-40, and Aβ1-42 of 250 pg/mL, 62.5 pg/mL, and 62.5 pg/mL, respectively [20].
Recent studies using other ultrasensitive approaches have been developed for improved quantification biomarkers that are found at very low concentrations in the blood. These methods also use antibodies but achieve results with higher sensitivity and accuracy. Three examples of these methods are: One single-molecule assay (SIMOA technology), the ELISA based sandwich immunoassay (ABtest®) and the immunomagnetic reduction assay (IMR). One study using the SIMOA approach was able to measure plasma concentrations of Aβ1-42, Aβ1-40, and Tau with a limit of quantification of 0.34 pg/ml, 0.16 pg/ml, and 0.42 pg/mL, respectively [21]. Another study reported a lower LOD using the same SIMOA approach - of 0.019 pg/mL for Aβ1-42 and 0.16 pg/mL for Aβ1-40 [22]. The ELISA based sandwich immunoassay (ABtest®) achieved a LOD for Aβ1-42 in plasma of 3.60 pg/mL and for Aβ1-40 a value of 7 pg/mL [23]. A higher sensitivity of detection is reached when using the immunomagnetic reduction assay (IMR). The IMR assays can measure low-detection limits for Aβ1-42, Aβ1-40, t-Tau and p-Tau of 0.770 pg/mL, 0.170 pg/mL, 0,026 pg/mL, and 0.0196 pg/mL, respectively, using a superconducting quantum interference device (SQUID) [24]. However, these methods are both time-consuming and expensive. In addition, due to the lack of sensitivity of these methods, it is possible that the lowest concentrations may not be detected. In addition, techniques such as ELISAs are not able to distinguish between monomers or oligomers in a single process.
The publication “Optical fibre-based sensing method for nanoparticle detection through supervised back-scattering analysis: a potential contributor for biomedicine” (Paiva et al., in OPTICAL FIBERS AND SENSORS FOR MEDICAL DIAGNOSTICS AND TREATMENT APPLICATIONS XIX, vol. 10872, 27 Feb. 2019 (2019-02-27)) teaches the detection of nanoparticles by back-scattered laser light signal collected by a polymeric lensed optical fibber tip dipped into a solution of synthetic polystyrene nanoparticles. The authors were able to correctly detect the presence of 100 nm synthetic nanoparticles in distilled water at different concentration values. The authors noted in the paper the difficulties that scientists have had in developing a “simple and fast” method to accurately detect and characterise extracellular vesicles. Indeed, the authors of this paper also failed to apply their method to natural, biological materials.
The method and device disclosed in this document enables the detection of molecules made up of amino acid residues, such as peptides and proteins, in sample of biofluids taken from patients at very low concentrations and the discrimination between three peptides even with a similar molecular mass.
A method for identification of amino acid residues, such as but not limited to peptides or proteins, in a fluid sample using machine learning techniques is disclosed. The method comprises producing a light signal from a laser, illuminating the fluid sample with the light signal through a lens in a sensing probe, acquiring a light signal from the fluid sample, extracting a plurality of features from the light signal, and comparing the extracted plurality of feature with a model in a database to determine the amino acid residues in the fluid sample.
In one aspect, the method enables the detection of the presence or absence of a specific peptide, the identification of which peptide being detected from other peptides and the quantification of the detected peptide. Both supervised learning methods (e.g., support vector machines, random forests, neural networks, etc) or clustering algorithms/unsupervised methods (e.g., K-means, U-Map) are used for identifying the peptide. Regression models (e.g., random forests regressor, linear regressor, polynomial regressor, etc) can be used for quantifying the peptide. The method can also be used for detection of proteins.
The method further comprises in one aspect the filtering of the acquired light signal to remove noisy low-frequency components and/or normalizing the light signal.
The light signal from the laser is modulated and the extraction of the plurality of features in the light signal is carried out over periods of time. The plurality of features are time domain and frequency derived features.
The model is created by one of a support vector machine or a clustering algorithm. It is also possible to use different models for different purposes.
A device for identification of amino acid residues in a fluid sample is also disclosed in this document. The device comprises a laser which is connected through an optical fibre with a sensing probe (8) with a lens, such as a microlens, for illuminating the sample. A detector acquires a light signal from the sample and a computer is adapted to analyse the light signal, extract features from the light signal, compare the extracted features with stored features in a database and produce a result.
The method and the device can be used for the detection of neurodegenerative disease, such as Alzheimer's disease, cardiovascular diseases, and cancer.
Concentrations above and below the biomarkers' human plasmatic concentrations regarding AD were tested. Two Aβ-derived peptides (with 42 and 28-amino acids) were tested in a concentration range of 1 pM-10 nM (including the Aβ1-42 plasmatic concentrations ranging from 5-300 pg/mL, that corresponds to a range between 5-60 pM [11]). T-tau was tested in a concentration range of 0,1 pM-10 nM, considering its reference physiological levels of 4-55 pg/mL (that corresponds to 0.1-10 pM) [7,12]. P-tau was tested in a concentration range of 0.01 pM-10 nM, considering its reference physiological levels of 0.1-1.2 PM [7,12]. The reference physiological levels are those levels at which the biomarkers are expected to be found in physiological samples, such as blood, plasma, and serum. Experiments started from the lowest—in the pM range—to the highest concentrations—until the nM range—to achieve a saturated peptide/protein concentration.
Plasma concentrations of α-synuclein of PD patients also vary between 1.6 to 320 pg/mL, depending on the method of quantification [13-16].
A detailed schematic of the acquisition apparatus is depicted in
1.45+0.045*sin(2*π*1000*t), t−time in seconds
Considering the laser driver's gain, the laser characteristic curve, and the optical loss along the fibre components, the lens' output optical power was 40 mW (but this is not limiting of the invention). This value was determined in accordance with the values used in the literature for optical delivery, collection, and manipulation effects through optical fibres considering the selected wavelength value range, and to cause as little damage as possible to the biological human-derived samples [28].
The modulation signal was externally injected into a laser driver 2 (MWTechnologies Lda, Portugal, Model #cLDD) through one of the output digital-to-analog ports of a data acquisition board 3 (NI, Austin, TX, Model #USB-6212 BNC). The resulting optical signal, mirroring the modulation equation, is inserted into an optical fibre and passes through a 1/99 optical coupler 4 (Laser Components GmbH, Germany, Model #3044214). While most of the radiation follows to the rest of the optical circuit, 1% of the radiation is monitored using a silicon photodetector 5 (Thorlabs Inc, Newton, NJ, Model #PDA-32A2) connected to one DAQ analog-input port.
A 50/50, 1×2, optical coupler 6 (AFW Technologies Pty Ltd, Australia, Model # FOSC-1-98-50-L-1-H64F-2) establishes a bidirectional connection between the incoming light from the laser module, a sensing photodetector 7 (Thorlabs Inc, Newton, NJ, Model #PDA-32A2) and a sensing probe 8. The sensing probe 8 is a microlensed optical fibre with its end just outside a metal capillary and is described below. The metal capillary gives stability to the optical fibre and protects the optical fibre to make sure that the optical fibre does not break. This arrangement allows the sensing probe 8 to simultaneously focus the light coming from the laser 1 and a collection of back-scattered radiation arising from a liquid dispersion sample 9 to be analysed. To provide further information about the samples' conditions/properties, temperature readings are obtained using an Infrared Thermometer 10 (Axiomet, Poland, Model #AX-7600).
The arrangement set out above is merely exemplary and is not limiting of the invention. Other optical components could be used.
The sensing probe 8 is manipulated using a 4 axis (x, y, z, and tilt) right-hand micromanipulator 11 (Siskiyou Corporation, Grants Pass, OR, Model #:MX7600) with a probe holder in which the capillary with the sensing probe 8 is fixed. This micromanipulator is connected to a closed-loop dial controller (Siskiyou Corporation, Grants Pass, OR, Model #:MC1000e-R1/4T) that allows a more precise displacement of the sensing probe 8 into and inside the sample 9.
A visualization and imaging module is composed of a self-made inverted microscope setup using a standard white LED light source 12, an objective 13 (currently at 20×, but higher amplification can be used to observe smaller particles), a mirror 14 and a zoom lens 15 (Edmund Optics, Barrington, NJ, Model #VZM 450). This microscope drives the desired imaging plane to a digital camera 16 (Edmund Optics, Barrington, NJ, USA Model EO-1312C #Model 83-770). The image from the digital camera is observed in real-time in a computer 17 using IDS:'s software uEye Cockpit. The sensing region of the digital camera 16 allows for the visualization of the focused infrared beam from the fluid sample 9 and the reaction of the focused infrared beam with the constituents of the fluid sample 9.
The fabrication of the polymeric microlens used in the sensing probe 8 will now be described. The polymeric microstructures used are fabricated through a guided wave photopolymerization process on top of cleaved optical fibres [25-27], a process in which the cross-linking of monomers is triggered by light at a specific wavelength. Two components must be present in the solution for the photopolymerization process taking place, a monomer, and a photo-initiator:
Once the correct proportion between monomer and photo-initiator is achieved, an optical setup consisting of a couple of mirrors and a CW laser is used to excite the photo-initiator. In this example, a laser was used emitting laser light at a wavelength of 405 nm (Omicron, Rodgay-Dudenhofen, Germany, #Model LuxX cw, 60 mW) is incident at 45° in two consecutive mirrors, resulting in a square shape optical path. After the second reflection, the laser is coupled into an optical fibre by an objective.
The optical fibre (Thorlabs, Newton, New Jersey, USA #Model SM 980-5.8-125) has a multi-mode behaviour for this wavelength, a multitude of optical modes can be excited, resulting in a different optical output pattern and a consequent difference in the geometry imprinted in the tip.
The shape of the structure of the polymeric optical tip should be a substantially spherical, lens-like termination so that the structure of the polymeric optical tip efficiently focuses the incident light. This requires the excitation of a mode with a Gaussian or Gaussian-like profile. Such profiles can be attained with the LP01 and LP02 optical fibre modes as are shown in
Once the setup is aligned, i.e., one of the LP01 or LP02 modes is observable at the output of a cleaved fibre (as seen in
During the fabrication procedure, some geometrical parameters, such as diameter and length, as well as the curvature radius of the polymeric optical tip are controlled. This can be done through the manipulation of some fabrication parameters, such as the optical fibre mode excited during polymerization, as previously mentioned, but also the percentage of the photo-initiator present in the solution, the exposure time, and laser power used during the polymerization, etc. To assure a high reproducibility of these polymeric optical tips, these parameters should be left constant throughout the whole fabrication process of a batch of polymeric optical tips. The requirements that must be kept constant as well as the parameters to control are summarized in Table 1.
For the purposes of the work presented in this text, the fabrication parameters used in the photopolymerization process were the following:
These parameters resulted in structures of the polymeric optical tip with lengths ranging from 30 μm to 50 μm, with the base of the polymeric optical tips having diameters that range from 4 μm to 7 μm, depending on the mode at the fibre's output. Pending on that, the curvature radius of the lens structures also varied between the values of 1.5 μm to 3 μm. The numerical apertures (NA) values range between 0.25 and 0.5 (values evaluated in a water medium) and a focused spot with dimensions of about ⅓rd to ¼th of the base diameter of the lens was obtained. The protective structure does not significantly affect the light propagation in the simple tip underneath the protective structure. The protective increases the contact area between fibre and polymer to the totality of the optical fibre cross-section, improving the mechanical resistance of the polymeric optical tip to the successive media crossings to which the polymeric optical tip will be exposed (e.g. air to plasma, air to serum, etc.). This structure has the aspect of a cupula placed around the initial polymeric optical tip, always having a height lower than the polymeric optical tip itself.
It will be appreciated that the above description is only one method in which the sensing probes 8 used in this disclosure can be fabricated. The method for detecting the molecules with the amino acid residues is not limited to the sensing probes 8 with the polymeric optical tops fabricated using the above fabrication method. Other structures capable of focusing light to a small spot and thus generate an electric field gradient can be used for the method here described. Such structures can be built on the apex or on the side of an optical fibre or on a planar substrate. It will be appreciated that these structures include optical fibre tapers, phase Fresnel plates (fibre or planar), a single nanometric hole, or an array of nanometric holes on a metallic surface, for plasmonic effects. The latter can either be deposited on an optical fibre or on a transparent planar substrate. To summarize, any type of metalens, be it metallic or dielectric, built on an optical fibre or on a planar substrate is suitable for this application. Back-scattered signal and liquid sample temperature acquisition setup. The setup used for
acquiring the back-scattered signal from the liquid dispersion samples 9 using the polymeric optical tip as the sensing probe 9 was comprised of the following modules shown in
Signal acquisition and processing. After the optical setup for the acquisition apparatus was correctly mounted and turned on, a simple assay was carried out for water/peptides solutions prepared. This is done by placing a volume of 150 μL of the water/peptide solution as the fluid sample 9 over a 35 mm Ibidi® micro rounded dish. Then, the polymeric optical tip of the sensor lens 8 was immersed in this fluid sample 9 with the help of the visualization and imaging system. Different peptides samples acquisition sequences were considered depending on the conducted experiment. The procedure used for calibrating the system regarding the peptide detection functionality is based on the following steps and is shown in some detail in
In a next, a set of descriptive features are extracted from the signal. The descriptive features are given below. The resulting dataset is used to train a binary classification model. Once the model converges and its generalization capability is assured, the system is ready to make predictions. There are several classification models that can be used for training, and these are explained in more detail below.
The calibration pipeline applied for creating multiclass artificial intelligence models able to identify the type of the peptides present in the solution is also schematized in
Temperature sensing based on back-scattered frequency features.
Sample temperature acquisition. The influence of sample's temperature on the back-scattered signal was evaluated through a simple experiment where distilled water was used in replacement of a biofluid (e.g., serum sample) as the fluid sample 9.
A distilled water sample (used as the fluid sample 9) of 1 mL at room temperature was placed in an Ibidi® dish and the back-scattered signal was collected for 30 seconds, 10 times in a row, in different locations of the sample. Time and frequency features were then calculated based on the collected back-scattered signal using the algorithm of this disclosure. The water temperature was measured at the beginning of the acquisitions and once again at the end, to monitor variations, using the infrared thermometer 10. It will be appreciated that this temperature recording could also be done using other automatic means, in particular using a type “T” thermocouple with an automatic logger for the detection of the temperature variation over time within a single sample.
After temperature analysis, the sample 9 was discarded and a new one was pipetted for analysis. This was repeated 10 times for 10 samples of 1 mL of distilled water. All of these samples 9 were collected from the same tube.
Output laser and back-scattered signals were acquired at first in non-spiked human serum samples (“blank” samples) and then in the human serum samples spiked with the peptide/protein in the pre-selected concentrations used as the fluid sample 9. The data was collected from the fluid samples 9 with the lowest to the highest peptide/protein concentration and two human serum dilutions. This sequence was considered for Amyloid-beta 1-42, Amyloid-beta 1-28, Tau-441 and Phosphorylated Tau 441. A cleaning protocol of the sensor probe 8 was applied for data collection between different human serum dilutions.
The peptides' concentration test was conducted for four different peptides, namely the Aβ1-42, the AB1-28, the Tau-441 and the Phosphorylated Tau 441. The tested concentrations for each of the peptides were determined considering the typical physiological concentration in humans. Given that these are different for the Amyloid-beta and the Tau peptides, a different selection of test concentrations present in the human serum sample was made. These are depicted in Table 1.
Table 1—Peptides' concentrations analysed during the detection experiments, presented in picomolar (pM). The order of the analysis followed the increase in concentration values.
All the concentrations in Table 1 were tested twice, from the lowest concentration (0 pM) to the highest (10000 pM), using a single probe for each of the peptides. The first sequence had the serum samples diluted in PBS at a 1:2 ratio, and the second made use of non-diluted serum, here defined as a 1:1 ratio. Between these two dilutions, a cleaning protocol was applied (see below) to prevent cross-contamination from one sample to the other.
Additionally, higher concentrations of Amyloid-beta 1-42 peptide were also tested, namely the 1 nM, 5 nM, 25 nM, 50 nM, 100 nM, 1000 nM, and 10000 nM. As described above, both dilutions were tested (1:2 and 1:1), from the lowest to the highest concentration.
To perform the distinction analysis, the same sensing probe 8 was exposed to serum solutions containing the same concentration of different peptides. The used peptides were the same as in the previous tests, the Amyloid-beta 1-42, Amyloid-beta 1-28 and the Tau-441, only this time, the tested concentrations were the same for all the peptides, them being 0 pM, 1 pM, 10 pM, 100 pM, and 1 nM. Each of the three peptides was tested for each concentration value (from the lowest to the highest), beginning with Amyloid-beta 1-42, followed by Amyloid-beta 1-28 and, finally, by Tau-441. Once again, the analysis sequence considered included at first the 1:2 serum dilution in PBS and then the non-diluted serum. As the sensing probe 8 was consecutively exposed to different peptides, the cleaning protocol (see below) was applied after each acquisition, to prevent cross-contamination from affecting the results.
The laser output and backscattered signals were acquired simultaneously by a custom-built MATLAB script (as noted above) which, after a starting order, records and saves the input from both photodetectors for 30 seconds, at 10 kHz sampling rate. The scrip also plots the acquired signals (
To avoid sample misrepresentation and ensure statistical variability, for every sample, 10 acquisitions were performed at different locations, following the above-mentioned script.
To prevent cross-contamination between samples, a standard cleaning protocol was followed. The sensing probe 8 was inserted into a solvent (e.g., diluted bleach) between any two samples 9 to remove any biological traces. Then, the sensing probe 8 was dipped in distilled water to remove any trace of bleach. While in the distilled water, one to two signal acquisitions (as above) were performed to ascertain any degradation issues and ensure probe prime conditions.
The choice of this cleaning protocol was based on a spectral analysis performed to the polymeric tips in the sensing probe 8 after being exposed to different media. The apparatus used for this study is schematized in
It was observed that when using a solvent such as ethanol (70% diluted in water) after the polymeric optical tip being in contact with the sample 9 of a serum, a deterioration of the polymeric optical tip's reflection spectrum is observed—See
Note that the cleaning of the polymeric optical tips can be done either by a chemical or a physical process. Although the present procedure is based on the use of a chemical solvent, the application of a surface treatment capable of preventing proteins adsorption by the surface is also a viable option as well as the application of an ultrasound-based cleaning protocol.
For all the experiments conducted, the back-scattered signals were processed using the same pipeline, schematized in
Each acquisition was first filtered using a second-order 500 Hz Butterworth high-pass filter to remove noisy low-frequency components of the acquired signal (e.g., 50 Hz electrical grid component). Then, the signal of each acquisition was normalized using the z-score. The z-score can be calculated using the following equation:
where mean(x) and SD(x) represent, respectively, the signal average and standard deviation. After this transformation, each whole acquisition was split into epochs of 10 seconds. Features were calculated for each one of these epochs. An additional pre-processing step was tested, which consisted in the subtraction of the laser output to the raw signal.
Features. After processing the signal of each acquisition, a set of 98 features were calculated for each 10 second epoch (table 3). These features can be divided into two types: time and frequency derived. Within the time domain features it is possible to group them into time domain metrics and non-linear. On the other hand, frequency related features can be subdivided in wavelet packet decomposition, Discrete Cosine Transform (DCT)-derived and spectral features. The feature extraction step was implemented with a custom-built python 3 script, using the scipy, pandas, PyWavelets, librosa, and numba python libraries.
Time domain metrics such as mean, standard deviation, root mean square, signal power, root sum of squares level (RSSQ), skewness, kurtosis, interquartile range, and entropy were used, given its adequacy in differentiating types of periodic signals. The skewness reflects the distribution symmetry degree while kurtosis quantifies whether the shape of the data distribution matches the Gaussian distribution. The interquartile range is a variability measure. Additionally, the area under the curve of the histogram distribution of the voltage values was considered.
Non-linear features are useful to describe the complexity and regularity of a signal and are often used to describe the phase behaviour of predominantly stochastic signals, such as EEG. A total of eight non-linear features were considered: approximate entropy, singular value decomposition (SVD) entropy, Petrosian fractal dimension, Hurst exponent, Detrended fluctuation analysis (DFA), Higuchi fractal dimension, Hjorth complexity and mobility. The approximate entropy is used to quantify the amount of regularity and the unpredictability of fluctuations over time-series data, whereas the SVD entropy is an indicator of the number of eigenvectors that are necessary for an adequate explanation of the data set, in other words, it measures the dimensionality of the data.
The term fractal relates to fluctuations in time that possess a form of self-similarity whose dimension cannot be described by an integer value. Therefore, a fractal dimension (FD) is a ratio that provides a statistical index of complexity and the degree of irregularity of a waveform. It is a highly sensitive measure for the detection of hidden information contained in physiological time series. Petrosian's algorithm provides a fast computation of the FD of a signal by translating the series into a binary sequence, while Higuchi is iterative in nature and is especially useful to handle waveforms as objects. Finally, DFA is a method for quantifying fractal scaling and correlation properties in the time-series.
The Hurst exponent is a measure of the “long-term memory” of a time series. It can be used to determine whether the time series is more, less, or equally likely to increase if it has increased in previous steps. Hjorth parameters are indicators of the statistical properties of a signal in the time domain. The mobility parameter is defined as the square root of the ratio of the variance of the first derivative of the signal and that of the signal. It represents the mean frequency or the proportion of standard deviation of the power spectrum. On the other hand, the complexity parameter indicates how the shape of a signal is similar to a pure sine wave, this value converges to 1 as the shape of the signal gets more similar to a pure sine wave.
Regarding the frequency-domain analysis of the back-scattered signal, three sets of features were extracted: Discrete Cosine Transform (DCT) parameters, Wavelet derived coefficients and spectral features. The DCT was applied to each epoch. The DCT can capture minimal periodicities of the signal, without injecting high-frequency artifacts in the transformed data. Besides being highly adequate to short signals, it is highly attractive for this type of problems which require to differentiate target classes, because DCT coefficients are uncorrelated. Thus, they can be used as suitable features for characterizing each peptide class. Additionally, the DCT can embed most of the signal energy into a small number of coefficients. The first n coefficients of the DCT of the scattering echo signal are defined by the following equation:
where εi is the signal envelope estimated using the Hilbert transform. The following features were extracted from DCT analysis: the number of coefficients needed to represent about 98% of the total energy of the original signal, the first 30 DCT coefficients, the Area Under the Curve (AUC) of the DCT spectrum for all the frequencies before the modulation frequency (1 kHz) and, the entropy of the DCT spectrum. A similar analysis was conducted using the Hilbert transform. The Hilbert transform when applied to the signal produces an analytical real-valued representation of it. The 10 highest amplitude peaks of the Hilbert transformed signal were used as features, as well as the number of coefficients needed to represent about 98% of the total energy of the original signal.
Some parameters based on the information extracted from Wavelet analysis of each original signal portion were also considered as features. Using Wavelet packet decomposition, it is possible to extract, in each frequency band, certain tonal information of the original signal depending on the frequency range and content of the back-scattered signal. For this process, it is necessary to choose a suitable mother Wavelet, that will be used as a prototype to be compared with the original signal and extract frequency subband information. Four mother Wavelets—Haar, Daubechies (Db10 and Db4) and Symlet—were selected to characterize the backscattered signal portions. Six features for each type of mother Wavelet based on the relative power of the Wavelet packet-derived reconstructed signal (one to six levels) were considered.
Spectral features characterize the signal's power spectrum, which is the distribution of power across the frequency components composing that signal. It is obtained using the Fourier Transform. Four measures were derived from the spectrum: spectral flatness, spectral centroid, spectral contrast and spectral roll-off. A total of twelve features were calculated from these measures. The spectral contrast is defined as the difference between valleys and peaks in a spectrum. For each sub-band, the energy contrast is estimated by comparing the mean energy in the top quantile (peak energy) to that of the bottom quantile (valley energy). The spectral flatness (or tonality coefficient) quantifies how much noise-like a signal is. A high spectral flatness (closer to 1.0) indicates that the spectrum is like white noise. The spectral roll-off frequency is defined as the centre frequency for a spectrogram bin such that at least 85% of the energy of the spectrum is contained in this bin and in the bins below. Finally, the spectral centroid indicates where the centre of mass of each frequency bin in the spectrogram is located. For each one of these measures three features were calculated: the mean, the maximum and the standard deviation.
The relationship between the temperature and the frequency features was studied by calculating the correlation between the temporal evolution of the features and the temperature variation throughout the experiment. Correlation values were calculated considering the average temperature between the sample's initial and final temperatures along each acquisition. Similarly, the mean value of each feature was calculated for each acquisition, so that the two time-series to be compared (temperature and each light scattered-derived feature) had the same number of points. The correlation was calculated using the following formula:
where xi represents the temperature time-series values and yi the feature values. Each time-series was normalized so that the correlation value lies between 0 and 1.
Two different artificial intelligence pipelines were developed to detect the presence of peptides. The first makes use of a supervised machine learning model—Support Vector Machine, whereas the second uses a clustering technique—U-map.
Supervised Learning Pipeline. The model was trained to distinguish between the presence and absence of the different peptides in the solutions (binary problem). A distinct model was built to detect each one of the peptides. The “absence class” was composed by acquisition samples of serum without the spiked peptide, whereas the “presence class” was composed of acquisitions samples of serum with the added peptide in different concentrations, depending on which peptide ought to be detected. Since the “absence class” had a significantly smaller number of samples, the “presence class” was randomly under-sampled, to build a balanced training set. The samples discarded during the under-sampling process were integrated into the test set. The model used to perform the classification was the Support Vector Machine (SVM) since it is capable to deal either with linear and non-linear input data and it is very suitable for high-dimensionality problems. SVM can distinguish between two different groups by finding a separating hyperplane with the maximal margin between the classes. Three general attributes define the SVM classifier: C—a hyper-parameter which controls the trade-off between margin maximization and error minimization, the kernel—a function that maps the training data into a high-dimensional feature space and, the sigma, which controls the size of the kernel. Several combinations of these parameters were tested to find the optimal model. Each model was trained using a cross-validation strategy. The optimal model was chosen based on the accuracy across all the validation folders.
Since each acquisition was divided into epochs and the features calculated from these epochs were fed into the AI model, a prediction was made for each one of the epochs. However, the goal was to evaluate the performance of the model in detecting the presence of the peptide at different concentrations. Thus, three different methods can be considered to calculate this performance.
First method: accuracy of the binary classification considering each epoch for each concentration.
Second method: median probability of detecting the peptide across all the samples corresponding to the same concentration.
Third method: obtained through the plot of the histogram of the predicted detection probabilities across all samples. The performance for each concentration is the bin with the most counts, that corresponds to the most frequently predicted probability range. Unsupervised Learning/Clustering pipeline
An unsupervised machine learning pipeline was developed to investigate whether it is possible to detect the presence of peptides without any previous knowledge about the data/any previous training stage. The algorithm comprises a dimensionality reduction using UMAP followed by an HDBSCAN clustering. UMAP is an algorithm for dimension reduction based on manifold learning techniques and concepts from topological data analysis. The first phase of UMAP consists of building a fuzzy topological representation. The second phase is simply optimizing the low dimensional representation to have as close a fuzzy topological representation as possible as measured by cross-entropy. The output of the UMAP is a two-dimensional representation of the feature map. HDBSCAN clustering is then applied to this reduced feature space. HBDSCAN is a hierarchical clustering algorithm that extracts a flat clustering based on the stability of the clusters. At the end, two clusters representing the presence and absence of peptide are provided as an output of the model.
The peptide distinction/classification algorithm was based on a supervised learning approach. A random forest classifier was trained to identify the three different peptides (Amyloid beta 1-42; Amyloid beta 1-28 and Tau 441). Random forest consists of many individual decision trees that operate as an ensemble. A decision tree is a flow-chart-like structure, where each internal node denotes a test on a feature, each branch represents the outcome of a test, and each leaf node holds a class label. A tree is built by splitting the source set, constituting the root node of the tree into subsets. The splitting is based on a set of splitting rules based on classification features. This process is repeated on each derived subset in a recursive manner. The recursion is completed when the subset at a node has all the same values of the target variable, or when splitting no longer adds value to the predictions. Each individual tree in the random forest spits out a class prediction and the class with the most votes becoming the model's prediction. Five general parameters that define the random forest were optimized: the maximum depth of the forest, the parameters controlling the number of samples in the leaf and split nodes, the number of features to consider when looking for the best split, and the number of decision trees in the forest. Several models with different combinations of these parameters were trained using a cross-validation strategy. The optimal set of parameters were the ones that produced the model with the higher accuracy across all validation folders.
The dataset was composed by samples of three different peptides (Amyloid beta 1-42; Amyloid beta 1-28 and Tau 441) at four different concentrations (1 pM, 10 pM, 100 pM and 1000 pM). The samples were divided randomly into training and test sets with a 7:3 proportion.
The performance in the test set was evaluated using the accuracy score and the f1-score. The accuracy score measures the proportion of correct predictions made by the model. The F1-score is a weighted average of the precision and recall. The precision gives the proportion of positive predictions that are actually true, whereas recall measures the proportion of positive samples that are actually predicted as positive. The f1-score is commonly used to evaluate the performance in multiclass problems.
The concentration of the peptide was determined using a supervised learning model: the random forest. A random forest regressor works similarly to a classification one: it constructs a multitude of decision trees and outputs the mean prediction of the individual trees. For this reason, the same parameters were optimized to choose the best model. A cross validation strategy was used to train the model. The model performance was evaluated using the r2 coefficient.
The dataset was constituted by samples of Tau 441 in different concentrations (0 pM, 1 pM, 10 pM, 100 pM)—that matched the human plasmatic levels and above. The concentration values were converted to the logarithmic range, so that the increase in concentration assumed a linear trend. The training and test samples were divided randomly. The training set encompassed 70% of the samples, while the test set represented 30%.
The error in the regressor prediction was measured using the root mean squared error of the logarithmic concentration values.
Temperature sensing based on back-scattered frequency features
Table 4 depicts the most correlated features (r>70%) with the temperature evolution. The correlation with the features derived from the difference signal (output laser subtracted to the back-scattered signal) was significantly smaller, which may be attributed to the fact that the laser and the acquired signal are not completely synchronous. The most correlated feature is the maximum spectral flatness, which suggests that the variation in temperature may influence the spectral content of the signal.
Table 4—Correlation values of the most correlated features with the temperature variation (r>70%). These features were calculated using the back-scattered signal and the signal resulting from the difference between the back-scattered signal and the laser signal output.
The results of the peptide detection differ depending on the algorithm applied. Thus, the results discussion was divided into two sections: the results regarding the supervised learning approach and the ones obtained using the clustering pipeline.
Amyloid-beta 1-42 (Serum dilution 1:2)
Amyloid-Beta 1-28 (Serum dilution 1:2)
FIGS. 20 and 21 represent the ‘Median Probability of Peptide Presence’ and ‘Detection Accuracy’ with peptide concentration (in pM) for Tau-441 in serum diluted in PBS (1:2 ratio), respectively. The known physiological Tau-441 concentration range falls within the shaded area (0.1-10 pM). The median probability of peptide presence is higher than 90% in the considered range. Detection accuracy is above 80% for all concentration values. It presents a slight oscillation between two maxima for the 0.1 pM and the 100 pM and it slightly decays after this concentration value.
Detection accuracy presents a very similar behaviour to the one observed in the Median Probability. Once we reach the 10 pM, the accuracy reaches a value of 1. Although this reflects a poorer performance for the physiological range, the method is still capable of identifying the peptide presence in those concentrations (above chance-level)—see that both median probability and accuracy values are above 50%.
Table 5 shows the results of the clustering algorithm for the detection of the Tau 441 peptide. The algorithm could identify two clusters in both datasets. For the total serum samples, the first cluster contains most of the information from the 100 pM, 1000 pM, and 10000 pM samples, while the second encompasses most of the 1 pM samples. The 0 pM, 0.1 pM, and 10 pM samples were randomly distributed between the two clusters. Despite correctly grouping the higher concentration samples, the algorithm was not capable of isolating the samples without peptide.
However, for the 1:2 dilution dataset, the clustering output was different: cluster one gathered 87.5% of all absence samples, while cluster two encompassed most of the samples corresponding to the presence of the peptide. The misclassification rate for this dataset was about 12%, which means that there is a clear distinction between the two types of samples (absence/presence of peptide).
Table 6 shows the results of the peptide identification task. There was not a significant drop in performance in the test set when comparing to the accuracy in the training set, which means that the models did not overfit. The accuracy was the lowest for the 1 pM samples and increased with the concentration.
The f1-score assumed a similar value to the accuracy indicating that the model is also capable of distinguishing each one of the peptides with the reported performance.
Table 7 presents the results of the regression analysis to quantify the peptide amount. The algorithm could model the increase in concentration with an r2 of 0.98 and an RMSE of 6.03. The discrepancy between the value predicted for the highest concentration and the real value may be explained by the fact that the model was trained with the logarithmic concentration values—in this scale the difference between the value predicted and the actual is minimal. A higher precision could be achieved by training the model with a larger variety of concentrations.
Lyophilized 50 μg of the recombinant human Tau-441 (AnaSpec, Fremont, CA, USA, Model #AS-55556-50), liquid 20 μg of the Phosphorylated recombinant human Tau-441 protein (Abcam, Cambridge, UK, Catalog #ab269024), lyophilized 0.5 mg of synthetic Amyloid-beta 1-42 (AnaSpec, Fremont, CA, USA, Model #AS-24224) and Amyloid-beta 1-28 (AnaSpec, Fremont, CA, USA, Model #AS-24231) peptides were prepared following the manufacturer's recommendations. The peptides were thawed at room temperature (RT) before being reconstituted. An aqueous solution of 10 mM NaOH was freshly prepared and filtered (using a 0.02 μm syringe filter) to use as the solvent for the Amyloid-beta 1-42 and Amyloid-beta 1-28 peptides preventing the formation of pre-aggregates. A solution of phosphate-buffered saline (1× PBS) was used to dissolve the Tau-441 peptide.
The Amyloid-beta peptides were initially dissolved by adding 40 μL of 10 mM NaOH, and the Tau-441 by adding 40 μL of 1× PBS to the powder peptide. The phosphorylated form of the Tau-441 was already dissolved. This step was followed by immediate dilution with 1× PBS solution to a concentration of approximately 1 mg/mL or less. The solutions were gently vortexed to mix. The serial peptide concentrations were prepared by diluting the peptides in pooled human serum or in a solution with the same pooled human serum diluted in a ratio of 1:2 in a 1× PBS solution. Each concentration prepared was resuspended several times before use. The remaining stock solution was aliquoted and stored at −80° C.
Human serum pooled gender (BiolVT, Model #HMN320377A, samples #HMN350432
to #HMN350436) processed from whole blood collections was used to do the experiments. The samples were stored at -80° C. and, prior to use, the pooled human serum aliquots were thawed on ice to prepare serial dilutions of the peptides. Peptide dilutions were prepared both in the pooled human serum medium and in a solution of pooled human serum diluted in a ratio of 1:2 in 1× PBS.
Samples were diluted following the appropriate dilution factor to meet the concentrations of table 1 and according to the scheme of
Two types of experiments were conducted to demonstrate the method. The first experiment involved peptides detection, differentiation, and quantification. This first experiment was designed to show the capability of the method and apparatus for detecting peptides in a complex liquid dispersion sample, such as human serum or plasma. The limit of detection in terms of peptide concentration was tested and the ability of the method to identify the spiking of different peptides/proteins in complex media (human serum) at the same concentration. The first experiment also shows the performance in identifying different peptides when present at the same concentration in a complex fluid, and its capability of quantifying the peptides concentration present in the analysed dispersion.
Metabolite detection and quantification. This second experiment was designed to demonstrate the method's capability of detecting metabolites in a complex liquid dispersion sample, such as human serum or plasma, and the corresponding limit of detection in terms of metabolites concentration; and, finally, its capability of quantifying the metabolite concentration present in the analysed dispersion.
The peptide detection, differentiation, and quantification tests were conducted for five peptides/proteins: C-Reactive Protein (CRP), Interleukin-6 (IL-6), Amyloid-beta 1-40 (AB1-40), Galectin-1, and Transthyretin (TTR). CRP and IL-6 are key inflammatory molecules widely associated with acute inflammation as well as severity and progression of chronic conditions, like cancer and COVID-19. Besides the association with cancer, Galectin-1 has several emerging roles in cardiovascular diseases including acute myocardial infarction, heart failure, Chagas cardiomyopathy, pulmonary hypertension, and ischemic stroke. The ratio of Aβ1-40/Aβ1-42 in blood-derived samples has been shown to predict individual brain amyloid-β-positive or -negative status determined by amyloid-β-PET imaging and used for the diagnosis of Alzheimer's disease.
Previously, it has been reported that the technology detects and quantifies Aβ1-42. Here, we explored the detection and quantification of Aβ1-40. Lastly, TTR transports the thyroid hormone thyroxine (T4) and retinol-binding protein (RBP) in serum and cerebrospinal fluid. Pathogenic mutations in TTR decrease the stability of their tetramers, enhancing their dissociation into monomers. These monomers can self-aggregate into oligomers and protofibrils that assemble to generate insoluble amyloid fibrils. TTR mutations are therefore involved in several amyloidogenic diseases, such as transthyretin amyloidosis and familiar polyneuropathy.
The peptide detection, differentiation, and quantification tests included spike-in experiments in which the peptides were diluted at predetermined concentrations in relevant biological suspensions. The tested concentrations for the peptides are presented in Table 8 and were determined considering the physiological concentration in human blood. In the particular case of TTR, only differentiation experiments were performed to identify between wild-type (wtTTR) and an amyloidogenic mutated form of TTR (TTR78). For each test, samples with distinct concentrations were analysed from the lowest to the highest concentration, using the same and single probe. Aβ1-40 and TTR spike-in samples were prepared in phosphate-buffered saline (PBS); CRP samples in a solution of 4% bovine serum albumin (BSA) diluted in phosphate-buffered saline (PBS), and in foetal bovine serum (FBS); IL-6 and Galectin-1 samples in human serum. CRP detection and quantification was further validated in human serum samples previously analysed using gold-standard laboratory methods. In total, 72 human serum samples were analysed, with a CRP concentration range of 0.3-628 mg/L and an average of 111.7±151.3 mg/L. The average age of the participants was 68±15 years old, and 47% were male. A cleaning procedure (5% bleach followed by water) was applied between samples acquisition to prevent cross-contamination from one sample to the other.
Metabolite detection and quantification tests were performed for glucose and insulin, in human samples previously quantified using gold-standard methods. Additionally, a surrogate method was developed to detect urinary creatinine from the analysis of human serum samples (indirect measurement of urinary creatinine). Samples were collected from 56 patients in two independent timepoints (4 months apart), totalling 112 samples for each detection and quantification test. The average age of the participants was 55±8 years old, and 43% were male. Glucose concentration levels in serum samples ranged from 80 mg/dL to 139 mg/dL with an average of 108±12 mg/dL, while insulin concentration varied from 3 μU/mL to 123 μU/mL with an average of 17±16 μU/mL. Creatinine concentration values in urine samples ranged from 352 mg/L and 2924 mg/L, with an average of 1458±554 mg/L.
Lyophilized 1 mg of the native C-reactive protein (Cloud-Clone Corp, Wuhan, China, Catalog #NPA821Hu02), 0.5 mg of the Amyloid-beta 1-40 (AnaSpec, Fremont, CA, USA, Catalog #AS-24235), 5 ug of the Interleukin-6 (PeproTech, Rocky Hill, NJ, USA, Catalog # 200-06) and 10 ug of Galectin-1 (PeproTech, Rocky Hill, NJ, USA, Catalog # 450-39) were prepared following the manufacturer's recommendations. The peptides were thawed or maintained for 15 minutes at room temperature (RT) before being reconstituted. An aqueous solution of 10 mM NaOH was freshly prepared and filtered (using a 0.02 μm syringe filter) to use as the solvent for the Amyloid-beta 1-40 preventing the formation of pre-aggregates. After being initially dissolved, the Aβ1-40 was immediately diluted with a solution of phosphate-buffered saline (1× PBS) to a concentration of approximately 1 mg/mL or less. CRP, IL-6, and Galectin-1 were reconstituted in a solution of 1× PBS.
The serial peptide concentrations were prepared by diluting the peptides in the biologically relevant solutions previously mentioned which were further diluted in a ratio of 1:2 in 1× PBS solution for analysis. Each concentration prepared was resuspended several times before use. The remaining stock solution was aliquoted and stored at −80° C.
Human serum pooled gender (BioIVT, Catalog #HMN320377A, samples #HMN350432 to #HMN350436) processed from whole blood collections was used to do the experiments. The samples were stored at −80° C. and, prior to use, the pooled human serum aliquots were thawed on ice to prepare serial dilutions of the peptides. Peptide dilutions were prepared in a solution of pooled human serum diluted in a ratio of 1:2 in 1× PBS.
Human samples used to directly detect and quantify peptides and metabolites were thawed on ice, diluted in a ratio of 1:2 in 1× PBS and analysed. For spike-in experiments, human serum pooled gender samples (BioIVT, Catalog #HMN320377A, samples #HMN350432 to #HMN350436) were thawed on ice prior to the preparation of the serial dilutions with peptides. In all conditions, the pooled serum was kept at a ratio of 1:2 in 1× PBS.
Artificial Intelligence methods for detection and quantification of peptides and metabolites.
The model was trained to distinguish between the presence and absence of the different peptides in the solutions (binary problem). A distinct model was built to detect each one of the peptides. The “absence class” was composed by acquisition samples of serum without the spiked peptide, whereas the “presence class” was composed of acquisitions samples of serum with the added peptide in different concentrations, depending on which ones of the peptides should be detected. In experiences, where the “absence class” had a smaller number of samples, the “presence class” was randomly under sampled, to build a balanced training set. The model used to perform the classification was the Support Vector Machine (SVM) since the SVM is capable of dealing with either with linear and non-linear input data and the SVM is very suitable for high-dimensionality problems. The SVM can distinguish between two different groups by finding a separating hyperplane with a maximal margin between the classes. Three general attributes define the SVM classifier: C—a hyper-parameter which controls the trade-off between margin maximization and error minimization, the kernel—a function that maps the training data into a high-dimensional feature space and, the sigma, which controls the size of the kernel. Several combinations of these parameters were tested to find the optimal model. Each model was trained using a cross-validation strategy. The optimal model was chosen based on the accuracy across all the validation folders.
Since each acquisition was divided into epochs and the features calculated from these epochs were fed into the AI model, a prediction was made for each one of the epochs. However, the goal was to evaluate the performance of the model in detecting the presence of the peptide at different concentrations. Thus, three different methods can be considered to calculate this performance.
Epoch accuracy: accuracy of the binary classification considering each epoch for each concentration.
Probability of presence: Median probability of detecting the peptide across all the samples corresponding to the same concentration.
Most frequent performance: Obtained through the plot of the histogram of the predicted detection probabilities across all samples. The performance for each concentration is the bin with the most counts, that corresponds to the most frequently predicted probability range. Peptide Differentiation
A supervised learning pipeline was developed to distinguish between types of peptides. A different model was created to differentiate between each pair of the peptides and the metabolites. The supervised learning algorithms used were support vector machines (SVM) and random forests (RF). The models were trained using a cross-validation. Every model was optimized to find its best parameters, according to the accuracy across the validation folders. Performance evaluation
Each optimized model was tested in the held-out test set (30% of the whole dataset), and its performance was evaluated by computing a complete metrics report. Due to the small number of samples present in the test set, metrics were calculated without epoch grouping, meaning that epochs were considered independent from each other. The report included the area under the receiver operating characteristic curve (AUROC), Accuracy, Precision, and Recall.
Regression Analysis
One of the methods used to determine the concentration of the peptides/metabolites was based on the application of supervised learning regressors: Random Forest Regressor and Support Vector Machine. A cross validation strategy was used to train each model. The model performance was evaluated using the r2 coefficient, and the best model was chosen according to the evaluation. The training and test samples were divided randomly. The training set encompassed 70% of the samples, while the test set represented 30% of the samples.
The error in the regressor predictions was measured using the Root Mean Squared Error of the logarithmic concentration values (RMSE), the Mean Absolute Error (MAE) and the r2 coefficient.
One of the alternative methods applied to obtain information about the concentration of the metabolites was based on the application of a supervised learning classifier, the Support Vector Machine. For the CRP, Glucose and Creatinine, the data was split into different classes that represent different concentration ranges. For example, the CRP data was split in two different ways: <100 mg/L vs >=100 mg/L and <=25 mg/L vs >=100 mg/L. In other words, the data was split with a close threshold (100 mg/L) and with a concentration gap between the two classes. Other concentration thresholds were also applied to define new classes for the evaluated peptides (Glucose, Creatinine, and Insulin), based on concentration ranges available.
A cross validation strategy was used to train the model. The training and test samples were divided randomly. The training set encompassed 70% of the samples, while the test set represented 30%. A binary classification approach based on the distinction of ‘low’ versus ‘high’ concentration levels was run for all peptides. An additional multiclass (‘low’ vs ‘medium’ vs ‘high’) classification was applied for Glucose.
The optimized model was tested in the held-out test set (30% of the whole dataset), and its performance was evaluated by computing a complete metrics report. The performance report included the Accuracy, Precision, Recall and Specificity scores. Particularly for the binary classification, the area under the receiver operating characteristic curve (AUROC) was also calculated.
Results are divided in the following sections: peptides detection, peptides differentiation, and peptides/metabolites detection and quantification.
A unique model was developed for the detection of each peptide. The results of each model will be presented separately.
The probability of detecting Galectin in solutions that contains Galectin is higher than 80% independently of the concentration, meaning that the classifier is confident in its predictions (see
A distinct model was built for the various peptide's differentiation tasks. The method can be used to differentiate peptides with an accuracy above 90%. Hereafter, the results for the tested classification tasks will be presented and discussed.
The results of the differentiation between the wild-type TTR (wtTTR) and the mutated TTR78 are presented in Table 9. As observed, the SVM model achieved values above 90% regarding all the performance metrics.
The results of the differentiation between Galectin and IL-6 in the held-out test set are presented in Table 10. All metrics are close to 100%, showing that the model can confidently distinguish between the two peptides.
The results of the different quantification tasks are presented below. A different model was developed for each different metabolite/peptide. The methodologies used for the quantification varied: a regression model was used for amyloid-beta 1-40, IL-6 and Galectin, whereas quantification based on concentration ranges was used for the CRP, Glucose, Creatinine, and Insulin.
A unique model was developed for each one of the different peptides: C Reactive Protein (CRP), Amyloid-beta 1-40, IL-6, and Galectin.
C Reactive Protein (CRP) As observed in Table 11, the model obtained good performance overall. The performance is higher when there is a concentration gap between the two classes, and the task proved to be easier in FBS than in plasma.
Amyloid-beta 1-40. The performance of the regression model for the quantification of amyloid-beta 1-40 in the held-out test set is shown in Table 12. The r2 coefficient is 0.65, meaning that the model can approximate the predictions to the real data points. Although the model can discriminate a relationship between the concentrations of the solutions, it does not do it very accurately since the MAE is high. Table 12 depicts the predictions made by the model, the trendline that fits them (dashed line), and the desired relationship (dotted line, light blue).
Table 13 shows the metrics report for the performance of the IL-6 quantification model in the held-out test set. The r2 coefficient is 0.93, indicating that the model can accurately explain the inputs. It can then effectively model the relationship between the optical fingerprint and the peptide concentration. The low values of the RMSE and MAE corroborate this hypothesis. Table 19 shows the model predictions and the corresponding error bars, the trendline that fits them, and the ideal line constructed with the perfect predictions. The fitted line is close to the ideal one, since the errors are small. However, the error bars are larger for the small concentrations, showing that the quantification is harder for those values.
Galectin. Table 14 presents the complete metrics report for the results of the galectin quantification in the test set. The prediction errors are small—both the RMSE and MAE are below one. The r2 coefficient is 0.97, showing that the model can successfully quantify the peptide. Based on
Metabolites. The results for the quantification of metabolites using concentration ranges are presented hereafter. A different model was developed for the different task: quantification of Glucose, Urinary Creatinine, and Insulin. The second was an indirect measurement.
Glucose As observed in Table 15, the model achieved very good performance for the gap threshold and even for the close threshold, with the area under the ROC curve always above 80%.
For the specific case of the glucose, a multiclass classification model achieved a very satisfactory performance and shows a good potential to achieve a regression algorithm in the future (Table 16).
The performance of the classifier for the urinary creatinine regarding the area under the ROC curve is always above 80%. The performance increases when the gap between the two classes increases, as expected.
The performance of the classifier for the insulin was above 75% for the close threshold (first row) with regards to AUROC and above 80% for the gap threshold classification problem as observed in table 18.
Number | Date | Country | Kind |
---|---|---|---|
LU102007 | Aug 2020 | LU | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/073187 | 8/20/2021 | WO |