The present invention relates to an analyser for determining the relative importance of fractions of biological mixtures, a method of determining the relative importance of fractions of biological mixtures, a computer program comprising instructions which, when executed, cause an analyser to perform the method, a computer-readable medium comprising the computer program and a signal carrying the computer program.
It is well known to separate biological mixtures such as mixtures of proteins in tissue extracts into fractions in order to determine the amount of particular fractions with a certain quality for practical uses including scientific research into the constituents of the mixture or biomedical testing, for example to determine the nature of a tumour. In particular it is known to compare a plurality of different biological mixtures in order to determine the physiochemical properties which cause or indicate the different physiological conditions between the different biological mixtures.
Methods of separation can be mass spectrometric or chromatographic and include but are not limited to: capillary electrophoresis, gel electrophoresis, paper electrophoresis, ion-exchange chromatography, affinity chromatography, gel filtration, partition chromatography, adsorption chromatography and mass spectrometry.
Biological mixtures include but are not limited to: cell culture or tissue extracts of proteins, lipids, saccharides and nucleic acids (RNA and DNA), which may undergo prior purification to enrich the mixture with a single component e.g. all, or a representative of phosphoproteins, glycoproteins, nucleic acids containing certain sequences or nucleotide modifications or bound to certain proteins or prior digestion of mixture components e.g. treatment with proteolytic enzymes or restriction nucleases.
Such separation methods produce a plurality of fractions of the original mixture, each containing biomolecules characterised by a level of a certain physicochemical property. For instance, gel electrophoresis of DNA fragment mixture separates the fragments by length where parts of gel can be considered fractions, and affinity chromatography of proteins produces fractions containing proteins of different binding affinity towards the carrier matrix. The quantity of a certain class of biomolecule in a fraction can be determined by spectrometric measurement of absorbed, reflected or emitted (as in fluorescence) light of one or more wavelengths, measurement of other optical properties including refractivity and polarization of light, and electric properties, including conductivity. The measurements may be preceded by a specific or non-specific staining or radioactive labelling; for instance, a radioactively labelled oligonucleotide probe can be used to specifically detect a DNA fragment of interest in an agarose electrophoresis gel, while an intercalating dye would stain all nucleic acids non-specifically.
However, it is difficult to easily determine from the measurements of two or more different biological mixtures which particular fractions relate to the physiological differences between the different mixtures. This can be due to noise or systematic errors in carrying out the measurements induced by the instruments or the experimental protocol.
Various techniques have been used to reduce noise or otherwise clarify the results of chromatographic or mass spectrographic methods. Chromatograms and complex chromatographic patterns have been processed using different methods: principal component regression analysis (Jellum et al, J Pharm Biomed Analysis 9, (1991), 663-669), applying Fourier transform and principal component regression to rapidly determine individual species in the sample (Cholli et al., U.S. Pat. No. 5,985,120). Improving signal to noise ratio in an electropherograms by binning measured data points into variable size bins and subsequent Fourier filtering is described in Anderson, U.S. Pat. No. 5,098,536. T. G. Stockham and J. T. Ives in U.S. Pat. No. 5,273,632 disclose complex signal processing based on blind deconvolution and homomorphic filtering of electrophoretic signals. Szymanska et al., Journal of Pharmaceutical and Biomedical Analysis 43 (2007) 413-420 teaches applying baseline correction, denoising, selection of a target sample, optimisation of electropherogram alignment, normalisation of obtained results by known creatinine concentrations and, finally PCA analysis to electrophoretic data. Shin and Markey, Journal of Biomedical Informatics 39 (2006) 227-248 is a review of machine learning approaches for use in mass spectrometry data and discusses the components of preprocessing, feature extraction, feature selection, classifier training and evaluation.
However, none of these known techniques can consistently remove all of the noise or systematic errors in the data. Thus there is a technical problem that current techniques result in a of lack of clarity of filtered data which makes determination of the relative importance of fractions of biological mixtures separated by a chromatographic or mass spectrometric method originating from cells or tissues with different physiological conditions difficult or impossible.
The inventive solution to this problem according to the invention comprises an analyser for determining relative importance of fractions in biological mixtures separated by a chromatographic or mass spectrometric method originating from cells or tissues with different physiological conditions, the analyser arranged to:
It has been found that by using an analyser carrying out steps a-f where a feature selection method, such as ReliefF, is carried out in the second attribute space, the removal of components relating to noise and systematic errors is facilitated and the identification of physiochemical attributes that correspond to differences in physiological conditions is improved.
Also provided is a method of determining relative importance of fractions in biological mixtures separated by a chromatographic or mass spectrometric method originating from cells or tissues with different physiological conditions, comprising:
As with use of the analyser according to the invention this method of carrying out steps a-f where a feature selection method, such as ReliefF, is carried out in the second attribute space, facilitates the removal of components relating to noise and systematic errors and the identification of physiochemical attributes that correspond to differences in physiological conditions is improved.
Also provided is a computer program comprising instructions which, when executed, cause an analyser to perform the method; a computer-readable medium comprising a computer program; and signal carrying the computer program. All of which share the same advantages as the method and apparatus mentioned above.
By way of a non-limiting example, an embodiment of the invention will now be described with reference to the accompanying drawings in which:
The embodiment herein described illustrates principles of the invention carried out on a typical biological problem, here a problem from plant developmental physiology—a comparison of proteins isolated from three types of in vitro grown tissues of horseradish (Armoracia lapathifolia Gillib.) that differ in physiological conditions—leaves, tumour and teratoma.
All analysed tissues related to this biological problem (leaf, tumour and teratoma) are to be compared with regard to their protein expression patterns. All tissues were of the same genetic origin; tumours were induced on leaf fragments with Agrobacterium tumefaciens B6S3; teratoma, in the form of shoots with malformed leaves represented an unsuccessful way of tissue reorganization. A transition from one tissue pattern to another depends on modifications of gene expression; consequently changes in the proteome, a protein complement of the genome, should be visible in electrophoretic protein patterns.
In this embodiment of the invention in vitro grown horseradish (Armoracia lapathifolia Gillib.) leaves (L), tumour (T) and teratoma (Tr) tissue cultures were maintained on the solid MS nutrient medium without any growth regulator. Culture conditions were: 24° C., 16-h photoperiod and irradiation of 33 μmol m−2 s−1. Primary tumours had been induced on leaf fragments with a wild octopine strain B6S3 of Agrobacterium tumefaciens, according to Horsch et al. (Transgenic plants. Cold Spring Harb Symp Quant Biol 1985, 50, 433-437.) During sub-culturing two morphologically different tissue lines were established: one, unorganized tumour line (T) and the other, shoot-producing teratoma line (Tr).
Soluble proteins were extracted from tissues in the exponential phase of growth (12 days after subculturing). Tissue samples were homogenised in the ice cold 0.1 M Tris/HCl buffer (pH 8.0) containing 17.1% sucrose, 0.1% ascorbic acid and 0.1% cysteine/HCl. Tissue mass (g) to buffer volume (ml) ratio was 1:5 for leaves, 1:1.2 for teratoma and 1:0.9 for tumour tissue. The insoluble polyvinylpyrrolidone (cca 50 mg) was added to tissue samples before grinding. The homogenates were centrifuged for 15 min at 20 000×g and 4° C. The supernatants were ultracentrifuged for 90 min at 120 000×g and 4° C.
Protein content of supernatants was determined according to Bradford method using bovine serum albumin as a standard. Samples were denatured by heating for 3 min at 100° C. in 0.125 M Tris/HCl buffer (pH 6.8), containing 5% (v/v) β-mercaptoethanol and 2% (w/v) SDS (sodium dodecyl sulphate). For SDS-PAG-electrophoresis 12 μg of proteins per sample were loaded onto the gel.
As shown in
The SDS electrophoresis in 12% T (2.67% C) polyacrylamide gels, with buffer system of Laemmli (1970) was run in Biorad Protean II xi cell at 100 V for 45 minutes and at 220 V for a further four hours.
It is believed that a number of repeated measurements (3 as a minimum) is needed for each tissue type, and/or for each measurement condition (gel batch, position on a gel) that is suspected to cause systematic errors. Therefore in the example measurements were carried out on six samples from each of the tissue cultures (L, T and Tr) resulting in 12 gels in total. Protein bands were visualised by silver staining (Blum et al. 1987).
Each gel produces 4 columns (or “lanes”) for each of the three tissues (outer left, inner left, inner right and outer right). The gels were scanned on an Umax Astra 2200 scanner with the resolution set to 300 dpi. An extract from one of the scanned gels is shown in the centre of
To obtain the measurements of physiochemical attributes of the plurality of tissues with first, second and third physiological conditions in a computer readable format, i.e. in the form of a data set in a first attribute space, three line profiles of each lane (a part of the gel with separated proteins of one sample) were created using the UTHSCSA Image Tool 3.00 software and exported to text files at step 102 (
At this stage the data set comprises a large matrix with data representing the coloration intensity of each pixel along each of the three line profiles for each of the four gel positions of the six gels samples for each of the three tissue types i.e. a matrix with 216 rows representing the protein profiles and numerous columns representing the pixel number and each element of the matrix representing the coloration intensity of the respective pixel in the respective protein profile.
In order to reduce the number of columns in the matrix, the profiles were split into windows of the optimal size in step 103 (
Optimal window size is determined by forcing simultaneously high log-likelihood for the unsupervised test and high ratio of accuracy to number of overlapping windows in a supervised test as depicted in
The unsupervised test was performed using expectation maximization algorithm, 100 times for each z with different random seeds. The highest average log likelihood ratio of 100 runs would indicate optimal z.
The supervised test was performed using the k nearest neighbour algorithm (kNN classifier), which was used to classify data by tissue using datasets with different z values; the optimal z being the one with the highest kappa statistic in 10 runs of tenfold cross-validation. These results were compared with the results obtained using SVM algorithm in the same fashion, as shown in
Once optimal window size is determined, the individual measurements are binned into windows according to the optimal windowing scheme.
In this case the line profiles were split into overlapping windows of size 1/z, where length of overlaps was a half of the window size. The total number of windows per line profile was therefore 2z−1; for each window the arithmetic mean of pixel coloration intensities was computed. This procedure was necessary because of inevitable inconsistencies in the gel structure that cause areas in the profiles to seem slightly ‘compressed’ or ‘expanded’ in comparison with other samples. There are also slight variations in the total lane length making a pixel-by-pixel comparison infeasible. Smaller windows (larger z) preserve more information but make the method more sensitive to shifts as described above; larger windows (smaller z) are more robust but less informative. The parameter z was systematically varied from 16 to 256 in steps of 8 to find an optimal window size. We used overlapping windows instead of simply consecutive ones, because of the possibility that a relevant protein band can be positioned exactly over the window border. Because of the slight local shifts, the same band could sometimes be read as a part of one window and the other time as a part of the following window. In these cases, the overlapping windows would contain the band of interest.
After computation of mean window intensities, a median of corresponding windows in the three profiles for each lane was determined to lessen the influence of gel irregularities on the intensity scores, resulting in one floating-window profile with 2z−1 attributes per sample. The datasets were then standardized, so that the windows of a single sample had a mean of 0 and standard deviation of 1; this was done to decrease the influence of staining variation. The data sets, in this embodiment 72 protein profiles (24 replicas of each tissue), were labelled by (i) the tissue type (leaf, teratoma or tumour), (ii) the gel batch number (1-6) or (iii) by lane position on the gel (outer left, inner left, inner right or outer right).
A diagrammatic illustration of windowing is shown in
Having carried out windowing and computed the median of the three profiles per lane, the dataset is reduced to a more manageable size with 72 rows and the same number of columns as windows i.e. 111.
The fixed representation of the reduced dataset can be used to build a classification model at step 105 (
The reduced data set is then projected into a second attribute space using a projection technique such that the projected data is described as a plurality of components mathematically constructed from the original data set. In this example the projection technique used at step 104 (
However, it is believed that other projection techniques that create new attributes by combining, in a linear or non-linear fashion, the original attributes would work equally well. For example correspondence analysis, independent component analysis (ICA), linear discriminant analysis (LDA), kernel PCA, autoencoders and similar encoding/decoding methods based on the neural network paradigm, as well as filtering techniques such as discrete cosine transform, discrete Fourier transform and wavelet transform could be used instead.
An optional step (106a,
The first three columns in
Next, in step 106b (
ReliefF operates on subsets of data chosen by a locality criterion; the neighbourhood size parameter was set to k=3. This heuristic approach quantifies an attribute's merit in context of possible non-linear interactions between attributes. This is in contrast to scoring each attribute without consideration of other attributes, as it is the case with ‘myopic’ measures like the Student's t-statistic. A single run of tenfold cross-validation in Weka Explorer module was employed to assess reliability, where in each iteration ReliefF was run on 9/10 of the dataset (class distribution was preserved), and average scores/rank as well as maximum deviations from average recorded.
Although in this embodiment ReliefF was the chosen feature selection method, other feature selection methods that evaluate relative importance of attributes could be applied in this invention. These include, but are not limited to: techniques based on conditional entropy measures (information gain, Chi-squared score, Gini index, and similar), techniques involving a program routine (wrapper) that performs a number of classification or regression experiments involving a supervised machine learning method where one or a set of attributes are left out in each experiment, or other feature selection methods operating on local class boundaries, as exemplified in the Relief method family adapted to noisy, incomplete data sets and/or data sets with mutually dependent features.
The fourth to sixth columns headed “merit” show the ReliefF scores of each of the 13 principal components based on each of the labels, where each full 0.05 in the score equals one dot, and each full 0.025 equals half a dot. The most important scores from the point of view of the invention are the scores in the “tis” (tissue type) column as these show which of the principal components correlate most strongly with the different tissue types (i.e. have value distributions that show the biggest difference based on the different “tissue” labels). Thus it can be seen that the three principal components with the most relevant data for distinguishing between tissue types are principal components 1, 6 and 7 (which have the highest number of dots in the “tis” column).
On the other hand, although principal component 2 contains the second largest amount of data (12.8% var) the data it contains is not useful for distinguishing the tissue type and principal components 3, 4 and 5 appear to include data which is more related to systematic errors induced by the differences between gels used rather than the type of tissue.
Accordingly in this embodiment at step 106b (
The next step 107 (
Also of academic interest may be the back-projected data sets under the heading “PCs 1-13 not in set”. These show the back projection of the principal components filtered out of the sets to their left, i.e. in the row labelled tissue where the set comprises PC's 1, 6 and 7, PCs 2-5 and 8-13 are shown. Classification accuracy in relation to all of the data in
Although there is a greater contrast between the three lanes in the back-projected artificial gels shown in
However, in step 109 (
It can be seen that for determining the most important fractions to distinguish leaf from the transformed tissues (teratoma and tumour) (left-hand side chart), the white bars are not a great deal taller than the black bars. This indicates that for distinguishing between these samples (which are relatively different physiologically and physiochemically) the method has not been exceptionally useful, although it has revealed that the fractions in the region of window 60 are important which could warrant further scientific investigation.
On the other hand, it can be seen that in order to determine the most important fractions that distinguish the teratoma from the tumour (a more complex problem in view of the greater similarity between these physiological conditions and one where visual inspection of the gels reveals no characteristic patterns) the method of the invention has strongly improved the results. The average ReliefF score of the top 20 windows in the filtered back-projected data is 0.339 compared to 0.115 in the raw data and the height of the white bars is clearly much greater than that of the black bars.
The three plots at the right hand side of
Having identified that these windows are most important, the proteins in these windows could be isolated from the gel and further tests carried out.
Alternatively, if for example the biological mixtures that had been studied were two different types of cancer with different physiological conditions, one of which reacted to a drug, the other of which did not, but which were undistinguishable otherwise, having identified the most important fractions to distinguish between them, it would be possible to build a reliable model to discriminate between the classes (step 111,
Referring to
The analyser 10 includes a controller 11, an input 12, a computation engine 13, storage 14 and an output 15. The controller 11 controls overall operation of the analyser 10.
The input 12 obtains measurements of physiochemical attributes for cells or tissues. In the abovementioned description, the measurements of data relating to biological mixtures 23 are obtained from a measurement device 16 and scanner 17; the measurement device 16 consists of a Biorad Protean II xi cell. It could alternatively be another chromatographic instrument or a mass spectrometer, displaying measurements as an image which can be scanned by scanner 17. However, the measurement device 16 could equally output the measurements directly to the analyser, or could form part of the analyser 10.
In this case, if the measurement device is chromatographic it would include: a mobile phase supply system; a sampling system arranged to receive the biological mixtures 23 comprising first cells or tissues with first physiological conditions and second cells or tissues with second, different, physiological conditions; a stationary phase system; and
a detector arranged to detect the quantity of different fractions; whereby, measurements of physiochemical attributes of first cells or tissues with first physiological conditions and second cells or tissues with second physiological conditions, in the form of a first data set in a first attribute space are obtained from the detector, either by way of an output into the input 12 or by a direct feed to the controller 11.
Alternatively, if the measurement device comprises a mass spectrometer connected to the analyser 10, the results of the spectrometric detection would be outputted via an output in the mass spectrometer to the input 12. If mass spectrometer forms part of the analyser 10, the results of the mass spectrometric detection could simply be fed directly to the controller 11.
As an alternative to inputting the measurements of physiochemical attributes to the analyser straight from the measurement device, the measurements could be stored and then obtained from a network 18, for example as an e-mail attachment or download, or from a data transfer device 19 such as a CD or USB mass storage device.
The computation engine 13 performs mathematical operations such as the feature selection method and projection techniques on the data sets in the first and second attribute spaces.
The storage 14 typically comprises a non-volatile memory such as an internal or external hard disk drive. The measurement information obtained by the input 12 can be written to the storage 14 for archiving if desired. A computer program 20 is stored in the storage 14 which, when executed, causes the analyser 10 to operate under the control of the controller 11. The computer program 20 may be received via the input 12, for example in a signal from the network 18 or as an executable file from a data transfer device 19.
The output 15 enables information processed by the analyser to be used by other entities and/or to be provided to an operator. For example, the analyser 10 can be connected to a printer 21 and/or a display 22.
Number | Date | Country | Kind |
---|---|---|---|
PCT/HR2007/000016 | May 2007 | HR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/HR2008/000019 | 5/28/2008 | WO | 00 | 11/24/2009 |