This translation process is to analyze the raw data (fluorescence data) of an individual and interpret it so as to obtain allelic marker information associated with each and every individual.
The principle is generally as follows: for a given marker, an allele (such as AA, BB for homozygous individuals, or AB for heterozygous individuals) is assigned to an individual based on the raw signal values. The signals associated with each individual can be plotted on a two-dimensional graph wherein each of the individuals having alleles AA, (resp. AB or BB) are localized in the same zone (generally an elliptic zone), that is named cluster AA (resp. AB or BB).
Such representation and clustering can be performed using any software known in the art, such as the Axiom Analysis Suite Version 1.1. (
Tools (based on mathematical modeling tools) for assigning an allele to each “individual/marker” pair are known in the art and provided by the manufacturers of the genotyping solutions (such as the manufacturers of the microarrays). In particular, Affymetrix provides algorithms based on models that are described in the literature (BRLMM: an Improved Genotype Calling Method for the GeneChip® Human Mapping 500K Array Set Revision Date: 2006 Apr. 14 Revision Version: 1.0; BRLMM-P: a Genotype Calling Method for the SNP 5.0 Array Revision Date: 2007 Feb. 13 Revision Version: 1.0; Birdseed). These tools may be parameterized by the user.
In other technologies, such as the technology developed by Illumina, the translation of raw data allelic data is facilitated by the use of “masks”. The raw data is imported within the software, which will perform automatic reading (automatic assignment of an allele for each pair individual/marker). From these readings, the end-user can save the equations of the ellipses corresponding to the position of alleles clusters for each event, in so-called “cluster file”. This information (“mask”) can then be used when new individuals are to be genotyped.
It is to be noted, however, that the ellipses equations are fixed and that it is difficult to make them evolve and to customize them during analysis of the samples. Consequently, since the position of the clusters (equation of the ellipses) depends on the type and origin of the material that is genotyped (grain, leaf, grain mixture thereof . . . ), this technique may prove not to be appropriate to sample variability in all cases.
A mask should thus be created for each type of material to be genotyped which is time and resource consuming. Furthermore, even though these masks help with assignation of a given allele to a given individual for a given marker, verification and manual correction for some markers are still needed (markers with low CallRate or aberrant level of heterozygosity).
Translating Raw Data into Allelic Data for a Sub-Set of Markers
This step is generally first performed on a subset of markers in order to optimize the assignment for all individuals.
Sample quality may also affect genotyping. In particular, if, in a batch of individuals, one or more samples are of poor quality, their presence can impact the quality of genotyping for all individuals of the batch. Indeed, presence of these individuals would move the localization of the clusters and hence lead to wrong assignment of alleles for some individuals.
The quality can be checked using various indicators that are known in the art.
The dQC (dish quality control) is an indicator used in Affymetrix solution to make it possible to detect contamination problems, taking both interchannel and intrachannel signal separation.
A commonly used indicator is the CallRate. At the individual level, it measures the percentage of markers for which allelic data could be obtained (i.e. an allele has been assigned to the individual for the marker). Consequently, the overall call rate of a sample is equal to the number of markers where receiving an AA, AB, or BB allelic genotype is assigned divided by the total number of markers on the chip.
A low CallRate for an individual may be due to various reasons, such as contamination of the sample or poor DNA concentration or poor quality DNA. It is therefore recommended not to keep the information associated with the individuals with a low CallRate.
More generally, it is recommended to remove the raw data for the low CallRate individuals before performing the genotyping analysis of the other individuals. Indeed, as indicated above, the presence of low quality individuals may add to the variability at the time of modeling (determination of the clusters) and thus impact the quality of allele assignment as a whole.
Translating Raw Data into Allelic Data for all Markers
After the previous step, translating raw data into allelic data is done for all markers. This step is performed as indicated above for the sub-set of markers, the differences lying in the fact that it is performed for all markers on individuals that have passed the quality control thresholds (such as sufficiently high dQC and CallRate).
Once allelic data has been obtained for all selected individuals, one can use scripts to classify markers so as to keep only the allelic data corresponding to the most reliable markers.
As an illustration, Affymetrix proposes the software SNPolisher_1.3.6.6, GTC 4.1.4, described in Best Practices Workflow and SNPolisher for Custom Axiom Array Analysis, March 2013). Later version of the SNPolisher software can be found at http://www.affymetrix.com/estore/partners_programs/programs/developer/tools/dev nettools.affx
Using these softwares, the markers can be classified into six groups:
The “PolyHighResolution” markers correspond to the most reliable markers (for which the allele assignment is the most reliable).
The markers and their quality can be ranked using various indicators.
In particular, Affymetrix uses the indicators SNP CallRate (percentage of individuals for which an allele was assigned for the given marker), FLD (Fisher's Linear Discriminant), HetSO (Het Strength Offset), HomRO (Homozygous Ratio Offset), as described in Analysis Guide, Axiom Genotyping Solution Guide Data Analysis, available at http://media.affymetrix.com/support/downloads/manuals/axiom_genotyping_solutio n_analysis_guide.pdf)
It is possible to find other indicators in the literature, such as the Silhouette index (Rousseeuw (1987) “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20: 53-65), or the Davies Bouldin Index (Davies and Bouldin (1979) “A Separation Cluster Measure” IEEE Transactions on Pattern Analysis and machine Intelligence PAMI-1 (2): 224-227).
The purpose of all these tests and controls is to identify markers for which the clusters that correspond to the different alleles are clear (i.e. AA, AB and BB alleles groups are clearly separated from each other) and thus reliable.
New Translation when Justified
As indicated above (and illustrated in
For these markers, one would find four clusters corresponding to alleles AA, BB, AB and Absence of Signal.
Affymetrix provides a tool adapted to these specific cases, namely a R function called OTV_Caller, which is known and available in the R SNPPolisher package provided by Affymetrix.
The publication “Statistical methods for off-target variant genotyping is Affymetrix'Axiom® Arrays,” Teresa A. Webster, Alia Pirani, Mei-mei Shen, Laurent Bellon and Hong Gao describes this process.
Actually, analysis of these specific markers requires an extra step, which is to identify an allele class (cluster) that is graphically below the other alleles classes, and is sufficiently homogeneous and dense to be considered a true cluster and real signal rather than background noise.
Once this new cluster has been identified and qualified for a specific marker, the allele “absence of signal” can be assigned to the individuals in this cluster and the marker will thus be classified as a PAV or OTV marker.
Quality control is performed, for example through the use of commercially scripts such as the ones provided by Affymetrix via the Ps_Metrics function of the SNPolisher package.
In brief, this will generate reports containing the CallRate and heterozygosity rate information, class of markers and the values associated with each indicator that were used to classify the markers.
Depending on predetermined thresholds, alerts can be provided to the user on reliability of the data provided for a given marker.
Issues Associated with the Current Methods
A major challenge of the high throughput genotyping techniques is to maximize the quality of allelic information obtained in the end, in order to optimize the quantity, reduce costs and time associated with these processes.
Currently the problems comprise:
It is thus necessary to improve the process of allocating alleles to individuals, to improve the quality control, in particular to improve the reliability of the markers that will be used, to obtain a method that would be less experiment-dependent.
Obtaining high quality marker data is critical for all marker-assisted selection applications. Low quality data will decrease the efficiency of plant breeding.
The above-process will thus be modified at multiple steps, that can be used independently, or in any combination:
Such process is illustrated on
It is however preferred when all these steps are combined together as this will largely improve the quality of the data obtained, as well as the quantity of information obtained (increase in the number of markers for which reliable information is obtained).
This process can be implemented within a software comprising a combination of optimized library which can accept, as input, any raw data, whatever the genotyping method used to obtain this raw data, and will provide, as output, visualization of the clusters, and any other information such as the genotyping information for the markers of the genotyping run and the individual tested in the run, allele calling, quality status and flags, tools for decision making.
An “allele” is one of a number of alternative forms of the DNA sequence at a given genetic locus.
An “array” or a “chip” is a DNA microarray which is used to detect polymorphisms within a population. It operates with
A “cluster” is the visual representation of the fluorescence signal on a two-dimensional graph for a group of individuals having the same genotype for a given marker. It is usually represented as an ellipse when the individual measures out of the chip are plotted on a two-dimensional graph.
An “individual” is a plant that is included in a genotyping program, and from which DNA is isolated and genotyped according to methods herein disclosed,
A “locus” is the specific location of a DNA sequence on a chromosome.
A “molecular marker” or “marker” is a gene or DNA sequence with preferably a known location on a chromosome for which different alleles can be revealed through the use of molecular protocols. It can be a short DNA sequence, such as a sequence surrounding a single base-pair change (single nucleotide polymorphism, SNP), or a long one, like microsatellites.
In summary, a marker makes it possible to discriminate between different alleles at a specific locus. Use of markers to do so in an individual consists in genotyping said individual (see below). An individual for which the alleles at different loci have been determined is said to be a “genotyped individual” for the markers associated with these loci.
A “run” is the part of genotyping experiment that consists of preparing the samples, applying them to the hybridization device, and acquiring the hybridization signal.
“Raw data” is what is obtained after performance of a run. It consists of the hybridization signals obtained for each marker and each individual.
A “program” is a “genotyping experiment”, including the run and the allocation of alleles to the individuals.
“genotyping” means determining the combination of alleles for one marker or a set of markers. In the context of the present invention, genotyping may be performed on the whole genome of an individual, on a part of the genome, such as on one or more chromosomes, on a part of one or more chromosomes, or on one or more specific regions of the genome as for example a gene. A lots of genotyping techniques exists, such as the GeneChip Human Mapping Array from Affymetrix, the Illumina platform, the Sequenom platform, the Taqman platform and the Invader assay. A synonym of “genotyping” in the context of the invention can be “allele calling”.
In order to genotype an individual (or part of the genome of an individual), one uses a set of markers. By extension, sequences that may be polymorphic between different individual can be named “markers” (it is clear that polymorphism for a given marker can be observed between a specific individual and a second individual, whereas it would not be observed with a third individual). These markers can thus indicate the nature of an allele at the specific locus that they target.
For markers that differentiate between two alleles, and assuming that the frequency of both alleles is balanced, it is statistically assumed that about 50% of the markers will reveal the same allele, between two unrelated individuals, just by chance.
A “genotyping experiment” comprises the steps of preparing samples of the individuals, obtaining raw data, analyzing the raw data, and assigning alleles to the individuals.
A “genotyping software” is a software that is used to allocate alleles to individuals after a genotyping run, by analyzing the raw data (hybridization signals) obtained for the individuals for each marker, that are used as input within this genotyping software. Such software are known in the art and are usually provided by the companies having developed the genotyping technology.
A “statistical software” is a software that makes it possible to do statistical analysis after input of data (such as the R software, available at https://www.r-project.org/)
A “parent” for an individual is, in the context of the invention the, or one of the, donor(s) of marker alleles of this individual. Several generations of crossing, re-crossing, selfing (self-pollination) or other steps (like the ones involved in doubled haploid production process) could have occurred between the parental stage and the individual stage but in all cases the alleles of one individual belong to the set of alleles of its parents.
In this step, the “raw data from a reference panel” that contains individuals, the genotype of which is known, is used.
In the context of the invention, the raw data of the reference panel is analyzed together with the raw data of the alleles of the genotyping experiment.
This reference panel consists of a set of individuals selected to maximize the presence of several individuals (5 minimum) in each one of the three clusters for each of the selected marker.
The selection of these individuals is described below, but it is preferred that these individuals are chosen, for example in different genetic groups for maize. In addition, the individuals constituting the panel are selected so that the clusters associated with each of the alleles exist for all of the markers and that these are clearly identified for each marker for these individuals. The method does not check if some individuals correspond to the absence of signal The method only consider the presence of at least 5 individuals in each one of the three main clusters.
This panel may for example consist of 384 individuals (as experiments are generally made using 384-well plates), so that it can be considered as an independent genotyping experience. One can use fewer or more individuals, depending on the number of markers studied, or of the species studied.
Individuals that are used in the reference panel have previously been “robustly” genotyped, i.e. that have been genotyped more than once, with always the same result obtained, or that are progeny of parents for which the genotype is known. Consequently, the alleles for the individuals of the reference panel are known.
It is to be noted that the step according the present method does not require performing actual experiments on the genetic material of the individuals of the reference panel. The present method can use the raw data that has been previously generated for these individuals (reference panel).
In the present method, the raw data file (such as the file with the extension .CEL, when generated through an Affymetrix genotyping process) associated with this reference panel is introduced within the softwares of the invention that assign an allele to each “individual/marker” pair.
One of the purpose of using the reference panel is to optimize the settings/parameters of the software to later reliably analyze the raw data obtained from individuals that one wants to genotype. Since the alleles of the individual of the reference panel are known, one will be able to finely design the locations where the clusters are expected for all the considered markers.
In order to do so, one will modify various setting/parameters available in the software and compare the genotyping data (allele allocation for each marker) obtained using these different configurations and options.
In particular, one can modify the parameters “Option Inbred Penalty” and “genotype option” that are used and disclosed in the Affymetrix MANUAL: apt-probeset-genotype (1.19).
This manual is, in particular, available at http://www.affymetrix.com/estore/support/developer/powertools/changelog/apt-probeset-genotype.html.affx
The “Option Inbred Penalty” takes into account the information relating to the heterozygosity rate expected for each individual.
The “genotype option” provides information on the expected alleles for some pair (individual/marker) to the algorithm/software. The algorithm takes into account this information, although the algorithm is able to provide output data different from the information already known for the reference panel, if this information seems not consistent with the actual positions of the observed clusters.
Once all the configurations have been tested, it is possible to determine which configuration provides the best output data, as compared to the data known for the individuals of the reference panels.
The best settings/parameters for the software can thus be saved. The allelic determination for the reference panel using these setting/parameters are recorded on a filed named “genotype file”, this file containing the real allele data for each individual and each marker from the reference panel.
The two files associated with the reference panel individuals (raw data file and corrected allelic data file (genotype file) obtained as described above), are then incorporated in routine with each new genotyping experiment on the same species, containing common markers or individuals).
It is to be noted that one can also obtain a “Posterior file” (when using the Affymetrix APT-PROBESET-GENOTYPE algorithm) which contains the equations of the ellipses (clusters) corresponding to each allele class for each marker. This file corresponds to the “mask” described above for the Illumina system. This “Posterior file” can also be used in future genotyping runs.
In summary:
This step thus corresponds to a method for obtaining parameter data that will be used in a computer software for analyzing raw data and assigning allelic information to said raw data, wherein said computer software preferably comprises a visualization interface of said genotyping data,
comprising the steps of
(a) introducing raw data from a reference panel as an input in said computer software, and wherein the allele information is known for the individuals of said reference panel
(b) varying the settings/parameters of said computer software so as to determine the setting/parameters that allow the best assignment of alleles for each individuals of said reference panel
(c) extracting said settings/parameter data that comprise computer files said the equations of said ellipses as determined in (b) and of the genotyping data associated, for each marker, with each individual of said reference panel.
The raw data of step (a) is the data obtained from the chip/array, containing the markers, and on which the physical samples have been applied. This raw data is in the form of a computer file that will contain, for instance and depending of the technology used, fluorescence intensity for each individual and each marker that depends on the level of interaction of the allele of the individuals with the probe on the chip/array. These files are generated by the devices sold by the chip/array/device providers and are in the adequate form to be inserted within the analysis software.
As indicated above, the user has some latitude to make the analysis parameters vary, for each marker. Step (b) uses this latitude in order to obtain the parameters that will optimize the quality (in particular the global CallRate).
Step (c) corresponds to the recording of the optimized setting/parameters that have been confirmed to provide good output data for the reference panel (which, as indicated, is composed of individuals for which allelic information is known), and that can thus be relied on to provide good information for other individuals of the same species that will be genotyped within the same program or by the same technology than the reference panel.
This method is a computer implemented method. The visualization makes it possible to see, for each marker, the plotting of each individual genotyping result [i.e. plotting, for each marker, the signal as determined for each individual during the run on the chip, on X/Y axis]. This visualization interface can further comprise the drawing of ellipses that represent the allele clusters associated with said marker. This plotting is known in the art. This software can be associated to other genotyping analysis tool, and the plotting of each individual genotyping result can be sent directly to these tools for further processing, if the quality controls are all positive for the experiment.
It is thus possible to perform, in this context, a method for genotyping a population of test individuals with a computer software for generating genotyping data, usually from raw data or after high throughput generation of data, comprising the steps of
The genotyping data thus obtained is the allelic information for each marker and each individual.
The reference panel has also a second purpose. As the raw data for this reference panel is to be introduced in each run, the assignation of the alleles for this reference panel can be checked and used as a control: indeed, if the only clear assignation of alleles is obtained for the data from the reference panel, this would strongly suggest that there was some kind of dysfunction in the production of the data for the test individuals, such as chip degradation, reactive deficiency or degradation of the materiel. Such method will prevent generation of poor quality data.
Choice of the Reference Panel
In a preferred embodiment, said individuals in said reference panel have been chosen according to a method comprising the steps of:
(a) Determining the minimal number of individuals (n) that need to be present in each cluster. Usually this minimum number must be of 5.
(b) Selecting n random individuals from a starting panel for which the allelic information is known for each individual
(c) Adding one new individual from said starting panel, wherein said added individual is chosen so as to most increase the number of clusters that respond to the desired condition (at least n individuals per cluster)
(d) Repeating step (c) until the desired condition (at least n individuals per cluster) is met.
The above-mentioned process for determining parameters and settings also allows the optimization of the number of markers that are correctly and systematically read for each analysis, referred to as “reliable markers”. These markers are of special importance when one wishes to work on results of genotyping obtained in independent genotyping experiments possibly obtained by different genotyping techniques.
Herein is also presented a solution that allows improvement of both the quality of the interpretation of raw data, and also, the identification of markers of good quality (i.e. markers that can reliably be used in various experiments, leading to reliable (correct) results) through the use of appropriate indicators.
It is proposed to act in two steps.
The first step is to make sure that a marker that is considered as being of good quality for an analysis is systematically used in further analysis. It is possible to set up the parameters of the software to achieve this goal.
The second step is to identify, as finely as possible, which markers are to be systematically kept and used (reliable markers). To achieve this end, new indicators have been designed, which can be used with indicators already proposed in the literature (Affymetrix, Axiom Genotyping Solution Data Analysis Guide 702961 Rev. 1).
These indicators measure the reliability of the reading, marker by marker. They are mainly used during the first analyses when using a new batch of markers.
It is during these first analyses that the list of usable and reliable markers is determined and is permanently fixed, as a calibration phase.
The indicators as described below make possible to identify reliable markers. They are based on calculations of distances, densities and make it possible to analyze the position of the clusters associated with each of the alleles. The more the allelic groups are dense and remote from the others, the stronger is the level of confidence in the marker and the more reliable the marker is. Number of genotyped individuals for which no allele can be assigned is negatively correlated with the quality of the marker.
The elements below shall be taken into consideration when deciding whether a marker is of quality and reliable across multiple genotype runs:
The following indicators can be calculated:
The above indicators are not described in the art, apart from the dunn and the wb.ratio.
Indicators linked to the density of missing data between the clusters (DensityAA and DensityBB) are powerful quality indicators for a marker.
Similarly, indicators of reproducibility and repeatability (Repro Intra and Repro Inter) are powerful indicators to quickly identify problems related to specific markers and to permanently exclude these non-reliable markers.
Appropriate thresholds for Densities, Repro and dunn indicators make it possible to be very specific when sorting the markers to keep and the ones to discard.
It is preferred to calculate the above indicators with a sufficient number of samples, (i.e. preferably more than 30 samples) for each marker.
If Repro.Intra.Diff>5 or Repro.Inter.Diff>5) (Repro.Intra and Repro.Inter), it is considered that the markers are not reliable and they shall thus be removed from future consideration.
Markers are considered as reliable and will thus be retained for future analysis when these markers can be sorted in one of the following classes:
(a) Call rate below threshold (SNP call rate is below threshold, but other cluster properties are good)
(b) PAV (presence Absence Variant)
OR
(c) Other
OR
(d) Poly High resolution
The invention thus also relates to a method for selecting markers that can be used for genotyping individuals, comprising the steps of
(a) inputting raw data (such as microarray data) obtained from a reference panel within a computer software for analyzing microarray genotyping data, wherein said computer software calculates, for each marker, clusters, wherein each individual is assigned to a cluster for each marker
(b) for each marker, calculating indicators that represent the adequacy of the data proposed by said computer software and the expected data
(c) Selecting a marker for future use for genotyping individuals if the indicators are above a predetermined threshold
In particular, the method uses at least one, and preferably all of the following indicators:
Markers are discarded (i.e. the markers are not taken into consideration or used in analysis of future runs) when the threshold for HomDiff_loc is above 5.
The invention also relates to a non-transitory computer readable storage medium having stored thereon processor-executable software instructions configured to cause a processor of a computing device to perform operations comprising: for each marker used in a genotyping program, calculating indicators that represent the adequacy of the data proposed by the genotyping software and the expected data, and optionally providing a signal when the result of the indicators are below a predetermined threshold. Such processor-executable software instructions make it possible to perform the methods herein disclosed.
It is to be noted that it is possible to repeat the steps of generating the reference panel, improving the settings of the computer software and optimizing the marker set. Indeed, when the marker set has been optimized with a given reference panel (an given settings), it is possible to set-up a new reference panel, using this marker set, and implement again the steps of determining proper settings for the computer program and assessing the marker quality as described above. This can be repeated a few times, in order to obtain strong reference panel, settings and marker set.
Correcting the reading of markers of PAV (Presence/Absence of variant) is already present in commercial softwares (such as the ones commercialized by Affimetrix).
However many markers of PAV type are not detected and an individual for which there is absence of the allele in a PAV marker is frequently interpreted incorrectly as an individual bearing a heterozygous allele.
Optimization of setting such that been mentioned above favorably contributes to detection of markers of PAV type
The optimization of the choice of indicators allows to significantly improve the detection of these markers and thus to allocate a 4th allele (PAV/absence of signal/no allele) to as many relevant markers as possible.
This is performed by an analysis of the heterozygous cluster. Briefly, it is checked whether it makes sense to split the heterozygous cluster into two heterozygous clusters (one that would be contain individual being actually heterozygous, and the other that would contain individuals that don't have a signal (absence of signal)).
In particular, the distance between these two sub-clusters/subgroups (Distance ABAB) is calculated, and the marker can be classified as PAV if this distance is significant (higher than a given threshold).
It also checked whether the missing data (i.e. the individuals for which no allele has been assigned for this marker) can also be considered as included within a fourth cluster corresponding to “absence of the allele”. It is to be noted that one shall not automatically consider that any missing data has to be considered as representing an “absence of the allele”, but that it also necessitates to verify whether the missing data can be regrouped in a cluster which makes sense from a genetics point of view. These verifications are made by looking at the position of the missing data, with each other and with respect to the AA, AB and BB clusters.
For the markers as depicted in
Indeed, in
In
The consequences of the misclassification of these types of marker is that a large number of data, that should have been considered as the absence of the allele, will not be analyzed, thus reducing the quantity and quality of information obtained from the experiment.
Furthermore, this would lead to erroneous conclusions at the time where this marker is used in analysis routines.
In order to properly classify a marker as a PAV, or in other terms, to determine whether one should assign a cluster “Absence of signal” to a marker, the heterozygous cluster will artificially be split into two new clusters of individuals.
The principle is to test whether the heterozygous cluster that is indicated to include heterozygous individuals (i.e. individuals that have not been assigned to a homozygous cluster) can actually regroup two clusters (one that contain the heterozygous individuals, i.e. individuals containing the two versions of the allele, and the other that would contain individuals that don't bear any version of the allele detected by the marker (absence of the signal).
Various tests will be made on the two clusters (using, in particular, the coordinates of the medoid of each cluster as representative of the cluster) in order to check which hypothesis makes the more sense:
The invention thus relates to a method for determining whether a cluster “Absence of signal” is to be assigned to a marker used in a genotyping experience comprising the steps of
(i) inputting raw data (such as the data obtained directly from the microarray) obtained from multiple individuals within a computer software for analyzing microarray genotyping data, wherein said computer software calculates, for each marker, clusters, wherein each individual is assigned to a cluster for each marker or may be assigned to a cluster for each marker.
(ii) for each marker,
Verifying conditions (1) to (4) below
Condition (1)
Wherein a cluster “Absence of signal” is assigned to the marker if the value calculated in (c1) is higher than a first predetermined threshold (preferably 0.2), and the value calculated in (d1) higher than a first predetermined threshold (preferably 0.4) and the value calculated in (e1) higher than a first predetermined threshold (preferably 5),
And
Condition (2)
Wherein a cluster “Absence of signal” is assigned to the marker if the value calculated in (a1) is higher than a first predetermined threshold (preferably 8)
And
Condition (3)
wherein a cluster “Absence of signal” is assigned to the marker is all conditions a3i to a3iii are fulfilled
And
Condition (4)
Wherein a cluster “Absence of signal” is assigned to the marker if the value calculated in (a4) is higher than a first predetermined threshold (preferably 0.2) and the number of individuals in the lowest heterozygous cluster is higher than a first predetermined threshold (preferably 5)
(iii) Assigning a cluster “Absence of signal” to the marker if at least one of the conditions (1) to (4) is fulfilled.
Some individuals can't be assigned to a cluster AA, AB or BB and would thus be assigned to a “virtual” cluster called MISSING. If there are at least 5 individuals in this MISSING cluster, and the MISSING cluster is under the AB cluster, this MISSING cluster is postentially a PAV cluster and the method above should merge this MISSING cluster and the new cluster “Absence of signal” created.
As already indicated above, computer softwares used to perform step (i) of this method are known in the art and commercialized by various manufacturers, such as Affymetrix of Illumina.
In step (ii) (b1), the separation of the original heterozygous cluster in two new clusters is performed by using any software or statistical function in the art such as the pam function available on the R software or by partitioning the data into 2 clusters around medoids, with the method as described in Reynolds, A., Richards, G., de la Iglesia, B. and Rayward-Smith, V. (1992) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms; Journal of Mathematical Modelling and Algorithms 5, 475-504. This partition is performed with the statistical software known in the art, determining, for each individual of the heterozygous cluster, whether it should be assigned to one or the other cluster, from a statistical point of view. Eventually, all individuals present in the original heterozygous cluster are assigned to one of the newly created cluster.
Each new cluster possesses a medoid, the mathematically representative object in the cluster of individuals, which has the smallest average dissimilarity to all other individuals in the cluster. The coordinates of the medoid for the clusters can be calculated by the software that created the new clusters.
The last (but that could be optional) step of the method is to assign the qualification “absence of a signal” to the marker if at least one of the conditions (1) to (4) is fulfilled. Consequently, an individual shall be assigned to four clusters for this marker (Homozygous AA, Homozygous BB, Heterozygous AB or Absence of signal) rather than to the three clusters AA, BB and AB for other markers.
The invention also relates to a non-transitory computer readable storage medium having stored thereon processor-executable software instructions configured to cause a processor of a computing device to perform operations comprising:
The methods described above make it possible to improve the quality of the marker set, that it
These steps, either individually or in any combination will improve the quality of the reading and hence of the genotyping run.
It is also interesting to improve the quality of the eventual allelic result that can be represented as a matrix of the form depicted in Table 1.
From this allelic information, it is possible to pose a diagnosis on the analysis process as a whole, and to respond to a few questions that are of importance to determine the reliability of the end result.
The applicant used new indicators as disclosed below to perform both a diagnosis of “errors in material” (genotype material does not match the expected one) and of “technical errors” (existence of a problem during the technical process, which translates to one or more errors on several individuals).
Indicators will be calculated, marker by marker, in order to determinate the reliability of each marker (CallRate, marker class, HetRate, reproducibility).
Then indicators are calculated for the individual on a set of reliable markers (subset of the marker set chosen among the PolyHighResolution markers and the markers which have reproducibility indicators (Repro Intra or Repro Inter) below 1). It is also preferred when these markers are evenly distributed on the zone that is mapped for the individuals.
Using these indicators make it possible to have a good diagnosis on the quality of the run.
High-throughput genotyping as envisaged is performed on laboratory plates, generally 384-well lab-plates, that are known and widely used in the art.
Each well of a plate contains the DNA to analyze and the components required to perform this analysis (primers, labels to be able make the reading of the markers on the genotyping chip/array.
A genotyping run is generally performed for a large number of individuals (a few dozens or hundred), this will involve the use of a proportional number of plates. Using the Affymetrix technology, plates containing 384 well are generally used, making it possible to test 384 DNA extracts from 384 starting samples. Generally, the DNA was extracted from the individuals using 96-well plates, and the 384 plate that will be used in the genotyping run corresponds to four 96 well plates put together (with a risk of error by plate inversion).
In the steps preceding the actual genotyping step, there may be different mistakes (such as inversion of plates or contamination of the material) that could potentially lead to inconsistent results of interpretation of the genotyping data (such as allocating allelic result to wrong individuals in case of inversion of a plate). It is thus essential to be able to detect this type of errors that correspond to technical errors.
The detection of these errors can be made by using two types of controls: Negative controls such as empty wells where no signal should be read. A signal read in these wells may result from a contamination or plate inversion. These empty wells should have very low dQC and CallRate (dQC lower than 0.7 and; CallRate lower than 92%). The position of these empty wells is an identifying element of the plates: the position of this negative control is different from one plate to another. Using this information makes it possible to determine which kind of inversion takes place. Positive controls such as wells where the DNA comes from individuals for which the expected alleles are known. Unlike the “panel of reference” disclosed above which was used to parametrize the genotyping software and for which only digital data is entered in the genotyping software, these positive controls correspond to physical samples that follow almost the same steps as other samples to genotype (it's added to the plates just after DNA extraction). In each 96-well plate, one can use two positive controls, such as a hybrid (control for heterozygosity) and a line (control of homozygosity). This couple is preferably unique for each plate of a program and can be used to identify the plate. If the allelic data obtained for these specific wells doesn't correspond to individuals placed in these wells on the plate but is found on another plate, then one can conclude that there was a substitution of these two plates.
It may be interesting to use indicators that are also linked to information of reference, using a reference Dataset that can be linked to the genotyping software. This reference Dataset shall contain information about individuals, and in particular genotyping information about these individuals. This genotyping information may be complete (for all the markers) or only for a limited number of markers. This genotyping information is preferably robust, in the meaning that it has been verified (in particular with indicators as herein described or by reproducibility).
This Dataset reference makes it possible to check and test the reproducibility intra and inter programs (see table 1), the consistence of results obtained for an individual present in multiple plates within a genotype experiment, or the consistence with results obtained for an individual in the dataset, that has already been genotyped in previous experiments.
It is also possible and very interesting to use the information from the dataset about pedigree of the individual. The pedigree corresponds to the genealogy information: parents (line, hybrids, population . . . ) and breeding steps (cross, self-pollination, double haploids . . . ) made to obtain the individual. Using pedigree information makes it possible to check for consistency between the declared pedigree (and thus the expected alleles) and the genotyping data (the observed alleles).
A new indicator Pedigree.ErrorRate was thus developed and is calculated for each individual.
Pedigree.ErrorRate=(number of loci with impossible alleles)/(Number of loci where both parents are homozygous).
If Pedigree.ErrorRate is higher than a defined threshold (5 is a preferred threshold), the individual is considered as not correct and a warning signal is emitted.
The invention thus relates to a method for genotyping individuals, wherein said genotyping has been performed on multiple individuals in a single run, wherein the parent genotype of some individuals in the run is known for at least some markers used for genotyping, comprising the steps of
a) allocating alleles to each individuals of the genotyping run
b) for each of the individuals for which parents genotype is known for at least some markers, calculating the value Pedigree.ErrorRate, wherein said value is Pedigree.ErrorRate=(number of loci with impossible alleles)/(Number of loci where parents are homozygous)
c) emitting a signal (such as an alert) indicating that the results associated with the run are susceptible to be aberrant, if the value Pedigree.ErrorRate above is higher than a predetermined threshold.
The results are susceptible to be aberrant when the value Pedigree.ErrorRate above is higher than a predetermined threshold, as this means that some controls are not met, thereby indicating a potential problem with the plate, and that further investigation is needed to decide whether the results can be processed and used or whether the results associated with the plate must be discarded (see
In a preferred embodiment, other indicators are calculated.
In particular, the method may further comprising, in step b), the step of calculating parameters HomDiff_ind, HetDiff_ind, for each of said individuals for which parent genotype is known for at least some markers, wherein
i) HomDiff_ind=for each individual, the percentage of loci where the allele determined for said individual is homozygous and the expected allele is homozygous but inversed
ii) HetDiff_ind=for each individual, the percentage of loci where the allele determined for said individual is homozygous and the expected allele is hetereozygous or the allele determined for said individual is heterozygous and the expected allele is homozygous, and
displaying a warning where an indicator HomDiff_ind or HetDiff_ind are above a pre-determined threshold. This threshold may be greater or lower than 5 but is preferably 5.
As an illustration, if the individual is identified as AA while its 2 parents are BB, this locus is considered a locus with impossible alleles. If one of the parents is AB and the other BB, the allele AA for an F1 progeny can be possible only if the cross between parents is followed by an haplo diploidisation step. An impossible locus for a sample is thus a locus where the observed alleles are not consistent with the alleles that are expected for the sample.
It thus appears that the expected genotype of a sample in the run depends both from the genotype of the parents of the plant from which the sample has been obtained, and from the way the plant has been generated (regular crossing, haplo-diploidization).
As a further illustration, when the tested plant is identified as AA and is the progeny of a AB plant and a BB plant, the locus is considered as impossible if the tested plant has been obtained by regular crossing of the parents only, but is considered as possible if the cross between parents has been followed by an haplo diploidization step.
The Pedigree.ErrorRate indicator is thus very interesting to detect the “error in material” (as indicated above) that can occur during a genotyping run with multiple individual (several thousands, such as 1000 or more individuals).
It is reminded that the identification of the parents of an F1 seed, can currently be done by pericarp genotyping, and that the methods as disclosed herein, alone or in any combination, can also be applied to F1 seeds of an A and B inbred line, possibly preexisting, and the parent profile inferred with a combination of statistical and genotyping methods, leveraging in particular other available F1 using A, B and other inbred lines.
Imputation process can also be a part of the genotyping process: using high density data to impute from medium to low density data.
The imputation process can leverage such pipeline using some common indicators and validation process.
The methods as disclosed herein, alone or in any combination, can also be combined with methods of automatic allele calling. Based on statistical models and genetic knowledge, these methods enable to get the right alleles for any kind of material (seeds, leafs, bulks of seeds . . . ) and for any kind of set of markers (low plex, midplex, multiplex) and any class of marker (high, medium and low quality).
The methods as disclosed herein, alone or in any combination, can also be combined with methods of statistical process control (SPC) to further improve the quality of the generated data. This type of tools enables detection of possible drifts over time in the genotyping process: from the reception of seeds to data import in databases. The concept is based on charts and indicators flagging irregular tendencies and enabling to detect where irregularities come from (equipment, raw material, temperatures and so on).
All the methods described above can be implemented with a single tool (a computer software).
In this case, the user would input a setting/parameter data file (the one that was obtained with the reference panel), and launch the executable software.
When the analysis is complete, a report can be published and can be checked for any action to be made (such as discarding plates or the like if one of the various controls on the plates shows that there is a problem) before using the generated allelic data.
Thus, one will use a system that will use the following to deliver genotyping information:
To run the analysis software (analysis of the raw data obtained from the chip in order to allocate alleles to individuals), one will launch the file with the .bat extension.
Once the analysis is completed (all checks have been performed, and allele information has been allocated for as many markers and individuals), the user shall receive a notification informing of the end of the analysis, and a log file can be created. This log file makes it possible to check that the various audit were properly conducted and that all steps have taken place. Finally, a report would summarize all of the conclusions and output data.
Output files from the report would allow a more detailed analysis of the analyzed material. They would include all of the indicators for the controls, markers, plates and samples. They include also all the distance calculations, reproducibility information and verification of pedigree.
The impact of using the panel of reference to determine the right settings that will be used and help classify the markers can be evaluated by observing the difference in the number of polyhighResolution markers (i.e. markers of good quality) with or without using the panel. The relative amount of such markers is a representative indicator of the quality of the genotyping data obtained with these softwares.
For a genotyping experiment on 68 individuals, it was observed that the number of polyhighResolution markers increased from 40% to 80% when the “reference panel” of 384 individuals was used for setting-up the software.
This is also confirmed in a genotyping program genotyping including 1046 individuals. The reference panel of 384 individuals used is the same as above. The polyhighResolution markers increased from 43% to 76%.
Another way to check the impact of the use of this reference panel on the genotyping data quality is to look at reproducibility and repeatability of the results.
One can check whether the level of consistence of the output obtained with or without using the reference panel with the expected reference data.
In the case of the program based on 1046 individuals above, it was noted that the fact of using the reference panel decreases the number of samples with problems of reproducibility.
The tools implemented to optimize the number of reliable markers and determine the set of operational markers that can be used in a genotyping program allowed to increase the number of markers used in routine
1These estimations are the ones communicated by the chip provider at the very first stage; it is highly correlated to the set of individuals
The use of additional indicators for the detection of the PAV made it possible to correct on average 5% of the markers and thus correct the alleles attributed to these markers for about 2% of individuals.
Introduction and use of the new pedigree indicator, as well as the various other controls, made it possible to detect 2 plate inversions in genotyping programs with of more than 1000 individuals.
A specific plate inversion protocol may be included in a software with the following steps, as disclosed in
If some positive or negative controls are not consistent for plate A, this plate is provisionally classified as failed and other verifications need to be done (
Step 1, if it is possible to find a plate B for which the controls results are the one expected for plate A (and vice-versa), the two plates are considered as having been inverted and rectification is made.
If this is not possible, a second step of verification shall be done. Plate A is read as if it had been inverted and if the new reading is consistent with the expected reading, the plate A is kept and interpreted accordingly.
If controls for plate A are still inconsistent after this second step, the run is considered as failed and the plate is discarded.
This control makes it possible to detect these inversions and not to discard or misread the plates. With this control, these plates can be retreated and the data used with the implementation of the method.
This process also made it possible to identify other types of errors such as material error, or error in declaration of the nature of the material genotyped . . . , giving the user the opportunity to correct.
Number | Date | Country | Kind |
---|---|---|---|
16306825.7 | Dec 2016 | EP | regional |
This application claims priority to and is a continuation of U.S. patent application Ser. No. 15/852,002, filed on Dec. 22, 2017, which claims benefit of European Application No. 16306825.7, filed Dec. 27, 2016, which are incorporated herein by reference in their entireties. The invention relates to a computer implemented method to improve the reliability of the information generated during high-throughput genotyping, in order to be able to assign reliable allele information to genotyped individuals. Such methods are critical in breeding activities. Using Genotyping data in breeding schemes means to be able to manipulate high quality big data in a short delay. These methods make this task possible. High throughput genotyping processes require high throughput analytical methods. As indicated in Lin et al (Bioinformatics (2008) 24 (23): 2665-2671), “Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A, T, C or G) in the genome sequence is altered. The vast majority of SNPs are biallelic. Consider a SNP marker with alleles A and B. There are three possible genotypes for a disomic individual, AA, AB and BB. Many low- and high-throughput technologies have been developed to genotype the SNPs efficiently, including the GeneChip Human Mapping Array from Affymetrix, the Illumina platform, the Sequenom platform, the Taqman platform and the Invader assay. Each platform uses a different technology, and they give somewhat different forms of data. In general, they all give certain quantitative measures of allelic abundance for the two alleles, yA and yB. The abundance measures can either be scalars or vectors. Individuals with genotype AA are expected to have high yA value and low yB value. The opposite is expected for individuals with genotype BB. Those with genotype AB are expected to have similar yA and yB values.” The results of the measures on the array can be plotted on two dimension graphs for each individual and each marker. “Each dot on the plot represents one individual. In SNP genotyping, we seek to identify genotype clusters based on these measurements and ‘call’ each person's genotype by assigning them to a cluster. Normally, we expect to find three clusters, but if one allele is rare in the population, a particular dataset might only have two clusters (genotypes)” Lin et al (Bioinformatics (2008) 24 (23): 2665-2671). The development of high-throughput genotyping methods, such as the one developed by Affymetrix and Illumina makes it possible to simultaneously obtain genotyping information for a large number of individuals and multiple markers. To obtain this information, manipulation steps are directly performed in the laboratory: sample preparation, DNA extraction, and run on arrays or chips where markers are present. Raw data files (such as CEL files in the Affymetrix process) are obtained after the run on the chip. These raw data contain the raw result (such as fluorescence intensity and data) for each individual and each marker. The raw data files are then analyzed, using computer devices and software that assign (or try to assign) specific allele information to each individual for each marker. As indicated above, for each marker, these softwares will generally assign an allele to an individual and thus create clusters of individuals (regrouped by allelic information). The quality of the assignment is generally checked and confirmed through the use of various indicators. The final output would be a data matrix containing alleles for all individuals and all markers.
Number | Date | Country | |
---|---|---|---|
Parent | 15852002 | Dec 2017 | US |
Child | 17672838 | US |