The present application claims priority from Australian Provisional Application No. 2010901540 filed on 12 Apr. 2010 and Provisional Application No. 2010903060 filed on 9 Jul. 2010, the content of both is incorporated herein by reference.
This disclosure relates generally to bioinformatics, and more particularly a computer-implemented method for detecting regulatory single nucleotide polymorphisms (rSNPs). This disclosure also relates to a computer system and a computer program for detecting rSNPs.
Single nucleotide polymorphisms are changes in one base at a specific position in a DNA sequence. One aim of the Genome-Wide Association Studies (GWAS) is to find single nucleotide polymorphisms (SNPs) that are linked to a particular disease or trait phenotype. Disease associated SNPs residing in coding regions, specifically non-synonymous mutations, are generally easy to interpret in terms of their functional impact.
However, unraveling the functional impact of SNPs residing outside of coding regions is more challenging. One possible functional role for non-coding disease associated SNPs is that they are rSNPs. That is, they alter the binding affinity of a transcription factor to the DNA, which in turn alters the expression of certain genes, consequently contributing to the disease phenotype. Finding these rSNPs is difficult, since they can be obscured by the potentially large number of SNPs present in a linkage-disequilibrium block, and the experimental procedures involved can be expensive and labour intensive.
According to a first aspect, there is provided a computer-implemented method for detecting a regulatory single nucleotide polymorphism (rSNP), comprising:
Using the computer-implemented method (i.e. in silico) above, candidate rSNPs can be determined more efficiently compared to experimental procedures. By using the statistical significance value of the change in transcription factor binding affinity to detect rSNPs, the method provide significant sets of predicted disrupted TF binding sites and significance scores of results across a large number of SNPs. Advantageously, the method yields prediction with statistical significance and is therefore suitable for high throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS.
The statistical significance value of the change in transcription factor binding affinity in step (b) may represent a statistical significance of observing a ratio between a statistical significance value of the first score and a statistical significance value of the second score.
The statistical significance of observing the ratio may be determined, using a distribution of ratios of statistical significance values of all scores. In this case, the method further comprises calculating or retrieving the distribution of ratios of statistical significance values of all scores prior to step (b). The first score may be greater than the second score in steps (a) and (b).
Alternatively; the statistical significance value of the change in transcription factor binding affinity in step (b) may represent a statistical significance of observing the first score and the second score on a two-dimensional, joint distribution of first and second scores. In this case, the method may further comprise calculating the two-dimensional, joint distribution of first and second scores prior to step (b).
The two-dimensional joint distribution may be calculated using mutation probabilities representative of biases of mutation preferences from the first allele to the second allele. In this case, the mutation probabilities may follow a probability distribution of observing the second allele. Alternatively, the mutation probabilities may follow a probability distribution with equal probabilities of observing different nucleotides.
The two-dimensional, joint distribution may also capture statistical similarity of the first score and the second score.
The distribution of scores or the joint distribution of scores may be calculated using direct computation or convolution. The distribution may be calculated recurrently using multiple iterations where a partial sum of the distribution is calculated at each iteration.
Step (b) may be performed only if the higher value between a first statistical significance value of the first score and a second statistical significance value of the second score satisfies a predetermined threshold of statistical significance.
The first and second scores may each be determined using a position weight matrix (PWM) associated with the first and second alleles. In this case, the first score and the second score may be determined using a sum of weights, each weight is associated with a base at a position in the first or second allele. The weights may be absolute frequencies of the base at the position, relative frequencies or a transformation of the absolute or relative frequencies.
The method may further comprise repeating steps (a), (b) and (c) for one or more other pairs of first and second alleles, and ranking each pair of first and second allele according to respective statistical significance value of the change in transcription factor binding affinity.
According to a second aspect, there is provided computer program to implement the method according to the first aspect. The computer program may be embodied in a computer-readable medium such that when code of the computer program is executed, causes a computer system to implement the method.
According to a third aspect, there is provided a computer system for detecting a regulatory single nucleotide polymorphism (rSNP), comprising a processing unit operable to:
Non-limiting example(s) of the computer-implemented method and computer system for detecting regulatory SNP will now be described with reference to the accompanying drawings, in which:
a) is a logo for a position weight matrix (PWM) of a transcription factor of OCT1.
b) is a PWM of OCT1.
c) shows two binding sites for OCT1 containing a SNP: Allele 1 (A1) with position 8=T and Allele 2 (A2) with position 8=C.
a) is a flowchart of a ratio method for performing step 330 in
b) is a flowchart of a joint distribution method for performing step 330 in
a) is a graph of a distribution of scores generated by OCT1 PWM and the scores for first (A1) and second (A2) alleles.
b) is a graph of a distribution of ratios of scores and log scale y axis of frequency.
a) is a graph of a distribution of scores generated by OCT1 PWM and the scores for first (A1) and second (A2) alleles.
b) is a graph of a joint distribution (2-dimensional) of observed calibrated scores for the first allele and second allele on a negative log scale. The dot represents the observed scores for the first allele and second allele for a worked example. The shaded area includes the scores used for calculation of the p-value of the change in scores using the joint distribution method in
c) is a graph of a joint distribution (2-dimensional) of observed calibrated scores for the first allele and second allele on a negative log scale. The dot represents the observed scores for the first allele and second allele for the worked example. The shaded area includes the scores used for calculation of the p-value of the change in scores using the ratio method in
a) is a table of the results of applying the ratio method to four known regulatory SNPs.
b) is a figure of box plots comparing the results of applying the ratio method in
Referring first to
The processing unit 110 may be operated using a local computing device 140 such as one located in a laboratory. The local computing device 140 is operated by a user (not shown for simplicity) and operable to receive input data from a data entry device 144, and to display output data using a display screen 142. In another example, the method for detecting rSNPs can be offered as a web-based tool accessible by remote computing devices 150, 160 each having a display screen 152, 162 and data entry device 154, 164. In this case, the remote computing devices 150, 160 are capable of sending and receiving data from the processing unit 110 via a wide area communications network 130 such as the Internet and where applicable, a wireless communications network comprising a wireless base station 132.
The processing unit 110 is also operable to retrieve relevant data from a local data store 120, or from a remote data store 170 accessible via the communications network 130. The relevant data include DNA sequences and associated position weight matrices (PWMs), distribution of scores, distribution of ratios of scores, and joint distribution of scores, which will be discussed further below.
As an example, the method will be explained using a rSNP that is known to violate an OCT-binding sequence (Demars et. al., 2010). The TF binding site reported to be violated by the rSNP is bound by OCT4. There is, however, no PWM for OCT4 in TRANSFAC. In this case, PWMs for any of the OCT proteins will be considered to be a positive match, as they generally have similar DNA recognition sites. A positive PWM hit is considered to match any protein with the family (i.e. a positive hit for a STAT1 binding protein could be a PWM for STAT3). The PWM for OCT1 will be used.
a) shows the graphical representation 200 of a PWM for OCT1. This is known as a logo (Crooks et al., 2004) and represents the set of DNA binding sites recognised by OCT1. The height at each position represents the conservation of that position in terms of bits and the height of each letter represents the frequency of observing that base in that position in the binding site.
Referring also to
w=[w(k,i)]4×n
Each column i=1, . . . , n of the PWM corresponds to a position in the matrix, in order of the first position to the last position, i.e. n=19. Each row k=1, . . . , 4 of the PWM corresponds to one of the four nucleotides or bases found in DNA, in the order of Adenine (A), Cytosine (C), Guanine (G) and Thymine (T).
The weight w(k, i) corresponds to the number of times that a given nucleotide has been observed at a given position i. For example, position i=8 always has a T whereas position i=11 may have an A or T. The corresponding weight w(4, 8) of observing a T in this position is 56 whereas w(1, 8), w(2, 8), and w(3, 8) are zero respectively. As for position 11, the corresponding scores w(1, 11) and w(4, 11) are 43 and 13 respectively. Position 8 is said to be absolutely conserved. It will be appreciated that although absolute frequencies (i.e. number of times) are used in
c) shows the sequences for a regulatory SNP associated with fetal growth disorder. Regulatory SNPs that change a base at a conserved position are much more likely to disrupt TF binding than those that change a base at a less conserved position. As the conservation at each base is encapsulated in a PWM, scoring each allele with a PWM allows observation of the change in score and hence the change in binding affinity of the TF.
In
For an input SNP and corresponding PWM, the processing unit 110 performs the steps in
First, in step 305, the processing unit 110 determines a first score representative of the transcription factor binding affinity for first allele A1. Similarly in step 310, a second score representative of the transcription factor binding affinity for second allele A2 is calculated. It should be understood that the term “determining” here, and throughout the description, includes the processing unit 110 performing a calculation of a value, or where appropriate, extracting or retrieving a pre-computed value from the data store 120.
The respective score Sn({right arrow over (b)}) for alleles A1 and A2 is calculated as follows:
for the n-tuple {circumflex over (b)}=(b1, . . . , bn)εn, where :={1, 2, 3, 4}≡={A, C, G, T} represents the space of four DNA bases and n=19. Assuming a prior probability distribution p(b), b=1, . . . , 4 on and each nucleotide is sampled independently from , the score Sn({right arrow over (b)}) can be viewed as a random variable (RV) on with the product probability distribution.
For the SNP shown in
S(A1)=w(T,1)+w(T,2)++w(T,8)++w(G,19)=550
S(A2)=w(T,1)+w(T,2)++w(C,8)++w(G,19)=495.
While these scores do not mean much on their own, the statistical significance of the score can be determined by looking at the entire distribution of scores. In step 315 in
More specifically, in step 320, the processing unit 110 then determines a first p-value (P1) representing the statistical significance of the first score Sn(A1) and a second p-value (P2) representing the statistical significance of the second score Sn(A2) using the distribution to obtain:
P1=P[Sn≧Sn(A1)]; and
P2=P[Sn≧Sn(A2)],
where Sn is a random variable on n. The p-value represents the probability of obtaining a random value that is at least as extreme as the score observed, and therefore the rank of the score within the distribution of all scores.
The processing unit 110 then performs an optional filtering process. If the higher score out of the p-values P1 and P2 is statistically significant, i.e. less than a threshold (P<0.001), then the TF represented by the PWM is considered likely to be bound; see step 325. Otherwise, the processing unit 110 proceeds to step 355, in which a new PWM is considered and steps 305 to 325 are repeated.
In step 330, the processing unit 110 then determines the statistical significance of the change in transcription factor binding affinity to detect whether the SNP is an rSNP. The statistical significance of the change in transcription factor binding affinity can be determined using two methods as follows:
In the particular case when the first score is greater than or equals to the second score Sn(A1)≧Sn(A2), the “ratio method” is equivalent to computation of the following probability (p-value):
Here alleles (B1,B2) are sampled form the space of all possible n-tuples of nucleotides and their single base mutations. The “joint distribution method” above corresponds to computation of the following joint probability (see also Eq. (14) below):
p
joint(A1,A2)=PB1,B2[[Sn(B1)≧Sn(A1)] & [Sn(B2)≦Sn(A2)]].
In step 345, the processing unit 110 then compares the determined statistical significance with a threshold to detect whether the SNP is an rSNP. The processing unit 110 then checks whether there are more scores to be analysed in the data store 120; see step 355 in
Step 315 and step 330 using the ratio method in
In Step 315 in
With a rounding and appropriate affine transformation of the weights, the computation of the distribution of Sn is reducible to the special case of non-negative integer weights w(k, i) of the limited magnitude. In general, the computation of the distribution of (1) is known to be a NP-hard problem, subsuming the standard NP-hard benchmark of “sum of the set” (Touzet and Varre, 2007). However, in our special case of integer values of finite magnitude it has linear complexity (Pisinger, 1999) and this is what is being leveraged here.
The efficient computation of the distribution of the scores in Eq. (1) is feasible if the weights w have regularly spaced discrete values:
w(k,i)=εz
where z is an integer and ε>0. This is achieved by rounding w(k, i) with granularity c, see (Touzet and Varre, 2007) for a discussion of some issues which could be caused by rounding.
After rounding further simplifications are possible. By replacing w(k, i) with positive integer w(2)(k, i):=(w(k, i)−C)/ε, where C:=mink,iw(k, i), the original score Sn computed by Eq. (1) is replaced by the integer value score:
where U:=maxk,iw(2)(k, i) The above simple affine relation means that the distributions of Sn and Sn(2) are trivially related and consequently the general problem of computing distribution of scores can be reduced to a special case with the following integer weights without loss of generality:
w(k,i)ε{0,1,2, . . . ,U},
Note that in such a case Sn attains values between 0 and
Now the distribution of RV Sn can be determined using:
xε{0, 1, 2, . . . , U*}, the sum is over all {right arrow over (b)}=(bi)εn such that
w(b1,1)+ . . . +w(bn,n)=x,
and p*,n(x):=0, if no such {right arrow over (b)} exists. It is recognised that Eq. (3) describes the convolution of n distribution p on .
The sum in Eq. (2) involves 4n terms, which in the case of larger motifs, say n=30, results in the number 4n≈1018 of terms being too many for a naive, direct computation. However, the whole evaluation simplifies if performed recurrently using multiple iterations. Namely, at each iteration, the distribution of the partial sum of the first j terms, for 2≦j≦n, is calculated as follows:
This allows for efficient evaluation of p*,n by finding first all intermediate distributions p*,n, j=1, . . . , n−1. The explicit algorithm is presented as follows:
The whole computation can be performed with ≦2n(U+1)(n+2) multiplications and twice as many additions and with minimal memory requirement of n(U+1) registers for the storing the values of the distribution.
(1) Ratio Method in
In the ratio method, the statistical significance of the change in transcription factor binding affinity is determined using the entire distribution of scores generated by the PWM and the distribution of ratios of first and second scores.
a) shows the respective p-value of A1 and A2 on the distribution of observed scores calculated or retrieved in step 315. Allele A1 with base T at position 8 resides in the tail of the distribution of observed scores and is therefore likely to be bound by OCT1. The probability of observing a higher score is P<0.0001. The probability of observing a higher score than allele A2 is an order of magnitude less at P<0.001. The statistical significance of the score difference can be tested statistically as follows.
Referring now to
where Δp is a random observation of the ratio between two p-values. P3 represents the p-value to the relationship between the two p-values in the form of ratio P1/P2 and can be expressed as a “log-rank” function.
As the p-value for each score is representative of the rank of the score within the distribution of all scores, the ratio of the p-values can be taken as a measurement of the difference or change in binding affinity between two alleles. Furthermore, the distribution of all ratios of p-values for all possible SNPs can be used to determine the statistical significance of the observed ratio (probability of observing a higher ratio).
If the p-value P3 of the ratio between the two p-values is significant compared to a random case, i.e. less than a threshold (e.g. P<0.01), then the current SNP is marked as a candidate rSNP; see steps 345 and 350 in
Referring to
Otherwise, the processing unit 110 proceeds to rank the analysed PWM(s) according to the calculated p-value of the ratio between the p-value of the first score and the p-value of the second score; see step 365. An exemplary list of ranked results generated according to step 365 is shown in Table 1 in
It will be appreciated that given a PWM for a TF, if it is assumed that the current SNP under consideration is an rSNP, the most likely position in which the TF is bound across the SNP is determined. Therefore, the method uses a sliding window approach to step across the SNP at each position equal to the width n of the PWM, and considering the maximum score output by the PWM to be the likely binding site. For each window encompassing the SNP, the change in the affinity of binding according to the preselected PWM is calculated, and the statistical significance of this change is estimated:
In general terms, given a fixed PWM, steps 305 to 320 serve to “calibrate” the scores of alleles A1 and A2. The calibration consists in allocation of a “rank” to each score in the sequence of all possible PWM scores (for every possible setting for the DNA sequence), sorted from the highest to the lowest value. Next in steps 330 to 340, given two scores for different values of the SNP base, the ratio of these ranks is calculated. This ratio is again rank-calibrated in the space of all possible such ratios for the PWM in question and for all possible SNP variations. This final rank is converted to the value between 0 and 1, by dividing it by the total number of such ratios.
It will be appreciated that the above procedure may be adjusted to take into account the biases in the distribution of bases. The simple counting procedures need to be replaced by the respective cumulative distributions, equivalent to the counting in the uniform case. Also, due to the large number of possible base combinations and some of the small values of extreme probabilities, it is convenient to use logarithms, so the ratio to rank calibrated ranked-scores becomes a difference of logs of cumulated probabilities. These are some minor issues, the major challenge is design a computational procedure which allow for robust quantification of the tails of distributions in question, which cannot by computed directly by naive implementation or even estimated by Monte-Carlo simulation. In the above example, it has been shown that they are computable in a rigorous, though indirect form.
Step 335 in
Knowledge of the distribution of S allows the scores to be calibrated using their p-values P[Sn≧x]. This is a form of ranking of all the possible score values, from the highest to the lowest, accounting for the varying probability of different scores being attained with different frequency. Introducing the following “log-rank” function for any xε{0, 1, . . . , U*}:
For clarity, ‘′’ and ‘″’ mark first and second alleles respectively. In the case of SNP changing the i-th nucleotide bi of {right arrow over (b)} to b′, we may observe a significant change in the score (1) from Sn({right arrow over (b)}) to
reflected in the change of log-rank
The change may signify a regulatory SNP, but the change needs to be calibrated in order to assess its significance. Specifically, the distribution of the RV of all possible single base changes is evaluated:
Δρ({right arrow over (b)},i,b′,b″)=ρ(Sn(i)({right arrow over (b)},b′))−ρ(Sn(i)({right arrow over (b)},b″)),
where nucleotides bj, b′ and b″ are drawn independently from according to our prior distribution and position i is drawn uniformly from {1, . . . , n}.
Note that for every bε, we have
where the third sum is over values xε{0, 1, . . . , min((n−1)U,U*)} such that
ρ(x+w(b′,i))−ρ(x+w(b″,i))=v,
and p*,n\i is the distribution of scores of Σj≠iw(bj,j) of the PWM with ith column neglected, which can be easily computed by a straightforward adaptation of the recurrence (4). Note that the probability is set to zero if no such x exists.
The algorithm below computes the distribution of log-rank changes Δρ with granularity δ>0. The RV Δρ has potentially 16n(n−1)(U+1) continuous values of magnitude ≦A:=−n log10 minbεp(b). It is practical to aggregate them by rounding with granularity δ>0 into the maximum of 2V+1 values, where:
Denoting [x]δ the rounding of x/δ to nearest integer xε, the final formula for distribution of the discrete RV ({right arrow over (b)},i,b′,b″)[Δρ]δ:
for vε[−V:V]:={−V, −V+1, . . . , V−1, V}, where the third sum is over all x such that [ρ(x+w(b′,i))−ρ(x+w(b″,i))]δ=vδ. The following algorithm describes the computation of this distribution.
For δ=0.01 and the whole TRANSFAC v2009.4 database of approximately 1300 PWMs, this computation can be completed in under four hours on a single CPU machine and only needs to be computed once. Note that sparse representation of vectors in the above algorithms can be used to save the required storage space.
(2) Joint Distribution Method in
In the joint distribution method, the distribution of all scores calculated or retrieved in step 315 by the processing unit 110 in
The joint distribution allows for a more principled and also more “natural” and intuitive way of introduction and justification of the impact of single nucleotide change on a PWM score. In this case, the statistical significance of the change in transcription factor binding affinity according to step 330 in
Referring to
a) shows the observed scores for A1 and A2 for a worked example of matrix OCT1.
c) uses the joint distribution, to demonstrate the scores used to calculate the p-value in the ratio method (shaded area 640). In this case, the area 640 is much larger than area 620 in
The joint distribution also allows for the introduction of biases in the mutation preferences, i.e, to compute the background distribution under constraints that some substitutions occur more frequently that the others. The mutation matrix is a 4×4 matrix [pm(b′,b″)], b′, b″ε that is conceptually capturing the mutation probabilities:
By definition of probability, Σb″pm(b′,b″)=1 for every b″ε.
(i) In a first example, the mutation probabilities pm may follow a uniform distribution allowing all three substitution with equal probability
for every b′,b″ε.
(ii) In a second example, mutation probabilities pm may have the same distribution as for and every b″ε.
p
m(b′,b″):=p(b″) (11)
(iii) In a third example, the mutation probabilities may allow all 4 bases with equal probability for b′, b″ε.
p
m(b′,b″):=¼ (12)
The algorithm for calculating the joint distribution for all scores for substitution mutation is as follows:
Define the wild type score random variable as:
X′:
n
→
,X′({right arrow over (b)}′):=Sn({right arrow over (b)}′)
with respect to the assumption that nucleotides of {right arrow over (b)}′ were iid drawn from . Similarly, introducing the mutation random variable:
X″({right arrow over (b)}′,j,b″):=Sn(j)({right arrow over (b)}′,b″)=Sn((b1,b2, . . . ,bj−1,b″,bj+1, . . . ,bn)) (13)
for any ({right arrow over (b)}′,j,b″)=((b1, . . . , bn),j,b″)εn×{1:n}× here, where Sn*j is defined in Eq. (6).
Assuming that the nucleotides b″ and position of SNP j are drawn independently from and [1:n] with the uniform distribution, respectively. The distribution of mutations is governed by the mutation matrix in Eq. (9). More specifically, this means that:
P[B″=b″|{right arrow over (b)}′,j]=p
m(bj,b″),
for any {right arrow over (b)}εn, b″ε. and 1≦j≦n.
The following results state that the joint distribution of (X′, X″) is given by p* which can be computed by Algorithm 4.
Theorem 1: Given any x′, x″ε{0, 1, . . . , U*},
P[X′=x′,X″=x
″]=p*(x′,x″)
The p-value for significant scores that are more extreme than observed can be defined as:
As before let us define the 2-dimensional matrix [Tscore(x′,x″)] quantifying the required p-values for x′, x″ε{0, 1, . . . , U*}:
if x′≧x″; and otherwise set:
Theorem 2: Given a wild type n-mer allele {right arrow over (b)}′εn and its single nucleotide mutation {right arrow over (b)}″εn. Then
p
val({right arrow over (b)}′,{right arrow over (b)}″)=Tscore(Sn({right arrow over (b)}′),Sn({right arrow over (b)}″)).
The following algorithm allows for the efficient evaluation of matrix Tscore, as this could be in practice far more demanding than it looks on a first glance. We state it in a slightly more generic task, of computation of matrix T:=TT(M) given a square matrix, M(i,j) for n0≦i,j≦n0+n, where n0 and n>0 are two integers, using the following definition:
The following algorithm computes matrix T in using 2 (n+1)2 additions, which is twice the number of elements in M and an n times better than n3/3+O(n2) additions required by the direct evaluation of the above definition.
The distribution calculated according to Algorithm 4 can be used for selection of rSNPs. Consider the log-rank function ρ(x) defined by Eq. (7) for the first allele. The function is monotonically non-decreasing, hence allows for an introduction of its “inverse” defined as follows:
for any y≧minxε{0, 1, . . . , U*}ρ(x) and i:=minxε{0, 1, . . . , U*}ρ(x), otherwise.
Let {right arrow over (b)}′, {right arrow over (b)}″εn be a wild type and its single substitution mutation, respectively, and
x′:=S
n({right arrow over (b)}′) & x″:=Sn({right arrow over (b)}″)
denote the corresponding scores (1) of a PWM. Define the p-value of central interest using the calibrating function ρ as defined by Eq. (7):
The interpretation of this definition is straightforward. For instance, the first definition in (20) expresses the probability of a randomly selected pair (wild type, mutant) being at least as extreme as ({right arrow over (b)}′,{right arrow over (b)}″) in terms that: (i) the wild type allele has score not lower than that of {right arrow over (b)}′ and (ii) the drop in the log-rank ρ when changed to mutation allele is at least as large as for ({right arrow over (b)}′,{right arrow over (b)}″):
where we assume that {right arrow over (c)}εn is iid drawn from .
In these terms pval({right arrow over (b)}′,{right arrow over (b)}″) is the probability of a mutation event ({right arrow over (b)}′,j,{right arrow over (b)}″) drawn from the above described distribution of mutations on n×[1:n]× such that the drop in its log-rank
is at least as significant as for ({right arrow over (b)}′,{right arrow over (b)}″). Additionally, in the first definition of (20) the score for the wild type allele, X′({right arrow over (b)}′)=Sn({right arrow over (b)}′), is not smaller than x′=Sn({right arrow over (b)}′).
The following theorem states that the pval is a actually computable. Define the 2-dimensional matrix T(x1,x2) for x1,x2ε{0, 1, . . . , U*}:
if x1≧x2; otherwise, set:
Theorem 3: For any wild type n-mer allele {right arrow over (b)}′εn and its single nucleotide mutation {right arrow over (b)}″εn
p
val({right arrow over (b)}′,{right arrow over (b)}″)=T(Sn({right arrow over (b)}′),Sn({right arrow over (b)}″)) (25)
The performance of the is-rSNP method exemplified in
Secondly, 146 disease associated SNPs that had been classified as ‘intergenic’ are extracted from the Published Catalog of Genome-Wide Association Studies (Hindorff et al., 2010), and screened for potential rSNPs. The goal here was to see if the TF predicted by is-rSNP as being disrupted had prior evidence of being associated with the disease. If this association was present, it would suggest that the predicted rSNPs are likely to have a functional impact on the disease via a disease associated TF.
(a) Evaluating is-rSNP Using Known rSNPs
Table 1 shows the results of applying the is-rSNP method to 4 different rSNPs. Each of these rSNPs has previous empirical evidence to show that the nucleotide variation disrupts the binding of a particular transcription factor. is-rSNP was used to screen each SNP with all human PWMs in the TRANSFAC (Matys et. al., 2006) database. This resulted in a ranked list of PWMs that have a significant change in PWM score between the alleles. For each rSNP, the predictions were thresholded at Benjamini-Hochberg (BH)-corrected and are reported in Table 1 in
In each of four cases there is a match between the predicted disrupted TF and the TF reported to be disrupted in the original studies (highlighted in bold). In the case of the fetal growth disorder rSNP (Demars et. al., 2010), and the blood pressure regulation rSNP (Funke-Kaiser et al., 2003), the most significant hit matches the reported TF. An interesting point to note is that there are multiple positive hits for OCT1, AP-2α and E2F2. This is due to the fact that many matrices are similar for the same TF family. These multiple matrices provide additional evidence for a bona fide binding site.
In addition to these 4 rSNPs, 37 known rSNPs extracted from OregAnno (Montgomery et. al., 2006) are also analysed. Out of those 41 (=4+37), 30 SNPs had the matching TF bound (PWM score BH-corrected) to one of the alleles, and in each of these cases a significant change in binding score was predicted by is-rSNP. Therefore is-rSNP correctly identified 30 out of the 41 known rSNPs.
Using the 41 known rSNPs, the utility of the output of is-rSNP is compared with that of sTRAP. To do this we compared the rank of the first correct prediction in the lists output by each algorithm for each SNP. The idea here is that the closer a correct prediction is to the top of the output list, the easier and more efficient it is to interpret and test predictions experimentally. The output of both algorithms is compared on only the 30 SNPs having scores with correct prediction.
Referring now to
In addition, the output a modified version of is-rSNP and is-rSNP2D without filtering steps 320 and 325 in
Moreover, a p-value rank according to at least one embodiment of is-rSNP has two main advantages over the log-ratio rank used by sTRAP: firstly, a log-ratio ranked list does not provide a sensible point of cut-off with respect to expected numbers of false positives, whereas p-value allows a cut-off with a known expected number of false positives; secondly, a p-value associated with each log-ratio facilitates interpretable output for multiple rSNP scanning. Using a p-value means that predictions across multiple SNPs can be combined into a single ranked list. This makes is-rSNP suitable for large-scale SNP screening.
(b) Searching for Candidate rSNPs in a Database of Disease Associated SNPs
The Published Catalog of Genome-Wide Association Studies (Hindorff et al., 2010) contains a summary of disease and trait associated SNPs reported in the literature. Many of these, while shown to be strongly associated with the disease or trait, do not have any known functional relationship and are annotated as ‘intergenic’ (not being associated with a gene). It is possible that many of these SNPs are rSNPs. Therefore we used is-rSNP to predict TF binding sites that are likely to be disrupted using 146 ‘intergenic’ of these disease associated SNPs. Out of the 146, 11 SNPs were predicted to have a significant (P<0.01) impact on a TF binding site. These predictions are reported in Table 2, the last column; see
Table 2 provides a summary of the results output when is-rSNP was used to identify candidate rSNPs in a set of intergenic disease associated SNPs. Only results that show a significant (BH-corrected P<0.01) change in TF binding affinity between the alleles are included. Columns 1-2 provide information about the disease and associated SNP. Columns 3-4 state the base change between normal and disease state and whether this results in a loss or gain in binding site. Columns 5-7 outline the TF predicted to be disrupted by the allele and the p-value associated with the change. Column 8, if present, contains a reference to prior evidence that the predicted TF is known to be associated with the disease. The final column provides the p-value of the enrichment of TF binding sites around genes shown to be differentially expressed in the disease.
To test if these predictions were likely to be biologically relevant, we looked for evidence in the literature that the TF predicted had prior evidence of being associated with the disease. 8 out of the 11 predicted rSNPs showed prior evidence that the disrupted TF plays a role in the disease. In addition, we also used expression profiling of each of the diseases, to see if genes differentially expressed in the disease were enriched for binding sites of the predicted TF. The enrichment, if shown, would provide additional evidence that the predicted TF plays a critical role in the disease and the rSNP is therefore likely to be the causal SNP. In 7 cases genes differentially expressed in the disease had binding site enrichment for the disrupted TF.
Personalized medicine aspires to develop a panel of targeted drugs that can correct imbalances in biochemical and cell biological processes that lead to disease states. For example, cancers that arise from BRCA1 mutations are specifically prone to PARP-inhibitors. To exploit disease-associated SNPs to derive novel drug targets, insight to the mechanistic link between the allelic variation and the disease is required. This link is particularly difficult to derive when a SNP is positioned between genes. Also, attenuating progress in the field is the immense difficulty to functionally validate the biological impact of single base substitution in a regulatory element in the genome, as knock-in experiments are very expensive and labour intensive.
Thus, identifying in silico means to assess the likelihood of allelic variation to impact the function of a given transcription factor on a specific target gene could accelerate the progress from GWAS to personalized medicine. In at least one embodiment, the is-rSNP method offers to capitalize on the central dogma of transcription regulation, and identify novel links between targetable transcription factors and disease states, if their cognate binding to critical targets in the genome is affected by the DNA sequence variation. In this case, it is shown here that in at least one embodiment, it is possible with high degree of confidence to identify novel transcription factor-disease links through the comparison of the allelic variation with the sequence constraints of all known transcription factors.
Beyond reproducing 30 validated cases, 11 such links are identified (see Table 2 in
Only 5 of the 11 SNPs in Table 2 showed both forms evidence of the predicted disrupted TF being associated with the disease. This is not unexpected as most of the GWAS studies use bulk genotyping arrays, therefore it is likely that the reported disease associated SNP is not in fact the causative SNP, but rather belongs to the same linkage-disequilibrium block as the causative SNP. In this case, it may be sensible to process not only the disease associated SNP with is-rSNP, but neighbouring SNPs as well. As mentioned previously, many of the matrices in TRANSFAC are similar or duplicated amongst TFs and TF families. This results in severe multiple testing correction to p-values when scanning multiple SNPs with a database of PWMs. If a non-redundant database can be used, this may result in many more significant rSNP predictions.
Also, the method could be employed on more empirically derived TF-binding datasets than TRANSFAC (Matys et. al., 2006), such as ChIP-Seq data, or a panel of 500 protein bound DNA elements, derived from in vivo by digital genomic footprinting (Hessel berth et al., 2009). Focusing on regions of chromatin histone modifications, potentially indicative of regulatory modules (Hon et al., 2009; Won et al., 2009; Visel and et. al., 2009), may also improve predictions.
For example, it is found that of the 11 disease-associated SNPs reported in Table 2, 9 are overlapping with tri-methylated lysine 27 of histone H3, a histone mark that represents chromatin silencing events, related to progenitor-differentiation axis, and regulated by polycomb group proteins. This unexpected result may represent some so far unappreciated gradient of penetrance of alleles, based on their chromatin accessibility, such as the case with imprinting, where only one of the copies of DNA is available to transcription factor binding (and the impact of a rare allele on phenotype would be pseudo haplotype). Whatever the mechanistic basis for the H3K27Me3 link with disease associated intergenic SNPs, this information may become useful for future versions of is-rSNP and for genotyping projects that use massively parallel sequencing. The data presented here offers numerous novel links between known, and in most cases targetable, transcription factors and specific cohorts of patients for defined disease, each of which becomes a hypothesis basis for novel clinical trials of personalized medicine.
In addition, a major contingency in the interpretation of intergenic SNPs, which is the identification of the target gene responsible for the control of the disease, may be assisted by the knowledge of which TF binding is affected by the rSNP. Within any region in the genome among neighbouring genes there is likely a known target of a TF, implicating that target gene in the control of the disease, even when there is a large distance between the rSNP and the target (Fullwood et. al., 2009). Of course further validation of the methodology is required before patients are addressed with novel drugs, but the method clearly deserves more investigation and development. In its current form, it is suitable for screening large sets of potential rSNPs, and provides output which can be interpreted with an idea of the statistical significance of each predicted rSNP.
In at least one embodiment, the is-rSNP method is distinguishable over previous attempts at finding rSNPs that have relied largely on prediction of transcription factor (TF) binding sites that overlap SNPs (Ponomarenko and et. al., 2001). This approach typically produces large numbers of false positives, as many SNPs will not significantly alter the binding affinity of a TF given the degenerate nature of TF binding sites.
Also, in at least one embodiment, the is-rSNP method is also distinguishable over an approach that studies the difference in binding site score between two alleles using a PWM (Andersen et. al., 2008). The reasoning here is that SNPs generating ‘larger’ PWM score differences are considered more likely to be rSNPs than those generating ‘smaller’ differences. However, as PWM scores can differ based on not only the length of the binding site, but the degeneracy at each position, considering the magnitude of score difference between alleles will favour short PWMs with low degeneracy, thus penalising long, degenerate PWMs.
Further, the is-rSNP method is distinguishable over an approach that looks at normalising the score distributions of PWMs so that observed changes were comparable between PWMs (Manke et al., 2010). In this case, the PWM scores are not used directly, but a modified affinity score is used to represent the binding affinity. A Fourier transform is used to calculate the complete distribution of affinity scores, consequently allowing the computation of p-value for observed scores. The ratio of p-values of affinity scores between two alleles can then be used to determine if the TF binding site is likely to be disrupted. This approach is shown to be successful through comparison against a set of known rSNPs and a large list of PWMs ranked by the log ratio of the score p-values is shown. Unfortunately in this case, there is no clear way of an appropriate cut-off point to distinguish true and false prediction. This makes interpretation of results difficult, especially when screening large numbers of SNPs.
Rather than using affinity scores, at least one embodiment of the method uses scores generated from the PWM directly. This facilitates a simple and efficient approach for calculating the distribution PWM scores via direct computation of convolution. This method also allows the computation of distributions of p-value ratios. By obtaining this distribution, the significance of a single base change on the binding affinity of a TF can be determined, rather than simply report the ratio of p-values as in Manke et al. (2010). By associating a p-value with the ratio, the method can determine smaller, significant sets of predicted disrupted TF binding sites and provide significance scores of results across a large number of SNPs.
In yet another embodiment, the distribution of all scores calculated in step 315 in
The key observation is that pval as defined in Eqs. 19-20 is actually determined by the values of log-ranks ρ(x′) and ρ(x″), where ρ is a monotonically non-decreasing function with non-negative values. The first step is to compute a distribution of quantised log-ranks:
({right arrow over (b)}′,j,{right arrow over (b)}″)({right arrow over (b)}′,{right arrow over (b)}″)([ρ∘Sn({right arrow over (b)}′)]δ,[ρ∘Sn({right arrow over (b)}″)]δ)=([ρ∘X′({right arrow over (b)}′)]δ,[ρ∘X″({right arrow over (b)}′,j,{right arrow over (b)}″)]δ).
The following theorem states that the pval can be approximated given the distribution of quantized log-ranks p*δ.
P[p
δ(X′)=v′,ρδ(X″)=v″]p*δ(v′,v″),
for any v′,v″ε[0:V].
Define Tδ:=TT(p*δ), where function TT is defined in Eq. (17). The 2 dimensional matrix for v′,v″ε{0, 1, . . . , V} satisfies the following form of Eq. (17):
Theorem 5: Given a wild type n-mer allele {right arrow over (b)}′εn and its single nucleotide mutation {right arrow over (b)}″εn. Let v′:=(Sn({right arrow over (b)}′)) and v″:=(Sn({right arrow over (b)}″)). If v′≧v″, then
T
δ(v′+1,v″−1))≦pval({right arrow over (b)}′,{right arrow over (b)}″)≦Tδ(v′−1,v″+1)); (27)
otherwise, v′<v″ and
T
δ(v′−1,v″+1))≦pval({right arrow over (b)}′,{right arrow over (b)}″)≦Tδ(v′+1,v″−1)). (28)
These results show that knowledge of the matrix Tδ=TT(p*δ) is sufficient for practical computation of pval.
In another example, the statistical similarity of PWMs may be defined using the joint distribution of scores of two PWMs in
The idea of statistical similarity between two PWMs is that they bound to the identical or at least similar DNA n-mers. This can be interpreted as that the same set of sites both PWMs allocate the similarity high scores, or that the differences between calibrated scores are minimal in the area of top scoring sites. The example of a function measuring such a similarity and defined as an expectation of a conditional probability distribution follows:
where S1 and S2 denote scores for our two PWMs, respectively, and v0ε. Evaluation of this definition depends on the knowledge of joint distribution of scores (v1,v2)=(ρ1S1,ρ2S2).
It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “receiving”, “processing”, “retrieving”, “selecting”, “calculating”, “determining”, “displaying”, “detecting”, “generating” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Unless the context clearly requires otherwise, words using singular or plural number also include the plural or singular number respectively.
It should also be understood that the techniques described might be implemented using a variety of technologies. For example, the “processing unit” described may be implemented by software, hardware or a combination of both. Further, the method(s) described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g. RAM) and/or non-volatile (e.g. ROM, disk) memory, carrier waves and transmission media (e.g. copper wire, coaxial cable, fibre optic media). Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data streams along a local network or a publicly accessible network such as the Internet.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Number | Date | Country | Kind |
---|---|---|---|
2010901540 | Apr 2010 | AU | national |
2010903060 | Jul 2010 | AU | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/AU11/00420 | 4/12/2011 | WO | 00 | 1/31/2013 |