METHOD FOR ENCODING DNA/RNA SEQUENCES BASED ON BIDIRECTIONAL TRINUCLEOTIDE POSITION-SPECIFIC PROPENSITIES AND POINTWISE JOINT MUTUAL INFORMATION

Information

  • Patent Application
  • 20220275401
  • Publication Number
    20220275401
  • Date Filed
    November 09, 2021
    3 years ago
  • Date Published
    September 01, 2022
    2 years ago
Abstract
Disclosed is a method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information, which consists of the steps: constructing the nucleotide position-specific propensity matrix of DNA/RNA sequences; constructing the bidirectional dinucleotide position-specific propensity matrix of DNA/RNA sequences; constructing the bidirectional trinucleotide position-specific propensity matrix of DNA/RNA sequences; determining the value of pointwise joint mutual information of the nucleotides of DNA/RNA sequences; concatenating features and encoding DNA/RNA sequences. In order to extract more position information of trinucleotides from DNA/RNA sequences, a parameter β is introduced to represent the distance between the current nucleotide and its forward or backward adjacent dinucleotide, the numerical feature vectors obtained from different values of β are concatenated into a high-dimensional numerical feature vector.
Description

This patent application claims the benefit and priority of Chinese Patent Application No. 202011236108.2 entitled “Method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information” filed on Nov. 9, 2020, the disclosure of which is incorporated by reference herein in its entirety as part of the present application.


TECHNICAL FIELD

The present disclosure belongs to the technical field of sequence data analysis and particularly relates to a method for encoding DNA/RNA sequences.


BACKGROUND ART

DNA/RNA sequence encoding method is a data processing method which converts DNA/RNA sequences into the numerical data. It plays an important role in solving the problem of identifying and predicting biological epigenetic sites such as DNA methylation sites and RNA methylation sites by using machine learning technology. Whether the DNA/RNA sequence encoding method can effectively extract the numerical features containing strong categorical information from DNA/RNA sequences will determine the performance of the subsequent classification model constructed using the features.


The existing DNA/RNA sequence encoding methods cannot extract the key feature information for effectively identifying the epigenetic sites from the DNA/RNA sequences, therefore, the performance of the subsequent classification model based on the existing DNA/RNA sequence encoding methods is poor. Combining the numerical features obtained by multiple DNA/RNA sequence encoding methods to get the high-dimensional numerical feature vector containing rich identification information can solve the shortcomings of constructing classification model by using a single DNA/RNA sequence encoding method, but it will lead to the high redundancy of the combined high-dimensional numerical features and waste of computing resources, and the improvement on the performance of the model is limited. Therefore, how to encode DNA/RNA sequences into numerical features containing key information while with low redundancy between features for effectively identifying epigenetic sites is the key issue to solve the problem of identification and prediction of biological epigenetic sites, and it is also the research hotspot in the art at present.


SUMMARY OF THE INVENTION

The technical problem to be solved by the present disclosure is to overcome the aforementioned defects of the prior art, and to provide a method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information, which can extract the features with strong categorical information, low redundancy between features and high accuracy of the subsequently constructed model.


The technical scheme used for solving the technical problems comprises the following steps:


(1) constructing a nucleotide position-specific propensity matrix of DNA/RNA sequences;


giving a dataset D of DNA/RNA sequences, the dataset consists of a positive dataset and a negative dataset, that is, D=D+∪D;


determining a nucleotide position-specific propensity matrix MS+ for the positive dataset D+ according to the following formula:







M
s
+

=

[




f

A
,
1

+




f

A
,
2

+



L



f

A
,
i

+






f

C
,
1

+




f

C
,
2

+



L



f

C
,
i

+






f

G
,
1

+




f

G
,
2

+



L



f

G
,
i

+






f

X
,
1

+




f

X
,
2

+



L



f

X
,
i

+




]





wherein, A, C, G and X are 4 types of nucleotides of DNA/RNA, and X represents nucleotide T in DNA, and U in RNA, and i represents a position of a nucleotide, 1≤i≤l, i is a finite positive integer, and l is a length of a DNA/RNA sequence; the l is an odd number. fA,i+, fC,i+, fG,i+ and fX,i+ are occurrence frequencies of nucleotides A, C, G and X at position i in positive dataset D+, respectively.


Determining a nucleotide position-specific propensity matrix MS of the negative dataset D according to the following formula:







M
s
-

=

[




f

A
,
1

-




f

A
,
2

-



L



f

A
,
i

-






f

C
,
1

-




f

C
,
2

-



L



f

C
,
i

-






f

G
,
1

-




f

G
,
2

-



L



f

G
,
i

-






f

X
,
1

-




f

X
,
2

-



L



f

X
,
i

-




]





wherein fA,i, fC,i, fG,i and fX,i are occurrence frequencies of nucleotides A, C, G and X at position i in negative dataset D, respectively.


(2) Constructing a bidirectional dinucleotide position-specific propensity matrix of DNA/RNA sequences;


determining a forward dinucleotide position-specific propensity matrix









M

?


d








?

indicates text missing or illegible when filed




for the positive dataset D+ according to the following formula:








M
uur

d
+

=

[





f
ur


AA
,
1

+





f
ur


AA
,
2

+



L




f
ur


AA
,
j

+







f
ur


AC
,
1

+





f
ur


AC
,
2

+



L




f
ur


AC
,
j

+





M


M


O


M






f
ur


XX
,
1

+





f
ur


XX
,
2

+



L




f
ur


XX
,
j

+




]





wherein, AA, AC, . . . , and XX are 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and X of DNA/RNA, j represents position of dinucleotide, 2≤j≤l−1, j is a finite positive integer, l is a length of a DNA/RNA sequence,








f
ur


AA
,
j

+

,


f
ur


AC
,
j

+

,


,

and




f
ur


XX
,
j

+






are occurrence frequencies of dinucleotides of AA, AC, . . . , and XX in the positive dataset D+, wherein a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position J+1, respectively.


Determining a backward dinucleotide position-specific propensity matrix







M
sun

d
+




for the positive dataset D+ according to the following formula:








M
sun

d
+

=

[





f
su


AA
,
2

+





f
su


AA
,
3

+



L




f
su


AA
,
j

+







f
su


AC
,
2

+





f
su


AC
,
3

+



L




f
su


AC
,
j

+





M


M


O


M






f
su


XX
,
2

+





f
su


XX
,
3

+



L




f
su


XX
,
j

+




]





wherein,








f
su


AA
,
j

+

,


f
su


AC
,
j

+

,


,

and




f
su


XX
,
j

+






are occurrence frequencies of dinucleotides of AA, AC, . . . , and XX in positive dataset D+, respectively, wherein, a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j−1, respectively.


Determining a forward dinucleotide position-specific propensity matrix







M
uur

d
-




for the negative dataset D according to the following formula:








M
uur

d
-

=

[





f
ur


AA
,
2

-





f
ur


AA
,
3

-



L




f
ur


AA
,
j

-







f
ur


AC
,
2

-





f
ur


AC
,
3

-



L




f
ur


AC
,
j

-





M


M


O


M






f
ur


XX
,
2

-





f
ur


XX
,
3

-



L




f
ur


XX
,
j

-




]





wherein








f
ur


AA
,
j

-

,


f
ur


AC
,
j

-

,


,

and




f
ur


XX
,
j

-






are occurrence frequencies of dinucleotides AA, AC, . . . , and XX in negative dataset D, respectively, wherein, a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j+1, respectively.


Determining a backward dinucleotide position-specific propensity matrix









M

?


d








?

indicates text missing or illegible when filed




for the negative dataset according to the following formula:








M
sun

d
-

=

[





f
su


AA
,
2

-





f
su


AA
,
3

-



L




f
su


AA
,
j

-







f
su


AC
,
2

-





f
su


AC
,
3

-



L




f
su


AC
,
j

-





M


M


O


M






f
su


XX
,
2

-





f
su


XX
,
3

-



L




f
su


XX
,
j

-




]





wherein,








f
su


AA
,
j

-

,


f
su


AC
,
j

-

,


,

and




f
su


XX
,
j

-






are occurrence frequencies of dinucleotides AA, AC, . . . , and XX of negative dataset D, respectively, wherein a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j−1, respectively.


(3) Constructing a bidirectional trinucleotide position-specific propensity matrix of DNA/RNA sequences


determining a forward trinucleotide position-specific propensity matrix







M
uur

t
+




for the positive dataset D+ according to the following formula:








M
uur

t
+

=

[





f
ur


AAA
,

β
+
3


+





f
ur


AAA
,

β
+
4


+



L




f
ur


AAA
,
k

+







f
ur


AAC
,

β
+
3


+





f
ur


AAC
,

β
+
4


+



L




f
ur


AAC
,
k

+





M


M


O


M






f
ur


XXX
,

β
+
3


+





f
ur


XXX
,

β
+
4


+



L




f
ur


XXX
,
k

+




]





wherein AAA, AAC, . . . , XXX are 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and X of DNA/RNA, β represents a distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, and β is a positive integer, l is a length of a DNA/RNA sequence, k is a finite positive integer, k represents a position of a first nucleotide of the forward trinucleotide, β+3≤k≤l−β−2, then a second nucleotide is at position k+β+1 and a third at k+β+2.








f
ur


AAA
,
k

+

,


f
ur


AAC
,
k

+

,


,

and




f
ur


XXX
,
k

+






are occurrence frequencies of trinucleotides of AAA, AAC, . . . , and XXX of positive dataset D+.


Determining a backward trinucleotide position-specific propensity matrix







M
sun

t
+




for the positive dataset D+ according to the following formula:










su
+






M
t




=

[







su
+






f


A

A

A

,

β
+
3












su
+






f


A

A

A

,

β
+
4








L






su
+






f

AAA
,
k













su
+






f


A

A

C

,

β
+
3












su
+






f


A

A

C

,

β
+
4








L






su
+






f

AAC
,
k









M


M


O


M








su
+






f


X

X

X

,

β
+
3












su
+






f


X

X

X

,

β
+
4








L






su
+






f

xxx
,
k








]





wherein,










su
+







f


A

A

A

,
k


,









su
+







f

AAC
,
k


,
















,













and








su
+






f

XXX
,
k









are occurrence frequencies of trinucleotides AAA, AAC, . . . , and XXX of positive dataset D+, respectively, wherein a first, second, and a third nucleotide of the backward trinucleotide are at positions k, k−β−1, and k−β−2, respectively, of sequences.


Determining a forward trinucleotide position-specific propensity matrix












u


?

-







M
t








?

indicates text missing or illegible when filed






for the negative dataset D according to the following formula:













u


?

-







M
t




=

[







ur
-






f


A

A

A

,

β
+
3












ur
-






f


A

A

A

,

β
+
4








L






ur
-






f

AAA
,
k













ur
-






f


A

A

C

,

β
+
3












ur
-






f


A

A

C

,

β
+
4








L






ur
-






f

AAC
,
k









M


M


O


M








ur
-






f


X

X

X

,

β
+
3












ur
-






f


X

X

X

,

β
+
4








L






ur
-






f

xxx
,
k








]






?

indicates text missing or illegible when filed






wherein,










ur
-







f


A

A

A

,
k


,









ur
-







f

AAC
,
k


,
















,













and








ur
-






f

XXX
,
k









are occurrence frequencies of trinucleotides of AAA, AAC, . . . , and XXX of negative dataset D, respectively, wherein a first, second, and third nucleotide of the above forward trinucleotides are at positions k, k+β+1, and k+β+2, respectively, of the sequences.


Determining a backward trinucleotide position-specific propensity matrix









M

?


t








?

indicates text missing or illegible when filed




for the negative dataset D according to the following formula:










su
-






M
t




=

[







su
-






f


A

A

A

,

β
+
3












su
-






f


A

A

A

,

β
+
4








L






su
-






f

AAA
,
k













su
-






f


A

A

C

,

β
+
3












su
-






f


A

A

C

,

β
+
4








L






su
-






f

AAC
,
k









M


M


O


M








su
-






f


X

X

X

,

β
+
3












su
-






f


X

X

X

,

β
+
4








L






su
-






f

xxx
,
k








]





wherein,










su
-







f


A

A

A

,
k


,









su
-







f

AAC
,
k


,
















,













and








su
-






f

XXX
,
k









are occurrence frequencies of trinucleotides AAA, AAC, . . . , and XXX of negative dataset D, respectively, wherein a first, second and third nucleotide of the above backward trinucleotides are at positions k, k−β−1, and k−β−2, respectively, of all sequences.


(4) Determining a value of pointwise joint mutual information of the nucleotides of DNA/RNA sequences


(4.1) Determining a value










r
+






v
k









of forward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the positive dataset D+ according to the following formula:










r
+






v
k




=

log






ur
+






f

xyz
,
k

ur








ur
+







f

x
,
k

+



f


y

z

,

k
+
β
+
1


ur











wherein, x is a nucleotide at position k, x∈{A, C, G, X},









u




y








is a nucleotide at position k+β+1,









u






y


{

A
,
C
,
G
,
X

}


,








1




z







is a nucleotide at position k+β+2,









1






z


{

A
,
C
,
G
,
X

}


,













and









ur
+






f


x

y

z

,
k

ur








is an occurrence frequency of trinucleotide








xyz

?









?

indicates text missing or illegible when filed




in positive dataset D+,










ur
+






f

yz
,

k
+
β
+
1


ur









is an occurrence frequency of dinucleotide








yz

?









?

indicates text missing or illegible when filed




of all sequence samples of positive dataset D+, and fx,k+ is an occurrence frequency of nucleotide x at position k of all sequence samples of positive dataset D+.


Determining a value










s
+






v
k









of backward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the positive dataset D+ according to the following formula:













s
+






v
k




=

log






su
+







f

xyz
,
k



?


s








su
+







f

x
,
k

+



f

yz
,

k
-
β
-
1




?












?

indicates text missing or illegible when filed






wherein, x is a nucleotide at position k, xε{A, C, G, X},








y

?









?

indicates text missing or illegible when filed




is a nucleotide at position k−β−1,










y

?




{

A
,
C
,
G
,
X

}


,

z
s









?

indicates text missing or illegible when filed




is a nucleotide at position k−β−2,










z
s



{

A
,
C
,
G
,
X

}


,

and




f

su
+



xyz
,
k


?











?

indicates text missing or illegible when filed




represents an occurrence frequency of trinucleotide








xyz

?









?

indicates text missing or illegible when filed




of all sequences in positive dataset D+,









f

su
+



yz
,

k
-
β
-
1



?









?

indicates text missing or illegible when filed




represents an occurrence frequency of dinucleotide








yz

?









?

indicates text missing or illegible when filed




of all sequences in positive dataset D+.


The encoding value vk+ of pointwise joint mutual information in the positive dataset D+ of a nucleotide at position k of DNA/RNA sequences to be encoded is defined as an average value of the value







v
k


r
+





of forward pointwise joint mutual information and the value







v
k


s
+





of backward pointwise joint mutual information. The DNA/RNA sequence with length l is encoded into a pointwise mutual information feature vector V+ with length of l−2β−4:







V
+

=

[


v

β
+
3

+

,

v

β
+
4

+

,
L
,

v
k
+


]








v
k
+

=




v
k


r
+


+


v
k


s
+



2





(4.2) Determining a value







v
k


r
-





of forward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the negative dataset D according to the following formula:










v
k


r
-


=

log






f

?



xyz
,
k


?





f

x
,
k

-




f

?



yz
,

k
+
β
+
1



?













?

indicates text missing or illegible when filed




Wherein,









f

?



xyz
,
k


?









?

indicates text missing or illegible when filed




represents an occurrence frequency of trinucleotide













?


1





xyz







?

indicates text missing or illegible when filed






in negative dataset D, and x,









y

?


,

and



z

?











?

indicates text missing or illegible when filed




are nucleotides at positions k, k+β+1 and k+β+2, respectively.









f

?



yz
,

k
+
β
+
1



?









?

indicates text missing or illegible when filed




is an occurrence frequency of dinucleotide













?


1





yz







?

indicates text missing or illegible when filed






in negative dataset D, and fx,k is an occurrence frequency of nucleotide x in negative dataset D.


Determining a value







v
k


s
-





of backward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the negative dataset D according to the following formula:













s
-






v
k




=

log






su
-







f

xyz
,
k



?

s








su
-







f

x
,
k

-



f

yz
,

k
-
β
-
1




?

s











?

indicates text missing or illegible when filed






wherein,












su
-







f

xyz
,
k



?


s








?

indicates text missing or illegible when filed






is an occurrence frequency of trinucleotide













?


s





xyz







?

indicates text missing or illegible when filed






of all sequences of negative dataset










D
-

·





?


s






x
,
y
,
z









?

indicates text missing or illegible when filed






are nucleotides at positions k, k−β−1 and k−β−2, respectively.












su
-







f

yz
,

k
-
β
-
1




?

s








?

indicates text missing or illegible when filed






is an occurrence frequency of dinucleotide













?


s





yz









?

indicates text missing or illegible when filed







of all sequences of negative dataset D.


The encoding value vk of pointwise joint mutual information of a nucleotide at position k of DNA/RNA sequences to be encoded in the negative dataset D is defined as an average of the value










r
-






v
k









of forward pointwise joint mutual information and the value












s
-






v
k










of backward pointwise joint mutual information, and a DNA/RNA sequence with length l is encoded into a pointwise mutual information feature vector V with a length of l−2,β−4:








V
-

=

[


v

β
+
3

-

,

v

β
+
4

-

,
L
,

v
k
-


]






v
k
-

=






r
-




s
-








v
k

+

v
k





2






(4.3) Determining a feature vector V of a DNA/RNA sequence to be encoded with a given length l by corresponding element of vector V+ minus that of V:





V=[Vβ+3, Vβ+4, . . . , Vk]






V
k
=v
k
+
−v
k



(5) Concatenating Features


When the value of parameter β is 0, the feature vector V(0) is [V3, V4, V5, . . . , Vl−3, Vl−2], and the number of elements is l−4. When the value of β is 1, the feature vector V(1) is [V4, V5, V6, . . . , Vl−4, Vl−3], and the number of elements is l−6, . . . , and when the value of β is (l−7)/2, the feature vector V((l−7)/2) is [V(l−1)/2, V(l+1)/2, V(l+3)/2], the number of elements is 3. When the value β is (l−5)/2, the feature vector V((l−5)/2) is [V(l+1)/2], and the number of elements is 1. Concatenating the feature vectors determined by different values of parameter β into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)2/4 elements.


(6) Encoding DNA/RNA Sequences


Encoding the DNA/RNA sequence dataset D into a numerical dataset D′ by performing the above step (1)-step (5),








D




R

s
×



(

l
-
3

)

2

4




,




where s is a number of samples in the numerical dataset D′, that is, the number of the DNA/RNA sequences in dataset D. The (l−3)2/4 is a feature number of the numerical dataset D′.


In the present disclosure, a bidirectional dinucleotide position-specific propensity and a trinucleotide position-specific propensity are proposed based on nucleotide position-specific propensities, and a pointwise joint mutual information is proposed based on nucleotide position-specific propensity matrix and bidirectional dinucleotide position-specific propensity matrix and bidirectional trinucleotide position-specific propensity matrix, then an encoding method is proposed for representing DNA/RNA sequences by using pointwise joint mutual information and nucleotide position-specific propensity matrix and bidirectional dinucleotide position-specific propensity matrix and bidirectional trinucleotide position-specific propensity matrix of positive and negative datasets of DNA/RNA sequences, and DNA/RNA sequences are encoded into numerical feature samples. In order to extract more trinucleotide position information from DNA/RNA sequences, the parameter β is introduced into the process of constructing the bidirectional trinucleotide position-specific propensity matrix to represent the distance between the current nucleotide and its forward or backward adjacent dinucleotide, and the numerical feature vectors obtained from different values of β are concatenated, so as to obtain a high-dimensional numerical feature vector with global and local categorical information and low redundancy between features. The simulation comparative experiments are carried out by using the encoding method provided by the present disclosure and the existing seven encoding methods, and the experimental results show that the accuracy, sensitivity, specificity, MCC (Mathew's correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of the support vector machine model constructed based on the encoding method provided by the present disclosure for identifying the DNA N4-methylcytosine (4mC) sites in the Caenorhabditis elegans DNA sequences are 0.987, 0.991, 0.983, 0.974, 0.999 and 0.999, respectively, which are much higher than those of the other seven compared encoding methods; the accuracy, sensitivity, specificity, MCC, AUROC and AUPRC of the support vector machine model constructed based on the encoding method provided by the present disclosure for identifying the RNA N6-methyladenosine (m6A) sites in the Saccharomyces cerevisiae RNA sequences are 0.995, 0.996, 0.994, 0.990, 1 and 1, respectively, which are much higher than those of the other seven compared encoding methods.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is the flowchart of the method of the present disclosure.



FIG. 2 shows the AUROC curves of the support vector machine models for identifying the DNA N4-methylcytosine sites in the DNA sequence of Caenorhabditis elegans based on the encoding method provided by the present disclosure and seven encoding methods, respectively.



FIG. 3 shows the AUPRC curves of the support vector machine models for identifying the DNA N4-methylcytosine sites in the DNA sequence of Caenorhabditis elegans based on the encoding method provided by the present disclosure and seven encoding methods, respectively.



FIG. 4 shows the AUROC curves of the support vector machine models for identifying the RNA N6-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences based on the encoding method provided by the present disclosure and seven encoding methods, respectively.



FIG. 5 shows the AUPRC curves of the support vector machine models for identifying the RNA N6-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences based on the encoding method provided by the present disclosure and seven encoding methods, respectively.





DETAILED DESCRIPTION OF THE EMBODIMENTS

The technical schemes provided by the present disclosure will be described in detail below with reference to the figures and examples, but they should not be understood as any limitation to the scope of the present disclosure.


Example 1

The DNA N4-methylcytosine (4mC) dataset of the Caenorhabditis elegans×DNA sequences recorded in the literature “iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties” was taken as an example. The dataset consisted of 3108 DNA sequences, of which, the number of sequences in positive dataset, i.e., the number of actual N4-methylcytosine samples, was 1554, the number of sequences in negative dataset, i.e., the number of non-N4-methylcytosine samples, was 1554, and the length l of each sequence was 41. The method for encoding the DNA sequences based on the bidirectional trinucleotide position-specific propensities and pointwise joint mutual information of this present example comprises the following steps (reference FIG. 1):


(1) a nucleotide position-specific propensity matrix of DNA sequences was constructed;


A dataset D of DNA sequences was given, and it consisted of a positive dataset D+ and a negative dataset D, i.e. D=D+D∪D;


the nucleotide position-specific propensity matrix MS+ for the positive dataset D+ was determined according to the following formula:







M
s
+

=

[




f

A
,
1

+




f

A
,
2

+



L



f

A
,
i

+






f

C
,
1

+




f

C
,
2

+



L



f

C
,
i

+






f

G
,
1

+




f

G
,
2

+



L



f

G
,
i

+






f

T
,
1

+




f

T
,
2

+



L



f

T
,
i

+




]





where, A, C, G and T were the 4 types of nucleotides of DNA sequences, i represents the position of a nucleotide, 1≤i≤l, and i was a positive integer, and l was the length of a DNA sequence, and it was an odd number, the value of l in this example was 41, fA,i+, fC,i+, fG,i+ and fT,i+ were occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of positive dataset D+, respectively;


The nucleotide position-specific propensity matrix MS of the negative dataset D was determined according to the following formula:







M
s
-

=

[




f

A
,
1

-




f

A
,
2

-



L



f

A
,
i

-






f

C
,
1

-




f

C
,
2

-



L



f

C
,
i

-






f

G
,
1

-




f

G
,
2

-



L



f

G
,
i

-






f

T
,
1

-




f

T
,
2

-



L



f

T
,
i

-




]





wherein fA,i, fC,i, fG,i and fT,i were the occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of negative dataset D, respectively.


(2) A bidirectional dinucleotide position-specific propensity matrix of DNA sequences were constructed;


The forward dinucleotide position-specific propensity matrix









uur
+






M
d







for the positive dataset D+ was determined according to the following formula:













u

?







M
d




=

[







u


?

+







f

AA
,
1











u


?

+







f

AA
,
2







L






u


?

+







f

AA
,
j













u


?

+







f

AC
,
1











u


?

+







f

AC
,
2







L






u


?

+







f

AC
,
j









M


M


O


M








u


?

+







f

TT
,
1











u


?

+







f


T

T

,
2







L






u


?

+







f

T


T

,
j










]






?

indicates text missing or illegible when filed






wherein, AA, AC, . . . , and TT were the 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and T of DNA sequences, j represented the position of the dinucleotide, that is the position of the first nucleotide of the dinucleotide, the second nucleotide of the dinucleotide was at position j+1, 2≤j≤l−1, and j was a finite positive integer, 2≤j≤40 in this example,










ur
+






f

AA
,
j





,




ur
+






f

AC
,
j





,





,

and









ur
+






f

TT
,
j










were the occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences of positive dataset D+, respectively;


The backward dinucleotide position-specific propensity matrix









suu
+






M
d







for the positive dataset D+ was determined according to the following formula:










suu
+






M
d




=

[







su
+






f

AA
,
2











su
+






f

AA
,
3







L






su
+






f

AA
,
j













su
+






f

AC
,
2











su
+






f

AC
,
3







L






su
+






f

AC
,
j









M


M


O


M








su
+






f

TT
,
2











su
+






f

TT
,
3







L






su
+






f

TT
,
j








]





wherein,










su
+






f

AA
,
j





,




su
+






f

AC
,
j





,





,

and









su
+






f

TT
,
j










were the occurrence frequencies of dinucleotides AA, AC, . . . , and TT in positive dataset D+, respectively, and the first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;


The forward dinucleotide position-specific propensity matrix









uur
-






M
d







for the negative dataset was determined according to the following formula:










uur
-






M
d




=

[







ur
-






f

AA
,
2











ur
-






f

AA
,
3







L






ur
-






f

AA
,
j













ur
-






f

AC
,
2











ur
-






f

AC
,
3







L






ur
-






f

AC
,
j









M


M


O


M








ur
-






f

TT
,
2











ur
-






f

TT
,
3







L






ur
-






f

TT
,
j








]





wherein










ur
-






f

AA
,
j





,




ur
-






f

AC
,
j





,





,

and









ur
-






f

TT
,
j










were occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences in negative dataset D, respectively. The first and second nucleotide of these dinucleotides were at positions j and j+1, respectively;


The backward dinucleotide position-specific propensity matrix









suu
-






M
d







for the negative dataset D was determined according to the following formula:










suu
-






M
d




=

[







su
-






f

AA
,
2











su
-






f

AA
,
3







L






su
-






f

AA
,
j













su
-






f

AC
,
2











su
-






f

AC
,
3







L






su
-






f

AC
,
j









M


M


O


M








su
-






f

TT
,
2











su
-






f

TT
,
3







L






su
-






f

TT
,
j








]





wherein,










su
-






f

AA
,
j





,




su
-






f

AC
,
j





,





,

and









su
-






f

TT
,
j










were occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences of negative dataset D, respectively. The first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;


(3) A bidirectional trinucleotide position-specific propensity matrix of DNA sequences was constructed


The forward trinucleotide position-specific propensity matrix









uur
+






M
t







for the positive dataset D+ was determined according to the following formula:










uur
+






M
d




=

[







ur
+






f

AAA
,

β
+
3












ur
+






f

AAA
,

β
+
4








L






ur
+






f

AAA
,
k













ur
+






f

AAC
,

β
+
3












ur
+






f

AAC
,

β
+
4








L






ur
+






f

AAC
,
k









M


M


O


M








ur
+






f

TTT
,

β
+
3












ur
+






f

TTT
,

β
+
4








L






ur
+






f

TTT
,
k








]





wherein AAA, AAC, . . . , TTT were 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and T of DNA sequences, β represented the distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, β was a positive integer, 0≤β≤18 in this example, k represented a position of trinucleotide, that is, the position of the first nucleotide of a trinucleotide, β+3≤k≤l−β−2, β+3≤k≤39−β in this example, and k was a positive integer,










ur
+






f

AAA
,
k





,




ur
+






f

AAC
,
k





,





,

and









ur
+






f

TTT
,
k










represent the frequencies of trinucleotides AAA, AAC, . . . , or TTT of all sequences in positive dataset D+, respectively. The first, second and third nucleotide of these trinucleotides were at positions k, k+β+1, and k+β+2 of the DNA sequences, respectively;


The backward trinucleotide position-specific propensity matrix









suu
+






M
t







for the positive dataset D+ was determined according to the following formula:










su
+






M
t




=

[







su
+






f

AAA
,

β
+
3












su
+






f

AAA
,

β
+
4








L






su
+






f

AAA
,
k













su
+






f

AAC
,

β
+
3












su
+






f

AAC
,

β
+
4








L






su
+






f

AAC
,
k









M


M


O


M








su
+






f

TTT
,

β
+
3












su
+






f

TTT
,

β
+
4








L






su
+






f

TTT
,
k








]





wherein,








f

su
+



AAA
,
k


,


f

su
+



AAC
,
k


,


,


and




f

su
+



TTT
,
k







were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and TTT of all sequences of positive dataset D+, respectively. The first, second and third nucleotide of these trinucleotides were at positions k, k−β−1, and k−β−2, respectively;


The forward trinucleotide position-specific propensity matrix









ur
-






M
t







for the negative dataset D was determined according to the following formula:










ur
-






M
t




=

[







ur
-






f

AAA
,

β
+
3












ur
-






f

AAA
,

β
+
4








L






ur
-






f

AAA
,
k













ur
-






f

AAC
,

β
+
3












ur
-






f

AAC
,

β
+
4








L






ur
-






f

AAC
,
k









M


M


O


M








ur
-






f

TTT
,

β
+
3












ur
-






f

TTT
,

β
+
4








L






ur
-






f

TTT
,
k








]





wherein,










ur
-






f

AAA
,
k





,




ur
-






f

AAC
,
k





,





,

and









ur
-






f

TTT
,
k










were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and TTT of all sequences of negative dataset D, respectively. The first, second and third nucleotide of a trinucleotide were at positions k, k+β+1, and k+β+2, respectively;


The backward trinucleotide position-specific propensity matrix









su
-






M
t







for the negative dataset Dwas determined according to the following formula:










su
-






M
t




=

[







su
-






f

AAA
,

β
+
3












su
-






f

AAA
,

β
+
4








L






su
-






f

AAA
,
k













su
-






f

AAC
,

β
+
3












su
-






f

AAC
,

β
+
4








L






su
-






f

AAC
,
k









M


M


O


M








su
-






f

TTT
,

β
+
3












su
-






f

TTT
,

β
+
4








L






su
-






f

TTT
,
k








]





wherein,










su
-






f

AAA
,
k





,




su
-






f

AAC
,
k





,





,

and









su
-






f

TTT
,
k










were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and TTT of all sequences of negative dataset D, respectively. The first, second and third nucleotide of a trinucleotide were at positions k, k−β−1, and k−β−2, respectively;


(4) A value of the pointwise joint mutual information of the nucleotides of DNA sequences was determined


(4.1) The value of the forward pointwise joint mutual information









r
+






v
k







of nucleotides of DNA sequences to be encoded in the positive dataset D+ was determined according to the following formula:










r
+






v
k




=

log






ur
+






f

xyz
,
k

ur









ur
+






f

x
,
k

+






f

yz
,

k
+
β
+
1


ur








wherein, x was the nucleotide at position k, X∈{A, C, G, T},








u




y






was the nucleotide at position k+β+1,









u






y


{

A
,
C
,
G
,
T

)


,








r




z







was the nucleotide at position k+β+2,









r






z


{

A
,
C
,
G
,
T

}


,









ur
+






f

xyz
,
k

ur








represents the occurrence frequency of trinucleotide








ur




xyz






of all sequences of positive dataset D+,









ur
+






f

yz
,

k
+
β
+
1


ur







was the occurrence frequency of dinucleotide








ur




yz






of all sequences of positive dataset D+, and fx,k+ was the occurrence frequency of nucleotide x of all sequences of positive dataset D+;


The value of the backward pointwise joint mutual information











s
+






v
k








of nucleotides of DNA sequences to be encoded in the positive dataset D+ was determined according to the following formula:















s
+






v
k




=

log






su
+






?








su
+







f

x
,
k

+



?















?



indicates text missing or illegible when filed





wherein,














?





y










?



indicates text missing or illegible when filed





was the nucleotide at position k−β−1,
















?





y





{

A
,
C
,
G
,
T

}


,



s




z











?



indicates text missing or illegible when filed





was the nucleotide at position k−β−2,















s




z






{

A
,
C
,
G
,
T

}


b


,




su
+






?












?



indicates text missing or illegible when filed





was the occurrence frequency of trinucleotide














?





xyz










?



indicates text missing or illegible when filed





of all sequences of positive dataset D+,














su
+






?











?



indicates text missing or illegible when filed





was the occurrence frequency of dinucleotide










sus




yz







of all sequences of positive dataset D+.


The encoding value vk+ of pointwise joint mutual information of the nucleotide at position k of a DNA sequence to be encoded in the positive dataset D+ was defined as the average of the value











r
+






v
k








of forward pointwise joint mutual information and the value











s
+






v
k








of backward pointwise joint mutual information, and a DNA sequence with length l was encoded into a pointwise mutual information feature vector V+ with l−2β−4 elements:







V
+

=

[


v

β
+
3

+

,

v

β
+
4

+

,
L




,

v
k
+


]








v
k
+

=






r
+






v
k




+




s
+






v
k





2





The value of l was 41 in this example.


(4.2) The value











r
-






v
k








of forward pointwise joint mutual information of nucleotides of a DNA sequence to be encoded in the negative dataset D was determined according to the following formula:















r
-






v
k




=

log



?





?







f

x
,
k

-



?















?



indicates text missing or illegible when filed





wherein, the nucleotides x,















?





2



,

and









?





z












?



indicates text missing or illegible when filed





were at positions k, k+β+1 and k+β+2, respectively, and the











?








?



indicates text missing or illegible when filed





was the occurrence frequency of trinucleotide









?

xyz








?

indicates text missing or illegible when filed




of all sequences of negative dataset D,









?


f

yz
,

k
+
β
+
1


ur









?

indicates text missing or illegible when filed




was the occurrence frequency of dinucleotide









?

yz








?

indicates text missing or illegible when filed




of all sequences of negative dataset D, and fh,k was the occurrence frequency of the nucleotide x of all sequences of negative dataset D.


The value







v
k


s
-





of backward pointwise joint mutual information of nucleotides of a DNA sequence to be encoded in the negative dataset D was determined according to the following formula:










v
k


s
-


=

log




f

xyz
,
k



?




f

x
,
k

-



f

yz
,

k
-
β
-
1




?












?

indicates text missing or illegible when filed




wherein, the nucleotides x,










?

y

,

and



z
s










?

indicates text missing or illegible when filed




were at positions k, k−β−1 and k−β−2, respectively. The









f

xyz
,
k



?









?

indicates text missing or illegible when filed




was the occurrence frequency of trinucleotide









?


x

y

z









?

indicates text missing or illegible when filed




of all sequences of negative dataset D. The









f

xyz
,

k
-
β
-
1




?









?

indicates text missing or illegible when filed




was the occurrence frequency of dinucleotide









?

yz








?

indicates text missing or illegible when filed




of all sequences of negative dataset D.


The encoding value vk of pointwise joint mutual information of the nucleotide at position k of a DNA sequence to be encoded in the negative dataset D was defined as an average of the value







v
k


r
-





of forward pointwise joint mutual information and the value







v
k


s
-





of backward pointwise joint mutual information, and a DNA sequence with a length of l was encoded into a pointwise mutual information feature vector V with a length of l−2β−4:









V
-

=

[


v

β
+
3

-

,

v

β
+
4

-

,
L

,

v
k
-


]









v
k
-

=




v
k


r
-


+


v
k


s
-



2






The value of l was 41 in this example.


(4.3) The feature vector V of a DNA sequence to be encoded with length l was determined by corresponding element of vector V+ minus that of V:





V=[Vβ+3, Vβ+4, . . . , Vk]






V
k
=v
k
+
−v
k
;


(5) Concatenating features


when the value of parameter β was 0, the feature vector V(0) was [V3, V4, V5, . . . , Vl−3, Vl−2], and the number of elements was l−4; when the value of β was 1, the feature vector V(1) was [V4, V5, V6, . . . , Vl−4, Vl−3], and the number of elements was l−6, . . . , and when the value of β was (l−7)/2, the feature vector V((l−7)/2) was [V(l−1)/2, V(l+1)/2, V(l+3)/2], the number of elements was 3; when the value of β was (l−5)/2, the feature vector V((l−5)/2) was [V(l+1)/2], and the number of elements was 1; the feature vectors determined by different values of the parameter β was concatenated into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)2/4 elements, the value of l was 41 in this example.


(6) Encoding the DNA sequences


The DNA sequence dataset D was encoded into a numerical dataset D′ by performing the above step (1)-step (5),








D




R

s
×



(

l
-
3

)

2

4




,




where s was a number of samples of the numerical dataset D′, and s was a finite positive integer, the value of s was 3108 in this example, i.e. the number of DNA sequences in this DNA sequence dataset D, and (l−3)2/4 was the feature number of the numerical data set D′. The encoding of DNA sequences was completed.


The DNA sequence encoding method of Example 1 was compared with PSNP (position-specific nucleotide propensities), PSDP (position-specific dinucleotide propensities), KNF (K-nucleotide frequencies), KSNPF (K spaced nucleotide pair frequencies), NPPS (nucleotide pair position specificity), PBE (positional binary encoding) and NCPNC (nucleotide chemical property and nucleotide composition) which are for identifying the DNA N4-methylcytosine sites in Caenorhabditis elegans DNA sequences by the performance of the support vector machine models constructed using each encoding method. The average classification accuracy, sensitivity, specificity, MCC (Mathew's Correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of the 10-fold cross-validation method were used to evaluate the experimental results. The experimental method was as follows:


1. The DNA sequences of N4-methylcytosine of Caenorhabditis elegans were encoded according to the method of Example 1;


2. Normalizing the dataset


The numerical dataset D′ was normalized by the maximum-minimum method according to the following formula:







g

m

n



=



g

m
,
n


-

min

(

g
n

)




max

(

g
n

)

-

min

(

g
n

)







where gm,n was the n-th feature value of the m-th sample of the numerical dataset D′, the normalized value of gm,n was g′m,n, max(gn) and min(gn) represent the maximum and minimum feature values of the n-th column of the numerical dataset D′, 1≤m≤s, l≤n≤(l−1)2/4, m and n were finite positive integers, the value of l in this example was 41, and the value of s was 3108.


3. Partitioning dataset


The normalized numerical dataset D′ was partitioned into 10 folds by using the K-fold cross-validation method (K=10). One fold of which was taken as the test dataset D′Te, and the remaining nine folds were taken as the training dataset D′Tr, till each fold was as test dataset, and there were 10 runs in total. The ratio of the training dataset D′Tr to the test dataset D′Te in each run was 9:1.


4. Training and testing the model


The support vector machine model was trained using the training dataset D′Tr, and the performance of the support vector machine model was tested using the test dataset D′Te.


The DNA N4-methylcytosine sites in Caenorhabditis elegans DNA sequences were identified by performing the same operation on the seven compared encoding methods according to steps 2-4 of the experimental methods. The experimental results of classification accuracy, sensitivity, specificity and MCC were shown in Table 1, the experimental results of AUROC were shown in FIG. 2, and the experimental results of AUPRC were shown in FIG. 3.









TABLE 1







Comparison of experimental results between the


method of Example 1 and other seven methods









Evaluation criterion











Encoding method
Accuracy
Sensitivity
Specificity
MCC














The present invention
0.987
0.991
0.983
0.974


PSNP
0.739
0.732
0.746
0.479


PSDP
0.827
0.820
0.833
0.653


KNF
0.653
0.656
0.651
0.307


KSNPF
0.662
0.642
0.681
0.324


NPPS
0.877
0.880
0.873
0.754


PBE
0.763
0.775
0.750
0.526


NCPNC
0.762
0.772
0.752
0.524









As shown in Table 1, the accuracy, sensitivity, specificity and MCC for identifying the DNA N4-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure were 0.987, 0.991, 0.983 and 0.974, respectively, which were much higher than those of the other seven compared encoding methods.


As shown in FIG. 2, the value of AUROC for identifying the DNA N4-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure was 0.999, which was much higher than that of the other seven compared encoding methods.


As shown in FIG. 3, the value of AUPRC for identifying the DNA N4-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure was 0.999, which was much higher than that of the other seven compared encoding methods.


Example 2

The RNA N6-methyladenosine (m6A) dataset of the Saccharomyces cerevisiae RNA sequences in the literature “Benchmark data for identifying N6-methyladenosine sites in the Saccharomyces cerevisiae genome” was taken as an example. The dataset consisted of 2614 RNA sequences, of which, the number of samples in positive dataset, i.e., the actual number of N6-methyladenosine samples, was 1307, the number of samples in negative dataset, i.e., the number of non-N6-methyladenosine samples, was 1307, and the length l of each sequence is 51. The method for encoding RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information of this present example comprises the following steps (reference FIG. 1):


(1) A nucleotide position-specific propensity matrix of RNA sequences was constructed;


A dataset D of RNA sequences was given, and the dataset consisted of a positive dataset D+ and a negative dataset D, i.e. D=D+∪D;


The nucleotide position-specific propensity matrix MS+ for the positive dataset D+ was determined according to the following formula:







M
s
+

=

[




f

A
,
1

+




f

A
,
2

+



L



f

A
,
i

+






f

C
,
1

+




f

C
,
2

+



L



f

C
,
i

+






f

G
,
1

+




f

G
,
2

+



L



f

G
,
i

+






f

U
,
1

+




f

U
,
2

+



L



f

U
,
i

+




]





wherein, A, C, G and U were the 4 types of nucleotides of RNA sequences, i represents the position of a nucleotide, 1≤i≤l, and it was a finite positive integer, and l was the length of an RNA sequence, and its value was an odd number, the value of l in this example was 51, fA,i+, fC,i+, fG,i+ and fU,i+ were occurrence frequencies of nucleotides A, C, G and U at position i of all sequences of positive dataset D+, respectively;


The nucleotide position-specific propensity matrix MS of the negative dataset D was determined according to the following formula:







M
s
-

=

[




f

A
,
1

-




f

A
,
2

-



L



f

A
,
i

-






f

C
,
1

-




f

C
,
2

-



L



f

C
,
i

-






f

G
,
1

-




f

G
,
2

-



L



f

G
,
i

-






f

U
,
1

-




f

U
,
2

-



L



f

U
,
i

-




]





wherein fA,i, fC,i, fG,i and fU,i were the occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of negative dataset D, respectively.


(2) A bidirectional dinucleotide position-specific propensity matrix of RNA sequences was constructed;


The forward dinucleotide position-specific propensity matrix







M
d


uur
+





for the positive dataset D+ was determined according to the following formula:








M
d


uur
+


=

[





f

AA
,
1



ur
+






f

AA
,
2



ur
+




L




f

AA
,
j



ur
+








f

AC
,
1



ur
+






f

AC
,
2



ur
+




L




f

AC
,
j



ur
+






M


M


O


M






f

UU
,
1



ur
+






f

UU
,
2



ur
+




L




f

UU
,
j



ur
+





]





wherein, AA, AC, . . . , and UU were 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and U of RNA sequences, j represents the position of the dinucleotide, i.e., the position of the first nucleotide of the dinucleotides, 2≤j≤l−1, and j was a finite positive integer, 2≤j≤50 in this example,








f

AA
,
j



ur
+


,


f

AC
,
j



ur
+


,


,

and




f

UU
,
j



ur
+







were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU of all sequences of positive dataset D+, respectively, and the first and second nucleotide of the dinucleotides were at positions j and j+1, respectively;







M
d


suu
+





The backward dinucleotide position-specific propensity matrix for the positive dataset D+ was determined according to the following formula:








M
d


suu
+


=

[





f

AA
,
2



su
+






f

AA
,
3



su
+




L




f

AA
,
j



su
+








f

AC
,
2



su
+






f

AC
,
3



su
+




L




f

AC
,
j



su
+






M


M


O


M






f

UU
,
2



su
+






f

UU
,
3



su
+




L




f

UU
,
j



su
+





]





wherein








f

AA
,
j



su
+


,


f

AC
,
j



su
+


,


,

and




f

UU
,
j



su
+







were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU of all sequences of positive dataset D+, respectively. The first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;


The forward dinucleotide position-specific propensity matrix







M
d


uur
-





for the negative dataset D was determined according to the following formula:








M
d


uur
-


=

[





f

AA
,
2



ur
-






f

AA
,
3



ur
-




L




f

AA
,
j



ur
-








f

AC
,
2



ur
-






f

AC
,
3



ur
-




L




f

AC
,
j



ur
-






M


M


O


M






f

UU
,
2



ur
-






f

UU
,
3



ur
-




L




f

UU
,
j



ur
-





]





wherein








f

AA
,
j



ur
-


,


f

AC
,
j



ur
-


,


,

and




f

UU
,
j



ur
-







were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU, whose nucleotides were at positions j and j+1, of all sequences of negative dataset D, respectively;


The backward dinucleotide position-specific propensity matrix







M
d


suu
-





for the negative dataset D was determined according to the following formula:








M
d


suu
-


=

[





f

AA
,
2



su
-






f

AA
,
3



su
-




L




f

AA
,
j



su
-








f

AC
,
2



su
-






f

AC
,
3



su
-




L




f

AC
,
j



su
-






M


M


O


M






f

UU
,
2



su
-






f

UU
,
3



su
-




L




f

UU
,
j



su
-





]





wherein,








f

AA
,
j



su
-


,


f

AC
,
j



su
-


,


,

and




f

UU
,
j



su
-







were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU, whose nucleotides were at positions j and j−1 respectively, of all sequences of negative dataset D, respectively;


(3) A bidirectional trinucleotide position-specific propensity matrix of RNA sequences was constructed


The forward trinucleotide position-specific propensity matrix







M
t


uur
+





for the positive dataset D+ was determined according to the following formula:










?


M
t


=

[





f


A

A

A

,

β
+
3




ur
+






f


A

A

A

,

β
+
4




ur
+




L




f

AAA
,
k



ur
+








f

AAC
,

β
+
3




ur
+






f

AAC
,

β
+
4




ur
+




L




f

AAC
,
k



ur
+






M


M


O


M






f

UUU
,

β
+
3




ur
+






f

UUU
,

β
+
4




ur
+




L




f

UUU
,
k



ur
+





]









?

indicates text missing or illegible when filed




wherein AAA, AAC, UUU were 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and U of RNA sequences, β represented the distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, β was a finite positive integer, 0≤β≤23 in this example, k represented the position of the trinucleotide, i.e. the position of the first nucleotide of the trinucleotides, β+3≤k≤l−β−2, β+3≤k≤49−β in this example, and k was a finite positive integer,








f

AAA
,
k



ur
+


,


f

AAC
,
k



ur
+


,


,

and




f

UUU
,
k



ur
+







were the frequencies of trinucleotides AAA, AAC, . . . , or UUU whose nucleotides were at positions k, k+β+1, and k+β+2 of all RNA sequences of positive dataset D+, respectively;


The backward trinucleotide position-specific propensity matrix









?


M
t









?

indicates text missing or illegible when filed




for the positive dataset D+ was determined according to the following formula:










?


M
t


=

[





f


A

A

A

,

β
+
3




su
+






f


A

A

A

,

β
+
4




su
+




L




f

AAA
,
k



su
+








f

AAC
,

β
+
3




su
+






f

AAC
,

β
+
4




su
+




L




f

AAC
,
k



su
+






M


M


O


M






f

UUU
,

β
+
3




su
+






f

UUU
,

β
+
4




su
+




L




f

UUU
,
k



su
+





]









?

indicates text missing or illegible when filed




wherein,








f

AAA
,
k



su
+


,


f

AAC
,
k



su
+


,


,

and




f

UUU
,
k



su
+







were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k−β−1, and k−β−2 of all RNA sequences of positive dataset D+, respectively;


The forward trinucleotide position-specific propensity matrix









?


M
t









?

indicates text missing or illegible when filed




for the negative dataset D was determined according to the following formula:










?


M
t


=

[





f


A

A

A

,

β
+
3




ur
-






f


A

A

A

,

β
+
4




ur
-




L




f

AAA
,
k



ur
-








f

AAC
,

β
+
3




ur
-






f

AAC
,

β
+
4




ur
-




L




f

AAC
,
k



ur
-






M


M


O


M






f

UUU
,

β
+
3




ur
-






f

UUU
,

β
+
4




ur
-




L




f

UUU
,
k



ur
-





]









?

indicates text missing or illegible when filed




wherein,








f

AAA
,
k



ur
-


,


f

AAC
,
k



ur
-


,


,

and




f

UUU
,
k



ur
-







were occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k+β+1, and k+β+2 of all RNA sequences of negative dataset D, respectively;


The backward trinucleotide position-specific propensity matrix









?


M
t









?

indicates text missing or illegible when filed




for the negative dataset D was determined according to the following formula:










?


M
t


=

[





f


A

A

A

,

β
+
3




su
-






f


A

A

A

,

β
+
4




su
-




L




f

AAA
,
k



su
-








f

AAC
,

β
+
3




su
-






f

AAC
,

β
+
4




su
-




L




f

AAC
,
k



su
-






M


M


O


M






f

UUU
,

β
+
3




su
-






f

UUU
,

β
+
4




su
-




L




f

UUU
,
k



su
-





]









?

indicates text missing or illegible when filed




wherein,








f

AAA
,
k



su
-


,


f

AAC
,
k



su
-


,


,

and




f

UUU
,
k



su
-







were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k−β−1, and k−β−2 of all RNA sequences of negative dataset D, respectively;


(4) A value of pointwise joint mutual information of the nucleotides of RNA sequences was determined


(4.1) The value







v
k


r
+





of forward pointwise joint mutual information of the nucleotides of RNA sequences to be encoded in the positive dataset D+ was determined according to the following formula:










v
k


r
+


=

log




?


f

xyz
,
k






f

x

k

+


?


f

xyz
,

k
+
β
+
1















?

indicates text missing or illegible when filed




wherein, x was the nucleotide at position k, x∈{A,C,G,U},









?

y








?

indicates text missing or illegible when filed




was the nucleotide at position k+β+1,











?

y



{

A
,
C
,
G
,
U

}


,


?

z









?

indicates text missing or illegible when filed




was the nucleotide at position k+β+2,










z

?




{

A
,
C
,
G
,
U

}


,


f

xyz
,
k

ur


u
+










?

indicates text missing or illegible when filed




was the occurrence frequency of trinucleotide








xyz

u

?










?

indicates text missing or illegible when filed




of all sequences of positive dataset D+,







f

yz
,

k
+
β
+
1


ur


u
+





was the occurrence frequency of dinucleotide








yz

u

?










?

indicates text missing or illegible when filed




or all RNA sequences or positive dataset D+, and fx,k+ was the occurrence frequency of nucleotide of all sequences of positive dataset D+.


The value








v

k


s
+





of backward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in the positive dataset D+ was determined according to the following formula:








v
k


s
+


=

log




f

xyz
,
k

sus


su
+





f

x
,
k

+



f

yz
,

k
-
β
-
1


sus



su
+








where,






y
su




was the nucleotide at position k−β−1,








y
su



{

A
,
C
,
G
,
U

}


,

z
s





was the nucleotide at position k−β−2,








z
s



{

A
,
C
,
G
,
U

}


,

and




f


x

γ

z

,
k

sus


su
+







was the occurrence frequency of trinucleotide







x

y

z

sus




of all RNA sequences of positive dataset D+,







f

yz
,

k
-
β
-
1


sus


su
+





was the occurrence frequency of dinucleotide






yz
sus




of all RNA sequences of positive dataset D+.


The encoding value vk+ of pointwise joint mutual information of nucleotide at position k of an RNA sequence to be encoded in the positive dataset D+ was defined as the average of the value







v
k


r
+





of forward pointwise joint mutual information and the value







v
k


s
+





of backward pointwise joint mutual information. An RNA sequence with a length of l was encoded into a pointwise mutual information feature vector V+ with a length of l−2β−4:









V
+

=

[


v

β
+
3

+

,

v

β
+
4

+

,
L

,

v
k
+


]









v
k
+

=




v
k


r
+


+


v
k


s
+



2






The value of l was 51 in this example.


(4.2) The value









r
-






v
k







of forward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in the negative dataset D was determined according to the following formula:










r
-






v
k




=

log






ur
-






f

xyz
,
k

ur






f

x
,
k

-






ur
-






f

yz
,

k
+
β
+
1


ur











wherein, x was the nucleotide at position k, xE{A,C,G,U},








u




y






was the nucleotide at position k+β+1,









u






y


{

A
,
C
,
G
,
U

}


,








r




z







was the nucleotide at position k+β+2,









r





z


{

A
,
C
,
G
,
U

}





,

and









ur
-






f

xyz
,
k

ur









was the occurrence frequency of trinucleotide








ur




xyz






of all sequences of negative dataset D,









ur
-






f

yz
,

k
+
β
+
1


ur







was the occurrence frequency of dinucleotide








ur




yz






of all sequences of negative dataset D, and fx,k was the occurrence frequency of nucleotide x of all sequences of negative dataset D.


The value









s
-






v
k







of backward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in negative dataset D was determined according to the following formula:










s
-






v
k




=

log






su
-






f

xyz
,
k

sus








su
-







f

x
,
k

-



f

yz
,

k
-
β
-
1


sus











wherein, nucleotide x was at position k, and nucleotide








su




y






was at position k−β−1, and nucleotide








s




z






was at position k−β−2,









su
-






f

xyz
,
k

sus







was the occurrence frequency of trinucleotide








sus




xyz






of all RNA sequences of negative dataset D,









su
-






f

yz
,

k
-
β
-
1


sus







was the occurrence frequency of dinucleotide








sus




yz






of all sequences of negative dataset D.


The encoding value vk of pointwise joint mutual information of the nucleotide at position k of an RNA sequence to be encoded in the negative dataset D was defined as the average of the value








v
k



r
-





of forward pointwise joint mutual information and the value








v
k



s
-





of backward pointwise joint mutual information, and an RNA sequence with a length of l was encoded into a pointwise mutual information feature vector V with a length of l−2β−4:







V
=

[


v

β
+
3

-

,

v

β
+
4

-

,
L

,

v
k
-


]






v
k

=





v
k



r
-


+



v
k



s
-



2






The value of l was 51 in this example.


(4.3) The feature vector V of an RNA sequence to be encoded with a given length l was determined by corresponding element of vector V+ minus that of V:





V=[Vβ+3, Vβ+4, . . . , Vk]






V
k
=v
k
+
−v
k
;


(5) Concatenating features


when the value of parameter β was 0, the feature vector V(0) was [V3, V4, V5, . . . , Vl−3, Vl−2], and the number of elements was l−4; when the value of β was 1, the feature vector V(1) was [V4, V5, V6, . . . , Vl−4, Vl−3], and the number of elements was l−6, . . . , and when the value of β was (l−7)/2, the feature vector V((l−7)/2) was [V(l−1)/2, V(l−1)/2, V(l+3)/2], the number of elements was 3; when the value of β was (l−5)/2, the feature vector V((l−5)/2) was [V(l+1)/2], and the number of elements was 1; the feature vectors determined by different values of the parameter β were concatenated into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)2/4 elements, the value of l was 51 in this example.


(6) Encoding the RNA sequences


The RNA sequence dataset D was encoded into a numerical dataset D′ by adopting the above step (1)-step (5),








D




R

s
×



(

l
-
3

)

2

4




,




where s was a number of samples of the numerical dataset D′, and s was a finite positive integer, the value of s was 2614 in this example, and (l−3)2/4 was a feature number of the numerical data set D′. The encoding of RNA sequences was completed.


The RNA sequence encoding method of Example 2 was compared with PSNP (position-specific nucleotide propensities), PSDP (position-specific dinucleotide propensities), KNF (K-nucleotide frequencies), KSNPF (K spaced nucleotide pair frequencies), NPPS (nucleotide pair position specificity), PBE (positional binary encoding) and NCPNC (nucleotide chemical property and nucleotide composition) encoding methods which were for identifying the RNA N6-methyladenosine sites in Saccharomyces cerevisiae RNA sequences by the performance of support vector machine models constructed using each encoding method. The average classification accuracy, sensitivity, specificity, MCC (Mathew's Correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of 10-fold cross-validation method were used to evaluate each method. The experimental method was as follows:


1. The RNA sequences of N6-methyladenosine of Saccharomyces cerevisiae were encoded according to the method of Example 2;


2. Normalizing the dataset


The numerical dataset D′ was normalized by the maximum-minimum method according to the following formula:







g

m
,
n



=



g

m
,
n


-

min

(

g
n

)




max

(

g
n

)

-

min

(

g
n

)







wherein gm,n was the n-th feature value of the m-th sample of the numerical dataset D′, the normalized value of gm,n was g′m,n, max(gn) and min(gn) represent the maximum and minimum feature values of the n-th column of the numerical dataset D′, 1≤m≤s, 1≤n≤(l−1)2/4, m and n were finite positive integers, the value of l in this example was 51, and the value of s was 2614.


3. Partitioning dataset


The normalized numerical dataset D′ was partitioned into 10 folds by using the K-fold cross-validation method (K=10), one fold was taken as the test dataset D′Te, and the remaining nine folds are taken as the training dataset D′Tr, till each fold was taken as the test dataset, so there were 10 runs in total. The ratio of the training dataset D′Tr to the test dataset D′Te in each run was 9:1.


4. Training and testing the model


The support vector machine model was trained using training dataset D′Tr, and the performance of the support vector machine model is tested by the test dataset D′Te.


The RNA N6-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences were identified by performing the same operation on the seven compared RNA sequence encoding methods according to steps 2-4 of the experimental methods. The experimental results of classification accuracy, sensitivity, specificity and MCC were shown in Table 2, the experimental results of AUROC were shown in FIG. 4, and the experimental results of AUPRC were shown in FIG. 5.









TABLE 2







Comparison of experimental results between the


method of Example 2 and other seven methods









Evaluation criterion











Encoding method
Accuracy
Sensitivity
Specificity
MCC














The present invention
0.995
0.996
0.994
0.990


PSNP
0.747
0.751
0.743
0.495


PSDP
0.766
0.764
0.769
0.534


KNF
0.692
0.741
0.643
0.387


KSNPF
0.651
0.712
0.591
0.307


NPPS
0.874
0.884
0.864
0.749


PBE
0.727
0.727
0.728
0.456


NCPNC
0.731
0.735
0.726
0.463









As shown in Table 2, the accuracy, sensitivity, specificity and MCC for identifying the RNA N6-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure were 0.995, 0.996, 0.994 and 0.990, respectively, which were much higher than those of the other seven compared encoding methods.


As shown in FIG. 4, the value of AUROC for identifying the RNA N6-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure was the maximum value of 1, which was much higher than that of the other seven compared encoding methods.


As shown in FIG. 5, the value of AUPRC for identifying the RNA N6-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure was the maximum value of 1, which was much higher than that of the other seven compared encoding methods.

Claims
  • 1. A method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information, comprising the following steps: (1) constructing a nucleotide position-specific propensity matrix of DNA/RNA sequences:giving a dataset D of DNA/RNA sequences, the dataset consists of a positive dataset D+ and a negative dataset D−;determining a nucleotide position-specific propensity matrix MS+ for the positive dataset according to the following formula:
Priority Claims (1)
Number Date Country Kind
202011236108.2 Nov 2020 CN national