METHOD FOR ENCODING DNA/RNA SEQUENCES BASED ON BIDIRECTIONAL TRINUCLEOTIDE POSITION-SPECIFIC PROPENSITIES AND POINTWISE JOINT MUTUAL INFORMATION

This patent application claims the benefit and priority of Chinese Patent Application No. 202011236108.2 entitled “Method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information” filed on Nov. 9, 2020, the disclosure of which is incorporated by reference herein in its entirety as part of the present application.

TECHNICAL FIELD

The present disclosure belongs to the technical field of sequence data analysis and particularly relates to a method for encoding DNA/RNA sequences.

BACKGROUND ART

DNA/RNA sequence encoding method is a data processing method which converts DNA/RNA sequences into the numerical data. It plays an important role in solving the problem of identifying and predicting biological epigenetic sites such as DNA methylation sites and RNA methylation sites by using machine learning technology. Whether the DNA/RNA sequence encoding method can effectively extract the numerical features containing strong categorical information from DNA/RNA sequences will determine the performance of the subsequent classification model constructed using the features.

The existing DNA/RNA sequence encoding methods cannot extract the key feature information for effectively identifying the epigenetic sites from the DNA/RNA sequences, therefore, the performance of the subsequent classification model based on the existing DNA/RNA sequence encoding methods is poor. Combining the numerical features obtained by multiple DNA/RNA sequence encoding methods to get the high-dimensional numerical feature vector containing rich identification information can solve the shortcomings of constructing classification model by using a single DNA/RNA sequence encoding method, but it will lead to the high redundancy of the combined high-dimensional numerical features and waste of computing resources, and the improvement on the performance of the model is limited. Therefore, how to encode DNA/RNA sequences into numerical features containing key information while with low redundancy between features for effectively identifying epigenetic sites is the key issue to solve the problem of identification and prediction of biological epigenetic sites, and it is also the research hotspot in the art at present.

SUMMARY OF THE INVENTION

The technical problem to be solved by the present disclosure is to overcome the aforementioned defects of the prior art, and to provide a method for encoding DNA/RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information, which can extract the features with strong categorical information, low redundancy between features and high accuracy of the subsequently constructed model.

The technical scheme used for solving the technical problems comprises the following steps:

(1) constructing a nucleotide position-specific propensity matrix of DNA/RNA sequences;

giving a dataset D of DNA/RNA sequences, the dataset consists of a positive dataset and a negative dataset, that is, D=D⁺∪D⁻;

determining a nucleotide position-specific propensity matrix M_S⁺ for the positive dataset D⁺ according to the following formula:

$M_{s}^{+} = [\begin{matrix} f_{A, 1}^{+} & f_{A, 2}^{+} & L & f_{A, i}^{+} \\ f_{C, 1}^{+} & f_{C, 2}^{+} & L & f_{C, i}^{+} \\ f_{G, 1}^{+} & f_{G, 2}^{+} & L & f_{G, i}^{+} \\ f_{X, 1}^{+} & f_{X, 2}^{+} & L & f_{X, i}^{+} \end{matrix}]$

wherein, A, C, G and X are 4 types of nucleotides of DNA/RNA, and X represents nucleotide T in DNA, and U in RNA, and i represents a position of a nucleotide, 1≤i≤l, i is a finite positive integer, and l is a length of a DNA/RNA sequence; the l is an odd number. f_A,i⁺, f_C,i⁺, f_G,i⁺ and f_X,i⁺ are occurrence frequencies of nucleotides A, C, G and X at position i in positive dataset D⁺, respectively.

Determining a nucleotide position-specific propensity matrix M_S⁻ of the negative dataset D⁻ according to the following formula:

$M_{s}^{-} = [\begin{matrix} f_{A, 1}^{-} & f_{A, 2}^{-} & L & f_{A, i}^{-} \\ f_{C, 1}^{-} & f_{C, 2}^{-} & L & f_{C, i}^{-} \\ f_{G, 1}^{-} & f_{G, 2}^{-} & L & f_{G, i}^{-} \\ f_{X, 1}^{-} & f_{X, 2}^{-} & L & f_{X, i}^{-} \end{matrix}]$

wherein f_A,i⁻, f_C,i⁻, f_G,i⁻ and f_X,i⁻ are occurrence frequencies of nucleotides A, C, G and X at position i in negative dataset D⁻, respectively.

(2) Constructing a bidirectional dinucleotide position-specific propensity matrix of DNA/RNA sequences;

determining a forward dinucleotide position-specific propensity matrix

${\overset{?}{M}}_{d}$

$? indicates text missing or illegible when filed$

for the positive dataset D⁺ according to the following formula:

${\overset{uur}{M}}_{d}^{+} = [\begin{matrix} {\overset{ur}{f}}_{AA, 1}^{+} & {\overset{ur}{f}}_{AA, 2}^{+} & L & {\overset{ur}{f}}_{AA, j}^{+} \\ {\overset{ur}{f}}_{AC, 1}^{+} & {\overset{ur}{f}}_{AC, 2}^{+} & L & {\overset{ur}{f}}_{AC, j}^{+} \\ M & M & O & M \\ {\overset{ur}{f}}_{XX, 1}^{+} & {\overset{ur}{f}}_{XX, 2}^{+} & L & {\overset{ur}{f}}_{XX, j}^{+} \end{matrix}]$

wherein, AA, AC, . . . , and XX are 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and X of DNA/RNA, j represents position of dinucleotide, 2≤j≤l−1, j is a finite positive integer, l is a length of a DNA/RNA sequence,

${\overset{ur}{f}}_{AA, j}^{+}, {\overset{ur}{f}}_{AC, j}^{+}, \dots, and {\overset{ur}{f}}_{XX, j}^{+}$

are occurrence frequencies of dinucleotides of AA, AC, . . . , and XX in the positive dataset D⁺, wherein a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position J+1, respectively.

Determining a backward dinucleotide position-specific propensity matrix

${\overset{sun}{M}}_{d}^{+}$

for the positive dataset D⁺ according to the following formula:

${\overset{sun}{M}}_{d}^{+} = [\begin{matrix} {\overset{su}{f}}_{AA, 2}^{+} & {\overset{su}{f}}_{AA, 3}^{+} & L & {\overset{su}{f}}_{AA, j}^{+} \\ {\overset{su}{f}}_{AC, 2}^{+} & {\overset{su}{f}}_{AC, 3}^{+} & L & {\overset{su}{f}}_{AC, j}^{+} \\ M & M & O & M \\ {\overset{su}{f}}_{XX, 2}^{+} & {\overset{su}{f}}_{XX, 3}^{+} & L & {\overset{su}{f}}_{XX, j}^{+} \end{matrix}]$

wherein,

${\overset{su}{f}}_{AA, j}^{+}, {\overset{su}{f}}_{AC, j}^{+}, \dots, and {\overset{su}{f}}_{XX, j}^{+}$

are occurrence frequencies of dinucleotides of AA, AC, . . . , and XX in positive dataset D⁺, respectively, wherein, a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j−1, respectively.

Determining a forward dinucleotide position-specific propensity matrix

${\overset{uur}{M}}_{d}^{-}$

for the negative dataset D⁻ according to the following formula:

${\overset{uur}{M}}_{d}^{-} = [\begin{matrix} {\overset{ur}{f}}_{AA, 2}^{-} & {\overset{ur}{f}}_{AA, 3}^{-} & L & {\overset{ur}{f}}_{AA, j}^{-} \\ {\overset{ur}{f}}_{AC, 2}^{-} & {\overset{ur}{f}}_{AC, 3}^{-} & L & {\overset{ur}{f}}_{AC, j}^{-} \\ M & M & O & M \\ {\overset{ur}{f}}_{XX, 2}^{-} & {\overset{ur}{f}}_{XX, 3}^{-} & L & {\overset{ur}{f}}_{XX, j}^{-} \end{matrix}]$

wherein

${\overset{ur}{f}}_{AA, j}^{-}, {\overset{ur}{f}}_{AC, j}^{-}, \dots, and {\overset{ur}{f}}_{XX, j}^{-}$

are occurrence frequencies of dinucleotides AA, AC, . . . , and XX in negative dataset D⁻, respectively, wherein, a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j+1, respectively.

Determining a backward dinucleotide position-specific propensity matrix

${\overset{?}{M}}_{d}$

$? indicates text missing or illegible when filed$

for the negative dataset according to the following formula:

${\overset{sun}{M}}_{d}^{-} = [\begin{matrix} {\overset{su}{f}}_{AA, 2}^{-} & {\overset{su}{f}}_{AA, 3}^{-} & L & {\overset{su}{f}}_{AA, j}^{-} \\ {\overset{su}{f}}_{AC, 2}^{-} & {\overset{su}{f}}_{AC, 3}^{-} & L & {\overset{su}{f}}_{AC, j}^{-} \\ M & M & O & M \\ {\overset{su}{f}}_{XX, 2}^{-} & {\overset{su}{f}}_{XX, 3}^{-} & L & {\overset{su}{f}}_{XX, j}^{-} \end{matrix}]$

wherein,

${\overset{su}{f}}_{AA, j}^{-}, {\overset{su}{f}}_{AC, j}^{-}, \dots, and {\overset{su}{f}}_{XX, j}^{-}$

are occurrence frequencies of dinucleotides AA, AC, . . . , and XX of negative dataset D⁻, respectively, wherein a first nucleotide of a dinucleotide is at position j and a second nucleotide is at position j−1, respectively.

(3) Constructing a bidirectional trinucleotide position-specific propensity matrix of DNA/RNA sequences

determining a forward trinucleotide position-specific propensity matrix

${\overset{uur}{M}}_{t}^{+}$

for the positive dataset D⁺ according to the following formula:

${\overset{uur}{M}}_{t}^{+} = [\begin{matrix} {\overset{ur}{f}}_{AAA, β + 3}^{+} & {\overset{ur}{f}}_{AAA, β + 4}^{+} & L & {\overset{ur}{f}}_{AAA, k}^{+} \\ {\overset{ur}{f}}_{AAC, β + 3}^{+} & {\overset{ur}{f}}_{AAC, β + 4}^{+} & L & {\overset{ur}{f}}_{AAC, k}^{+} \\ M & M & O & M \\ {\overset{ur}{f}}_{XXX, β + 3}^{+} & {\overset{ur}{f}}_{XXX, β + 4}^{+} & L & {\overset{ur}{f}}_{XXX, k}^{+} \end{matrix}]$

wherein AAA, AAC, . . . , XXX are 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and X of DNA/RNA, β represents a distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, and β is a positive integer, l is a length of a DNA/RNA sequence, k is a finite positive integer, k represents a position of a first nucleotide of the forward trinucleotide, β+3≤k≤l−β−2, then a second nucleotide is at position k+β+1 and a third at k+β+2.

${\overset{ur}{f}}_{AAA, k}^{+}, {\overset{ur}{f}}_{AAC, k}^{+}, \dots, and {\overset{ur}{f}}_{XXX, k}^{+}$

are occurrence frequencies of trinucleotides of AAA, AAC, . . . , and XXX of positive dataset D⁺.

Determining a backward trinucleotide position-specific propensity matrix

${\overset{sun}{M}}_{t}^{+}$

for the positive dataset D⁺ according to the following formula:

$\begin{matrix} {su}_{+} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{+} \\ f_{A A A, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{A A A, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {su}_{+} \\ f_{A A C, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{A A C, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{+} \\ f_{X X X, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{X X X, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{xxx, k} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {su}_{+} \\ f_{A A A, k}, \end{matrix} \begin{matrix} {su}_{+} \\ f_{AAC, k}, \end{matrix} \begin{matrix} \dots, \end{matrix} \begin{matrix} and \end{matrix} \begin{matrix} {su}_{+} \\ f_{XXX, k} \end{matrix}$

are occurrence frequencies of trinucleotides AAA, AAC, . . . , and XXX of positive dataset D⁺, respectively, wherein a first, second, and a third nucleotide of the backward trinucleotide are at positions k, k−β−1, and k−β−2, respectively, of sequences.

Determining a forward trinucleotide position-specific propensity matrix

$\begin{matrix} u ?_{-} \\ M_{t} \end{matrix} ? indicates text missing or illegible when filed$

for the negative dataset D⁻ according to the following formula:

$\begin{matrix} u ?_{-} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {ur}_{-} \\ f_{A A A, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{A A A, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {ur}_{-} \\ f_{A A C, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{A A C, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {ur}_{-} \\ f_{X X X, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{X X X, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{xxx, k} \end{matrix} \end{matrix}] ? indicates text missing or illegible when filed$

wherein,

$\begin{matrix} {ur}_{-} \\ f_{A A A, k}, \end{matrix} \begin{matrix} {ur}_{-} \\ f_{AAC, k}, \end{matrix} \begin{matrix} \dots, \end{matrix} \begin{matrix} and \end{matrix} \begin{matrix} {ur}_{-} \\ f_{XXX, k} \end{matrix}$

are occurrence frequencies of trinucleotides of AAA, AAC, . . . , and XXX of negative dataset D⁻, respectively, wherein a first, second, and third nucleotide of the above forward trinucleotides are at positions k, k+β+1, and k+β+2, respectively, of the sequences.

Determining a backward trinucleotide position-specific propensity matrix

${\overset{?}{M}}_{t}$

$? indicates text missing or illegible when filed$

for the negative dataset D⁻ according to the following formula:

$\begin{matrix} {su}_{-} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{-} \\ f_{A A A, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{A A A, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {su}_{-} \\ f_{A A C, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{A A C, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{-} \\ f_{X X X, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{X X X, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{xxx, k} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {su}_{-} \\ f_{A A A, k}, \end{matrix} \begin{matrix} {su}_{-} \\ f_{AAC, k}, \end{matrix} \begin{matrix} \dots, \end{matrix} \begin{matrix} and \end{matrix} \begin{matrix} {su}_{-} \\ f_{XXX, k} \end{matrix}$

are occurrence frequencies of trinucleotides AAA, AAC, . . . , and XXX of negative dataset D⁻, respectively, wherein a first, second and third nucleotide of the above backward trinucleotides are at positions k, k−β−1, and k−β−2, respectively, of all sequences.

(4) Determining a value of pointwise joint mutual information of the nucleotides of DNA/RNA sequences

(4.1) Determining a value

$\begin{matrix} r_{+} \\ v_{k} \end{matrix} $

of forward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the positive dataset D⁺ according to the following formula:

$\begin{matrix} r_{+} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {ur}_{+} \\ f_{xyz, k}^{ur} \end{matrix}}{\begin{matrix} {ur}_{+} \\ f_{x, k}^{+} f_{y z, k + β + 1}^{ur} \end{matrix}}$

wherein, x is a nucleotide at position k, x∈{A, C, G, X},

$\begin{matrix} u \\ y \end{matrix} $

is a nucleotide at position k+β+1,

$\begin{matrix} u \\ y \in {A, C, G, X}, \end{matrix} \begin{matrix} 1 \\ z \end{matrix}$

is a nucleotide at position k+β+2,

$\begin{matrix} 1 \\ z \in {A, C, G, X}, \end{matrix} \begin{matrix} and \end{matrix} \begin{matrix} {ur}_{+} \\ f_{x y z, k}^{ur} \end{matrix}$

is an occurrence frequency of trinucleotide

$\overset{?}{xyz}$

$? indicates text missing or illegible when filed$

in positive dataset D⁺,

$\begin{matrix} {ur}_{+} \\ f_{yz, k + β + 1}^{ur} \end{matrix} $

is an occurrence frequency of dinucleotide

$\overset{?}{yz}$

$? indicates text missing or illegible when filed$

of all sequence samples of positive dataset D⁺, and f_x,k⁺ is an occurrence frequency of nucleotide x at position k of all sequence samples of positive dataset D⁺.

Determining a value

$\begin{matrix} s_{+} \\ v_{k} \end{matrix} $

of backward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the positive dataset D⁺ according to the following formula:

$\begin{matrix} s_{+} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {su}_{+} \\ f_{xyz, k} ? s \end{matrix}}{\begin{matrix} {su}_{+} \\ f_{x, k}^{+} f_{yz, k - β - 1} ? \end{matrix}} ? indicates text missing or illegible when filed$

wherein, x is a nucleotide at position k, xε{A, C, G, X},

$\overset{?}{y}$

$? indicates text missing or illegible when filed$

is a nucleotide at position k−β−1,

$\overset{?}{y} \in {A, C, G, X}, \overset{s}{z}$

$? indicates text missing or illegible when filed$

is a nucleotide at position k−β−2,

$\overset{s}{z} \in {A, C, G, X}, and {\overset{{su}_{+}}{f}}_{xyz, k}^{?}$

$? indicates text missing or illegible when filed$

represents an occurrence frequency of trinucleotide

$\overset{?}{xyz}$

$? indicates text missing or illegible when filed$

of all sequences in positive dataset D⁺,

${\overset{{su}_{+}}{f}}_{yz, k - β - 1}^{?}$

$? indicates text missing or illegible when filed$

represents an occurrence frequency of dinucleotide

$\overset{?}{yz}$

$? indicates text missing or illegible when filed$

of all sequences in positive dataset D⁺.

The encoding value v_k⁺ of pointwise joint mutual information in the positive dataset D⁺ of a nucleotide at position k of DNA/RNA sequences to be encoded is defined as an average value of the value

$\overset{r_{+}}{v_{k}}$

of forward pointwise joint mutual information and the value

$\overset{s_{+}}{v_{k}}$

of backward pointwise joint mutual information. The DNA/RNA sequence with length l is encoded into a pointwise mutual information feature vector V⁺ with length of l−2β−4:

$V^{+} = [v_{β + 3}^{+}, v_{β + 4}^{+}, L, v_{k}^{+}]$

$v_{k}^{+} = \frac{\overset{r_{+}}{v_{k}} + \overset{s_{+}}{v_{k}}}{2}$

(4.2) Determining a value

$\overset{r_{-}}{v_{k}}$

of forward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the negative dataset D⁻ according to the following formula:

$\overset{r_{-}}{v_{k}} = \log \frac{{\overset{?}{f}}_{xyz, k}^{?}}{f_{x, k}^{-} {\overset{?}{f}}_{yz, k + β + 1}^{?}}$

$? indicates text missing or illegible when filed$

Wherein,

${\overset{?}{f}}_{xyz, k}^{?}$

$? indicates text missing or illegible when filed$

represents an occurrence frequency of trinucleotide

$\begin{matrix} ? 1 \\ xyz \end{matrix} ? indicates text missing or illegible when filed$

in negative dataset D⁻, and x,

$\overset{?}{y}, and \overset{?}{z}$

$? indicates text missing or illegible when filed$

are nucleotides at positions k, k+β+1 and k+β+2, respectively.

${\overset{?}{f}}_{yz, k + β + 1}^{?}$

$? indicates text missing or illegible when filed$

is an occurrence frequency of dinucleotide

$\begin{matrix} ? 1 \\ yz \end{matrix} ? indicates text missing or illegible when filed$

in negative dataset D⁻, and f_x,k⁻ is an occurrence frequency of nucleotide x in negative dataset D⁻.

Determining a value

$\overset{s_{-}}{v_{k}}$

of backward pointwise joint mutual information of nucleotides of DNA/RNA sequences to be encoded in the negative dataset D⁻ according to the following formula:

$\begin{matrix} s_{-} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {su}_{-} \\ f_{xyz, k} ? s \end{matrix}}{\begin{matrix} {su}_{-} \\ f_{x, k}^{-} f_{yz, k - β - 1} ? s \end{matrix}} ? indicates text missing or illegible when filed$

wherein,

$\begin{matrix} {su}_{-} \\ f_{xyz, k} ? s \end{matrix} ? indicates text missing or illegible when filed$

is an occurrence frequency of trinucleotide

$\begin{matrix} ? s \\ xyz \end{matrix} ? indicates text missing or illegible when filed$

of all sequences of negative dataset

$D^{-} \cdot \begin{matrix} ? s \\ x, y, z \end{matrix} ? indicates text missing or illegible when filed$

are nucleotides at positions k, k−β−1 and k−β−2, respectively.

$\begin{matrix} {su}_{-} \\ f_{yz, k - β - 1} ? s \end{matrix} ? indicates text missing or illegible when filed$

is an occurrence frequency of dinucleotide

$\begin{matrix} ? s \\ yz \end{matrix}  ? indicates text missing or illegible when filed$

of all sequences of negative dataset D⁻.

The encoding value v_k⁻ of pointwise joint mutual information of a nucleotide at position k of DNA/RNA sequences to be encoded in the negative dataset D⁻ is defined as an average of the value

$\begin{matrix} r_{-} \\ v_{k} \end{matrix} $

of forward pointwise joint mutual information and the value

$\begin{matrix} s_{-} \\ v_{k} \end{matrix} $

of backward pointwise joint mutual information, and a DNA/RNA sequence with length l is encoded into a pointwise mutual information feature vector V⁻ with a length of l−2,β−4:

$V^{-} = [v_{β + 3}^{-}, v_{β + 4}^{-}, L, v_{k}^{-}] v_{k}^{-} = \frac{\begin{matrix} r_{-} s_{-} \\ v_{k} + v_{k} \end{matrix}}{2}$

(4.3) Determining a feature vector V of a DNA/RNA sequence to be encoded with a given length l by corresponding element of vector V⁺ minus that of V⁻:

V=[V_β+3, V_β+4, . . . , V_k]

V
_k
=v
_k
⁺
−v
_k
⁻

(5) Concatenating Features

When the value of parameter β is 0, the feature vector V(0) is [V₃, V₄, V₅, . . . , V_l−3, V_l−2], and the number of elements is l−4. When the value of β is 1, the feature vector V(1) is [V₄, V₅, V₆, . . . , V_l−4, V_l−3], and the number of elements is l−6, . . . , and when the value of β is (l−7)/2, the feature vector V((l−7)/2) is [V_(l−1)/2, V_(l+1)/2, V_(l+3)/2], the number of elements is 3. When the value β is (l−5)/2, the feature vector V((l−5)/2) is [V_(l+1)/2], and the number of elements is 1. Concatenating the feature vectors determined by different values of parameter β into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)²/4 elements.

(6) Encoding DNA/RNA Sequences

Encoding the DNA/RNA sequence dataset D into a numerical dataset D′ by performing the above step (1)-step (5),

$D^{'} \in R^{s \times \frac{{(l - 3)}^{2}}{4}},$

where s is a number of samples in the numerical dataset D′, that is, the number of the DNA/RNA sequences in dataset D. The (l−3)²/4 is a feature number of the numerical dataset D′.

In the present disclosure, a bidirectional dinucleotide position-specific propensity and a trinucleotide position-specific propensity are proposed based on nucleotide position-specific propensities, and a pointwise joint mutual information is proposed based on nucleotide position-specific propensity matrix and bidirectional dinucleotide position-specific propensity matrix and bidirectional trinucleotide position-specific propensity matrix, then an encoding method is proposed for representing DNA/RNA sequences by using pointwise joint mutual information and nucleotide position-specific propensity matrix and bidirectional dinucleotide position-specific propensity matrix and bidirectional trinucleotide position-specific propensity matrix of positive and negative datasets of DNA/RNA sequences, and DNA/RNA sequences are encoded into numerical feature samples. In order to extract more trinucleotide position information from DNA/RNA sequences, the parameter β is introduced into the process of constructing the bidirectional trinucleotide position-specific propensity matrix to represent the distance between the current nucleotide and its forward or backward adjacent dinucleotide, and the numerical feature vectors obtained from different values of β are concatenated, so as to obtain a high-dimensional numerical feature vector with global and local categorical information and low redundancy between features. The simulation comparative experiments are carried out by using the encoding method provided by the present disclosure and the existing seven encoding methods, and the experimental results show that the accuracy, sensitivity, specificity, MCC (Mathew's correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of the support vector machine model constructed based on the encoding method provided by the present disclosure for identifying the DNA N⁴-methylcytosine (4mC) sites in the Caenorhabditis elegans DNA sequences are 0.987, 0.991, 0.983, 0.974, 0.999 and 0.999, respectively, which are much higher than those of the other seven compared encoding methods; the accuracy, sensitivity, specificity, MCC, AUROC and AUPRC of the support vector machine model constructed based on the encoding method provided by the present disclosure for identifying the RNA N⁶-methyladenosine (m⁶A) sites in the Saccharomyces cerevisiae RNA sequences are 0.995, 0.996, 0.994, 0.990, 1 and 1, respectively, which are much higher than those of the other seven compared encoding methods.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is the flowchart of the method of the present disclosure.

FIG. 2 shows the AUROC curves of the support vector machine models for identifying the DNA N⁴-methylcytosine sites in the DNA sequence of Caenorhabditis elegans based on the encoding method provided by the present disclosure and seven encoding methods, respectively.

FIG. 3 shows the AUPRC curves of the support vector machine models for identifying the DNA N⁴-methylcytosine sites in the DNA sequence of Caenorhabditis elegans based on the encoding method provided by the present disclosure and seven encoding methods, respectively.

FIG. 4 shows the AUROC curves of the support vector machine models for identifying the RNA N⁶-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences based on the encoding method provided by the present disclosure and seven encoding methods, respectively.

FIG. 5 shows the AUPRC curves of the support vector machine models for identifying the RNA N⁶-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences based on the encoding method provided by the present disclosure and seven encoding methods, respectively.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The technical schemes provided by the present disclosure will be described in detail below with reference to the figures and examples, but they should not be understood as any limitation to the scope of the present disclosure.

Example 1

The DNA N⁴-methylcytosine (4mC) dataset of the Caenorhabditis elegans×DNA sequences recorded in the literature “iDNA4mC: identifying DNA N⁴-methylcytosine sites based on nucleotide chemical properties” was taken as an example. The dataset consisted of 3108 DNA sequences, of which, the number of sequences in positive dataset, i.e., the number of actual N⁴-methylcytosine samples, was 1554, the number of sequences in negative dataset, i.e., the number of non-N⁴-methylcytosine samples, was 1554, and the length l of each sequence was 41. The method for encoding the DNA sequences based on the bidirectional trinucleotide position-specific propensities and pointwise joint mutual information of this present example comprises the following steps (reference FIG. 1):

(1) a nucleotide position-specific propensity matrix of DNA sequences was constructed;

A dataset D of DNA sequences was given, and it consisted of a positive dataset D⁺ and a negative dataset D⁻, i.e. D=D⁺D∪D⁻;

the nucleotide position-specific propensity matrix M_S⁺ for the positive dataset D⁺ was determined according to the following formula:

$M_{s}^{+} = [\begin{matrix} f_{A, 1}^{+} & f_{A, 2}^{+} & L & f_{A, i}^{+} \\ f_{C, 1}^{+} & f_{C, 2}^{+} & L & f_{C, i}^{+} \\ f_{G, 1}^{+} & f_{G, 2}^{+} & L & f_{G, i}^{+} \\ f_{T, 1}^{+} & f_{T, 2}^{+} & L & f_{T, i}^{+} \end{matrix}]$

where, A, C, G and T were the 4 types of nucleotides of DNA sequences, i represents the position of a nucleotide, 1≤i≤l, and i was a positive integer, and l was the length of a DNA sequence, and it was an odd number, the value of l in this example was 41, f_A,i⁺, f_C,i⁺, f_G,i⁺ and f_T,i⁺ were occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of positive dataset D⁺, respectively;

The nucleotide position-specific propensity matrix M_S⁻ of the negative dataset D⁻ was determined according to the following formula:

$M_{s}^{-} = [\begin{matrix} f_{A, 1}^{-} & f_{A, 2}^{-} & L & f_{A, i}^{-} \\ f_{C, 1}^{-} & f_{C, 2}^{-} & L & f_{C, i}^{-} \\ f_{G, 1}^{-} & f_{G, 2}^{-} & L & f_{G, i}^{-} \\ f_{T, 1}^{-} & f_{T, 2}^{-} & L & f_{T, i}^{-} \end{matrix}]$

wherein f_A,i⁻, f_C,i⁻, f_G,i⁻ and f_T,i⁻ were the occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of negative dataset D⁻, respectively.

(2) A bidirectional dinucleotide position-specific propensity matrix of DNA sequences were constructed;

The forward dinucleotide position-specific propensity matrix

$\begin{matrix} {uur}_{+} \\ M_{d} \end{matrix}$

for the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} u ? \\ M_{d} \end{matrix} = [\begin{matrix} \begin{matrix} u ?_{+} \\ f_{AA, 1} \end{matrix} & \begin{matrix} u ?_{+} \\ f_{AA, 2} \end{matrix} & L & \begin{matrix} u ?_{+} \\ f_{AA, j} \end{matrix} \\ \begin{matrix} u ?_{+} \\ f_{AC, 1} \end{matrix} & \begin{matrix} u ?_{+} \\ f_{AC, 2} \end{matrix} & L & \begin{matrix} u ?_{+} \\ f_{AC, j} \end{matrix} \\ M & M & O & M \\ \begin{matrix} u ?_{+} \\ f_{TT, 1} \end{matrix} & \begin{matrix} u ?_{+} \\ f_{T T, 2} \end{matrix} & L & \begin{matrix} u ?_{+} \\ f_{T T_{, j}} \end{matrix} \end{matrix}] ? indicates text missing or illegible when filed$

wherein, AA, AC, . . . , and TT were the 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and T of DNA sequences, j represented the position of the dinucleotide, that is the position of the first nucleotide of the dinucleotide, the second nucleotide of the dinucleotide was at position j+1, 2≤j≤l−1, and j was a finite positive integer, 2≤j≤40 in this example,

$\begin{matrix} {ur}_{+} \\ f_{AA, j} \end{matrix}, \begin{matrix} {ur}_{+} \\ f_{AC, j} \end{matrix}, \dots, and \begin{matrix} {ur}_{+} \\ f_{TT, j} \end{matrix}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences of positive dataset D⁺, respectively;

The backward dinucleotide position-specific propensity matrix

$\begin{matrix} {suu}_{+} \\ M_{d} \end{matrix}$

for the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} {suu}_{+} \\ M_{d} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{+} \\ f_{AA, 2} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{AA, 3} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AA, j} \end{matrix} \\ \begin{matrix} {su}_{+} \\ f_{AC, 2} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{AC, 3} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AC, j} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{+} \\ f_{TT, 2} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{TT, 3} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{TT, j} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {su}_{+} \\ f_{AA, j} \end{matrix}, \begin{matrix} {su}_{+} \\ f_{AC, j} \end{matrix}, \dots, and \begin{matrix} {su}_{+} \\ f_{TT, j} \end{matrix}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and TT in positive dataset D⁺, respectively, and the first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;

The forward dinucleotide position-specific propensity matrix

$\begin{matrix} {uur}_{-} \\ M_{d} \end{matrix}$

for the negative dataset was determined according to the following formula:

$\begin{matrix} {uur}_{-} \\ M_{d} \end{matrix} = [\begin{matrix} \begin{matrix} {ur}_{-} \\ f_{AA, 2} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{AA, 3} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AA, j} \end{matrix} \\ \begin{matrix} {ur}_{-} \\ f_{AC, 2} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{AC, 3} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AC, j} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {ur}_{-} \\ f_{TT, 2} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{TT, 3} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{TT, j} \end{matrix} \end{matrix}]$

wherein

$\begin{matrix} {ur}_{-} \\ f_{AA, j} \end{matrix}, \begin{matrix} {ur}_{-} \\ f_{AC, j} \end{matrix}, \dots, and \begin{matrix} {ur}_{-} \\ f_{TT, j} \end{matrix}$

were occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences in negative dataset D⁻, respectively. The first and second nucleotide of these dinucleotides were at positions j and j+1, respectively;

The backward dinucleotide position-specific propensity matrix

$\begin{matrix} {suu}_{-} \\ M_{d} \end{matrix}$

for the negative dataset D⁻ was determined according to the following formula:

$\begin{matrix} {suu}_{-} \\ M_{d} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{-} \\ f_{AA, 2} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{AA, 3} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AA, j} \end{matrix} \\ \begin{matrix} {su}_{-} \\ f_{AC, 2} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{AC, 3} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AC, j} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{-} \\ f_{TT, 2} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{TT, 3} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{TT, j} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {su}_{-} \\ f_{AA, j} \end{matrix}, \begin{matrix} {su}_{-} \\ f_{AC, j} \end{matrix}, \dots, and \begin{matrix} {su}_{-} \\ f_{TT, j} \end{matrix}$

were occurrence frequencies of dinucleotides AA, AC, . . . , and TT of all sequences of negative dataset D⁻, respectively. The first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;

(3) A bidirectional trinucleotide position-specific propensity matrix of DNA sequences was constructed

The forward trinucleotide position-specific propensity matrix

$\begin{matrix} {uur}_{+} \\ M_{t} \end{matrix}$

for the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} {uur}_{+} \\ M_{d} \end{matrix} = [\begin{matrix} \begin{matrix} {ur}_{+} \\ f_{AAA, β + 3} \end{matrix} & \begin{matrix} {ur}_{+} \\ f_{AAA, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{+} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {ur}_{+} \\ f_{AAC, β + 3} \end{matrix} & \begin{matrix} {ur}_{+} \\ f_{AAC, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{+} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {ur}_{+} \\ f_{TTT, β + 3} \end{matrix} & \begin{matrix} {ur}_{+} \\ f_{TTT, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{+} \\ f_{TTT, k} \end{matrix} \end{matrix}]$

wherein AAA, AAC, . . . , TTT were 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and T of DNA sequences, β represented the distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, β was a positive integer, 0≤β≤18 in this example, k represented a position of trinucleotide, that is, the position of the first nucleotide of a trinucleotide, β+3≤k≤l−β−2, β+3≤k≤39−β in this example, and k was a positive integer,

$\begin{matrix} {ur}_{+} \\ f_{AAA, k} \end{matrix}, \begin{matrix} {ur}_{+} \\ f_{AAC, k} \end{matrix}, \dots, and \begin{matrix} {ur}_{+} \\ f_{TTT, k} \end{matrix}$

represent the frequencies of trinucleotides AAA, AAC, . . . , or TTT of all sequences in positive dataset D⁺, respectively. The first, second and third nucleotide of these trinucleotides were at positions k, k+β+1, and k+β+2 of the DNA sequences, respectively;

The backward trinucleotide position-specific propensity matrix

$\begin{matrix} {suu}_{+} \\ M_{t} \end{matrix}$

for the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} {su}_{+} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{+} \\ f_{AAA, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{AAA, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {su}_{+} \\ f_{AAC, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{AAC, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{+} \\ f_{TTT, β + 3} \end{matrix} & \begin{matrix} {su}_{+} \\ f_{TTT, β + 4} \end{matrix} & L & \begin{matrix} {su}_{+} \\ f_{TTT, k} \end{matrix} \end{matrix}]$

wherein,

${\overset{{su}_{+}}{f}}_{AAA, k}, {\overset{{su}_{+}}{f}}_{AAC, k}, \dots, and {\overset{{su}_{+}}{f}}_{TTT, k}$

were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and TTT of all sequences of positive dataset D⁺, respectively. The first, second and third nucleotide of these trinucleotides were at positions k, k−β−1, and k−β−2, respectively;

The forward trinucleotide position-specific propensity matrix

$\begin{matrix} {ur}_{-} \\ M_{t} \end{matrix}$

for the negative dataset D⁻ was determined according to the following formula:

$\begin{matrix} {ur}_{-} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {ur}_{-} \\ f_{AAA, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{AAA, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {ur}_{-} \\ f_{AAC, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{AAC, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {ur}_{-} \\ f_{TTT, β + 3} \end{matrix} & \begin{matrix} {ur}_{-} \\ f_{TTT, β + 4} \end{matrix} & L & \begin{matrix} {ur}_{-} \\ f_{TTT, k} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {ur}_{-} \\ f_{AAA, k} \end{matrix}, \begin{matrix} {ur}_{-} \\ f_{AAC, k} \end{matrix}, \dots, and \begin{matrix} {ur}_{-} \\ f_{TTT, k} \end{matrix}$

were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and TTT of all sequences of negative dataset D⁻, respectively. The first, second and third nucleotide of a trinucleotide were at positions k, k+β+1, and k+β+2, respectively;

The backward trinucleotide position-specific propensity matrix

$\begin{matrix} {su}_{-} \\ M_{t} \end{matrix}$

for the negative dataset D⁻was determined according to the following formula:

$\begin{matrix} {su}_{-} \\ M_{t} \end{matrix} = [\begin{matrix} \begin{matrix} {su}_{-} \\ f_{AAA, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{AAA, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AAA, k} \end{matrix} \\ \begin{matrix} {su}_{-} \\ f_{AAC, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{AAC, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{AAC, k} \end{matrix} \\ M & M & O & M \\ \begin{matrix} {su}_{-} \\ f_{TTT, β + 3} \end{matrix} & \begin{matrix} {su}_{-} \\ f_{TTT, β + 4} \end{matrix} & L & \begin{matrix} {su}_{-} \\ f_{TTT, k} \end{matrix} \end{matrix}]$

wherein,

$\begin{matrix} {su}_{-} \\ f_{AAA, k} \end{matrix}, \begin{matrix} {su}_{-} \\ f_{AAC, k} \end{matrix}, \dots, and \begin{matrix} {su}_{-} \\ f_{TTT, k} \end{matrix}$

(4) A value of the pointwise joint mutual information of the nucleotides of DNA sequences was determined

(4.1) The value of the forward pointwise joint mutual information

$\begin{matrix} r_{+} \\ v_{k} \end{matrix}$

of nucleotides of DNA sequences to be encoded in the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} r_{+} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {ur}_{+} \\ f_{xyz, k}^{ur} \end{matrix}}{\begin{matrix} {ur}_{+} \\ f_{x, k}^{+} \end{matrix} f_{yz, k + β + 1}^{ur}}$

wherein, x was the nucleotide at position k, X∈{A, C, G, T},

$\begin{matrix} u \\ y \end{matrix}$

was the nucleotide at position k+β+1,

$\begin{matrix} u \\ y \in {A, C, G, T), \end{matrix} \begin{matrix} r \\ z \end{matrix}$

was the nucleotide at position k+β+2,

$\begin{matrix} r \\ z \in {A, C, G, T}, \end{matrix} \begin{matrix} {ur}_{+} \\ f_{xyz, k}^{ur} \end{matrix}$

represents the occurrence frequency of trinucleotide

$\begin{matrix} ur \\ xyz \end{matrix}$

of all sequences of positive dataset D⁺,

$\begin{matrix} {ur}_{+} \\ f_{yz, k + β + 1}^{ur} \end{matrix}$

was the occurrence frequency of dinucleotide

$\begin{matrix} ur \\ yz \end{matrix}$

of all sequences of positive dataset D⁺, and f_x,k⁺ was the occurrence frequency of nucleotide x of all sequences of positive dataset D⁺;

The value of the backward pointwise joint mutual information

$\begin{matrix} s_{+} \\ v_{k} \end{matrix}$

of nucleotides of DNA sequences to be encoded in the positive dataset D⁺ was determined according to the following formula:

$\begin{matrix} s_{+} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {su}_{+} \\ ? \end{matrix}}{\begin{matrix} {su}_{+} \\ f_{x, k}^{+} ? \end{matrix}}$

$? indicates text missing or illegible when filed$

wherein,

$\begin{matrix} ? \\ y \end{matrix}$

$? indicates text missing or illegible when filed$

was the nucleotide at position k−β−1,

$\begin{matrix} ? \\ y \end{matrix} \in {A, C, G, T}, \begin{matrix} s \\ z \end{matrix}$

$? indicates text missing or illegible when filed$

was the nucleotide at position k−β−2,

$\begin{matrix} s \\ z \end{matrix} \in {A, C, G, T} b, \begin{matrix} {su}_{+} \\ ? \end{matrix}$

$? indicates text missing or illegible when filed$

was the occurrence frequency of trinucleotide

$\begin{matrix} ? \\ xyz \end{matrix}$

$? indicates text missing or illegible when filed$

of all sequences of positive dataset D⁺,

$\begin{matrix} {su}_{+} \\ ? \end{matrix}$

$? indicates text missing or illegible when filed$

was the occurrence frequency of dinucleotide

$\begin{matrix} sus \\ yz \end{matrix}$

of all sequences of positive dataset D⁺.

The encoding value v_k⁺ of pointwise joint mutual information of the nucleotide at position k of a DNA sequence to be encoded in the positive dataset D⁺ was defined as the average of the value

$\begin{matrix} r_{+} \\ v_{k} \end{matrix}$

of forward pointwise joint mutual information and the value

$\begin{matrix} s_{+} \\ v_{k} \end{matrix}$

of backward pointwise joint mutual information, and a DNA sequence with length l was encoded into a pointwise mutual information feature vector V⁺ with l−2β−4 elements:

$V^{+} = [v_{β + 3}^{+}, v_{β + 4}^{+}, L, v_{k}^{+}]$

$v_{k}^{+} = \frac{\begin{matrix} r_{+} \\ v_{k} \end{matrix} + \begin{matrix} s_{+} \\ v_{k} \end{matrix}}{2}$

The value of l was 41 in this example.

(4.2) The value

$\begin{matrix} r_{-} \\ v_{k} \end{matrix}$

of forward pointwise joint mutual information of nucleotides of a DNA sequence to be encoded in the negative dataset D⁻ was determined according to the following formula:

$\begin{matrix} r_{-} \\ v_{k} \end{matrix} = \log \frac{?}{\begin{matrix} ? \\ f_{x, k}^{-} ? \end{matrix}}$

$? indicates text missing or illegible when filed$

wherein, the nucleotides x,

$\begin{matrix} ? \\ 2 \end{matrix}, and \begin{matrix} ? \\ z \end{matrix}$

$? indicates text missing or illegible when filed$

were at positions k, k+β+1 and k+β+2, respectively, and the

$?$

$? indicates text missing or illegible when filed$

was the occurrence frequency of trinucleotide

$? xyz$

$? indicates text missing or illegible when filed$

of all sequences of negative dataset D⁻,

$? f_{yz, k + β + 1}^{ur}$

$? indicates text missing or illegible when filed$

was the occurrence frequency of dinucleotide

$? yz$

$? indicates text missing or illegible when filed$

of all sequences of negative dataset D⁻, and f_h,k⁻ was the occurrence frequency of the nucleotide x of all sequences of negative dataset D⁻.

The value

$\overset{s_{-}}{v_{k}}$

of backward pointwise joint mutual information of nucleotides of a DNA sequence to be encoded in the negative dataset D⁻ was determined according to the following formula:

$\overset{s_{-}}{v_{k}} = \log \frac{f_{xyz, k} ?}{f_{x, k}^{-} f_{yz, k - β - 1} ?}$

$? indicates text missing or illegible when filed$

wherein, the nucleotides x,

$? y, and \overset{s}{z}$

$? indicates text missing or illegible when filed$

were at positions k, k−β−1 and k−β−2, respectively. The

$f_{xyz, k} ?$

$? indicates text missing or illegible when filed$

was the occurrence frequency of trinucleotide

$? x y z$

$? indicates text missing or illegible when filed$

of all sequences of negative dataset D⁻. The

$f_{xyz, k - β - 1} ?$

$? indicates text missing or illegible when filed$

was the occurrence frequency of dinucleotide

$? yz$

$? indicates text missing or illegible when filed$

of all sequences of negative dataset D⁻.

The encoding value v_k⁻ of pointwise joint mutual information of the nucleotide at position k of a DNA sequence to be encoded in the negative dataset D⁻ was defined as an average of the value

$\overset{r_{-}}{v_{k}}$

of forward pointwise joint mutual information and the value

$\overset{s_{-}}{v_{k}}$

of backward pointwise joint mutual information, and a DNA sequence with a length of l was encoded into a pointwise mutual information feature vector V⁻ with a length of l−2β−4:

$V^{-} = [v_{β + 3}^{-}, v_{β + 4}^{-}, L, v_{k}^{-}] v_{k}^{-} = \frac{\overset{r_{-}}{v_{k}} + \overset{s_{-}}{v_{k}}}{2}$

The value of l was 41 in this example.

(4.3) The feature vector V of a DNA sequence to be encoded with length l was determined by corresponding element of vector V⁺ minus that of V⁻:

V=[V_β+3, V_β+4, . . . , V_k]

V
_k
=v
_k
⁺
−v
_k
⁻;

(5) Concatenating features

when the value of parameter β was 0, the feature vector V(0) was [V₃, V₄, V₅, . . . , V_l−3, V_l−2], and the number of elements was l−4; when the value of β was 1, the feature vector V(1) was [V₄, V₅, V₆, . . . , V_l−4, V_l−3], and the number of elements was l−6, . . . , and when the value of β was (l−7)/2, the feature vector V((l−7)/2) was [V(_l−1)/2, V_(l+1)/2, V_(l+3)/2], the number of elements was 3; when the value of β was (l−5)/2, the feature vector V((l−5)/2) was [V_(l+1)/2], and the number of elements was 1; the feature vectors determined by different values of the parameter β was concatenated into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)²/4 elements, the value of l was 41 in this example.

(6) Encoding the DNA sequences

The DNA sequence dataset D was encoded into a numerical dataset D′ by performing the above step (1)-step (5),

$D^{'} \in R^{s \times \frac{{(l - 3)}^{2}}{4}},$

where s was a number of samples of the numerical dataset D′, and s was a finite positive integer, the value of s was 3108 in this example, i.e. the number of DNA sequences in this DNA sequence dataset D, and (l−3)²/4 was the feature number of the numerical data set D′. The encoding of DNA sequences was completed.

The DNA sequence encoding method of Example 1 was compared with PSNP (position-specific nucleotide propensities), PSDP (position-specific dinucleotide propensities), KNF (K-nucleotide frequencies), KSNPF (K spaced nucleotide pair frequencies), NPPS (nucleotide pair position specificity), PBE (positional binary encoding) and NCPNC (nucleotide chemical property and nucleotide composition) which are for identifying the DNA N⁴-methylcytosine sites in Caenorhabditis elegans DNA sequences by the performance of the support vector machine models constructed using each encoding method. The average classification accuracy, sensitivity, specificity, MCC (Mathew's Correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of the 10-fold cross-validation method were used to evaluate the experimental results. The experimental method was as follows:

1. The DNA sequences of N⁴-methylcytosine of Caenorhabditis elegans were encoded according to the method of Example 1;

2. Normalizing the dataset

The numerical dataset D′ was normalized by the maximum-minimum method according to the following formula:

$g_{m n}^{'} = \frac{g_{m, n} - \min (g_{n})}{\max (g_{n}) - \min (g_{n})}$

where g_m,nwas the n-th feature value of the m-th sample of the numerical dataset D′, the normalized value of g_m,nwas g′_m,n, max(g_n) and min(g_n) represent the maximum and minimum feature values of the n-th column of the numerical dataset D′, 1≤m≤s, l≤n≤(l−1)²/4, m and n were finite positive integers, the value of l in this example was 41, and the value of s was 3108.

3. Partitioning dataset

The normalized numerical dataset D′ was partitioned into 10 folds by using the K-fold cross-validation method (K=10). One fold of which was taken as the test dataset D′_Te, and the remaining nine folds were taken as the training dataset D′_Tr, till each fold was as test dataset, and there were 10 runs in total. The ratio of the training dataset D′_Trto the test dataset D′_Tein each run was 9:1.

4. Training and testing the model

The support vector machine model was trained using the training dataset D′_Tr, and the performance of the support vector machine model was tested using the test dataset D′_Te.

The DNA N⁴-methylcytosine sites in Caenorhabditis elegans DNA sequences were identified by performing the same operation on the seven compared encoding methods according to steps 2-4 of the experimental methods. The experimental results of classification accuracy, sensitivity, specificity and MCC were shown in Table 1, the experimental results of AUROC were shown in FIG. 2, and the experimental results of AUPRC were shown in FIG. 3.

TABLE 1

Comparison of experimental results between the

method of Example 1 and other seven methods

Evaluation criterion

Encoding method
Accuracy
Sensitivity
Specificity
MCC

The present invention
0.987
0.991
0.983
0.974

PSNP
0.739
0.732
0.746
0.479

PSDP
0.827
0.820
0.833
0.653

KNF
0.653
0.656
0.651
0.307

KSNPF
0.662
0.642
0.681
0.324

NPPS
0.877
0.880
0.873
0.754

PBE
0.763
0.775
0.750
0.526

NCPNC
0.762
0.772
0.752
0.524

As shown in Table 1, the accuracy, sensitivity, specificity and MCC for identifying the DNA N⁴-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure were 0.987, 0.991, 0.983 and 0.974, respectively, which were much higher than those of the other seven compared encoding methods.

As shown in FIG. 2, the value of AUROC for identifying the DNA N⁴-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure was 0.999, which was much higher than that of the other seven compared encoding methods.

As shown in FIG. 3, the value of AUPRC for identifying the DNA N⁴-methylcytosine sites in Caenorhabditis elegans DNA sequences through the support vector machine model constructed based on the DNA sequence encoding method of the present disclosure was 0.999, which was much higher than that of the other seven compared encoding methods.

Example 2

The RNA N⁶-methyladenosine (m⁶A) dataset of the Saccharomyces cerevisiae RNA sequences in the literature “Benchmark data for identifying N⁶-methyladenosine sites in the Saccharomyces cerevisiae genome” was taken as an example. The dataset consisted of 2614 RNA sequences, of which, the number of samples in positive dataset, i.e., the actual number of N⁶-methyladenosine samples, was 1307, the number of samples in negative dataset, i.e., the number of non-N⁶-methyladenosine samples, was 1307, and the length l of each sequence is 51. The method for encoding RNA sequences based on bidirectional trinucleotide position-specific propensities and pointwise joint mutual information of this present example comprises the following steps (reference FIG. 1):

(1) A nucleotide position-specific propensity matrix of RNA sequences was constructed;

A dataset D of RNA sequences was given, and the dataset consisted of a positive dataset D⁺ and a negative dataset D⁻, i.e. D=D⁺∪D⁻;

The nucleotide position-specific propensity matrix M_S⁺ for the positive dataset D⁺ was determined according to the following formula:

$M_{s}^{+} = [\begin{matrix} f_{A, 1}^{+} & f_{A, 2}^{+} & L & f_{A, i}^{+} \\ f_{C, 1}^{+} & f_{C, 2}^{+} & L & f_{C, i}^{+} \\ f_{G, 1}^{+} & f_{G, 2}^{+} & L & f_{G, i}^{+} \\ f_{U, 1}^{+} & f_{U, 2}^{+} & L & f_{U, i}^{+} \end{matrix}]$

wherein, A, C, G and U were the 4 types of nucleotides of RNA sequences, i represents the position of a nucleotide, 1≤i≤l, and it was a finite positive integer, and l was the length of an RNA sequence, and its value was an odd number, the value of l in this example was 51, f_A,i⁺, f_C,i⁺, f_G,i⁺ and f_U,i⁺ were occurrence frequencies of nucleotides A, C, G and U at position i of all sequences of positive dataset D⁺, respectively;

The nucleotide position-specific propensity matrix M_S⁻ of the negative dataset D⁻ was determined according to the following formula:

$M_{s}^{-} = [\begin{matrix} f_{A, 1}^{-} & f_{A, 2}^{-} & L & f_{A, i}^{-} \\ f_{C, 1}^{-} & f_{C, 2}^{-} & L & f_{C, i}^{-} \\ f_{G, 1}^{-} & f_{G, 2}^{-} & L & f_{G, i}^{-} \\ f_{U, 1}^{-} & f_{U, 2}^{-} & L & f_{U, i}^{-} \end{matrix}]$

wherein f_A,i⁻, f_C,i⁻, f_G,i⁻ and f_U,i⁻ were the occurrence frequencies of nucleotides A, C, G and T at position i of all sequences of negative dataset D⁻, respectively.

(2) A bidirectional dinucleotide position-specific propensity matrix of RNA sequences was constructed;

The forward dinucleotide position-specific propensity matrix

$\overset{{uur}_{+}}{M_{d}}$

for the positive dataset D⁺ was determined according to the following formula:

$\overset{{uur}_{+}}{M_{d}} = [\begin{matrix} \overset{{ur}_{+}}{f_{AA, 1}} & \overset{{ur}_{+}}{f_{AA, 2}} & L & \overset{{ur}_{+}}{f_{AA, j}} \\ \overset{{ur}_{+}}{f_{AC, 1}} & \overset{{ur}_{+}}{f_{AC, 2}} & L & \overset{{ur}_{+}}{f_{AC, j}} \\ M & M & O & M \\ \overset{{ur}_{+}}{f_{UU, 1}} & \overset{{ur}_{+}}{f_{UU, 2}} & L & \overset{{ur}_{+}}{f_{UU, j}} \end{matrix}]$

wherein, AA, AC, . . . , and UU were 16 types of dinucleotides formed by the 4 types of nucleotides A, C, G, and U of RNA sequences, j represents the position of the dinucleotide, i.e., the position of the first nucleotide of the dinucleotides, 2≤j≤l−1, and j was a finite positive integer, 2≤j≤50 in this example,

$\overset{{ur}_{+}}{f_{AA, j}}, \overset{{ur}_{+}}{f_{AC, j}}, \dots, and \overset{{ur}_{+}}{f_{UU, j}}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU of all sequences of positive dataset D⁺, respectively, and the first and second nucleotide of the dinucleotides were at positions j and j+1, respectively;

$\overset{{suu}_{+}}{M_{d}}$

The backward dinucleotide position-specific propensity matrix for the positive dataset D⁺ was determined according to the following formula:

$\overset{{suu}_{+}}{M_{d}} = [\begin{matrix} \overset{{su}_{+}}{f_{AA, 2}} & \overset{{su}_{+}}{f_{AA, 3}} & L & \overset{{su}_{+}}{f_{AA, j}} \\ \overset{{su}_{+}}{f_{AC, 2}} & \overset{{su}_{+}}{f_{AC, 3}} & L & \overset{{su}_{+}}{f_{AC, j}} \\ M & M & O & M \\ \overset{{su}_{+}}{f_{UU, 2}} & \overset{{su}_{+}}{f_{UU, 3}} & L & \overset{{su}_{+}}{f_{UU, j}} \end{matrix}]$

wherein

$\overset{{su}_{+}}{f_{AA, j}}, \overset{{su}_{+}}{f_{AC, j}}, \dots, and \overset{{su}_{+}}{f_{UU, j}}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU of all sequences of positive dataset D⁺, respectively. The first and second nucleotide of these dinucleotides were at positions j and j−1, respectively;

The forward dinucleotide position-specific propensity matrix

$\overset{{uur}_{-}}{M_{d}}$

for the negative dataset D⁻ was determined according to the following formula:

$\overset{{uur}_{-}}{M_{d}} = [\begin{matrix} \overset{{ur}_{-}}{f_{AA, 2}} & \overset{{ur}_{-}}{f_{AA, 3}} & L & \overset{{ur}_{-}}{f_{AA, j}} \\ \overset{{ur}_{-}}{f_{AC, 2}} & \overset{{ur}_{-}}{f_{AC, 3}} & L & \overset{{ur}_{-}}{f_{AC, j}} \\ M & M & O & M \\ \overset{{ur}_{-}}{f_{UU, 2}} & \overset{{ur}_{-}}{f_{UU, 3}} & L & \overset{{ur}_{-}}{f_{UU, j}} \end{matrix}]$

wherein

$\overset{{ur}_{-}}{f_{AA, j}}, \overset{{ur}_{-}}{f_{AC, j}}, \dots, and \overset{{ur}_{-}}{f_{UU, j}}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU, whose nucleotides were at positions j and j+1, of all sequences of negative dataset D⁻, respectively;

The backward dinucleotide position-specific propensity matrix

$\overset{{suu}_{-}}{M_{d}}$

for the negative dataset D⁻ was determined according to the following formula:

$\overset{{suu}_{-}}{M_{d}} = [\begin{matrix} \overset{{su}_{-}}{f_{AA, 2}} & \overset{{su}_{-}}{f_{AA, 3}} & L & \overset{{su}_{-}}{f_{AA, j}} \\ \overset{{su}_{-}}{f_{AC, 2}} & \overset{{su}_{-}}{f_{AC, 3}} & L & \overset{{su}_{-}}{f_{AC, j}} \\ M & M & O & M \\ \overset{{su}_{-}}{f_{UU, 2}} & \overset{{su}_{-}}{f_{UU, 3}} & L & \overset{{su}_{-}}{f_{UU, j}} \end{matrix}]$

wherein,

$\overset{{su}_{-}}{f_{AA, j}}, \overset{{su}_{-}}{f_{AC, j}}, \dots, and \overset{{su}_{-}}{f_{UU, j}}$

were the occurrence frequencies of dinucleotides AA, AC, . . . , and UU, whose nucleotides were at positions j and j−1 respectively, of all sequences of negative dataset D⁻, respectively;

(3) A bidirectional trinucleotide position-specific propensity matrix of RNA sequences was constructed

The forward trinucleotide position-specific propensity matrix

$\overset{{uur}_{+}}{M_{t}}$

for the positive dataset D⁺ was determined according to the following formula:

$? M_{t} = [\begin{matrix} \overset{{ur}_{+}}{f_{A A A, β + 3}} & \overset{{ur}_{+}}{f_{A A A, β + 4}} & L & \overset{{ur}_{+}}{f_{AAA, k}} \\ \overset{{ur}_{+}}{f_{AAC, β + 3}} & \overset{{ur}_{+}}{f_{AAC, β + 4}} & L & \overset{{ur}_{+}}{f_{AAC, k}} \\ M & M & O & M \\ \overset{{ur}_{+}}{f_{UUU, β + 3}} & \overset{{ur}_{+}}{f_{UUU, β + 4}} & L & \overset{{ur}_{+}}{f_{UUU, k}} \end{matrix}]$

$? indicates text missing or illegible when filed$

wherein AAA, AAC, UUU were 64 types of trinucleotides formed by 4 types of nucleotides A, C, G, and U of RNA sequences, β represented the distance between the nucleotide at position k and its forward adjacent dinucleotide, 0≤β≤(l−5)/2, β was a finite positive integer, 0≤β≤23 in this example, k represented the position of the trinucleotide, i.e. the position of the first nucleotide of the trinucleotides, β+3≤k≤l−β−2, β+3≤k≤49−β in this example, and k was a finite positive integer,

$\overset{{ur}_{+}}{f_{AAA, k}}, \overset{{ur}_{+}}{f_{AAC, k}}, \dots, and \overset{{ur}_{+}}{f_{UUU, k}}$

were the frequencies of trinucleotides AAA, AAC, . . . , or UUU whose nucleotides were at positions k, k+β+1, and k+β+2 of all RNA sequences of positive dataset D⁺, respectively;

The backward trinucleotide position-specific propensity matrix

$? M_{t}$

$? indicates text missing or illegible when filed$

for the positive dataset D⁺ was determined according to the following formula:

$? M_{t} = [\begin{matrix} \overset{{su}_{+}}{f_{A A A, β + 3}} & \overset{{su}_{+}}{f_{A A A, β + 4}} & L & \overset{{su}_{+}}{f_{AAA, k}} \\ \overset{{su}_{+}}{f_{AAC, β + 3}} & \overset{{su}_{+}}{f_{AAC, β + 4}} & L & \overset{{su}_{+}}{f_{AAC, k}} \\ M & M & O & M \\ \overset{{su}_{+}}{f_{UUU, β + 3}} & \overset{{su}_{+}}{f_{UUU, β + 4}} & L & \overset{{su}_{+}}{f_{UUU, k}} \end{matrix}]$

$? indicates text missing or illegible when filed$

wherein,

$\overset{{su}_{+}}{f_{AAA, k}}, \overset{{su}_{+}}{f_{AAC, k}}, \dots, and \overset{{su}_{+}}{f_{UUU, k}}$

were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k−β−1, and k−β−2 of all RNA sequences of positive dataset D⁺, respectively;

The forward trinucleotide position-specific propensity matrix

$? M_{t}$

$? indicates text missing or illegible when filed$

for the negative dataset D⁻ was determined according to the following formula:

$? M_{t} = [\begin{matrix} \overset{{ur}_{-}}{f_{A A A, β + 3}} & \overset{{ur}_{-}}{f_{A A A, β + 4}} & L & \overset{{ur}_{-}}{f_{AAA, k}} \\ \overset{{ur}_{-}}{f_{AAC, β + 3}} & \overset{{ur}_{-}}{f_{AAC, β + 4}} & L & \overset{{ur}_{-}}{f_{AAC, k}} \\ M & M & O & M \\ \overset{{ur}_{-}}{f_{UUU, β + 3}} & \overset{{ur}_{-}}{f_{UUU, β + 4}} & L & \overset{{ur}_{-}}{f_{UUU, k}} \end{matrix}]$

$? indicates text missing or illegible when filed$

wherein,

$\overset{{ur}_{-}}{f_{AAA, k}}, \overset{{ur}_{-}}{f_{AAC, k}}, \dots, and \overset{{ur}_{-}}{f_{UUU, k}}$

were occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k+β+1, and k+β+2 of all RNA sequences of negative dataset D⁻, respectively;

The backward trinucleotide position-specific propensity matrix

$? M_{t}$

$? indicates text missing or illegible when filed$

for the negative dataset D⁻ was determined according to the following formula:

$? M_{t} = [\begin{matrix} \overset{{su}_{-}}{f_{A A A, β + 3}} & \overset{{su}_{-}}{f_{A A A, β + 4}} & L & \overset{{su}_{-}}{f_{AAA, k}} \\ \overset{{su}_{-}}{f_{AAC, β + 3}} & \overset{{su}_{-}}{f_{AAC, β + 4}} & L & \overset{{su}_{-}}{f_{AAC, k}} \\ M & M & O & M \\ \overset{{su}_{-}}{f_{UUU, β + 3}} & \overset{{su}_{-}}{f_{UUU, β + 4}} & L & \overset{{su}_{-}}{f_{UUU, k}} \end{matrix}]$

$? indicates text missing or illegible when filed$

wherein,

$\overset{{su}_{-}}{f_{AAA, k}}, \overset{{su}_{-}}{f_{AAC, k}}, \dots, and \overset{{su}_{-}}{f_{UUU, k}}$

were the occurrence frequencies of trinucleotides AAA, AAC, . . . , and UUU whose nucleotides were at positions k, k−β−1, and k−β−2 of all RNA sequences of negative dataset D⁻, respectively;

(4) A value of pointwise joint mutual information of the nucleotides of RNA sequences was determined

(4.1) The value

$\overset{r_{+}}{v_{k}}$

of forward pointwise joint mutual information of the nucleotides of RNA sequences to be encoded in the positive dataset D⁺ was determined according to the following formula:

$\overset{r_{+}}{v_{k}} = \log \frac{? f_{xyz, k}^{}}{f_{x k}^{+} ? f_{xyz, k + β + 1}^{}}$

$? indicates text missing or illegible when filed$

wherein, x was the nucleotide at position k, x∈{A,C,G,U},

$? y$

$? indicates text missing or illegible when filed$

was the nucleotide at position k+β+1,

$? y \in {A, C, G, U}, ? z$

$? indicates text missing or illegible when filed$

was the nucleotide at position k+β+2,

$\overset{?}{z} \in {A, C, G, U}, \overset{u_{+}}{f_{xyz, k}^{ur}}$

$? indicates text missing or illegible when filed$

was the occurrence frequency of trinucleotide

$\overset{u ?}{xyz}$

$? indicates text missing or illegible when filed$

of all sequences of positive dataset D⁺,

$\overset{u_{+}}{f_{yz, k + β + 1}^{ur}}$

was the occurrence frequency of dinucleotide

$\overset{u ?}{yz}$

$? indicates text missing or illegible when filed$

or all RNA sequences or positive dataset D⁺, and f_x,k⁺ was the occurrence frequency of nucleotide of all sequences of positive dataset D⁺.

The value

$\overset{s_{+}}{v_{k}}$

of backward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in the positive dataset D⁺ was determined according to the following formula:

$\overset{s_{+}}{v_{k}} = \log \frac{\overset{{su}_{+}}{f_{xyz, k}^{sus}}}{\overset{{su}_{+}}{f_{x, k}^{+} f_{yz, k - β - 1}^{sus}}}$

where,

$\overset{su}{y}$

was the nucleotide at position k−β−1,

$\overset{su}{y} \in {A, C, G, U}, \overset{s}{z}$

was the nucleotide at position k−β−2,

$\overset{s}{z} \in {A, C, G, U}, and \overset{{su}_{+}}{f_{x γ z, k}^{sus}}$

was the occurrence frequency of trinucleotide

$\overset{sus}{x y z}$

of all RNA sequences of positive dataset D⁺,

$\overset{{su}_{+}}{f_{yz, k - β - 1}^{sus}}$

was the occurrence frequency of dinucleotide

$\overset{sus}{yz}$

of all RNA sequences of positive dataset D⁺.

The encoding value v_k⁺ of pointwise joint mutual information of nucleotide at position k of an RNA sequence to be encoded in the positive dataset D⁺ was defined as the average of the value

$\overset{r_{+}}{v_{k}}$

of forward pointwise joint mutual information and the value

$\overset{s_{+}}{v_{k}}$

of backward pointwise joint mutual information. An RNA sequence with a length of l was encoded into a pointwise mutual information feature vector V⁺ with a length of l−2β−4:

$V^{+} = [v_{β + 3}^{+}, v_{β + 4}^{+}, L, v_{k}^{+}] v_{k}^{+} = \frac{\overset{r_{+}}{v_{k}} + \overset{s_{+}}{v_{k}}}{2}$

The value of l was 51 in this example.

(4.2) The value

$\begin{matrix} r_{-} \\ v_{k} \end{matrix}$

of forward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in the negative dataset D⁻ was determined according to the following formula:

$\begin{matrix} r_{-} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {ur}_{-} \\ f_{xyz, k}^{ur} \end{matrix}}{f_{x, k}^{-} \begin{matrix} {ur}_{-} \\ f_{yz, k + β + 1}^{ur} \end{matrix}}$

wherein, x was the nucleotide at position k, xE{A,C,G,U},

$\begin{matrix} u \\ y \end{matrix}$

was the nucleotide at position k+β+1,

$\begin{matrix} u \\ y \in {A, C, G, U}, \end{matrix} \begin{matrix} r \\ z \end{matrix}$

was the nucleotide at position k+β+2,

$\begin{matrix} r \\ z \in {A, C, G, U} \end{matrix}, and \begin{matrix} {ur}_{-} \\ f_{xyz, k}^{ur} \end{matrix}$

was the occurrence frequency of trinucleotide

$\begin{matrix} ur \\ xyz \end{matrix}$

of all sequences of negative dataset D⁻,

$\begin{matrix} {ur}_{-} \\ f_{yz, k + β + 1}^{ur} \end{matrix}$

was the occurrence frequency of dinucleotide

$\begin{matrix} ur \\ yz \end{matrix}$

of all sequences of negative dataset D⁻, and f_x,k⁻ was the occurrence frequency of nucleotide x of all sequences of negative dataset D⁻.

The value

$\begin{matrix} s_{-} \\ v_{k} \end{matrix}$

of backward pointwise joint mutual information of nucleotides of RNA sequences to be encoded in negative dataset D⁻ was determined according to the following formula:

$\begin{matrix} s_{-} \\ v_{k} \end{matrix} = \log \frac{\begin{matrix} {su}_{-} \\ f_{xyz, k}^{sus} \end{matrix}}{\begin{matrix} {su}_{-} \\ f_{x, k}^{-} f_{yz, k - β - 1}^{sus} \end{matrix}}$

wherein, nucleotide x was at position k, and nucleotide

$\begin{matrix} su \\ y \end{matrix}$

was at position k−β−1, and nucleotide

$\begin{matrix} s \\ z \end{matrix}$

was at position k−β−2,

$\begin{matrix} {su}_{-} \\ f_{xyz, k}^{sus} \end{matrix}$

was the occurrence frequency of trinucleotide

$\begin{matrix} sus \\ xyz \end{matrix}$

of all RNA sequences of negative dataset D⁻,

$\begin{matrix} {su}_{-} \\ f_{yz, k - β - 1}^{sus} \end{matrix}$

was the occurrence frequency of dinucleotide

$\begin{matrix} sus \\ yz \end{matrix}$

of all sequences of negative dataset D⁻.

The encoding value v_k⁻ of pointwise joint mutual information of the nucleotide at position k of an RNA sequence to be encoded in the negative dataset D⁻ was defined as the average of the value

$\overset{r_{-}}{v_{k}}$

of forward pointwise joint mutual information and the value

$\overset{s_{-}}{v_{k}}$

of backward pointwise joint mutual information, and an RNA sequence with a length of l was encoded into a pointwise mutual information feature vector V⁻ with a length of l−2β−4:

$V = [v_{β + 3}^{-}, v_{β + 4}^{-}, L, v_{k}^{-}] v_{k} = \frac{\overset{r_{-}}{v_{k}} + \overset{s_{-}}{v_{k}}}{2}$

The value of l was 51 in this example.

(4.3) The feature vector V of an RNA sequence to be encoded with a given length l was determined by corresponding element of vector V⁺ minus that of V⁻:

V=[V_β+3, V_β+4, . . . , V_k]

V
_k
=v
_k
⁺
−v
_k
⁻;

(5) Concatenating features

when the value of parameter β was 0, the feature vector V(0) was [V₃, V₄, V₅, . . . , V_l−3, V_l−2], and the number of elements was l−4; when the value of β was 1, the feature vector V(1) was [V₄, V₅, V₆, . . . , V_l−4, V_l−3], and the number of elements was l−6, . . . , and when the value of β was (l−7)/2, the feature vector V((l−7)/2) was [V_(l−1)/2, V_(l−1)/2, V_(l+3)/2], the number of elements was 3; when the value of β was (l−5)/2, the feature vector V((l−5)/2) was [V_(l+1)/2], and the number of elements was 1; the feature vectors determined by different values of the parameter β were concatenated into a high-dimensional feature vector [V(0), V(1), . . . , V((l−7)/2), V((l−5)/2)] with (l−3)²/4 elements, the value of l was 51 in this example.

(6) Encoding the RNA sequences

The RNA sequence dataset D was encoded into a numerical dataset D′ by adopting the above step (1)-step (5),

$D^{'} \in R^{s \times \frac{{(l - 3)}^{2}}{4}},$

where s was a number of samples of the numerical dataset D′, and s was a finite positive integer, the value of s was 2614 in this example, and (l−3)²/4 was a feature number of the numerical data set D′. The encoding of RNA sequences was completed.

The RNA sequence encoding method of Example 2 was compared with PSNP (position-specific nucleotide propensities), PSDP (position-specific dinucleotide propensities), KNF (K-nucleotide frequencies), KSNPF (K spaced nucleotide pair frequencies), NPPS (nucleotide pair position specificity), PBE (positional binary encoding) and NCPNC (nucleotide chemical property and nucleotide composition) encoding methods which were for identifying the RNA N⁶-methyladenosine sites in Saccharomyces cerevisiae RNA sequences by the performance of support vector machine models constructed using each encoding method. The average classification accuracy, sensitivity, specificity, MCC (Mathew's Correlation coefficient), AUROC (Area under the receiver operating characteristic curve) and AUPRC (Area under the precision recall curve) of 10-fold cross-validation method were used to evaluate each method. The experimental method was as follows:

1. The RNA sequences of N⁶-methyladenosine of Saccharomyces cerevisiae were encoded according to the method of Example 2;

2. Normalizing the dataset

The numerical dataset D′ was normalized by the maximum-minimum method according to the following formula:

$g_{m, n}^{'} = \frac{g_{m, n} - \min (g_{n})}{\max (g_{n}) - \min (g_{n})}$

wherein g_m,nwas the n-th feature value of the m-th sample of the numerical dataset D′, the normalized value of g_m,nwas g′_m,n, max(g_n) and min(g_n) represent the maximum and minimum feature values of the n-th column of the numerical dataset D′, 1≤m≤s, 1≤n≤(l−1)²/4, m and n were finite positive integers, the value of l in this example was 51, and the value of s was 2614.

3. Partitioning dataset

The normalized numerical dataset D′ was partitioned into 10 folds by using the K-fold cross-validation method (K=10), one fold was taken as the test dataset D′_Te, and the remaining nine folds are taken as the training dataset D′_Tr, till each fold was taken as the test dataset, so there were 10 runs in total. The ratio of the training dataset D′_Trto the test dataset D′_Tein each run was 9:1.

4. Training and testing the model

The support vector machine model was trained using training dataset D′_Tr, and the performance of the support vector machine model is tested by the test dataset D′_Te.

The RNA N⁶-methyladenosine sites in the Saccharomyces cerevisiae RNA sequences were identified by performing the same operation on the seven compared RNA sequence encoding methods according to steps 2-4 of the experimental methods. The experimental results of classification accuracy, sensitivity, specificity and MCC were shown in Table 2, the experimental results of AUROC were shown in FIG. 4, and the experimental results of AUPRC were shown in FIG. 5.

TABLE 2

Comparison of experimental results between the

method of Example 2 and other seven methods

Evaluation criterion

Encoding method
Accuracy
Sensitivity
Specificity
MCC

The present invention
0.995
0.996
0.994
0.990

PSNP
0.747
0.751
0.743
0.495

PSDP
0.766
0.764
0.769
0.534

KNF
0.692
0.741
0.643
0.387

KSNPF
0.651
0.712
0.591
0.307

NPPS
0.874
0.884
0.864
0.749

PBE
0.727
0.727
0.728
0.456

NCPNC
0.731
0.735
0.726
0.463

As shown in Table 2, the accuracy, sensitivity, specificity and MCC for identifying the RNA N⁶-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure were 0.995, 0.996, 0.994 and 0.990, respectively, which were much higher than those of the other seven compared encoding methods.

As shown in FIG. 4, the value of AUROC for identifying the RNA N⁶-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure was the maximum value of 1, which was much higher than that of the other seven compared encoding methods.

As shown in FIG. 5, the value of AUPRC for identifying the RNA N⁶-methyladenosine sites in Saccharomyces cerevisiae RNA sequences through the support vector machine model constructed based on the RNA sequence encoding method of the present disclosure was the maximum value of 1, which was much higher than that of the other seven compared encoding methods.

METHOD FOR ENCODING DNA/RNA SEQUENCES BASED ON BIDIRECTIONAL TRINUCLEOTIDE POSITION-SPECIFIC PROPENSITIES AND POINTWISE JOINT MUTUAL INFORMATION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)