The instant application contains a Sequencing Listing which has been submitted electronically in XML file and is hereby incorporated by reference in its entirety. Said XML copy, created on Jun. 8, 2023, is named 133582_sequencing-listing and is 5,669 bytes in size.
The present disclosure belongs to the field of site prediction, and mainly relates to a predicting method of transcription factor binding sites, in particular to a predicting method of transcription factor binding sites based on weighted multi-granularity scanning.
In eukaryotes, gene expression is regulated by many regulatory factors. The regulation and control of genes in organisms is referred to as gene expression regulation. The regulation of gene expression has a far-reaching influence on adaptation to environmental changes and self-regulation of organisms. In eukaryotes, both the time of transcription and the rate of the transcription process can control gene expression, so that transcription regulation is closely related to gene expression regulation. Transcription factors, as a special DNA binding protein, can bind to a DNA template chain, and then regulate the transcription process. Transcription factors participate in different biological processes at all stages of life activities. The processes such as proliferation, growth, differentiation and apoptosis of cells are inseparable from the regulation effect of transcription factors. The abnormal function of transcription factors will lead to abnormal life activities, and then lead to a variety of diseases. For example, common nervous system diseases, coronary heart disease, diabetes, hypertension and even cancer are closely related to the changes of transcription factors.
Transcription factor binding sites are sites on DNA sequences that bind to transcription factors, and most of the sites are located on promoters upstream of DNA sequences. The study of transcription factor binding sites is helpful to study a series of diseases caused by site mutation. In some cancer treatments, transcription factor binding sites are also commonly used as effective drug targets, which is of great significance for research and development and innovation of drugs. At present, the predicting method of transcription factor binding sites generally have the defects that the prediction accuracy is unsatisfactory, or the accuracy is high, but the prediction experiment takes a long time, and the accuracy is unsatisfactory for a small data set, so that the current site prediction demand cannot be satisfied. Therefore, the existing methods need to be innovated.
Aiming at the defects of the existing predicting method of transcription factor binding sites, the present disclosure provides a predicting method of transcription factor binding sites based on weighted multi-granularity scanning TF_DF. TF_DF uses the combined feature representation method to better characterize the potential features of DNA sequences, and combines the weighted multi-granularity scanning method and the cascading forest technology to improve the accuracy of prediction results, so that the model pays more attention to those important features during training. The purpose is to solve the problems that the prediction accuracy not high and the model training time is long in the current predicting method of transcription factor binding sites.
The method comprises the following steps:
to obtain a feature vector F1, and then combining with multi-base feature coding for feature representation to obtain a feature vector F2, splicing the feature vectors F1 and F2 to obtain a combined feature representation F, and encoding the result category with the formula
where d is the total number of features, Scorei is the importance score of the i-th column of features in the weight vector W, and the specific calculation formula is as follows:
Scorei=Σt=1TScorenode(t)
where Scorenode(t) is the importance score of the t-th decision tree node, and the specific calculation formula is as follows:
Scorenode=GnodeGnode,0 −Gnode,1
where Gnode,0 and Gnode,1 represent the Gini index of the nodes belonging to category 0 under the node branch and the Gini index of the nodes belonging to category 1 under the node branch, respectively;
Gnode is the Gini index of each node, and the specific formula is as follows:
where N is the number of samples in the training set Dtrain, Nnode,0 is the number of nodes belonging to category 0, and Nnode,1 is the number of nodes belonging to category 1;
Preferably, in the multi-base feature coding method, the length L of the feature column is obtained according to the formula L=4m, where m is the length of the base in the multi-base, m has a value of 3, bases A, T, C and G form a sequence set C with a length of 3 bp: {'AAA', ‘AAT’, ‘AAG’, ‘AAC’, ‘ATA’, ‘ATT’, ‘ATG’, ‘ATC’, ‘AGA’, ‘AGT’, ‘AGG’, ‘AGC’, ‘ACA’, ‘ACT’, ‘ACG’, ‘ACC’, ‘TAA’, ‘TAT’, ‘TAG’, ‘TAC’, ‘TTA’, ‘TTT’, ‘TTG’, ‘TTC’, ‘TGA’, ‘TGT’, ‘TGG’, ‘TGC’, ‘TCA’, ‘TCT’, ‘TCG’, ‘TCC’, ‘GAA’, ‘GAT’, ‘GAG’, ‘GAC’, ‘GTA’, ‘GTT’, ‘GTG’, ‘GTC’, ‘GGA’, ‘GGT’, ‘GGG’, ‘GGC’, ‘GCA’, ‘GCT’, ‘GCG’, ‘GCC’, ‘CAA’, ‘CAT’, ‘CAG’, ‘CAC’, ‘CTA’, ‘CTT’, ‘CTG’, ‘CTC’, ‘CGA’, ‘CGT’, ‘CGG’, ‘CGC’, ‘CCA’, ‘CCT’, ‘CCG’, ‘CCC’}, each element in set C is set as a feature column, there are 64 feature columns in total, and its element is the feature name of the feature column;
Preferably, in step (3), Q has a value of 4, and R has a value of 1.
Preferably, in step (4), T has a value of 462, and the maximum depth of the tree is 11.
Preferably, in step (5), μ has a value of 50, and L has a value of 1.
In order to clearly illustrate the technical scheme of the present disclosure, the present disclosure will be described in conjunction with
It should be pointed out that the following detailed description is exemplary and is intended to provide further explanation of the present disclosure. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which the present disclosure belongs.
The input file of the TF_DF method contains a CSV-type file. Raw data.csv file contains 1200 positive samples and 1200 negative samples of the transcription factor SP1 binding sits of human chromosome 1, which is the original data set D. Each piece of data contains a DNA sequence with a length of 14 bases and its corresponding categories (binding sites and non-binding sites), and the initial data is preprocessed on the basis of the data set. The output file of the TF_DF method contains a CSV-type file and an output-type file. The file sequence feature.csv is the data set D* obtained by data preprocessing. The file TF_classification.output is the prediction category of each site in the test set output by using the TF_DF method. The output of the TF_DF method is whether each DNA sequence predicted by the method is a transcription factor binding site.
The TF_DF predicting method can be specifically divided into the following steps.
In this embodiment, the data set D={D1, D2, . . . , Dn} of the transcription factor SP1 binding sites of human chromosome 1 is preprocessed. Taking the small amount of data into account, it is necessary to carry out data augmentation on the data set first. According to the sequence features of DNA binding sites, the inverse sequence, the complementary sequence and the complementary inverse sequence of each DNA sequence are found, and the number of positive samples and the number of negative samples both expand to 4800 (
to obtain a feature vector F1 (
that is, the result category is the transcription factor binding sites.
In this embodiment, the data set D* after data preprocessing contains 4800 positive samples and 4800 negative samples, and each piece of sample data contains 120 feature items and 1 result feature category. The positive and negative samples are disrupted and mixed.
The data set D* after feature representation in step (2) is divided according to the ratio 4:1 of the number of samples in the training set to the number of samples in the test set to obtain a training set Dtrain and a test set Dtest; the number of samples in the training set Dtrain and the number of samples in the test set Dtest are 7680 and 1920 after the data set is divided in this example, respectively.
462 decision trees are used to calculate the weight vector W of the training set Dtrain. According to the formula the Gini
index G node of each node is calculated, where N is the number of samples in the training set Dtrain, Nnode,0 is the number of nodes belonging to category 0, and Nnode,1 is the number of nodes belonging to category 1. According to the formula Scorenode=Gnode−Gnode,0−Gnode,1 the importance score Scorenode of each node is calculated, where Gnode,0 and Gnode,1 represent the Gini index of the nodes belonging to category 0 under the node branch and the Gini index of the nodes belonging to category 1 under the node branch, respectively. According to the formula Scorei=Σt=1TScorenode, the importance score Scorei of the i-th column of features is calculated, where
T is the number of decision trees. According to the formula
the weight Wi of each feature is calculated, where Scorei is the importance score of the i-th column of features, and d is the total number of features.
In this example, top 10 partial features in the weight and their corresponding weight results are as follows:
As shown in
F* is input into the cascade forest, and the model training is carried out to obtain a classification prediction model of transcription factor binding sites. The test set Dtest is input into the classification prediction model to verify the performance of the model.
Take the predicted DNA sequence “(SEQ ID NO:5) GGGGCGGGGCCGGC” as an example. Then the final classification prediction result of the DNA sequence is ‘1 ’, which is the transcription factor binding site.
According to five-fold cross-validation and three evaluation indexes, the performance of the method is evaluated, and the accuracy of the method and F1 value are calculated with the formula
and the formula
respectively, where a is the number of samples in which the predicted classification results are the same as the actual classification results, and b is the number of samples in the test set Dtest. p value and r value are calculated with the formula
and the formula
respectively, where TP is the number of data points in which the predicted classification result is the transcription factor binding site and the actual classification result is the transcription factor binding sites, FP is the number of data points in which the predicted classification result is the transcription factor binding site but the actual classification result is the non-transcription factor binding site, and FN is the number of data points in which the predicted classification result is the non-transcription factor binding site but the actual classification result is the transcription factor binding site. Accuracy can be regarded as the percentage of the correct rate of the algorithm output results with the value in the range of [0,1]. The closer the accuracy is to 1, the larger the number of correctly predicted samples, and the closer the accuracy is to 0, the smaller the number of correctly classified results. When the value F1 is higher, it can be shown that the algorithm is closer to the ideal state. The AUC value is the area surrounded by the coordinate axis under the ROC curve, which can more objectively reflect the ability of the model. Generally speaking, the higher the AUC value, the stronger the performance of the algorithm. It can be known through the above calculation formula that the accuracy, the value F1 and the AUC of the test set Dtest are 0.8943, 0.8920 and 0.9219, respectively.
The features of a single base are important for identifying TFBS in DNA sequence, and the base next to each base can be also important. In order to prove this idea, the single basic feature is compared with the features represented by combining the multi-base feature coding method on several models.
The experimental results (
The data set D* is input into the TF_DF method for model training after being divided, so as to realize the high-accuracy prediction of each point in the prediction set. 15 experiments have been carried out on all the proposed classification algorithms. In order to ensure a fair comparison, the same training data and test data are used in each experiment, and the parameter settings of each model are also the same. The following table shows the average results of 15 experiments of KNN, Adaboost, Random Forest, LightGBM, Deep Forest and TF_DF method.
In contrast, the accuracy, the F1 value and the AUC of the TF_DF method are 89.43%, 89.20% and 92.19%, respectively, which are higher than other classification algorithms to varying degrees. This shows that the TF_DF method has higher prediction ability. Compared with the experimental results, it can be concluded that the TF_DF method designed by the present disclosure improves the accuracy, ability and performance of the classifier. That is to say, the TF_DF method is better in effect than previous classification algorithms in the field of classification and prediction of transcription factor binding sites.
Compared with the method in the prior art, the method has the following beneficial effects.
The TF_DF method realizes highly accurate prediction of transcription factor binding sites, especially the site prediction for small data sets. The method abandons the idea of a single-base feature and combines multi-base feature coding to extract the features of each base context, which improves the accuracy rate of classification and prediction results. At the same time, based on the idea of different importance of features, the multi-granularity scanning is optimized to obtain better performance, and the cascade forest is used to train and predict the model. Compared with the existing predicting method of transcription factor binding sites, the present disclosure has higher efficiency and accuracy, and has better robustness and portability.
Finally, it should be explained that the above is only a preferred embodiment of the present disclosure, rather than limit the present disclosure. Although the present disclosure has been described in detail with reference to the aforementioned embodiments, those skilled in the art can still modify the technical scheme described in the aforementioned embodiments or to replace some technical features equivalently. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202210535743.3 | May 2022 | CN | national |
This application claims the priority benefit of China application serial no. 202210535743 .3, filed on May 18, 2022. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.