1. Field of the Invention
The present invention relates to a method for identifying protein binding sites on genomic DNA, using a DNA chip.
2. Background Art
Gene expression is regulated by a sequence-specific binding of a plurality of proteins to genomic DNA. Examining gene expression regulation is equivalent to exploring the mechanism inherent in living organisms of development, differentiation, growth, and aging. For example, it is known that the binding of a transcription factor called “activator” (protein) and a genomic sequence performs an important function for the maintenance of life. Generally, transcription factors are said to bind to a promoter region located upstream of a gene to activate the expression of a gene downstream. However, transcription factors do not always bind to an upstream region of a gene, and it has been reported that they also bind to regions called enhancers or silencers several hundred or a thousand bases away and perform regulation.
Thanks to the development of DNA chips, methods have been established in recent years that allow gene expression of various life forms to be examined in a comprehensive manner. In a hybridization technique, the cDNA sequence or oligonucleotides as part of a gene (coding region) are prepared and spotted on a slide glass as a probe in a dense manner. mRNA is collected from a specific living species and added to the probe as a target, whereby hybridization occurs. Such hybridization techniques now enable the measurement of transcript levels in a comprehensive manner.
Patent Document 1 discloses a method for analyzing the interaction of protein and DNA using the aforementioned technique. In this method, DNA reproducing a noncoding region (intergenic region) is spotted on a slide glass, thereby preparing a DNA chip having the noncoding region as a probe. Meanwhile, protein and DNA are cross-linked in vivo using formaldehyde, and after the cells have been destroyed, chromatin immunoprecipitation is conducted using an antibody that recognizes a specific transcription factor (protein). As a result, a DNA fragment including a noncoding region to which the specific transcription factor can bind can be obtained as a target. The DNA fragment is labeled by fluorescent dye and caused to hybridize to a DNA chip in which the noncoding region is disposed as a probe. In this way, control genes in the transcription factor can be examined comprehensively.
Patent Document 2 discloses a method for identifying a protein-binding region using a DNA chip in which a coding region (gene) is spotted as a probe. In this method, DNA including a noncoding region and some of the coding regions on either side of the noncoding region is used as a target. The method enables the determination of the presence of a protein-binding region between the two adjacent coding regions. The method enables the detection of a DNA-protein binding region with little experimental error without requiring the renewed preparation of a DNA chip in which a noncoding region is used as a probe.
Other various methods for identifying protein-binding regions on DNA are also being studied using computers. Algorithms have also been developed that predict binding regions by sequential analysis based on statistical approaches.
Patent Document 3 discloses a program for identifying a protein-binding region based on a database in which information about known transcription factors are stored and a sequence database, using an algorithm with high computational efficiency. Binding sequences of transcription factors are short and recognized to be from 6 to 15 bases long. When such short sequences are detected from the entire noncoding regions of a genome, numerous false positives are found. Therefore, it is important in the prediction of protein-binding regions how to reduce the false positives. The publication discloses that statistically significant binding sequences are predicted using known information in a transcription factor database called TRANSFAC.
In the example of Patent Document 3, statistical prediction is made using only computers. Although the method reflects existing data, it cannot confirm the actual protein-binding regions in a comprehensive manner.
Promoter regions where RNA polymerases bind, or gene expression regions where transcriptional control factors (proteins) bind are referred to as cis-elements. In Patent Publications 1 and 2, cis-elements cannot be accurately identified; only binding regions are estimated.
Application of DNA chip technology allows for the narrowing of gene regulating regions in a comprehensive manner. However, no system has so far been devised that is equipped to find cis-elements from experimental data and analyze them accurately.
It is therefore an object of the invention to solve the aforementioned problems and provide a method for detecting DNA-protein binding regions using a DNA chip in which a noncoding or coding region is spotted.
The invention provides a method for determining a protein binding site on a genomic DNA, comprising:
In accordance with the invention, the position on a genomic DNA to which a specific protein binds, namely, a noncoding region, can be identified. Detecting gene controlled by a certain transcription factor (protein) in a comprehensive manner can lead not only to the clarification of changes in the expression of known control genes but also to the discovery of unknown gene regulation mechanisms. Further, by detecting specific DNA sequences recognized by specific proteins, the possibility of sequence-specific binding increases, so that light can be shed on transcription regulation of a variety of unknown genes.
Initially, a method for determining hybridization intensity, namely, fluorescent intensity, by a hybridization experiment using a DNA chip or DNA microarray is described. Hereafter, a DNA fragment affixed to the DNA chip (spotted on a slide glass) will be referred to as a probe, and a DNA fragment to be hybridized to the probe will be referred to as a target.
In the present example, fluorescent intensity data is acquired using two conventional methods. One method is the method disclosed in Patent Document 1 and is shown in
As a probe, the noncoding regions A1, A2, and A3 of the genomic DNA are used. A DNA chip 20 is prepared by spotting the noncoding regions A1, A2, and A3 on a slide glass. Then, a DNA binding protein X12 from a specific organism 11 is bound to the genomic DNA by cross-linking. The genomic DNA is then disintegrated with an ultrasonic disintegrator. The resultant DNA fragments are extracted using an antibody 13 that specifically recognizes the protein X. Thereafter the cross-linking is removed, and the DNA binding protein X is separated from the DNA fragments. The thus separated DNA fragments are labeled with a fluorescent dye, thereby obtaining a target. The DNA fragments of the target include the noncoding regions A1, A2, and A3 that can bind to the protein X12. The target is then hybridized to the probe of the DNA chip, whereby fluorescent intensity data can be obtained. Based on this fluorescent intensity data, the binding site of the protein X is determined, as will be described below.
The other is the method disclosed in Patent Document 2 and is schematically shown in
In the present example, a control test was also conducted, as shown in
With reference to
The chip database 201 stores fluorescent intensity (hybridization intensity) data obtained by the two methods described with reference to
The program memory 205 includes: a program 206 for displaying the protein binding site on the genomic DNA; a program 207 for displaying a list of the sequences of protein binding sites on the genomic DNA; a program 208 for retrieving cis elements and displaying them in a list; a program 209 for detecting and displaying the frequency of appearance of specific cis elements in a designated sequence; and a program 210 for determining whether a specific cis element is a false positive and displaying the result of determination.
The cis-element search program 208 utilizes MEME (Multiple EM for Motif Elicitation) based on a conventional EM algorithm. The cis-element false-positive determination program 210 determines a particular cis element to be a false positive if it appears in sequences with lower fluorescent intensity when cis elements appear in a plurality of sequences. A false positive herein means that the possibility of a specific protein specifically recognizing the cis element in the genome sequence is small. If a cis element appears in a single sequence a number of times, the program determines it to be a positive, thereby determining that the possibility of a specific protein controlling the sequence is high.
“Expression data about each probe” 303, which shows the fluorescent intensity (hybridization intensity) of each probe mounted on the DNA chip, is experimental data entered by the user. “Sequence of each probe” 304 is the base sequence of each probe, and it is experimental data entered by the user. “Detailed information about each probe” 305 is an annotation to a particular gene when the probe is a coding region. It is entered by the user as required.
“Displayed result of protein binding site” 313 shows the results of executing the program 206 for displaying protein-binding sites. “Displayed result of protein binding site sequence” 314 shows the result of executing the program 207 for displaying protein-binding site sequences. “Result of cis element search” 315 shows the result of executing the cis-element search program 208. “Displayed result of detecting the frequency of appearance of cis elements” 316 shows the result of executing the program 209 for detecting and displaying cis-element frequencies and that of the program 210 for determining cis-element false positives. The results of executing these programs are entered by the user as required.
With reference to
After the system of the example is started up, the user, at step 500, clicks the button 401 for data entry and enters fluorescent intensity data. The user further enters various data shown in
When the portions in the genome sequence where a given protein X has bound are to be visually displayed, the user clicks the binding site display button 403 at step 502, whereby the protein binding site display program 206 is executed. When a list of sequences in the genome sequence to which a given protein X has bound are to be displayed, the user, at step 503, clicks the binding site sequence display button 404, whereby the protein binding site sequence display program 207 is executed. Thus, the sites in the genome sequence where the protein X has bound can be specified.
When examining which cis element in the binding sites has been recognized by the protein X when it bound to the binding sites, the user clicks the cis element search button 405 at step 504. As a result, the cis element search program 208 is executed, and short sequences (cis elements) that appear commonly in the binding sites are retrieved.
Further, when the frequency of appearance of a specific cis element in the genome sequence is to be examined from the cis element search result, the user, at step 505, clicks the button 406 for detecting and displaying the frequency of appearance of cis element, whereby the cis element appearance frequency detecting and displaying program 209 is executed. When determining whether a specific cis element is a false positive, the user clicks the cis element false-positive determination button 407 at step 506. This causes the cis element false-positive determination program 210 to be executed, and a false-positive determination is made on the cis element.
At the bottom of the displayed data 601 and the pulldown menu 602, there is provided a displayed color setting 603 with a bar 604 for representing fluorescent intensity by the gradation of certain colors. At the bottom of the bar 604, there is shown a minimum value 605, an average value 606, and a maximum value 607, indicating the gradation in numerals. These values 605 to 607 are initially set to default average values determined from entered data. The user can change the hue by directly changing the values 605 to 607.
A threshold value 608 indicates the lower limit of fluorescent intensity beyond which it is determined that hybridization has not been fully conducted. If the fluorescent intensity is below the threshold, the particular probe ID is eliminated from the candidates for protein binding sites. Probe IDs with data that is below the threshold are excluded from color display during the processing of programs shown in
The user can select either “Display protein binding site” 609 or “Display state of hybridization to probe” 610.
In the experiment shown in
In the experiment shown in
The location of the coding region 707 is identified from gene number 307 of the target, and start site 380 and end site 309 of gene in Table 306 in
To the left of the screen 800, there are displayed a genome 802 of the target in Table 310 of
In the experiment shown in
If the value of fluorescent intensity 411 in
At step 1003, fluorescent intensities are allocated on the genome for each probe ID when they are displayed. For this purpose, a variable called Probe ID is set, the locations of 1st to N-th data are determined, threshold determination is made based on the fluorescent intensities, and the fluorescent intensities are displayed with the designated colors. The ID of the probe determined to be a protein binding site is stored in the data region 313 of
At step 1004, it is determined whether the number in the variable Probe ID is smaller than the total number N of items of data for Probe ID. If it is smaller than N, the routine proceeds to the next step 1005.
At step 1005, chromosome number 311 corresponding to the position 412 on the genome sequence of the probe with Probe ID 409 is acquired from Table 310, and a corresponding genome sequence 312 and the sequence 410 of the corresponding probe ID are prepared. At step 1006, a multiple alignment program is run on the target genome sequence 312 and the sequence 410 of the probe ID. The locations in the sequence 410 of the probe ID where the genome sequence 312 of the target starts and ends with the highest score are determined as Indata_start and Indata_end, respectively.
At step 1007, it is determined whether the fluorescent intensity 411 for the Probe ID is greater than the threshold value 608. If the fluorescent intensity 411 exceeds the threshold value 608, the routine proceeds to step 1008.
At step 1008, the values of the fluorescent intensity 411 for the locations of Indata_start and Indata_end are drawn with designated colors. The Probe ID is also displayed. The values of Indata_start and Indata_end for the Probe ID, and the downstream gene name are stored in the data 313 of
At step 1100, it is determined whether the number in Probe ID of the probe to be processed is smaller than the total number N of Probe IDs. If it is smaller than N, the routine proceeds to next step 1101.
At step 1101, for the chromosome number 311 at the location 412 on the genome sequence of the probe corresponding to the Probe ID 409, a genome sequence 312 and sequence information 410 for the probe ID are prepared. At step 1102, a multiple alignment program is run on the sequence information 410 of the probe ID for the target genome sequence 312. The start site and end site of the sequence information 410 of the probe ID on the genome sequence 312 that have the highest scores are determined as Indata_start and Indata_end, respectively.
At step 1103, the user determines whether “Display protein binding site” 609 or “Display state of hybridization to probe” 610 in
At step 1105, it is determined if the fluorescent intensity 411 is greater than the threshold value 608. If the fluorescent intensity 411 is greater than the threshold value 608, the routine proceeds to step 1106. If the fluorescent intensity 411 is not greater than the threshold value 608, the routine proceeds to step 1112.
At step 1106, the values of the fluorescent intensity 411 for the locations of Indata_start and Indata_end are drawn with designated colors. Probe ID is also displayed.
At step 1104, it is determined whether or not the fluorescent intensity 411 is greater than the threshold value 608. If the fluorescent intensity 411 is greater than the threshold value 608, the routine proceeds to step 1107. If not, the routine proceeds to step 1103 where Flag is set to zero.
At step 1107, 1 is added to the variable Flag. If Probe ID is 1, 1 is entered in Flag. At step 1108, it is determined whether Flag is 2. If not, the value of Indata_start is set in a variable Pre_start at step 1111, the value of Indata_start is set in a variable Pre_end, and the value of the fluorescent intensity 411 is set in Pre_data.
If Flag is 2, which means that the fluorescent intensity of hybridization of a previous Probe ID has exceeded the threshold value and the fluorescent intensity of the next Probe ID has also exceeded the threshold value, the routine proceeds to step 1109. At step 1109, Pre_end is considered to be the start and Indata_start is considered to be the end, and the average values of the fluorescent intensity 411 and Pre_data are drawn at these positions, respectively, with designated colors. Probe ID is also displayed. Pre_end and Indata_start for Probe ID, and downstream gene names are stored.
At step 1110, Flag is set to be 1 and the routine proceeds to step 1111 where Indata_start and Indata_end for the current Probe ID are set in Pre_start and Pre_end, respectively. The value of the fluorescent intensity 411 is also set in Pre_data. At step 1112, the next Probe ID is processed by repeating the same sequence from the initial step 1101.
At step 1201, it is determined if the number in Probe ID of the probe to be processed is smaller than the total number N of Probe IDs. If smaller than N, the routine proceeds to the next step 1203.
At step 1203, the probe ID 904 on the screen 900, and the start site 905 and end site 906 of binding sites for Probe ID 904 are displayed. At step 1204, gene name 907 is displayed. At step 1205, based on the location 412 on the genome sequence of Probe ID, the sequence information for the genome sequence for the chromosome 311 is acquired, and the base sequence 908 on the chromosome is displayed. At step 1206, the same steps are repeated for the next Probe ID.
In “Number of commonly detected items” 1302, three alternatives are shown. One is when one or more cis elements (motif sequences) are detected from each searched sequence. Another is when zero or one or more cis elements (motif sequences) are detected. The other is when any repetition is allowed. In “Maximum number of motif detection” 1303, the maximum number of cis elements (motif sequences) to be detected commonly from all of the searcged sequences is designated. In “Number of detected sites” 1304, the number of cis elements (motif sequences) to be detected from a single searched sequence is designated. “Length of motif sequence” 1305 is the item for designating the length of retrieved cis elements. The values that can be set in “Number of detected sites” 1304 and “Length of motif sequence” 1305 are from 2 to 100. If any other values are entered, an error message is displayed.
A screen 1500 shows a bar graph 1505 of the frequency of appearance of the base sequence of a motif sequence. The horizontal axis 1504 shows base sequence, and the vertical axis 1503 shows E-values (expected values).
A screen 1501 shows a table of the locations of motif sequences in the searched sequence. The table includes probe ID 1506 uniquely indicating a searched sequence, strand 1507 indicating the direction of motif sequence, start site 1508 of motif sequence in the searched sequence, P-value (significance probability) 1509 of the motif, and the motif sequence 1510 with 10 previous and subsequent bases. The + sign in strand 1507 indicates the direction from 5′ to 3′, and the − sign indicates the opposite direction.
A screen 1502, which visually represents the location of a motif sequence in the searched sequence, is basically the same as the screen 1501 in terms of contents. The screen 1502 includes probe ID 1511, P-value of motif 1512, searched sequence 1513, and motif sequence 1514.
At step 1602, the cis elements obtained as search results are displayed. Specifically, motif number 1402, length of motif sequence 1403, and the sequence of cis element as the motif are displayed, as shown in
At step 1604, the screen 1500 of
“Retrieved sequences” 1701 shows the motif sequences (cis elements) of which the frequency of appearance is to be detected. When the screen 1700 is displayed, the list of motif sequences 1401 on the screen 1400 of
At the bottom of the screen 1700, there is displayed a dialog for entering a false-positive determination standard. When a false-positive determination is to be made, one of three standards is designated. A first standard 1708 is where a motif sequence that appeared more often than a designated proportion with respect to the entire retrieved sequences is considered a false positive. A second standard 1709 is where a motif sequence that appeared in sequences with data that is below a designated value more often than a designated proportion value is considered to be a false positive. A third standard 1710 is such that a motif sequence that appeared in a single sequence more often than a designated proportion value is considered a positive.
In accordance with the first standard, a motif sequence that appeared in more than 80% with respect to the upstream of the entire searched genes has too high a frequency. Because such a sequence is difficult to be considered a sequence which a protein X would specifically recognize, it is considered to be a false positive. In accordance with the second standard, a motif sequence that was contained in the data with lower fluorescent intensities is difficult to be considered to be a sequence which the protein X would specifically recognize. Therefore, such a sequence is considered to be a false positive. In accordance with the third standard, if a motif sequence exists in one upstream sequence with a high frequency, the possibility of the motif sequence being controlled by the protein X is high, and therefore the sequence is considered to be a positive.
a shows an example of the table 1800 of frequency of appearance of cis elements obtained by running the program for detecting and displaying the frequency of appearance of cis elements. The table 1800 of cis element appearance frequencies include gene names 1803 and consensus sequences 1804 to 1806. At the head 1802 of each of the gene names 1803, a mark is indicated if the gene corresponds to an upstream location of a gene shown in the image 900 shown in
The entire lines of the table 1800 of cis element appearance frequencies can be sorted in either ascending or descending order. For instance, when an ascending sort is carried out with respect to the marks 1802, a false-positive determination table 1801 can be obtained, as shown in
The first column 1807 is shown darker towards the top and lighter towards the bottom, indicating that the binding sites for the protein X are concentrated at the top of the false-positive determination chart 1801. When the second standard 1710 of
At step 1903, the sequence of Gene 307 between a location that is 500 bases prior to Start 308 and Start is acquired from the genome sequence 312 in the chip database 201 of
At step 1906, all of the gene names 1803 that contain the motif sequence are displayed. At step 1907, the column 907 of gene names downstream of binding sites in the screen 900 of
The processes from steps 1903 to 1909 are carried out on all of the genes in the retrieved data 1704. Further, these processes are carried out for all of the motif sequences in the retrieved sequences 1701.
At step 2001, if one or more retrieved sequences are found in the searched sequences, this counts as one, and when the ratio of the number of retrieved sequences found to the number of searched sequences exceeds a designated value, the retrieved cis element is determined to be a false positive.
At step 2002, if one or more retrieved sequences are found in searched sequences that are lower than the designated expression data, this counts as one, and when the ratio of the number of such retrieved sequences to the number of searched sequences exceeds a specified value, the retrieved sequences are determined to be false positives. At step 2004, the sequences determined to be false positives are shown in red.
At step 2003, if more than a specified number of retrieved sequences are contained in a single searched sequence, the retrieved sequences are determined to be positives. At step 2005, the sequences determined to be positives are shown in blue.
While preferred embodiments of the invention have been described, it is to be understood that modifications will be apparent to those skilled in the art without departing from the spirit and scope of the following claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 262045/2004 | Sep 2004 | JP | national |