The present invention relates to a system and method for sequence matching and alignment in a database management system, such as a relational database management system
Genetic databases store vast quantities of data including nucleotide (gene) and amino acid (protein) sequences of different organisms. They assist molecular biologists in understanding the biochemical function, chemical structure and evolutionary history of organisms. An important aspect of managing today's exponential growth in genetic databases is the availability of efficient, accurate and selective techniques for detecting similarities between new and stored sequences.
The discovery of sequence homology to a known protein or family of proteins often provides the first clues about the function of a newly sequenced gene. As the DNA and amino acid sequence databases continue to grow in size they become increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such homologies.
There are a number of algorithms and software tools for searching sequence databases. All of them use some measure of similarity between sequences to distinguish biologically significant relationships from random similarities that occur by chance. The most studied measures are those used in conjunction with variations of the dynamic programming algorithm. These methods assign scores to insertions, deletions and replacements, and compute an alignment of two sequences that corresponds to the least costly set of such mutations. Such an alignment may be thought of as minimizing the evolutionary distance or maximizing the similarity between the two sequences compared. In either case, the cost of this alignment is a measure of similarity. Because of their computational requirements, dynamic programming algorithms are impractical for searching large databases without the use of a supercomputer or other special purpose hardware.
In order to allow searching large databases on commonly available computers, fast algorithms based on heuristics that attempt to approximate the above methods have been developed. In many heuristic methods the measure of similarity is not explicitly defined as a minimal cost set of mutations, but instead is implicit in the algorithm itself. For example, the FASTP program of Lipman and Pearson first finds locally similar regions between two sequences based on identities but not gaps, and then re-scores these regions using a measure of similarity between residues (a character in a sequence string is called a residue). Despite their rather indirect approximation of minimal evolution measures, heuristic tools such as FASTP have been quite popular and have identified many distant but biologically significant relationships.
Sequence similarity measures can generally be classified as either global or local. Global similarity algorithms optimize the overall alignment of two sequences, which may include large stretches of low similarity. Local similarity algorithms seek only relatively conserved subsequences, and a single comparison may yield several distinct subsequence alignments; unconserved regions do not contribute to the measure of similarity. Local similarity measures are generally preferred for database searches, where DNA sequences may be compared with partially sequenced genes, and where distantly related proteins may share only isolated regions of similarity.
Many similarity measures begin with a scoring matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. A sequence segment is a contiguous stretch of residues of any length, and the similarity score for two aligned segments of the same length is the sum of the similarity values for each pair of aligned residues.
Basic Local Alignment Search Tool (BLAST) is another heuristic-based algorithm for finding local alignments between sequences. In addition to being a fast algorithm compared to other similar algorithms, an important advantage of BLAST is that it provides a measure of statistical significance of the alignment scores with respect to an appropriate random sequence model. This allows the biologists to discard statistically insignificant alignments while detecting the significant ones fast. Hence BLAST has become a popular and widely used sequence alignment method.
Conventionally, many large genomic databases are implemented in conjunction with Database Management Systems (DBMSs). However, these genomic databases use the DBMS only as a storage repository. All the analysis and sequence alignments are done using external tools after exporting the data from the DBMS and transforming it into the appropriate formats accepted by the tools.
There are several problems that arise with the use of a conventional external BLAST server, as shown in
A need arises for an integrated solution in which the BLAST functionality is integrated into a DBMS. This integrated solution would provide improved performance and scalability over the conventional approach, in addition to reducing the required hardware resources and reducing the cost of the system.
The present invention is an integrated solution in which the BLAST functionality is integrated into a DBMS. This integrated solution would provide improved performance and scalability over the conventional approach, in addition to reducing the required hardware resources and reducing the cost of the system. A modern DBMS offers a wide range of data management and analytic functionality that may be advantageously used for bioinformatics applications.
Such a DBMS offers a scalable and efficient platform for storage and retrieval of genetic data. In one embodiment of the present invention, in a database management system, a system for sequence matching and alignment comprises a database table storing sequence information comprising target sequences, a query sequence, a table function operable to accept the query sequence and match the query sequence with at least one target sequence stored in the database table, and a structured query language query referencing a database table storing sequence information comprising target sequences, a query sequence, and a table function, the structured query language query evaluatable by the database management system. The table function may be either a match function operable to provide a sequence identification, score, and expect value of the match of a query sequence with a target sequence stored in the database table, or an alignment function operable to provide a full alignment of the query sequence with a target sequence stored in the database. The match function may be a separate function from the alignment function. The table function may be included in a FROM clause of the structured query language query.
The table function may be operable to accept the query sequence and match the query sequence with at least one target sequence stored in the database table by processing input arguments to the table function, the input arguments including a reference to the database table and a reference to the query sequence, divide the query sequence into a plurality of query subsequences, and search the database table to find for each query subsequence target sequences that match the query subsequence. The sequences may be nucleotide sequences of genetic material, amino acid sequences of proteins, or both. The table function may be further operable to translate the nucleotide sequences to amino acid sequences as per a specified genetic code. The plurality of query subsequences may comprise a set of overlapping fixed length query subsequences. The table function may be further operable to score each query subsequence using a scoring matrix.
The details of the present invention, both as to its structure and operation, can best be understood by referring to the accompanying drawings, in which like reference numbers and designations refer to like elements.
BLAST, developed by Altschul et al. in 1990, is a heuristic method to find the high scoring locally optimal alignments between a query sequence and a database [1]. BLAST focuses on no-gap alignments of a certain fixed length. The BLAST algorithm and family of programs rely on work on the statistics of un-gapped sequence alignments by Karlin and Altschul. The statistics allow the probability of obtaining an un-gapped alignment (also called MSP—Maximal Segment Pair) with a particular score to be estimated. The BLAST algorithm permits nearly all MSPs above a cutoff to be located efficiently in a database.
The algorithm operates in three steps:
A low value for T reduces the possibility of missing MSPs with the required S score, however lower T values also increase the size of the hit list generated in step 2 and hence the execution time and memory required. In practice, the values of T and S are chosen so as to balance the processor requirements and sensitivity.
BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found. The NCBI version of BLAST provides filters to exclude automatically regions of the query sequence that have low compositional complexity, or short periodicity internal repeats. The presence of such sequences can yield extremely large numbers of statistically significant but biologically uninteresting MSPs. For example, searching with a sequence that contains a long section of hydrophobic residues will find many proteins with transmembrane helices.
Like many other similarity measures, the MSP score for two sequences may be computed in time proportional to the product of their lengths using a simple dynamic programming algorithm. An important advantage of the MSP measure is that recent mathematical results allow the statistical significance of MSP scores to be estimated under an appropriate random sequence model. Furthermore, for any particular scoring matrix, one can estimate the frequencies of paired residues in maximal segments. This tractability to mathematical analysis is a crucial feature of the BLAST algorithm.
In searching a database of thousands of sequences, generally only a handful, if any, will be homologous to the query sequence. The scientist is therefore interested in identifying only those sequence entries with MSP scores over some cutoff score S. These sequences include those sharing highly significant similarity with the query as well as some sequences with borderline scores. This latter set of sequences may include high scoring random matches as well as sequences distantly related to the query. The biological significance of the high scoring sequences may be inferred solely on the basis of the similarity score, while the biological context of the borderline sequences may be helpful in distinguishing biologically interesting relationships.
The BLAST algorithm can be used to search nucleotide and amino acid query sequences against databases of nucleotide and amino acid sequences. Based on the nature of the query and the database sequences, the NCBI BLAST provides the following variants:
Although this implementation of the BLAST algorithm is preferred, there are other implementations and variants of the BLAST algorithm that may be used advantageously by the present invention. Therefore, the present invention contemplates any and all implementations and variants of the BLAST algorithm.
BLAST Interface in a Relational Database Management System
In a preferred embodiment of the present invention, BLAST functionality may be implemented in a Relational Database Management System (RDBMS), such as the ORACLE® RDBMS. The features of this preferred embodiment may have wide application and are not limited to any particular RDBMS, or to relational database systems. Thus, it is clear that the present invention contemplates implementation on any database system, whether relational or non-relational.
A preferred embodiment of the present invention includes an API to the sequence similarity search functionality, which is a table function that can be used in the FROM clause of a SQL query. Table functions return virtual tables that can be manipulated just like regular tables [6]. Preferably, two families of functions are provided—the MATCH( ) family and the ALIGN( ) family. They accept the same set of input parameters. The MATCH( ) functions return only the sequence id, score and expect value of the target sequences in the database that have a high similarity with the query sequence. The ALIGN( ) functions return the full alignment of the query sequence with the target sequences. There are use cases in which BLAST is used as an initial screener for more complex alignment searches. In those cases, the result of the MATCH( ) function would be sufficient.
Example functions provided in a preferred embodiment include three MATCH( ) functions and three ALIGN( ) functions, as follows:
The purpose of this table function is to perform a BLASTN search of the given nucleotide sequence against the selected portion of the nucleotide database. The input query nucleotide sequence is specified as a character large object (CLOB). The database can be selected using a standard SQL select and passed into the function as a reference cursor. The reference cursor must have the schema (sequence_id VARCHAR2, sequence_data CLOB). The standard BLAST parameters that are described below are also accepted. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.
1.2. BLASTP_MATCH( )
The purpose of this table function is to perform a BLASTP search of the given set of protein sequences against the portion of the protein database selected. The database can be selected using a standard SQL select and passed into the function as a cursor. The standard BLAST parameters that are described below are also accepted. The match returns the identifier of the query sequence (q_seq_id), the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.
1.3. TBLAST_MATCH( )
The purpose of this table function is to perform BLAST searches involving translations of either the query sequence or the database of sequences. The available options are:
3. TBLASTX: The query sequence and the database sequence are both translated. The database can be selected using a standard SQL select and passed into the function as a cursor. The standard BLAST parameters that are described below are also accepted. The match returns the identifier of the query sequence (q_seq_id), the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.
1.4. BLASTN_ALIGN( )
The purpose of this table function is to perform a BLASTN alignment of the given nucleotide sequences against the portion of the nucleotide database selected. The database can be selected using a standard SQL select and passed into the function as a cursor. The standard BLAST parameters that are described below are also accepted. The BLASTN_MATCH( ) function returns only the score and expect value of the match. It does not return information about the alignment. The BLASTN_MATCH function will typically be used where the user wants to follow up a BLAST search with a full FASTA or Smith-Waterman alignment. The BLASTN_ALIGN( ) function does the BLAST alignment and returns the information about the alignment. The following attributes are returned:
score: score corresponding to the alignment
1.5. BLASTP_ALIGN( )
The purpose of this table function is to perform a BLASTP alignment of the given protein sequences against the portion of the protein database selected. The database can be selected using a standard SQL select and passed into the function as a cursor. The standard BLAST parameters that are described below are also accepted. The BLASTP_MATCH( ) function returns only the score and expect value of the match. It does not return information about the alignment. The BLASTP_MATCH function will typically be used where the user wants to follow up a BLAST search with a full FASTA or Smith-Waterman alignment. The BLASTP_ALIGN( ) function does the BLAST alignment and returns the information about the alignment. The schema of the returned alignment is the same as that of BLASTN_ALIGN( ).
1.6. TBLAST_ALIGN( )
The purpose of this table function is to perform BLAST alignments involving translations of either the query sequence or the database of sequences. The available translation options are BLASTX, TBLASTN and TBLASTX. The schema of the returned alignment is the same as that of BLASTN_ALIGN( ) and BLASTP_ALIGN( ).
1.7. BLAST Parameters
Table 1 lists the input parameters to the BLAST functions with a short description. A detailed description of these parameters can be found in [3]. The MATCH( ) and ALIGN( ) functions accept the same set of input parameters.
The ALIGN( ) family of BLAST functions return the full alignment of the query sequence with the target sequence. The attributes of the ALIGN output and their descriptions are shown in Table 3. The output format is the same for all ALIGN( ) functions.
A process 200 for finding matching sequences in a genetic information database is shown in
As an example of the processing performed, assume that the query sequence is “ATGCAGTACGTACGATCAGTACGT” and the database consists of two sequences; (1, “ATTCACTACTTACGATTGCAACGT”) and (2, “ATTCGGTATGCACGATCAGTACGT”). The major part of the processing involved in all six BLAST match and align functions is similar. Some functions have a few additional steps. For example, in TBLAST_MATCH and TBLAST_ALIGN, where there is translation involved, the sequences undergo the appropriate translations before the subsequent steps are performed. However, the steps shown in
Process 200 begins with step 201, in which the input arguments are processed and placed into a parameter object. Use of a parameter object is preferred as it is more compact this way to pass the arguments around to different functions. However, use of the parameter object is not necessary. Further, in typical use cases only a few arguments may be specified. For the arguments that are not specified, default values are substituted. An exemplary parameter object may include the following attributes.
The fully filled parameter object is the output of this step 201.
In step 202, the appropriate sequence translations are performed. The TBLAST_MATCH and TBLAST_ALIGN functions involve translation of nucleotide sequences into amino acid sequences. This translation is performed according to a genetic code. There are several different genetic codes that can be used for this translation. In a preferred embodiment, the “universal” genetic code is used. This code is also the default used by NCBI BLAST. There are 13 genetic codes supported in the present system. However, the present invention does contemplate using additional genetic codes.
DNA is a two-stranded molecule. Each strand is a polynucleotide composed of A (adenosine), T (thymidine), C (cytidine), and G (guanosine) residues. One strand of DNA holds the information that codes for various genes; this strand is often called the template strand or antisense strand (containing anticodons). The other, and complementary, strand is called the coding strand or sense strand (containing codons). Amino acid residues of proteins are specified as triplet codons. That is, a combination of 3 characters in a nucleotide sequence corresponds to an amino acid residue. Since DNA has a 4-letter alphabet, there are 64 possible combinations (4{circumflex over ( )}3=64). The mapping of these DNA residue combinations to the amino acid combinations is called a “genetic code”.
In the universal genetic code, 61 out of the 64 combinations correspond to an amino acid residue. The remaining 3 codons are used for “punctuation”; that is, they signal the termination (the end) of the growing polypeptide chain. The universal genetic code is shown below.
The top line corresponds to the amino acid residue and the other three lines correspond to the nucleotide bases. For example, TTT corresponds to F, TTA corresponds to L and GGG corresponds to G. The “*” in the top line corresponds to punctuation.
The input DNA sequence translated into an amino acid sequence according to the specified genetic code is output from this step 202.
In step 203, the query sequence is divided into a set of overlapping fixed length subsequences. For a given word length w (usually 3 for proteins) and scoring matrix, a list of all w-length subsequences (w-mers) that can score greater than a specified threshold T (a value of T=17 is used in NCBI BLAST), when compared to w-mers from the query, are created. For example, with w=3 the query sequence “ATGCAGTACGTACGATCAGTACGT” will first be split into subsequences, “ATG”, “TGC”, “GCA”, . . . etc. After the split, the subsequences that score less than T, when compared to the other w-mers from the query are dropped. The scoring is done according to a specified scoring matrix.
The wordlist with scores more than the specified threshold is output from this step 203.
In step 204, the database is searched using the list of high scoring w-mers found in the previous step 203, to find the corresponding w-mers in the database. The objective in this step is to identify for each query subsequence, the list of (sequence_id, offset) pairs in the database, where the query subsequence appears. In one embodiment, the entire database may be scanned in order to find the corresponding w-mers. In other embodiments, various forms of indexes may be used to speed up searching of the database.
The list of high scoring pairs is output from this step 204.
In step 205, each hit identified in step 204 is extended to determine if a Maximal Segment Pair (MSP) that includes the w-mer scores greater than S, the preset threshold score for an MSP. Since pair score matrices typically include negative values, extension of the initial w-mer hit may increase or decrease the score. Accordingly, a parameter defines how large an extension will be tried in an attempt to raise the score above S.
This step produces the score and expectation value for the high scoring hits, which is the output of process 200.
Usage examples of the BLAST family of table functions in which BLAST searches are combined with other database functionality are described below.
Functional annotation is the process of annotating newly discovered genes with descriptions about their potential functions. An example of functional annotation is shown in
Assume that the table SwissProt_DB 302 consists of all the protein sequences in the SwissProt database and the table Query_DB 304 consists of the newly discovered fragments of the sequence to be searched for. The following query returns the top three matches in each organism. The BLASTP_MATCH table function 306 returns the sequence id, score and expect value 308 of the match. It is joined back with the SwissProt_DB table 302 on the sequence id 310 to get the organism attribute 312. The RANK function 314 partitions the result on the organism, sorts it in the descending order of score and computes a rank for each row 316 and outputs the results. An exemplary SQL query is shown below:
Another exemplary use case of the present invention is drug discovery. In drug discovery, if the identified marker genes are newly found sequence fragments, similarity search is quite useful to identify potential leads. In this example, assume that the Inhibits (gene_id, inhibitor) table stores the relationship between genes and their inhibiting compounds and the compounds (compound_id, toxicity, . . . ) table stores information about the various compounds including their toxicity. The table Marker_Genes stores the sequence fragments that are used to query against the sequences stored in GENE_DB table. The following query selects three known sequences that are most similar to the query sequence and a list of non-toxic compounds that inhibit them.
Another exemplary use case of the present invention involves using the BLASTN_MATCH function. In this example, the table GENE_DB stores DNA sequences. GENE_DB has attributes (seq_id, publication date, modification date, organism, sequence) among other attributes. The following query does a BLAST search of the given query sequence against all human DNA sequences and returns the seq_id, score and expect value of matches that score >25. The schema of the table that stores the sequences is not required to be fixed. It is only required that it contains an identifier and the sequence and any number of other optional attributes.
The following query does the BLAST search against all sequences published after Jan. 01, 2000.
Other attributes of the matching sequence can be obtained by joining the BLAST result with the original sequence table as follows:
In this approach, the portion of the database to be used for the search can be specified using SQL which is much more powerful than other search mechanisms like ENTREZ from NCBI. The full power of SQL can be used to perform more sophisticated functions.
Another exemplary use case of the present invention involves using the BLASTP_MATCH function. In this example, the table PROT_DB stores protein sequences. GENE_DB has attributes (identifier, name, publication date, modification date, organism, sequence) among other attributes. The following query does a BLASTP search of the given query sequence against all protein sequences and returns the identifier, score, name and expect value of matches that score >25.
Another exemplary use case of the present invention involves using the BLASTN_ALIGN function. In this example, the table GENE_DB stores DNA sequences. GENE_DB has attributes (seq_id, publication date, modification date, organism, sequence) among other attributes. The following query does a BLAST search and alignment of the given query sequence against all human DNA sequences and returns the publication_date, organism and the alignment attributes of matching sequences that score >25 and where more than 50% of the sequence is conserved in the match.
An exemplary block diagram of a database management system 400, in which the present invention may be implemented, is shown in
Input/output circuitry 404 provides the capability to input data to, or output data from, database/System 400. For example, input/output circuitry may include input devices, such as keyboards, mice, touchpads, trackballs, scanners, etc., output devices, such as video adapters, monitors, printers, etc., and input/output devices, such as, modems, etc. Network adapter 406 interfaces database/System 400 with Internet/intranet 410. Internet/intranet 410 may include one or more standard local area network (LAN) or wide area network (WAN), such as Ethernet, Token Ring, the Internet, or a private or proprietary LAN/WAN.
Memory 408 stores program instructions that are executed by, and data that are used and processed by, CPU 402 to perform the functions of system 400. Memory 408 may include electronic memory devices, such as random-access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), flash memory, etc., and electromechanical memory, such as magnetic disk drives, tape drives, optical disk drives, etc., which may use an integrated drive electronics (IDE) interface, or a variation or enhancement thereof, such as enhanced IDE (EIDE) or ultra direct memory access (UDMA), or a small computer system interface (SCSI) based interface, or a variation or enhancement thereof, such as fast-SCSI, wide-SCSI, fast and wide-SCSI, etc, or a fiber channel-arbitrated loop (FC-AL) interface.
The contents of memory 408 varies depending upon the function that system 400 is programmed to perform. In the example shown in
In the example shown in
As shown in
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media such as floppy disc, a hard disk drive, RAM, and CD-ROM's, as well as transmission-type media, such as digital and analog communications links.
Although specific embodiments of the present invention have been described, it will be understood by those of skill in the art that there are other embodiments that are equivalent to the described embodiments. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrated embodiments, but only by the scope of the appended claims.
The benefit under 35 U.S.C. § 119(e) of provisional application 60/498,698, filed Aug. 29, 2003, is hereby claimed.
Number | Date | Country | |
---|---|---|---|
60498698 | Aug 2003 | US |