The present teachings relate to the field of gene, alternative slice variant and transcript identification and display.
Gene prediction programs are generally based on ab initio homology principles. In general, the advantage of these prediction programs is their ability to provide in silico genes on either incomplete or complete genomes with limited experimental evidence—so as long as the predicted genes follow predefined models. Unfortunately these programs often produce gene locations and structures of low confidence and/or high rates of over/under prediction.
Accurate identification of genes typically requires manual curation. This involves merging diverse sources of information such as the output of different gene prediction programs, homology searches of proteins and sequence databases. The process is often time consuming and error prone. Differences in the manner in which individual curators attack the problem can also produce inconsistent results where one curator may identify gene boundaries differently than another. The present teachings discuss computational methods that can relieve a human operator from this burdensome task. They can identify and annotate genes, their variants, and identify transcripts in an accurate and consistent manner. This information can be useful for basic, biomedical, and pharmaceutical research.
The skilled artisan will understand that the drawings, described below, are for illustration purposes only. The drawings are not intended to limit the scope of the present teachings in any way.
Figure one gives the number of sequences in various sequence databases.
Figure two illustrates an embodiment of the present teachings that can be used for gene, splice variant and transcript identification.
Figure three illustrates an embodiment of the present teachings that graphically displays cluster information.
Figure four illustrates a comparison of splice junction differences between curated Celera transcripts and (a) non-curated Celera transcripts, (b) Mammalian Genome Collection transcripts, and (c) GenBank mRNAs.
Figure five represents a (a) histogram of unique exon lengths for Celera transcripts (b) histogram of unique exon lengths for Celera transcripts, and (c) histogram of the shortest exons per Celera transcript.
Figure six illustrates the comparison of intron structure between sequences.
Figure seven illustrates an embodiment of the present teachings that (a) graphically displays cluster information, and (b) allows a user to input sequence identifiers to be located in clusters.
Figure eight shows the mapping rate of various evidence database sequences to the human genome.
Figure nine shows results for collapsing overlapping ESTs into clusters.
Figure ten illustrates clone-end alignment information for individual ESTs observed during EST clustering.
Figure eleven shows the number of clusters identified by an embodiment of the present teachings via different combinations of evidence.
Figure twelve compares the number of transcribed genes determined by an embodiment of the present teachings to the number of genes identified by the Ensembl and RefSeq organizations
Figure thirteen is a Venn diagram illustrating the overlap in gene identification between the transcribed genes determined by an embodiment of the present teachings to those genes identified by the Ensembl and RefSeq organizations.
Figure fourteen illustrates the types of clusters encountered when clustering curated Celera transcripts.
Figure fifteen shows the number of clusters identified by an embodiment of the present teachings via different combinations of evidence.
Figure sixteen illustrates an embodiment of a computer system upon which various embodiments of the present teachings can be implemented.
The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described in any way.
While the present teachings are described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those of skill in the art.
Figure one shows the number of genes estimated by various research organizations. This figure shows that the number of genes varies widely in the literature and emphasizes that current gene prediction tools are imperfect in identifying correct genes/transcripts and splicing variant. Experimental sequences expressed from human genes can be extremely useful in identifying and annotating human genes. The present teachings discuss a genome-based-evidence-clustering approach that focuses on genetic information such as, genes, splice variants, and transcripts on a genome using evidence. Such evidence can include information from the Mammalian Gene Collection (MGC), GenBank mRNAs, ESTs, and RefSeq NMs.
Figure two illustrates an embodiment of the present teachings identifies genetic information using evidence from various sources to build clusters. In 202, observed information is collected from various data sources. A variety of data sources can be used as input. For example, two high quality sources that have been manually reviewed by biologists, and are therefore thought to be complete are the Celera Transcripts (CTs) and the RefSeqs (NMs). Another class of sequences is the Mammalian Genome Collection sequences (MGCs). These sequences arise from the end-to-end sequencing of cDNAs from many human tissues. An an entire protein coding region must be found in order that a full-sequenced cDNA be promoted to an MGC. Another class of sequences are GenBank mRNAs which have no quality standards associated with them other than encouraging people to deposit only full-length sequences. A somewhat less reliable data source is the set of ESTs (expressed sequence tagged). These are transcript fragments. EST sequences are generated from single-pass reads and can have a potentially high rate of sequencing errors, contamination from vectors, retroviruses or genomic amplification. Additionally protein information can be used. This information can be mapped to the genome using programs such as GeneWise. One skilled in the art will appreciate that many other data sources can be used and that the teachings herein are not limited to the sources mentioned above.
At 204, the data sources are mapped to the genome. This mapping can be performed via a variety of techniques. Some embodiments use programs such as SIM4 (Florea et. al., A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Research 8(9):967-74) which is incorporated by reference herein for all purposes or BLAT (Kent, Genome Research 12(6):656-664). At 206, the evidence can be ranked in terms of its quality. For example, the mapped sequences can be ranked according to a variety of metrics including the percentage of the sequence aligned, the amount of coverage, the numbers of exons spanned and the number of cDNA gaps. Some embodiments retain only evidence that surpass a user-defined quality measure. For example, only evidence with 98% identity and 50% coverage may be retained. Ranking and filtering data in this way can provide a high-quality result.
At 208 the evidence is collapsed into clusters. Some embodiments create clusters that define regions of transcriptional activity and can define gene boundaries. Some embodiments use intron spacing information to identify splice variants.
At 210, the clusters are output. This output can take a variety of forms such as a report or a visual representation.
Identification of Gene Boundaries
Various embodiments identify gene boundaries via clustering at 208. This can be accomplished by determining the relationships of all sequences selected earlier. Some embodiments establish these relationships by performing pair-wise and all-against-all sequence comparison and then examining sequence characteristics such as a sequence's genomic location, matching identities and coverage on the reference genome, numbers of exons and introns, overlapping types (exon or intron overlap), overlapping lengths, percentages of overlapping length, the numbers of overlapped exons, or the numbers of exactly-matched splicing sites given any two selected sequences. Given this information, various embodiments define a gene model. This can be accomplished by establishing a criterion which, when met, establishes that two sequences belong to the same gene (called gene-links.) In some cases the criteria is that at least one overlapping exon on the same strand of the same genome location is shared between two sequences. One skilled in the art will appreciate that other criteria can be used and/or parameters, such as the amount of overlap, adjusted depending on the level of stringency desired. Some embodiments further cluster by identifying all sequences that are linked to another sequence by a gene-link. These sequences can then be assigned to a single cluster. In some instances, if no gene-link exists for a given sequence, the sequence itself can form a singleton cluster.
Various embodiments split clusters if two or more well-defined RefSeq loci are linked through non-RefSeq evidence sequences. In this case, overlapped sequences can be assigned to appropriate clusters according to their mapping qualities.
The boundary of a given cluster can be defined as the maximum span of all evidence sequences that form the cluster. In some embodiments, if a cluster contains a known gene such as a RefSeq locus, then the boundary of the cluster can be referred to as a gene boundary.
Identification of Splice Variants
Various embodiments identify splice variants at 208 in
The impact of the user-defined threshold on short sequence exclusion can be evaluated by determining how frequently introns and exons are below the cutoff. This can be accomplished via the histograms contained in
In some embodiments the ends of a sequence are not used in the clustering process. Because of this, the end of a partial-length sequence does not need to extend all the way to the next intron-exon boundary of the longer sequence it is matched to. For example, in
Some embodiments provide a method of viewing the clusters.
For comparison purposes, the numbers of transcribed genes for individual chromosomes, together with the numbers of Ensembl genes, RefSeq Loci, annotated by various groups since 2000 are listed in
In
The rules for splice variant analysis were encoded in an automated computer pipeline. Prior to clustering the data, the rules were verified by passing the human Celera Transcripts (hCTs) through the pipeline. Because Celera Transcripts have undergone human curation, they are of high quality and the number of times Celera Transcripts cluster together should be low. The 62,472 hCTs that successfully mapped to the genome clustered into 59,977 transcript clusters. 2599 of these clusters contained more than one hCT, of which there were a total of 2977 total hCT pairs. Each of these pairs was grouped according to the structural differences they possessed. 264 pair had significant structural differences such as a missing exon or intron of less than twenty base pairs or where a splice site varied by more than twenty base pairs (1402 and 1404 on
Transcript clustering was attempted under 4 different sets of assumptions about sequence length. These are:
Results for these runs are contained in
Computer System Implementation
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Consistent with certain embodiments of the present teachings functions including evidence input, filtering, mapping, clustering and output can be performed and results displayed by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510. Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process states described herein. Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.
The term “computer-readable medium” as used herein refers to any media that participates in providing instructions to processor 504 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as memory 506. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red-signal. An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502. Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
The foregoing description has been presented for purposes of illustration and description. It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice. Additionally, the described implementation includes software but the present teachings may be implemented as a combination of hardware and software or in hardware alone. The present teachings may be implemented with both object-oriented and non-object-oriented programming systems.
This application claims priority to U.S. Provisional Patent Application No. 60/504309, filed on Sep. 18, 2003, which is incorporated herein in its entirety by reference.
Number | Date | Country | |
---|---|---|---|
60504309 | Sep 2003 | US |