DNA microarrays have a large and growing role in life sciences research.
They are used extensively in experiments to determine the levels of mRNA expression in a sample being tested, and as such allow the DNA or genes of the sample to be sequenced.
In general, a microarray used in such experiments includes a plurality of oligo spots which contain DNA. The spots are disposed on a substrate (e.g., a slide or tray), and are preferably ordered in an array or grid formation. Preferably, each spot of the microarray is provided with a unique DNA sequence. In other words, each spot has a different DNA sequence than the other spots in the microarray. Because each spot will hybridize only to its complementary DNA strand, when each spot is probed by the sample being tested, the interaction therebetween can be used to determine the level at which complementary DNA is present in the sample being tested (and thus is indicative of the mRNA or gene sequence for the sample).
To prepare the sample probe, the mRNA material of the sample, or its equivalent, is isolated and then labeled with a dye, e.g., a fluorescent reporter group. The labeled mRNA sample is then combined with the spots of the microarray to form hybridization complexes between the spot materials and the mRNA materials that have complementary or identical sequences. Any formed complexes are detected by using a scanner or like device to measure fluorescent signals emitted from specific spots in the microarray. Since the position of each spot and the sequence of its material are known, those spots emitting fluorescent signals are indicative of the mRNA of the sample being tested. Due to the number of spots that can be provided in the microarray and experimented with in parallel, observations can be made on a whole-genome scale.
Since microarrays are used to address biological problems on a whole-genome scale, microarray applications are generally voluminous and computationally intensive. Also, working with data relating to microarray applications require various database utilities, such as for example those relating to microarray production, sample production, hybridization, process monitoring, data storage and retrieval, image analysis, statistical analysis, data mining, and graphical data presentation. Therefore, there is a need for an efficient, comprehensive and integrated computer-based microarray database system with diverse applications spanning the multiple steps of the microarray process. It is to such a system, and methods for implementing and using the same, that the present invention is directed.
[
Referring now to the drawings, and in particular to
As shown in
The server 18 is in communication with a plurality of user systems 22. In general, each of the user systems 22 is a computer system associated with one of the users 14. Each user 14 utilizes its user system 22 to transmit information to and receive information from the server 18 of the MDB system 10. The term “transmit,” and derivations thereof, as used herein generally means to pass or to send.
Preferably, each user system 22 includes at least one output device 26 (e.g., a monitor, display, screen, speaker, or printer) and at least one input device 30 (e.g., a keyboard, mouse, keypad, joystick, microphone, or touch-screen). Further, each user system 22 preferably operates a web-browser or other program which allows for the retrieving and viewing of electronic information via the Internet and/or the World Wide Web.
The server 18 of the MDB system 10 is also in communication with a database 34. In general, the database 34 is used for storing microarray research information or data which is accessible by the server 18. In one embodiment, the database 34 is a relational database, such as for example an Oracle 9i database. Generally, the database 34 is stored on a storage media in a storage device (not shown), such as a hard disk drive for example. Also, the database 34 can be local to or remote from the server 18.
In one embodiment, the server 18 is established on the Internet so as to be publically available. For example, the server 18 can include a web server computer system, such as an Apache web server computer system, which hosts a website on the World Wide Web so that the server 18 can be accessed using http: protocol. The users 14 can then utilize their corresponding user systems 22 to initiate a web browser and connect to the server 18 via the website. The establishing of a computer system and websites on the Internet and/or World Wide Web are known to those skilled in the art and therefore no further discussion is deemed necessary to teach one skilled in the art how to establish the server 18 of the MDB system 10 on the Internet and/or the World Wide Web.
The server 18 communicates with the user systems 22 and the database 34 via communication links 38. The communication links 38 can be any suitable communication link which permits electronic communications, such as such as extra computercommunication systems, intra computer communication systems, internal buses, local area networks, wide area networks, Internet networks, point to point shared and dedicated communications, infra red links, microwave links, telephone links, cable TV links, satellite links, radio links, fiber optic links, cable links and/or any other suitable communication device, system or network, or combinations thereof. Preferably, the communication links 38 between the server 18 and the user systems 22 are Internet based links which allows electronic information to be transferred between the server 18 and the user systems 22 via the Internet.
In operation, the server 18 of the MDB system 10 allows a user to input, store, manage, retrieve, process, analyze, display, output or otherwise utilize data relating to microarray experiments. To facilitate such operations, the server 18 provides a user interface to each user 14 via its user system 22. In one embodiment, the user interface is a web interface implemented with PHP, JavaScript, Java Applet, and HTML based programs. Further, the server 18 in one embodiment uses PHP server-side scripting language to create R scripts based on information received from the user systems 22 (e.g., user requests) and to transform information to be transmitted to the user systems 22 (e.g., analysis results and image files) into a web compatible format.
In general, the user interface is utilized to prompt and guide the user 14 through a step-by-step process for creating data entries and keeping records in the MDB system 10 for one or more microarray experiments. Preferably, the logical progression of data entry flow via the user interface is designed from the user's perspective. Also, the data entry design of the user interface is preferably constructed so as to generally prevent a user 14 from making invalid entries or entering data in the wrong order. This prevents data from being entered haphazardly from any point during the process, and thus users 14 cannot hastily or inadvertently gloss over data details. Rather, the data is complete and consistent in format from experiment to experiment. Maintaining such order and uniformity in the recording keeping process prevents future confusion for other users 14 (e.g., other scientists or managers) when reviewing experiment details. Various embodiments of the user interface is described in further detail below with reference to the various applications of the MLB system 10.
Because the operation of the MLB system 10 is similar for each user 14, for purposes of brevity and clarity of understanding, the MLB system 10 is generally discussed herein with reference to one of the users 14 and its corresponding user system 22. Also, because operation of the MLB system 10 is similar for each microarray for which an experiment is performed and information is included in the MLB system 10, the MLB system 10 is generally discussed herein with reference to one microarray. From the discussions with reference to the one user 14 and the one microarray, it should be apparent to those of ordinary skill in the art how to construct and apply the MLB system 10 for a plurality of users 14 and/or a plurality of microarrays.
In one embodiment, the server 18 of the MLB system 10 has an “Array Production Management” application 40, a “Laboratory Information Management System” (LIMS) application 42, a “Data Import” application 44, a “Data Analysis” application 48, and a “Data Export” application 50, as shown in
As mentioned above, the Array Production Management application 40 of the present invention is, in general, used to track information relating to microarrays before any experimentation has been conducted. More particularly, the Array Production Management application 40 of the present invention is used to receive, store and manage array information associated with the production or printing of experimental microarrays, which is also referred to herein as spotting run information. Spotting run information can include aspects of array design, probe formation, plate source information, microarray spot locations, printing chemistry, print-run information, array printer configuration, and gene annotation for the microarray, for example.
Generally, spotting run information is unique because the DNA oligo spots of a microarry and the date and time of creation, taken together, are not duplicated. As such, the spotting run information can provide a unique access point or identifier which can be used to link to other aspects of information associated with the microarray within the MDB system 10. Further, each individual DNA oligo spot can be used to link to specific information associated with that particular spot in the microarray. As such, in one embodiment of the present invention, the DNA oligo spots are used as a “starting point” to create entries wherein data can be uploaded and stored in the MDB system 10 (even before such data has been created and made available).
In one embodiment, the Array Production Management application 40 includes an “Array Production List” module 52, an “Array Layout” module 54, a “Plates List” module 56, an “Individual Plate” module 58, and a “Gene Annotation” module, as shown in
Within the Array Production List module 52, array production information is received and stored in a “Print Run List” table 62 having a plurality of data entry fields which can be displayed via the user interface to the user 14, as shown for example in
Within the Array Layout module 54, array layout information relating to the correspondence between plates and array location is received and stored in an “Array Design Info” table 64 having a plurality of data entry fields, which can be displayed to the user 14 via the user interface as shown for example in
Further, as shown in
Within the Plates List module 56, plate list information relating to a list of plates in the microarray is received and stored in a “Plate List” table 66 having a plurality of data entry fields, which can be displayed to the user 14 via the user interface, as shown for example in
Within the Individual Plate module 58, plate information for each individual plate in the plate list is received and stored in a “Plate Info” table 68 for that plate (only one being shown in
The Plate Info Table 68 includes a “Plate ID” field which provides the plate number identifier for the plate. A “Plate Row” field provides the plate row number, and a “Plate Column” field provides the plate column number. An “Oligo_ID” field provides an oligo identifier for the oligo in a well of the plate. An “b_number” field provides a b number for the oligo in a well of the plate. A “z_number” field provides a z number for the oligo in a well of the plate. A “Ecs_number” provides an ecs number for the oligo in a well of the plate. A “Gene_Symbol” field provides the gene symbol for the oligo in a well of the plate. An “Oligo_Length” field provides a length of the oligo in a well of the plate. A “TM” field provides the melting temperature for the oligo in a well of the plate. And a “Description” field contains a general description for the plate.
Within the Gene Annotation module 60, annotation information for the annotation of genes associated with the microarray is received and stored in an “Gene Annotation” table 70 having a plurality of data entry fields, which can be displayed to the user 14 via the user interface as shown for example in
In one embodiment, at least a portion of the array production information of the Array Production List module 52, the array layout information of the Array Layout module 54, the plate list information of the Plates List module 56, the plate information of the Individual Plate module 58, and/or the annotation information of the Gene Annotation module 60 of the Array Production Management application 40 are inputted by uploading one or more data files (e.g, spreadsheet files) containing such information. For example, for membrane and GeneChip arrays, data files are provided by the manufacturer and can be uploaded to the server 18, for example by the user 14 via the user system 22, or be otherwise made accessible to the server 18. However, it should be understood that the array information can be inputted in the Array Production Management application 40.
As mentioned above, the LIMS application 42 of the present invention is, in general, used to track information relating to samples and hybridization processes used to hybridize microarrays. More particularly, the LIMS application 42 is used to receive, store and manage experimental information relating to the microarray experimentation process, including information relating to the protocols that were followed, how particular samples were produced and treated (e.g, what organism and labeling materials were used), and the hybridization process, for example. As such, the LIMS application 42 of the MDB system 10 functions as a virtual or digital research notebook where records can be made in a uniform and consistent manner. Further, the LIMS application 42 can be used to manage and track probe source plates and printed microarrays (e.g., using barcode identifiers), and as such allows the MDB system 10 to function also as a material management system.
In one embodiment, the LIMS application 42 allows the user 14 to input experimental information for one or more experiments via the user interface in an automated process in order to save the user time in entering or importing data into the server 18. This process works for single or replicate experiments (where a plurality of replicates are generally preprocessed and averaged to make up a “single” experiment, i.e., a data set for a time point or treatment). However, the automated process is especially useful for importing information for replicate experiments.
In general, the user 14 is allowed to input information regarding a project and an experiment set with which an experiment is associated before proceeding with the details of the experimental process for the experiment. Such an organizational format corresponds to the intuitive levels of microarray experimentation. Also, because the LIMS application 42 organizes the information for microarrays in a logical, hierarchical format, it makes it easier for the users 14 to find projects, experiment sets, and experiment information. Further, if an experiment is a replicate, and thus can be associated with a pre-existing project or experiment set, the LIMS application 42 can automatically link repetitive information between experiment replicates. This eliminates the need for entering repetitive information by the user 14.
The automated process of the LIMS application 42 is referred to herein as a “Create” function. One embodiment of the flow of the Create function for a single or first experiment is shown in
For purposes clarity of understanding, the flow and operation of the Create function will first be described below with reference to a first experiment wherein all new information is inputted. Then the flow and operation of the Create function will be described with reference to replicate experiments.
As shown in
In a next step of the Create function, an experiment set is defined. In one embodiment, the LIMS application 42 includes an ExpSet module 102, which functions as a second level data container. Within the ExpSet module 102, experiment set information is received and stored relating to at least one experiment set of the project that experiments can be associated with or assigned to. The term “experiment set” generally refers to a set of experiments that collectively consists of several measurements corresponding to a series of time points or a series of similar treatments. The experiment set information can include an identifying experiment set name, number or other descriptive identifier, for example. In one embodiment, the experiment set information is inputted into the Expset module 102 via an input field in the user interface (not shown). However, it should be understood that any input means can be used to input the experiment set information.
After the experiment set is defined for the Create function, the user 14 is prompted to input experiment information relating to an experiment associated with the experiment set. The term “experiment” generally refers to a process performed at a single time point or with a single treatment (for example a time point or treatment in which the user 14 may replicate for validation of a result). In general, the experiment information includes details relating to the formation of a labeled extract sample and the hybridization process for the experiment, as discussed further below. Such information is encapsulated in an Experiment module 104, which functions as a third level data container in the LIMS application 42. In other words, within the Experiment module 104, experiment information is received and stored relating to at least one experiment.
In one embodiment, the information encapsulated in the Experiment module 104 of the LIMS application 42 is received and stored in separate modules, which include a “BioSample Protocol” module 110, an “Extract Protocol” module 112, a “Label Protocol” module 114, a “Hybridization Protocol” module 116, a “BioSample Source” module 118, a “BioSample” module 120, an “Extract Sample” module 122, a “Label Sample” module 124, and a “Hybridization” module 126, as shown in
Within the Biosample Protocol module 110, biosample protocol information is received and stored relating to the protocol used to produce a biosample from a biosample source. In general, the biosample protocol information is indicative of how the biosample associated with the experiment was treated during production, such as for example the culture conditions for the biosample. The biosample protocol information can include a name, number or other descriptive identifier for the biosample protocol, a description of the biosample protocol, an identity of a submitter of the biosample protocol, a date the biosample protocol was submitted, an identity of a last modifier, and a date of the last modification, for example. In one embodiment, the biosample protocol information is inputted into the Biosample Protocol Module 110 via a plurality of input fields in the user interface (not shown). Also, the biosample protocol information inputted for the biosample protocol can be displayed to the user 14 via the user interface, as shown for example in
Within the Extract Protocol module 112, extract protocol information is received and stored relating to the protocol used to extract a sample. In general, the extract protocol information is indicative of the process by which RNA samples were extracted from the biosample. The extract protocol information can include a name, number or other descriptive identifier for the extract protocol, a description of the extract protocol, an identity of a submitter of the extract protocol, a date the extract protocol was submitted, an identity of a last modifier, and a date of the last modification, for example. In one embodiment, the extract protocol information is inputted into the Extract Protocol module 112 via a plurality of input fields in the user interface (not shown). Also, the extract protocol information inputted for the extract protocol can be displayed to the user 14 via the user interface, as shown for example in
Within the Labeled Protocol module 114, label protocol information is received and stored relating to the protocol used to label the biosample associated with the replicate. In general, the label protocol information is indicative of the process by which RNA samples are labeled with dyes, radioisotopes, biotin, etc, so as to provide a labeled extracted biosample. The label protocol information can include a name, number or other descriptive identifier for the label protocol, a description of the label protocol, an identity of a submitter of the label protocol, a date the label protocol was submitted, an identity of a last modifier, and a date of the last modification, for example. In one embodiment, the label protocol information is inputted into the Labeled Protocol module 114 via a plurality of input fields in the user interface (not shown). Also, the label protocol information inputted for the label protocol can be displayed to the user 14 via the user interface, as shown for example in
Within the Hybridization Protocol module, hybridization protocol information is received and stored relating to the protocol used for hybridization. In general, the hybridization protocol information is indicative of the process by which labeled samples and microarray spots are hybridized. The hybridization protocol information can include a name, number or other descriptive identifier for the hybridization protocol, a description of the hybridization protocol, an identity of a submitter of the hybridization protocol, a date the hybridization protocol was submitted, an identity of a last modifier, and a date of the last modification, for example. In one embodiment, the hybridization information is inputted into the Hybridization Protocol module 116 via a plurality of input fields in the user interface (not shown). Also, the hybridization protocol information inputted for the hybridization protocol can be displayed to the user 14 via the user interface, as shown for example in
Within the Biosample Source module 118, biosample source information is received and stored relating to a biosample source. In general, the biosample source information is indicative of an organism, a strain, a genotype, etc. The biosample source information can include a name, number or other descriptive identifier for the biosample source, a description of the biosample source, an identification of an organism associated with the biosample source, an identification of a parent associated with the biosample source, an identification of a strain associated with the biosample source, and an identification of a genotype associated with the biosample source, for example. In one embodiment, the biosample source information is inputted into the Biosample Source module 118 via a plurality of input fields in the user interface (not shown). Also, the biosample source information inputted for the biosample source can be displayed to the user 14 via the user interface, as shown for example in
Within the Biosample module 120, biosample information is received and stored relating to a biosample. In general, the biosample information is indicative of a biological laboratory experiment from which an RNA sample is extracted for a biosample source, such as for example for a bacterial culture or tissue culture. The biosample information can include a name, number or other descriptive identifier for the biosample, a description of the biosample source, a date the biosample was produced, an identification of the biosample protocol used to produce the biosample (such as the biosample protocol name), and an identification of the biosample source used to produce the biosample (such as the biosample source name), for example. By including the identification of the biosample protocol and the biosample source, it can be seen that the biosample information in the Biosample module 120 is linked to the biosample protocol information in the Biosample Protocol module 110 for the corresponding biosample protocol and to the biosample source information in the Biosample Source module 118 for the corresponding biosample source, respectively.
In one embodiment, the biosample information is inputted into the Biosample module 120 via a plurality of input fields and/or lists in the user interface (not shown). Also, the biosample information inputted for the biosample can be displayed to the user 14 via the user interface, as shown for example in
Within the Extract Sample module 122, extract sample information is received and stored relating to an extracted sample. In general, the extract sample information in indicative of an extracted RNA representing gene expression in a biological experiment. The extract sample information can include a name, number or other descriptive identifier for the extracted sample, a date the extracted sample was extracted, a description of the extracted sample, an identification of the extract protocol used to produce the extracted sample (such as the extract protocol name), and an identification of the biosample used to produce the extracted sample (such as the biosample name), for example. By including the identification of the extract protocol and the biosample, it can be seen that the extract sample information in the Extract Sample module 122 is linked to the extract protocol information in the Extract Protocol module 112 for the corresponding extract protocol and to the biosample information in the Biosample module 120 for the corresponding biosample, respectively.
In one embodiment, the extract sample information is inputted into the Extract Sample module 120 via a plurality of input fields and/or lists in the user interface (not shown). Also, the extract sample information inputted for the extracted sample can be displayed to the user 14 via the user interface, as shown for example in
Within the Label Sample module 124, label sample information is received and stored relating to a labeled extracted sample. In general, the label sample information in indicative of a labeled RNA sample used to hybridize to probes. The label sample information can include a name, number or other descriptive identifier for the labeled extracted sample, a date the extracted sample was labeled, a description of the label used, a label dye type, a label dye amount, an amount of cDNA synthesized, an identification of the label protocol used to produce the labeled extracted sample (such as the label protocol name), an identification of the extracted sample used to produce the labeled extracted sample (such as the extracted sample name), and an identification of the biosample used to produce the extracted sample (such as the biosample name), for example. By including the identification of the label protocol, the extracted label and the biosample source, it can be seen that the label information in the Label Sample module 124 is linked to the label protocol information in the Label Protocol module 114 for the corresponding label protocol, to the extract sample information in the Extract Sample module 122 for the corresponding extracted sample, and to the biosample information in the Biosample module 120 for the corresponding biosample, respectively.
In one embodiment, the label sample information is inputted into the Label Sample module 124 via a plurality of input fields and lists in a “Create Label Sample” input means 150 in the user interface, as shown for example in
Within the Hybridization module 126, hybridization information is received and stored relating to a hybridization process (i.e., the process by which base pairs are formed between complementary regions of two strands of DNA). The hybridization information can include a name, number or other descriptive identifier for the hybridization, a date the hybridization was performed, a description of the hybridization, an identification of the hybridization protocol used for the hybridization (such as the hybridization protocol name), an identification of the labeled extracted sample used for the hybridization (such as the labeled extracted sample name), an amount of the labeled extracted sample used for the hybridization, and an identification of at least a portion of a microarray used for the hybridization (such as the print run number identifier and the slide number), for example. By including the identification of the hybridization protocol and the labeled extracted sample, it can be seen that the hybridization information in the Hybridization module 126 is linked to the hybridization protocol information in the Hybridization Protocol module 116 for the corresponding hybridization protocol and the label sample information in the Label Sample module 124 for the corresponding labeled extracted sample, respectively. Additionally, it can be seen that by including the identification of the portion of the microarray used for the hybridization, the hybridization information in the Hybridization module 126 for the hybridization is also linked to the array information in the Array Production Management application 40 (discussed above) for the corresponding microarray, and in particular to the array layout information in the Array Layout module 54 (which is also linked to the Array Production List Module 52, the Plates List Module 56, the individual Plate module 58 and the Gene Annotation module 60).
In one embodiment, the hybridization information is inputted into the Hybridization module 126 via a plurality of input fields and/or lists in the user interface (not shown). Also, the hybridization information inputted for the hybridization process can be displayed to the user 14 via the user interface, as shown for example in
It is important to note that the Label Sample module 124 and the Hybridization module 126 of the LIMS application 42 of the present invention allow the user 14 to track information relating to the total amount of a labeled extracted sample produced and use of a certain quantity (e.g., a number of picomoles) of the labeled extracted sample for one or more experiments. Thus, the LIMS application 42 can be utilized as a record keeping system which aids the user 14 (or other users 14) in planning experiments and reviewing an experiment after it is completed. In other words, the LIMS application 42 can function as a type of inventory or quality control means that subtracts the amount used from each labeled extracted sample so as to keep an updated record of an amount of remaining labeled extracted sample. Such a quality control feature can be used to indicate or warn the user 14 (or other users 14) if an impossible or illegitimate experiment has been claimed as being performed when the experiment is indicated as having been performed with materials that were expended in previous experiments. Further, such a feature can be used to provide users 14 with an interactive look-up listing of labeled extracted samples from which information related to various labeled extracted samples can be readily accessed and viewed. This listing can be used for example to plan future work with one or more particular labeled extracted samples.
Once all the information has been inputted into the Create function for the experiment, the LIMS application 42 creates corresponding data locations so that hybridization experiment results (e.g., a data set and image) can be individually uploaded for the experiment, as discussed further below with respect to the Data Import application 44.
As seen from the discussion above the Create function of the LIMS application 42 allows a user to associate an experiment with an experiment set and project of choice. Further, each experiment is linked to information relating to its biosample source, its biosample and biosample protocol, its extracted sample and extract protocol, its labeled extracted biosample and label protocol, and its hybridization and hybridization protocol. When a new experiment is the first to be associated with a new project, the option is offered to create all entities of the modules of the LIMS application 42. However, if a new experiment is a replicate associated with an existing project that includes at least one other existing experiment, information that is expected to be common between the new experiment and the at least one existing experiment can be automatically entered for the new experiment by the server 18 of the MDB system 10. For example, information within the Project Module 100, ExpSet module 102 and Experiment module 104 can be automatically retrieved for a new replicate experiment from at least one corresponding existing experiment once a relationship is indicated between the replicate experiment and the existing experiment. Further, information within the Biosample module 120, Extract Sample module 122, Label Sample module 124 and the Hybridization module 126 can be automatically retrieved for the new replicate experiment from the at least one corresponding existing experiment based on information provided in the Biosample Source module 118, Biosample Protocol module 110, Extract Protocol module 112, Label Protocol module 114 and the Hybridization Protocol module 116 for the replicate experiment since the information within these modules are linked.
As such, to input information for replicate experiments in the LIMS application 42, the user 14 preferably uses the truncated Create function for replicate experiments shown in
In a first window of the Create Replicates input means 180, the user 14 is prompted to identify a project and an experiment set with which the replicate experiments are to be associated. Since the replicate experiments are associated with a preexisting project, the project information for the replicate experiments can be automatically determined and provided by the server 18 from the identification of the associated preexisting project. For example, as shown for in
Similarly, since the replicate experiments are associated with a preexisting experiment set, the experiment set information for the replicate experiments can be automatically determined and provided by the server 18 from the identification of the associated preexisting experiment set. For example, as shown in
In a second window of the Create Replicates input means 180, the user 14 is also prompted to input the number of replicate experiments by selecting from a list of numbers, e.g., in a list in a pull down menu, as shown for example in
Preferably, the Create Replicates input means 180 is adapted to allow the user 14 to input information for either or both single channel or two channel labeling, e.g., when a red (cy5) label and/or a green (cy3) label is used. As such, the user 14 can indicate to the LIMS application 42 which label or labels were used so that the LIMS application 42 can format the Create Replicates input means 180 accordingly. In one embodiment, the user 14 is prompted to indicate the labels used in a first window of the Create Replicates input means 180. For example, as shown in
For purposes of brevity and illustration, the Create Replicates input means 180 is discussed further below and shown herein in one embodiment with regards to two channels with reference to both green (cy3) and red (cy5) labels. From the discussion of the two channels and two labels, it will be apparent to one of ordinary skill in the art how to format the Create Replicates input means 180 for only one of the channels, either for a green (cy3) label or red (cy5) label, or for two channels but with reference to only the green (cy3) label, as unnecessary entries for inputting, storing, and/or displaying information can be omitted, ignored or otherwise not utilized.
Once the Create Replicates input means 180 is formatted accordingly for the channels and number of replicate experiments, the user 14 is prompted in the second window of the Create Replicates input means 180 to identify the biosample source, the biosample protocol, the extract sample protocol, the label protocol, and the hybridization protocol for the red (cy5) label and the green (cy3) label for the experiment replicates. From such information, the server 18 can automatically determine and provide at least a portion of the experiment information for the replicate experiments since the replicate experiments will have at least some experiment information in common with at least one other preexisting experiment.
For example, as shown in
Also, the user 14 is prompted in the second window of the Create Replicates input means 180 to input at least a portion of the label information for the red (cy5) label and the green (cy3) label, as shown for example in
Further, the user 14 is prompted in the second window of the Create Replicates input means 180 to input at least a portion of the hybridization information for the set of replicate experiments, as shown for example in
Once all the information has been inputted into the Create function for the replicate experiments, the LIMS application 42 creates corresponding data locations so that experiment results (e.g., data sets and images) can be individually uploaded for each replicate experiment since each will have its own hybridization results to upload when the entire experiment is complete.
From the above discussion, it can be seen that the Create function for replicate experiments of the LIMS application 42 saves time because repetitive tasks are done only once. Further, the flow of the Create function reflects the way a user 14, such as a biologist or scientist, would intuitively think about the experiment. Although a collective experiment is performed as a set of replicate experiments, the results for each replicate experiment are unique and should be individually maintained. Also, the experiment results are generally handled as unique during data analysis, although they may be averaged at some later point in the analysis.
As mentioned above, the Data Import application 44 of the present invention is, in general, used to import and process experiment results for hybridized microarrays. More particularly, the Data Import application 44 is used for recording, storing and processing experiment results, including information relating to the raw data and digitized images collected for experiments. In one embodiment, the Data Import application 44 includes an “Upload Data” module 200, a “Raw Data” module 202, an “Array Image” module 204, a “Preprocessed Data” module 206 and a “Production Data” module 208, as shown in
Within the Upload Data module 200, raw data information and array image information is retrieved or uploaded for the experiment. Generally, each technology platform currently in use (such as for example membranes, microarrays, and Affymetrix GeneChips) has a specific raw data format and image format associated with it. For example, raw data and an image can be generated using a scanner platform-specific image processing software such as GenePix. The Upload Data module 200 brings the raw data information from a specific platform into the Raw Data module 202 where it is staged for preprocessing, and brings the array image information into the Array Image module 204.
In one embodiment, a raw data file containing the raw data information and an associated array image file containing the array image information are uploaded via the Upload Data module 200 to the Raw Data module 202 and the Array Image module 204, respectively. In general, the raw data file includes intensity data, generally in a spreadsheet format. For example, the raw data file can be an “Excel” file. The array image file generally includes graphical data for the hybridized spots of a microarray from which an image 220 of the microarray (as shown for example in
The raw data information and array image information can be for example uploaded from a device outputting such data (such as a scanner), from a file stored on a storage media (such as a hard disk drive, a compact disk, floppy disk, etc.), or from a computer database (such as a database of the user system 22 or other remote computer). Further, image processing information such as scan power and laser power can also be uploaded.
In one embodiment, so that raw data files and array image files can be easily uploaded by user 14 via the user system 22, the user 14 is provided with an upload means 250 in the user interface, as shown for example in
For example, as shown in
In the upload means 250, the user 14 also indicates the location from which the raw data file and the array image file can be uploaded. For example, as shown in
In one embodiment, the server 18 of the present invention links or associates the raw data information and the array image information in the Data Import application 44 with the experimental information in the LIMS application 42 such that the graphical information for the microarray, and further for each spot in the microarray, is linked to its intensity data and LIMS information. Further, the uploaded array image is preferably linked to the spotting run from which it was created, i.e., the array information in the Array Production Management application 40. As such, the MDB system 10 allows for the linking of array production information, LIMS information, and experimental information. Such linking gives users 14 the ability to relate information describing array production (e.g., how a microarray was spotted and with what materials) to each experiment done with a microarray, including all the experiment parameters, hybridization results, and analysis results.
For example, if a spot does not seem to perform as desired in an experiment, a user (e.g, a scientist or lab manager) can visualize graphics of the hybridized spot and intensity numbers for as many experiments as desired. Also, each experiment can be linked to a day the array was produced and materials used. This is a separate issue from the day experiments were conducted using the microarrays. However, during problem solving, a user generally must consider both microarray production and microarray experimental procedures using completed microarrays. The MDB system 10 of the present invention facilitates this problem solving process by linking details of array production and details of array experiments. As such, the MDB system 10 provides the user 14 with information to decide whether a spot's problems are caused by spotting methods and materials, or by experimental methods and materials.
Further, each portion of the array image which is indicative of one of the spots in the spotting run is linked to the raw data and LIMS information associated with the corresponding spot, and thus provides qualitative information for each spot. For example, shown in
As discussed above, uploading the raw data information and the array image information using the Upload Data module 200 places the raw data file data into the Raw Data module 202, which serves as staging area for raw data storage for integration into the Preprocessed Data module 206 (and subsequently the Production Data module 208). Within the Preprocessed Data module 206, the raw data from the Raw Data module 202 for one or more experiment is transformed according to a platform-specific protocol by normalization, filtering, scaling, etc., in a pre-processing step, and is converted into a common format employed for the Production Data module 208 by averaging experiments and calculating statistical metrics so as to generate preprocessed data. For example, for membrane arrays, generally phosphor imaging is used to produce a TIFF image file that is further processed in ArrayVision™ ver 5.1, Imaging Research, Inc. software. The resulting raw data is preferably normalized by a global normalization strategy and replicate experiments are chosen for calculation of expression averages, ratios, and statistical confidence. As another example, Affymetrix GeneChip arrays are generally scanned on a platform specific system and the image is processed by using a platform-specific software package. The resulting raw data is preferably normalized and scaled, then replicate experiments are chosen for calculation of on/off threshold, expression averages, ratios, and statistical confidence.
In one embodiment, the Preprocessed Data module 206 includes a plurality of processing tools which the user 14 can use via the user interface. In general, such processing tools allow the user to arrange, mathematically manipulate, or otherwise process data. More particularly, the processing tools relate to pre-filtering, normalization, statistical analysis, experimental replicate comparison and/or replicate averaging, for example. By integrating processing tools into the user interface, the user 14 can be prompted in how to apply procedures, such as for example state-of-the-art procedures for pre-filtering spot information, whole microarray and microarray-microarray normalization, statistical significance of results, and comparison and averaging of replicate experiments.
Further, the processing tools allow the Preprocessed Data module 206 to be adapted for quality control of raw data by the user 14 so that the raw data can be evaluated before being sent to the Production Data module 208 as preprocessed data. In one embodiment, the user 14 can review one or more individual spot images and the information related to the spot images to ascertain whether raw data is reliable or of a suitable quality and decide whether or not to enter them into Production Data module 208. For example, the user can evaluate whether the spots behaved properly in the biological experiment and whether at any time the results for a particular probe may be called into question. Further, the Preprocessed Data module 206 is preferably adapted to allow the user 14 to go back and forth between the image, the raw data file, and probe-specific information for the experiment. This feature of the Preprocessed Data module 208 can also be made accessible from any stage of the operation for the user's reference.
The processing tools of the Preprocessed Data Module 206 in one embodiment are powered by “R” and “Bioconductor” techniques. R is a widely used open source, high-level language and environment for statistical computing and graphics. Bioconductor provides tools for the analysis and comprehension of microarray data (bioinformatics).
In one embodiment, one processing tool is a “Pre-filtering data” analysis tool, which allows the user 14 to flag or remove data points from consideration on the basis of several parameters, including null spots in the array, spot quality (bad spots flagged during scanning or by spot-data analysis), signal to noise ratio, and background subtraction for gene on/off calls. Pre-filtering can also be used for multi-species microarrays to consider only those microarray probes that are specific for genes represented in the biosample source and biosample. In one embodiment, the user 14 is provided with a “Filter Feature” input means 300 in the user interface, as shown for example in
Another processing tool is a “Normalization” analysis tool. In one embodiment, the Normalization analysis tool allows the user 14 to perform at least one of 1) within-print-tip-group intensity dependent location normalization (Lowess) followed by within-print-tip-group scale normalization using the median absolute deviation (scale print-tip), with or without background subtraction, 2) global median location normalization, 3) global intensity or A-dependent location normalization using loess (global loess), 4) 2D spatial location normalization using loess (2D) within-print-tip-group intensity dependent location normalization using loess (print-tip), or 5) total intensity normalization.
In one embodiment, the user 14 is provided with a “Normalization Analysis Options” input means 320 in the user interface, as also shown for example in
In another embodiment, the user 14 is provided with a pre-filtering data analysis tool and “Normalization Method” input means 340 in the user interface, as shown for example in
Further, another one of the processing tools can offer the user 14 statistical methods for evaluating experimental noise and for determining whether or not genes are “responders” in experiments. That is, genes with significant, differential expression between experimental conditions in pairwise comparisons. An example statistical method is a simple noise determination, wherein the standard deviation of replicate spots for multiple replicate arrays is determined so as to provide the user 14 with a measure of noise for particular spots (genes) in their replicate experiments.
Also, one of the processing tools can allow the user 14 to perform a replicates correlation analysis to evaluate the quality of replicate experiments by determining the simple correlation between replicate experiment results. For example, based on the input provided in the Normalization Method input means 340 (see
Further, one of the processing tools can allow the user 14 to perform differential gene expression analysis for two or more experiments. In such an analysis, the user 14 can also be offered choices of methods for evaluating whether individual genes are statistically significant responders in their experiments. In one embodiment, the user 14 is provided with a “Two Samples Comparison Option” input means 370 in the user interface, as shown for example in
If the t-statistic analysis is indicated in the Two Samples Comparison Option input means 370, then the user 14 is provided with a “t-Statistic Analysis Results” table 390 in the user interface, as shown for example in
As yet another example, one of the processing tools can allow for data mining, wherein conventional and advanced data mining algorithms (e.g., SVM, MATOM, SOM) can be implemented by the user 14, such as for example for time series analysis and standard cluster analysis.
Within the Production Data module 208, the preprocessed data from the Preprocessed Module 206 is received and stored. In one embodiment, the Production Data module 208 includes a production data table designed to accommodate the most commonly accessed data fields, such as for example microarray intensity, ratio values, and confidence intervals (regardless of the technology platform used). As such, the Production Data module 208 can be adapted to handle for example proteome and metabolome data (provided that Preprocessed Data 206, Raw Data 202 and Upload Data 202 modules are programmed to interface with the specific application).
The purpose for the Production Data module 208 is mainly two-fold. First, the preprocessed data are stored in a format that is standardized for commonly used analytical tools. Second, accessing the preprocessed data from the scaled-down Production Data module 208 with its reduced data fields makes the application of analytical tools operate much faster. For example, such features offer significant advantages (such as for example for displaying transcriptome data) since to Applicants' knowledge, current microarray database systems generally do not truly integrate data generated on different platforms and are notoriously slow in display time because they access data from very large raw data tables.
As such, it can be seen that the Upload Data module 200, the Raw Data module 202, the Preprocessed module 206 and the Production Data module 208 cooperate to form a “pipeline” for experiment results. The pipeline not only allows novice users to preprocess the experiment results for entry into the Production Data module 208 using standardized filtering and normalization protocols, but also sophisticated users can quickly enter experiment results for a large number of experiments.
As mentioned above, the Data Analysis application 48 of the present invention is, in general, used to generate graphical presentations of experiment and analysis results. More particularly, the Data Analysis application 48 provides visual tools, such as for analysis and graphical presentations, which the user 14 can utilized to visualize and evaluate experiment results and analysis results via the user interface. In one embodiment, the visual tools are powered by Netpbm, which transforms post script files to image files.
The Data Analysis application 48 includes a “Presentation” module 430, as shown in
The Presentation module 430 can also allow the user 14 to display data using a M-A plot. M-A plots are a standard for presenting microarray data from individual or replicate experiments. The M-A plot is preferably offered by the Presentation module 430 for raw data and for data sets which have been filtered and/or normalized, and/or replicated. Further, a M-A plot in the Presentation Module 430 can be adapted so as to allow for annotation and metalink information for individual genes.
In one embodiment, the user 14 is provided with a “Plot Options” input means 440 in the user interface, as shown for example in
Further, the Presentation module 430 can allow the user 14 to display a plot of the results of a t-statistic analysis and/or a linear model and empirical Bayes method analysis. In one embodiment, if the t-statistic analysis is indicated in the Two Sample Comparison Option input means 370 (as discussed above), then the user 14 is provided with a “t-statistic” plot means 450 in the user interface, as shown for example in
If the linear model and empirical Bayes method analysis is indicated in the Two Sample Comparison Option input means 370, then the user 14 is provided with a Linear Model and Empirical Bayes plot means 460 in the user interface, as shown for example in
The Presentation module 430 can further include other visual tools. For example, the Presentation module 430 can include a visual tool which allows the user 14 to display and sort data for at least a portion of the experiment information, the experiment results and/or the analysis results, such as for example in a spreadsheet in a JAVA Applet. The Presentation module 430 can allow the user 14 to sort by any criteria, such as for example by the b number. For example, shown in
In one embodiment, at least a portion of the processing tools and visual tools of the present invention are made readily available to the user 14 in a “Data Analysis Option” input means 450 in the user interface, as shown for example in
The Data Analysis Option input means 450 allows the user 14 to indicate one or more processing tools and/or visual tools the user 14 wants to use. For example, as shown in
As mentioned above, the Data Export application 50 of the present invention is, in general, used to export information, such as for example experiment results or analysis results. More particularly, the Data Export application 50 is used for exporting or downloading raw data, analysis data, images, charts, plots, graphs, etc., generated using any other application of the MDB system 10, preferably in a MIAME or GEO compliant form. For example, such information can be exported in a “soft copy” form, such as in a digital file, or in a “hard copy” form, such as in a paper print out. In one embodiment, analysis results are downloaded in a tab delimited text file format so that it can be easily imported to other data analysis software such as Spotfire. For example, the analysis results can be exported to Spotfire, wherein a M-A plot with spots colored by gene function groups contained in the downloaded file are visualized in Spotfire. Also, data interpretation by users (e.g., scientists and researchers) often requires connections (e.g., meta-links) to external databases that handle other pertinent data types. Therefore, unification links to external databases can further be provided for the user 14.
It can be seen that the Data Import application 44, the Data Analysis application 48 and Data Export application 50 of the present invention provide the user 14 with flexibility and speed in the number of combinations and selections that the user 14 can use to analyze data, create tables and plots, and export information. For example, there are ten selections in the Data Analysis Option input means 450 shown in
To further manage and provide access to information for the experiment, the server 18 can also display an “Experiment Results” access means 470 to the user 14 in the user interface, as shown for example in
Further, the server 18 can display at least a portion of the Experiment Results access means 470 for a plurality of experiments in an “Experiment List” table 490 to the user 14 in the user interface, as shown for example in
In one embodiment, the server 18 of the MDB system 10 further includes a “User Management” application 500. In general, the User Management application 500 of the present invention is used to store and manage information associated with one or more users 14 of the MDB system 10, such as those associated within a particular laboratory or research group. For example, information identifying one of the users, such as a user name and/or password, can be defined in the User Management application 500. Further, the User Management application 500 preferably allows for users 14 to be assigned to different categories or levels, wherein each level has associated with it different user rights or privileges within the applications of server 18. In one embodiment, the user levels in the User Management application include an “Administrator” level and a “User” level.
In one embodiment, the User Management application 500 provides an input means (not shown) in the user interface, which allows a user 14 at the administrator level to create or define new user accounts and assign privileges to users 14 at the User level. Preferably, only one user 14 is assigned the role of administrator, and the User level is associated with general users 14 of the MDB system 10. Within the User level, different users 14 can be given different privileges. Further, the User Management application 500 can provide an input means (not shown) which allows the users 14 at the User level to individually change at least a portion of their own information, such as his/her password.
The following example of the construction and operation of the MDB system 10 is set forth hereinafter. It is to be understood that the example is for illustrative purposes only and is not to be construed as limiting the scope of the invention as described and claimed herein.
The MDB system 10 has the server 18 that includes a web server computer system which hosts a website on the Internet. Once the user 14 utilizes its user system 22 to initiate a web browser and connect to the server 18 via the website, the server 18 provides the user 14 with the user interface (implemented with PHP, JavaScript, Java Applet and HTML). The user interface is designed to facilitate and smooth the entire process of loading, storing, managing, linking, retrieving, analyzing, displaying and/or otherwise utilizing microarray research information to and from the MDB system 10, specifically for E. coli gene expression. The user interface is adapted to facilitate such functions by including a means for user management, project management, array production management, laboratory information management, data importation, data analysis, data visualization and data exportation.
Shown in
The Front End section to the MDB system 10 provides for data display and public access to E. coli data, and contains the Presentation module 430 and Production Data module 408. The Presentation module 430 contains the presentation, display, and analysis tools (e.g., cluster analysis, graphic displays, heat maps, etc.), and provides the user interface. As shown in
For example, the user 14 can plot the data according to statistical confidence intervals determined from the data, as shown for example in
The user 14 can also display the data in Gene Expression Genome View (also a heat map format) which accommodates all genes in very large experiment sets to fit on a single page, as shown for example in
Alternatively, the user 14 can enter a gene of their choice (e.g., using a gene name or b-number) in a gene query box, as also shown for example in
The user 14 further has the choice of displaying the entire experiment data as a ratio graph, as shown for example in
Also, the user interface is adapted to allow the user 14 to download the data for analysis in their favorite software package, as shown for example in
The Production Data module 408 contains the minimum information required for the presentation tools of the Presentation module 430, and is configured to facilitate integration of DNA array data generated on the most commonly used technology platforms (e.g., membranes, microarrays, and Affymetrix GeneChips). Preferably, the Production Data module 408 includes a database table designed to accommodate the most commonly accessed data fields, i.e., microarray intensity, ratio values, and confidence intervals, regardless of the technology platform used. Thus the MDB system 10 has the potential to handle proteome and metabolome data, provided that Preprocess Data, Raw Data, and Upload Data modules are programmed to interface with the specific application.
The purpose for the Production Data module 408 is two-fold. First, the data are stored in a common format that is standardized for the tools that can be accessed in the Presentation module 430. Second, accessing the data from the scaled-down Production Data module 408 with its minimum of data fields makes the tools operate much faster. These are huge advantages for displaying transcriptome data, as most databases do not truly integrate data generated on different platforms and are notoriously slow in display time because they access data from very large raw data tables. A schema for the Production Data module 408 is shown in more detail in
The Project Management section contains four modules that correspond to intuitive levels of microarray experimentation. Replicate slides in an Experiment module 104 are preprocessed (i.e., normalized) and averaged to create an experiment. Experiments are usually associated in experiment sets (i.e., time points in a biological experiment or series of similar treatments) in the ExpSet module 102, and the experiment sets are associated with a project in the Project module 100, which is specific to the hypothesis being tested (i.e., related experiments published together in a single paper).
In the Raw Data Section, the Raw Data module 202 is the staging area for raw data storage and preprocessing of the data for integration into the Production Data module 208, which is specific to the technology platform and involves data filtering, normalization, scaling, etc., and conversion of the data to the common format employed for production data. Uploading data from the Upload Data module 200 into the Raw Data module 202 requires the user 14 to select from the Project Management section a preexisting Project/ExpSet/Experiment or create new ones as desired; in this way the higher level information is only entered once and does not need to be added every time Replicate microarray information is uploaded. Also, the LIMS and genome annotation are associated with the replicate experiments.
The user interface includes a “pipeline” structure that allows the user to create experiments from replicates and to group experiments into experiment sets for upload into the Production Data module 208, as shown for example in
Within the user interface, the user 14 can select an option, i.e., Create Experiment, and chooses replicates to be preprocessed to create an experiment, which is associated with an experimenter (as shown in
The Microarray Platform section provides for collection, storage, and management of the information associated with the microarray production process, including print process management and array design that are specific to the microarray technology platform involved.
From the above description, it is clear that the present invention is well adapted to carry out the objects and to attain the advantages mentioned herein, as well as those inherent in the invention. For example, it can also be seen that the various applications of the MDB system 10 of the present invention guides the user 14 from the first step of a mircoarray experiment and goes through to the last step of analysis to provide reliability and reproducibility of results, identification of relationships to previous experiment results, and displays of functional aspects of the results. Robust microarray data management is envisioned to enhance, by eliminating information bottlenecks, disease diagnosis and prediction, genealogy, animal registration organizations, pharmaceutical development, detecting and managing bioterrorism threats.
Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it will be apparent to those skilled in the art that certain changes and modifications may be practiced without departing from the spirit and scope of the present invention, as described herein. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for the designing of other structures, methods, and systems for carrying out the several purposes of the present invention. As such, it should be understood that the invention is not limited to the specific and preferred embodiments described herein, including the details of construction and the arrangements of the components as set forth in the above description or illustrated in the drawings. Further, it should be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
The present application claims priority to the provisional application identified by the U.S. Ser. No. 60/598,914, filed on Aug. 4, 2004, the entire content of which is hereby expressly incorporated herein by reference.
The present invention was made with partial support from the National Science Foundation Grant No. EPS-0132534 and the National Institutes of Health Grant No. RR-01-005.
Number | Date | Country | |
---|---|---|---|
60598914 | Aug 2004 | US |