Method and apparatus for information visualization and analysis

Information

  • Patent Application
  • 20080082521
  • Publication Number
    20080082521
  • Date Filed
    September 28, 2006
    19 years ago
  • Date Published
    April 03, 2008
    17 years ago
Abstract
A method and apparatus for analyzing, organizing and manipulating data for use by computer-executable programs by performing the steps of providing a set of documents wherein each document is provided from a document source, mapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents and is expressed as a signature for that document, identifying a unique identifier for each document, providing a graphical representation of the documents, and associating the graphical representation of each document with the document source using the unique identifier so that any manipulation of the graphical representation of the document will result in a corresponding manipulation of the document in at least one computer executable program.
Description
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

For the purposes of promoting an understanding of the principles of the invention, a preferred embodiment of the present invention was programmed and reduced to practice. This embodiment of the present invention coupled a signal generator, such as that described in U.S. Pat. No. 6,484,168 entitled “System for Information Discovery” issued Nov. 19, 2002 (hereafter the SID generator), with a computer-executable program, preferably a commercially available software package or “application.”


The underlying architecture for any implementation of the present invention is the same regardless of the application (such as Microsoft Excel or Outlook) and regardless of the signature generator (such as the SID generator). The present invention is preferably integrated within an application as a “plug-in” extension. The hosting application preferably supports extension using a published application programming interface. As used herein, the word ‘document’ does not necessarily mean text or written form. It could consist of a series of numbers or categories that taken in total “document” a state or situation. Further, a “document” could include a portion of a written text.


Preferred applications interfaced with the present invention are characterized by the following traits; They manage data and metadata (data about the data). the data consists of structured and/or categorical data and unstructured data such as freeform text or numbers representing a state, such as time, temperature, speed, etc., they display data and metadata to the user, they allow the user to manipulate the data or sets of data, and they support an application programming interface which some or all of the following: the ability to access source documents, the ability to add metadata to documents, the ability to manipulate the location of the document, the ability to view the document in it's original form, the ability to detect a user selecting set of documents, the ability to programmatically select a set or subset of these documents, and the ability to identify each document using a unique identifier


The general process for a user to the present invention in a host application is as follows. When the host application is started, it detects that a plug-in is available and makes that plug-in option available to the user by adding one or more windows buttons or menu items. The present invention can also be invoked upon an action the user takes without requiring that the buttons or menu items be accessed. These actions would have been requested by the user. The present invention preferably does not take actions without the user first invoking those actions.


When the user presses the button, the present invention is notified of the request. The action is usually a request to “process” a set of “documents,” which, as previously explained, can be files, rows, emails etc. the present invention qualifies the user's request by insuring that a minimum set of documents has been selected, and provides the user with processing choices. These processing choices include but are not limited to the type of processing required. The present invention preferably has a feature to perform pre-analysis on the data allowing the present invention to recommend a process, such as selecting which metadata to include in the processing, if necessary, the metadata to include in processing, and the parameters required for the types of processes requested.


Once the user has provided some feedback the present invention processes the information. For example, and not meant to be limiting, if the user requests that the present invention process a set of Excel rows containing unstructured text in some columns and a set of associated metadata in other columns, the process requested is one that measures the proximity of one document to another using corpus level differentiating terms (SID).


Regardless of the process requested, the present invention prepares the information in the following manner. The present invention contains a framework in which the data can be pre-analyzed so as to suggest to the user the best course of analysis. Each process model, the present invention supports may contain a description for evaluation of applicability to the dataset being provided. The present invention uses this description to make suggestions based on this information. Present invention samples the corpus looking for applicability to the known and supported processes for that version of the present invention. This can consist of, but is not limited to, the nature of the data such as numeric, text, size, structure. It can also include results of a preliminary analysis of the data such as information distribution, correlation, and covariance.


Each document is then tracked by a unique identifier. This is supplied by the hosting application and has meaning to the application. The unstructured text of the document is transformed into an integer array with each array member's value representing a unit of text in the order of the original document. A unit of text could be a word, phrase, phonetic signature. Each array member value has an associated entry in a glossary which can be used to look up the original text. The length of the array (document vector) is representative of the length of the document and some level of compression of the original text is established.


The requested metadata (or attributes) for the document is processed in a similar manner. The width of the integer array (attribute vector) needed to support the attributes for a document is fixed in accordance with the number of requested attributes the user wishes to track. Although, usually a fixed set of attributes is tracked for each document, there is nothing to prevent the present invention from tracking a set of attributes where each document could have a different number of attributes. The document and attribute vector are then combined along with the unique document identifier to produce the default processing vector.


In one embodiment of the present invention, the present invention can perform additional preparation on the default processing vector. For example, and not meant to be limiting, the present invention may be configured to aggregate vectors. This combines the document vectors of one or more documents based on attribute vector values so as to create a new aggregated document. Processing can then be done on the aggregated document. For example, and not meant to be limiting, if each document had an author and year attribute, aggregation could be used to synthesize documents which represent the combined documents or an author, year or author in a year. In this case the synthesized document only contains the attributes used for aggregation and the unique identifier for this document becomes the set of unique identifiers for the documents aggregated into the synthesized document.


Another example of additional preparation that can be performed on the default processing vector is bifurcation. This takes a default processing vector and splits the document vector portion based on the preparation requested. This could include splitting a document by paragraph or page or the original document could be split by a change in topic. This results in the default processing vector being expanded to one or more processing vectors where the document vector portion is a segment from the original document and the attribute vector is identical to the attributes for the original document.


The unique identifier may also be expanded to include a segment identifier. For example, and not meant to be limiting, the present invention can do phrase detection where the process for corpus level preparation is the same. Phrase detection may be at the corpus level or the document level. At the document level, the phrase detection is preferably done when building the document vector. If the user requests, once the document vectors are built, the present invention can go through the corpus and treat the corpus as one document. Using the same phrase detection step, phrases are detected at the corpus level versus the document level. Once a phrase has been detected, it is added to the glossary and its code is used to replace two or more discrete words in each document vector. The present invention can preferably save and reload all of the information from the preparation phase. At this point, and for any point forward, the present invention can record the current state or version of the prepared data and save. This allows the present invention to make adjustments to data structures and always know that any version of the prepared data can be recalled.


The present invention is preferably configured to support inclusion of processes that deal with documents containing unstructured data, numeric data and metadata. Examples of such processes are presented in U.S. patent application Ser. No. ______ (Attorney Docket No. 15060-E, and the present invention is preferably integrated with the Deep Center analytical foundation described therein. In addition, processes requiring structured information associated with the document, such as the SID process and the concept based clustering process, are preferably supported.


The present invention is preferably configured to invoke multiple processes on a single data set. There is nothing in the framework that expects that the processes running concurrently or sequentially. The present invention preferably is configured to support either. In the case of sequential processes the functions available at the end of the preparation phase above may apply. For example, and not meant to be limiting, the thread of data continuity may become the unique identifier.


During processing the present invention can preferably monitor the progress of processing, display the processing errors, and allow the user to cancel processing. These steps are readily accomplished if the process has the capabilities of generating events, setting flags or other well known methods for monitoring.


The present invention preferably allows the state of the process engine to be saved at any state for any reason. Just as the present invention is preferably configured with the capability to save the state in the preparation phase, present invention is preferably configured to save the state of the processing engine so that adjustments can be made to the process and later recalled to a known state.


Although some engines provide an interactive API in which the process engine state can be interrogated after processing, present invention can also preferably support those processing engines that providing a static set processing results. The private present invention is thereby configured to operate with any process that will return, at a minimum, a signature for each document. It is also preferred that the document signatures will be accompanied by corpus level data structures. The present invention preferably captures the results and those results can be used by the present invention to provide process interrogation functionality.


In general, the results of processing will be a set of signatures that can be used to create an alternative representation of the data. Preferably, the processing would identify features and then provide a classification or categorization of those features as a set of signatures. A typical representation of these signatures would be in a visualization, but a visualization is not required. After processing, the present invention maintains a relationship between actions taken by the host application (if supported) and actions taken by the present invention on the corresponding representation of that data (actionable representation). The present invention preferably wraps each piece of actionable information in an event envelope which creates signals upon any action on that item. These events include but are not limited to selection—the user selects the information, click—the user clicks or double clicks on a visual representation of the data, hovering or mouse over—the user runs the mouse over a visual representation of the information, and information actions including, but not limited to, Delete, Change in location, Change in metadata. The present invention preferably allows custom actions to take place in the hosting application with each event. The actual action is dependant on the process engine, visualization, data and hosting application.


To assist the User in analyzing complex data, in many cases the current invention can use signatures from the process along with metadata from the original context to provide some level of automated computer aided analysis by looking for patterns between the signatures from the processing and metadata or derivates of the metadata. Any pattern matching algorithm would be included in the term “computer aided analysis” as used herein. A typical example would be to look for correlation between a piece of metadata such as location or date and a signature signifying a category assigned to unstructured text.


The present invention preferably provides a set of calls that can be used by the application developers from within applications that support exposure of the present invention's APIs. Using the present invention's APIs, an Excel user, for example, and not meant to be limiting, can call all of the functionality provided by the present invention using, for example, “Visual Basic for Applications” (VBA) or an equivalent. All of the main functionality of the present invention can be customized by the VBA programmer so that VZIN can be made to do things in accordance with the way the VBA programmer wants them done.


While the invention has been described in connection with specific embodiments utilized for the project undertaken to demonstrate a preferred embodiment of the present invention, those having ordinary skill in the art will readily that many changes and modifications may be made without departing from the invention in its broader aspects. The appended claims are therefore intended to cover all such changes and modifications as fall within the true spirit and scope of the invention.

Claims
  • 1. A method for analyzing, organizing and manipulating data for use by computer-executable programs comprising the steps of: providing a set of documents, wherein each document is provided from a document sourcemapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents and is expressed as a signature for that document,identifying a unique identifier for each documentproviding a graphical representation of the documentsassociating the graphical representation of each document with the document source using the unique identifier so that any manipulation of the graphical representation of the document will result in a corresponding manipulation of the document in at least one computer executable program.
  • 2. The method of claim 1 wherein the step of mapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents is accomplished by the steps of creating high dimensional vectors for each of the documents, such that each high dimensional vector represents the relative relationship of the individual documents a term or topic attribute; andarranging the high dimensional vectors into clusters, with each of the clusters representing a plurality of documents grouped by relative significance of their relationship to a topic attribute.
  • 3. The method of claim 2 wherein said unique signatures are optimized to provide an optimum number of clusters.
  • 4. The method of claim 1 wherein each document comprises data in a tabular form having a plurality of rows, each row having a plurality of columns.
  • 5. The method of claim 4 wherein each document comprises at least a portion of a row.
  • 6. An apparatus for analyzing, organizing and manipulating data for use by computer-executable programs comprising a computer system configured to perform the steps of: inputting a set of documents, wherein each document is provided from a document source,mapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents and is expressed as a signature for that document,identifying a unique identifier for each document,providing a graphical representation of the documents,associating the graphical representation of each document with the document source using the unique identifier so that any manipulation of the graphical representation of the document will result in a corresponding manipulation of the document in at least one computer executable program running on said computer system.
  • 7. The apparatus of claim 6 wherein the step of mapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents is accomplished by the steps of creating high dimensional vectors for each of the documents, such that each high dimensional vector represents the relative relationship of the individual documents a term or topic attribute; andarranging the high dimensional vectors into clusters, with each of the clusters representing a plurality of documents grouped by relative significance of their relationship to a topic attribute.
  • 8. The apparatus of claim 7 wherein said unique signatures are optimized to provide an optimum number of clusters.
  • 9. The apparatus of claim 6 wherein each document comprises data in a tabular form having a plurality of rows, each row having a plurality of columns.
  • 10. The apparatus of claim 9 wherein each document comprises at least a portion of a row.
  • 11. A computer readable medium having computer-executable instructions for performing a method for analyzing, organizing and manipulating data for use by other computer-executable programs comprising the steps of: providing a set of documents, wherein each document is provided from a document sourcemapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents and is expressed as a signature for that document,identifying a unique identifier for each documentproviding a graphical representation of the documentsassociating the graphical representation of each document with the document source using the unique identifier so that any manipulation of the graphical representation of the document will result in a corresponding manipulation of the document in at least one computer executable program.
  • 12. The computer readable medium having computer-executable instructions of claim 11 wherein the step of mapping the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents is accomplished by the steps of creating high dimensional vectors for each of the documents, such that each high dimensional vector represents the relative relationship of the individual documents a term or topic attribute; andarranging the high dimensional vectors into clusters, with each of the clusters representing a plurality of documents grouped by relative significance of their relationship to a topic attribute.
  • 13. The computer readable medium having computer-executable instructions of claim 12 wherein the unique signatures are optimized to provide an optimum number of clusters.
  • 14. The computer readable medium having computer-executable instructions of claim 11 wherein each document comprises data in a tabular form having a plurality of rows, each row having a plurality of columns.
  • 15. The computer readable medium having computer-executable instructions of claim 14 wherein each document comprises at least a portion of a row.
  • 16. A system for analyzing, organizing and manipulating data for use by computer-executable programs comprising: an input device configured to receive a set of documents, wherein each document is provided from a document source,a processor configured to:map the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents and is expressed as a signature for that document,identify a unique identifier for each document,provide a graphical representation of the documents,associate the graphical representation of each document with the document source using the unique identifier so that any manipulation of the graphical representation of the document will result in a corresponding manipulation of the document in at least one computer executable program.
  • 17. The system of claim 16 wherein the processor is configure to map the documents to a location in multi-dimensional space wherein each document's position in multidimensional space is determined as a function of the document's relationship to other documents by creating high dimensional vectors for each of the documents, such that each high dimensional vector represents the relative relationship of the individual documents a term or topic attribute; andarranging the high dimensional vectors into clusters, with each of the clusters representing a plurality of documents grouped by relative significance of their relationship to a topic attribute.
  • 18. The system of claim 17 wherein the processor is configured so that the unique signatures are optimized to provide an optimum number of clusters.
  • 19. The system of claim 16 wherein the processor is configured so that each document comprises data in a tabular form having a plurality of rows, each row having a plurality of columns.
  • 20. The system of claim 19 wherein each document comprises at least a portion of a row.
Government Interests

The invention was made with Government support under Contract DE-AC0676RLO 1830, awarded by the U.S. Department of Energy. The Government has certain rights in the invention.