This disclosure relates to a method and system that enables the user of a computational or communications device such as a computer, a smart phone, a personal digital assistant or any similar device, to visualize and analyze structured or unstructured text data from one or multiple sources.
Users of computational devices such as computers, smart phones, personal digital assistants and the like may have access to very large quantities of data. A small fraction of the data may reside on the device itself while the vast majority of the data may be stored in databases and/or may be accessible via means of communications such as the Internet, or other wired or wireless networks capable of transmitting data.
Search engines have made the discovery and retrieval of useful and relevant data a little bit easier, but often cannot help users make sense of the data. Search results may include results that do not relate to the user's search, making the search experience sometimes overwhelming. There is therefore a need for a method and system to analyze data quickly and display its essence in a succinct but holistic manner that highlights important themes, topics or concepts in the data and how they might relate. Such a method may aid the user in making sense of data.
Text mining and analysis is a rich field of study [M. W. Berry, Survey of Text Mining: Clustering, Classification, and Retrieval, Springer, 2003]. Proposed solutions to make sense of text data may rely on combinations of statistical and rule-based methods. Many methods may be computation intensive, particularly methods that attempt to include semantics, taxonomies or ontologies into the analysis, as well as methods that require the performance of global computations such as the calculation of eigenvalues and eigenvectors for extremely large matrices. As a result such methods may be used mostly by specialists.
Another characteristic of methods that rely on semantics is their context and language dependence. There is therefore a need for a method and system that is generic enough that it can work with virtually any context and with a broad range of languages without any modification of the method.
Several methods use a network approach but may fail in their genericity, speed and ease-of-use [Carley, Kathleen. (1997) Network Text analysis: The Network Position of concepts. Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, 79-100. Mahway, N.J.: Lawrence Erlbaum Associates; Corman S R, Kuhn T, McPhee R D, Dooley K J. Studying complex discursive systems: Centering resonance analysis of communication. Human Communication Research 2002; 28: 157-206].
This disclosure relates to a method and system that enables the user of a computational or communications device such as a computer, a smart phone, a personal digital assistant or any similar device, to (1) specify one or several sources of structured or unstructured text data (for example, a file in a format that supports text, such as Word, PDF, Postscript, HTML, TXT, RTF, XML, email or any other; URL; RSS feeds; URLs returned by a search query submitted to a search engine), (2) analyze the ensemble of text formed by the specified text data source or sources, and (3) display a network of words that represents certain salient features of the specified text data source or sources.
The number of words displayed can be defined by the user.
The method and system work fast and are generic. That is, they work with any text in any written language that has words, or whose text can be tokenized into word-like elements, and they work in any context, i.e., for a medical report or for the news of the day.
While the performance of the method and system may vary as a function of the specific set of text data sources, the method and system produce a useful view of the key concepts in the text and how they relate to one another. They can be used by anyone wishing to analyze a document (for example, a Senate report, a book or a set of articles), a website, news, email or other textual material, to obtain a fast global overview of the data.
The method and system work in such a way that the data sources specified may reside on the computational device itself or may be accessible via communications protocols, or a combination of both, thereby seamlessly integrating the user's offline and online universes. By right-clicking on any word in the displayed word network (or taking an equivalent defined action), the user has access to all occurrences of the selected word in the data sources, viewed in the various contexts in which it has appeared. By left-clicking on any word (or taking an equivalent defined action), the user can hide, delete or focus on the word; focusing on the word means that a new network of words is displayed that centers on the selected word, with all words in the new word network connecting to the selected word. The user can also decide to view clusters in the word network, each cluster corresponding to a theme or topic in the data.
The disclosed method and system for text data analysis and visualization enables a user of a computational device to specify a set of text data sources in various formats, residing on the device itself or accessible online, and visualize the content of the text data sources in a single overview of salient features in the form of a network of words. The method for text analysis uses a statistical correlation model for text that assigns weights to words and to links between words. The system for displaying a representation of the text data uses a force-based network layout algorithm. The method for extracting clusters of relevant concepts for display consists of identifying “communities of words” as if the network of words were a social network. Each cluster so identified usually represents a specific topic or sub-topic, or a set of tightly connected concepts in the text data.
One object of the invention is to enable the user to analyze one or more documents in any format that supports text data. The user specifies the location(s) of the document(s) and the systems parses and analyzes the document as described for the news reader example, and displays the salient features as a network of words using the same method.
Another object of the invention is to enable the user to perform a search and visualize the contents of the search results pages as a network of words. The user enters a search query, whereupon the system hands the query over to one or several search engines such as Alexa, Google, Yahoo, MSN, Technorati, or others, via open Application Programming Interfaces, collects the results, opens the URLs or documents returned by the search engine(s), parses the text data, analyzes it and displays the salient features a network of words using the same method.
These and other features and advantages of the invention(s) disclosed herein will be more fully understood by reference to the following detailed description, in conjunction with the attached drawings.
To provide an overall understanding, certain illustrative embodiments will now be described; however, it will be understood by one of ordinary skill in the art that the systems and methods described herein can be adapted and modified to provide systems and methods for other suitable applications and that other additions and modifications can be made without departing from the scope of the systems and methods described herein.
Unless otherwise specified, the illustrated embodiments can be understood as providing exemplary features of varying detail of certain embodiments, and therefore, unless otherwise specified, features, components, modules, and/or aspects of the illustrations can be otherwise combined, separated, interchanged, and/or rearranged without departing from the disclosed systems or methods. Additionally, the shapes and sizes of components are also exemplary and unless otherwise specified, can be altered without affecting the disclosed systems or methods.
Making sense of large amounts of data across a wide range of offline or online sources has become a daily routine for large numbers of users who have a lot of data on their computer or other computational or communications device and also have access to an ever-increasing online world. A significant fraction of this data is textual. The disclosure herein makes it easy for anyone with a computer or other computational or communications device and access to a database or network connection to specify one or more sources of textual data (for example, a file in a format that supports text, such as Word, PDF, Postscript, HTML, TXT, RTF, XML, email or any other; URL; RSS feeds; URLs returned by a search query submitted to a search engine) and see a visual representation of the text data as a network of words that extracts salient features from the text data. The user can then view different clusters of words that represent salient topics in the text data. The user may also focus on one word, for example by mouse-clicking on it (or by a similar or analogous operation), to see which other words or clusters the selected word is connected to.
Users also my access textual data and information from a range of other computational devices, such as PDAs, smartphones or other devices yet to appear in the marketplace. The method and system described here will work with all such devices and this disclosure covers such applications.
For illustrative purposes and without limitation, the methods and systems are described herein in relation to RSS feeds as text data sources.
To display the text data from the sources the user has chosen (in the illustrative example the RSS feeds) as a network of words, the system executes the following tasks (the method):
(1) It goes through each source of text data, here each RSS feed, parses each source into words, and builds a statistical model of the words and relationships between words. To do so, it first cleans up the parsing by removing non-word characters and also all words that belong to a pre-specified list of stop-words. It then applies a bi/tri-gram extraction algorithm [Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press: 1999;] to identify multi-word expressions, that is, two or three words that always appear together. It then applies a stemming algorithm [Julie Beth Lovins (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11:22-31; Jaap Kamps, Christof Monz, Maarten de Rijke and Börkur Sigurbjörnsson (2004). Language-dependent and Language-independent Approaches to Cross-Lingual Text Retrieval. In: C. Peters, J. Gonzalo, M. Braschler and M. Kluck, eds. Comparative Evaluation of Multilingual Information Access Systems. Springer Verlag, pp. 152-165; Eija Airio (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9:249-271] to reduce variations in the occurrences of a word into a single word, e.g., car and cars will be reduced to a single word: car. It then counts the number of times each word appears in the sources. It then counts how many times any two words appear together in the same sentence, where a sentence is defined by the smallest ensemble of words between two points. Let Nk be the number of times word k appears in the sources, and Njk=Nkj the number of times words j and k appear together in a sentence. Conditional frequencies are defined by F(j/k)=Njk/Nj and F(k/j)=Njk/Nk.
The window in which two words j and k appear together does not have to be a sentence. It can be a paragraph or a window of fixed size W within a sentence or within a paragraph, a fast but semantic-less approach. We have found, however, that using the sentence as the window maintains some semantic information in the statistics without the great variability of paragraph length and its associated speed issues.
(2) Once the Nk and Njk values have been computed for all words and pairs of words, the weights of words and links between words are computed. The weight, Wk, of word k is defined by Nk/(ΣNj). The weight of the link between words j and k, Wjk, is defined by 0.5*(F(j/k)+F(k/j)). Other formulas can be used to compute Wjk. For example, if Hjk=F(j/k)*F(k/j): Wjk=SQRT(Hjk); Wjk=Log(1+Hjk); Wjk=Hjk*Log(1+Hjk); Wjk=(Hjk+(Hjk/max_i Hki))/2; or others. A normalizing scheme is used after all Wj's and Wjk's have been computed, so all values of Wj's and Wjk's are recomputed to be within ranges that can handled by the graph, or word network, drawing algorithm.
(3) The word network (
Once the word network has been displayed, the user can tweak it by deleting words (the display is then recalculated), moving words with the mouse (or other device or method) and/or the user can focus on a word. When the user focuses on a word, the most frequent N−1 words that are connected to the focal word are displayed together with their links to the focal word and among themselves (
The user can also select a second focal word (
The user can also add a word of his/her choosing by writing a word directly into the display. If the user chooses to add a new word, the new word is treated as if it were a focal word and a word network is created as if the new word were a focal word: the most frequent N−1 words that are connected to the new word are displayed together with their links to the new word and among themselves. If there are fewer than N−1 words connected to the new word, only those words connected to the new word are displayed.
The user can also decide to examine if news topics emerge from the text data. Strongly connected clusters of words need to be identified. To do so, the system employs an algorithm inspired by the detection of communities in social networks [M. E. J. Newman, Detecting community structure in networks, European Journal of Physics B 38 (2004), 321-330; Fast algorithm for detecting community structure in networks, Physical Review E 69 (2004)].
The user can also mouse-click on (or otherwise choose) a word and see the contexts in which the word appears. For the example of RSS feeds from news sites, the user sees a window that summarizes the locations (here, URLs) where the word appears, with the title of the corresponding news item (
The methods and systems described herein are not limited to a particular hardware or software configuration, and may find applicability in many computing or processing environments. For example, the algorithms described herein can be implemented in hardware or software, or a combination of hardware and software. The methods and systems can be implemented in one or more computer programs, where a computer program can be understood to include one or more processor executable instructions. The computer program(s) can execute on one or more programmable processors, and can be stored on one or more storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), one or more input devices, and/or one or more output devices. The processor thus can access one or more input devices to obtain input data, and can access one or more output devices to communicate output data. The input and/or output devices can include one or more of the following: Random Access Memory (RAM), Redundant Array of Independent Disks (RAID), floppy drive, CD, DVD, magnetic disk, internal hard drive, external hard drive, memory stick, or other storage device capable of being accessed by a processor as provided herein, where such aforementioned examples are not exhaustive, and are for illustration and not limitation.
The computer program(s) is preferably implemented using one or more high level procedural or object-oriented programming languages to communicate with a computer system; however, the program(s) can be implemented in assembly or machine language, if desired. The language can be compiled or interpreted.
As provided herein, the processor(s) can thus be embedded in one or more devices that can be operated independently or together in a networked environment, where the network can include, for example, a Local Area Network (LAN), wide area network (WAN), and/or can include an intranet and/or the internet and/or another network. The network(s) can be wired or wireless or a combination thereof and can use one or more communications protocols to facilitate communications between the different processors. The processors can be configured for distributed processing and can utilize, in some embodiments, a client-server model as needed. Accordingly, the methods and systems can utilize multiple processors and/or processor devices, and the processor instructions can be divided amongst such single or multiple processor/devices.
The device(s) or computer systems that integrate with the processor(s) can include, for example, a personal computer(s), workstation (e.g., Sun, HP), personal digital assistant (PDA), handheld device such as cellular telephone, laptop, handheld, or another device capable of being integrated with a processor(s) that can operate as provided herein. Accordingly, the devices provided herein are not exhaustive and are provided for illustration and not limitation.
References to “a processor” or “the processor” can be understood to include one or more processors that can communicate in a stand-alone and/or a distributed environment(s), and can thus can be configured to communicate via wired or wireless communications with other processors, where such one or more processor can be configured to operate on one or more processor-controlled devices that can be similar or different devices. Furthermore, references to memory, unless otherwise specified, can include one or more processor-readable and accessible memory elements and/or components that can be internal to the processor-controlled device, external to the processor-controlled device, and can be accessed via a wired or wireless network using a variety of communications protocols, and unless otherwise specified, can be arranged to include a combination of external and internal memory devices, where such memory can be contiguous and/or partitioned based on the application. Accordingly, references to a database can be understood to include one or more memory associations, where such references can include commercially available database products (e.g., SQL, Informix, Oracle) and also proprietary databases, and may also include other structures for associating memory such as links, queues, graphs, trees, with such structures provided for illustration and not limitation.
References to a network, unless provided otherwise, can include one or more intranets and/or the internet.
Although the methods and systems have been described relative to specific embodiments thereof, they are not so limited. Obviously many modifications and variations may become apparent in light of the above teachings and many additional changes in the details, materials, and arrangement of parts, herein described and illustrated, may be made by those skilled in the art.
Number | Name | Date | Kind |
---|---|---|---|
4700295 | Katsof et al. | Oct 1987 | A |
4796194 | Atherton | Jan 1989 | A |
4935877 | Koza | Jun 1990 | A |
5136686 | Koza | Aug 1992 | A |
5148513 | Koza et al. | Sep 1992 | A |
5195172 | Elad et al. | Mar 1993 | A |
5233513 | Doyle | Aug 1993 | A |
5428712 | Elad et al. | Jun 1995 | A |
5465221 | Merat et al. | Nov 1995 | A |
5541835 | Dextraze et al. | Jul 1996 | A |
5568590 | Tolson | Oct 1996 | A |
5581657 | Lyon | Dec 1996 | A |
5617510 | Keyrouz et al. | Apr 1997 | A |
5708774 | Boden | Jan 1998 | A |
5737581 | Keane | Apr 1998 | A |
5761381 | Arci et al. | Jun 1998 | A |
5761494 | Smedley et al. | Jun 1998 | A |
5793931 | Hillis | Aug 1998 | A |
5799304 | Miller | Aug 1998 | A |
5809489 | Davidor et al. | Sep 1998 | A |
5855015 | Shoham | Dec 1998 | A |
5858462 | Yamazaki et al. | Jan 1999 | A |
5864633 | Opsal et al. | Jan 1999 | A |
5867397 | Koza et al. | Feb 1999 | A |
5890133 | Ernst et al. | Mar 1999 | A |
5890146 | Wavish et al. | Mar 1999 | A |
5897629 | Shinagawa et al. | Apr 1999 | A |
5930780 | Hughes et al. | Jul 1999 | A |
5963447 | Kohn et al. | Oct 1999 | A |
5963939 | McCann et al. | Oct 1999 | A |
5970487 | Shackleford et al. | Oct 1999 | A |
5978507 | Shackleton et al. | Nov 1999 | A |
5987457 | Ballard | Nov 1999 | A |
6029139 | Cunningham et al. | Feb 2000 | A |
6055523 | Hillis | Apr 2000 | A |
6088690 | Gounares et al. | Jul 2000 | A |
6094652 | Faisal | Jul 2000 | A |
6098059 | Nordin et al. | Aug 2000 | A |
6125351 | Kauffman | Sep 2000 | A |
6185548 | Schwartz et al. | Feb 2001 | B1 |
6236955 | Summers | May 2001 | B1 |
6249714 | Hocaoglu et al. | Jun 2001 | B1 |
6253200 | Smedley et al. | Jun 2001 | B1 |
6282527 | Gounares et al. | Aug 2001 | B1 |
6321205 | Eder | Nov 2001 | B1 |
6327582 | Worzel | Dec 2001 | B1 |
6336110 | Tamura et al. | Jan 2002 | B1 |
6349238 | Gabbita et al. | Feb 2002 | B1 |
6408263 | Summers | Jun 2002 | B1 |
6411373 | Garside et al. | Jun 2002 | B1 |
6424358 | DiDomizio et al. | Jul 2002 | B1 |
6430545 | Honarvar et al. | Aug 2002 | B1 |
6434435 | Tubel et al. | Aug 2002 | B1 |
6434492 | Pollack et al. | Aug 2002 | B1 |
6434542 | Farmen et al. | Aug 2002 | B1 |
6449761 | Greidinger et al. | Sep 2002 | B1 |
6468770 | Keyes et al. | Oct 2002 | B1 |
6480832 | Nakisa et al. | Nov 2002 | B2 |
6484166 | Maynard | Nov 2002 | B1 |
6490566 | Schmidt | Dec 2002 | B1 |
6513024 | Li | Jan 2003 | B1 |
6523016 | Michalski | Feb 2003 | B1 |
6528715 | Gargi | Mar 2003 | B1 |
6571282 | Bowman-Amuah | May 2003 | B1 |
6576919 | Yoshida et al. | Jun 2003 | B1 |
6636848 | Aridor et al. | Oct 2003 | B1 |
6662167 | Xiao | Dec 2003 | B1 |
6671628 | Hurst | Dec 2003 | B2 |
6678618 | Schwartz et al. | Jan 2004 | B1 |
6709330 | Klein et al. | Mar 2004 | B1 |
6721647 | Kita et al. | Apr 2004 | B1 |
6741959 | Kaiser | May 2004 | B1 |
6745184 | Choi et al. | Jun 2004 | B1 |
6760335 | Andersson et al. | Jul 2004 | B1 |
6763354 | Hosken et al. | Jul 2004 | B2 |
6848104 | Van Ee et al. | Jan 2005 | B1 |
6865571 | Inaba et al. | Mar 2005 | B2 |
6882988 | Subbu et al. | Apr 2005 | B2 |
6895286 | Kaji et al. | May 2005 | B2 |
6895396 | Schwartz et al. | May 2005 | B2 |
6895405 | Choi et al. | May 2005 | B1 |
6912587 | O'Neil | Jun 2005 | B1 |
6928434 | Choi et al. | Aug 2005 | B1 |
6934405 | Schuessler et al. | Aug 2005 | B1 |
6937993 | Gabbita et al. | Aug 2005 | B1 |
6941287 | Vaidyanathan et al. | Sep 2005 | B1 |
6947844 | Steitz et al. | Sep 2005 | B2 |
6947845 | Steitz et al. | Sep 2005 | B2 |
6947930 | Anick et al. | Sep 2005 | B2 |
6950270 | Lyle et al. | Sep 2005 | B2 |
6950712 | Ulyanov et al. | Sep 2005 | B2 |
6952650 | Steitz et al. | Oct 2005 | B2 |
6952700 | Modha et al. | Oct 2005 | B2 |
6957200 | Buczak et al. | Oct 2005 | B2 |
6996560 | Choi et al. | Feb 2006 | B1 |
7000700 | Cairns et al. | Feb 2006 | B2 |
7003504 | Angus et al. | Feb 2006 | B1 |
7003516 | Dehlinger et al. | Feb 2006 | B2 |
7003560 | Mullen et al. | Feb 2006 | B1 |
7007006 | Zilio et al. | Feb 2006 | B2 |
7013238 | Weare | Mar 2006 | B1 |
7035740 | Kermani | Apr 2006 | B2 |
7043463 | Bonabeau et al. | May 2006 | B2 |
7047169 | Pelikan et al. | May 2006 | B2 |
7070647 | Fujimori et al. | Jul 2006 | B2 |
7076475 | Honarvar | Jul 2006 | B2 |
7110888 | Nicholls | Sep 2006 | B1 |
7117202 | Willoughby | Oct 2006 | B1 |
7127695 | Huang et al. | Oct 2006 | B2 |
7139665 | Datta et al. | Nov 2006 | B2 |
7181438 | Szabo | Feb 2007 | B1 |
7190116 | Kobayashi et al. | Mar 2007 | B2 |
7191164 | Ray et al. | Mar 2007 | B2 |
7194461 | Kawatani | Mar 2007 | B2 |
7280986 | Goldberg et al. | Oct 2007 | B2 |
7333960 | Bonabeau et al. | Feb 2008 | B2 |
7356518 | Bonabeau et al. | Apr 2008 | B2 |
7457678 | Smith et al. | Nov 2008 | B2 |
7491494 | Liu et al. | Feb 2009 | B2 |
20010003824 | Schnier | Jun 2001 | A1 |
20020083031 | De Varax | Jun 2002 | A1 |
20020156773 | Hildebrand et al. | Oct 2002 | A1 |
20020161747 | Li et al. | Oct 2002 | A1 |
20020174126 | Britton et al. | Nov 2002 | A1 |
20030033287 | Shanahan et al. | Feb 2003 | A1 |
20030088458 | Afeyan et al. | May 2003 | A1 |
20040010479 | Ali | Jan 2004 | A1 |
20040117333 | Voudouris et al. | Jun 2004 | A1 |
20040117355 | Lef et al. | Jun 2004 | A1 |
20040133355 | Schneider | Jul 2004 | A1 |
20040139058 | Gosby et al. | Jul 2004 | A1 |
20040162738 | Sanders et al. | Aug 2004 | A1 |
20040204957 | Afeyan et al. | Oct 2004 | A1 |
20040243388 | Corman et al. | Dec 2004 | A1 |
20040254901 | Bonabeau et al. | Dec 2004 | A1 |
20050005232 | Gosby | Jan 2005 | A1 |
20050119983 | Bonabeau et al. | Jun 2005 | A1 |
20050165763 | Li et al. | Jul 2005 | A1 |
20050187926 | Britton et al. | Aug 2005 | A1 |
20050198026 | Dehlinger et al. | Sep 2005 | A1 |
20050261953 | Malek et al. | Nov 2005 | A1 |
20060010117 | Bonabeau et al. | Jan 2006 | A1 |
20060167862 | Reisman | Jul 2006 | A1 |
20080040671 | Reed | Feb 2008 | A1 |
Number | Date | Country |
---|---|---|
1235180 | Aug 2002 | EP |
WO-0002136 | Jan 2000 | WO |
WO-0002138 | Jan 2000 | WO |
WO-0054185 | Sep 2000 | WO |
WO-0227541 | Apr 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20090144617 A1 | Jun 2009 | US |
Number | Date | Country | |
---|---|---|---|
60887710 | Feb 2007 | US |