The present invention relates to data mining. In particular, the present invention relates to performing data transformations for data mining purposes.
Data mining relates to processing data to identify patterns within the data. These patterns within the data provide an effective analysis tool to aid in decision making. Text mining relates to the extension of data mining to articles and other text documents that generally include unstructured text. Text mining can aid in classifying documents for research, detecting situations within reports, predict effectiveness for various procedures and gauge success for different operations.
Different forms of text mining utilizing a computer include keyword searches and various relevance ranking algorithms. While these methods can be effective, a sufficient amount of individual's time can still be needed in order to effectively discover and identify relevant documents. Due to the vast amount of articles, e-mail messages, reports and other unstructured data, excessive amounts of individual classification can be time consuming and expensive. As a result, an effective way to perform data mining on unstructured data would provide an effective tool.
A method for performing data mining is provided. The method includes selecting at least one data source of unstructured text. Additionally, a transformation is selected to identify a list of terms in the unstructured text. A run-time path is established to connect the data source to the unstructured text to load the list of terms identified into a destination database.
The present invention relates to utilizing extraction, transformation and loading processes to provide an efficient tool for text mining. Using the present invention, transformation modules can be utilized in order to establish a pipeline for text mining. In particular, a term extraction transformation and a term look-up transformation can be utilized to provide effective text mining. Before addressing the present invention in further detail, a suitable environment for use with the present invention will be described.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructions, which can be written on any form of a computer readable medium.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. For natural user interface applications, a user may further communicate with the computer using speech, handwriting, gaze (eye movement), and other gestures. To facilitate a natural user interface, a computer may include microphones, writing pads, cameras, motion sensors, and other devices for capturing user gestures. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
At step 226, a flow of tasks and transformations to create a destination database is defined. This flow creates a pipeline for data that can easily be viewed and modified such that text mining can be easily performed. Method 220 can be used for different text mining tasks such as analyzing a training corpus for patterns and/or identifying relevance of new documents. As discussed below, two such transformation used for these processes are term extraction and term look-up that can form part of the pipeline for text mining processes. These transformations identify a list of terms in the one or more data sources. Data resulting from these transformations can be loaded into other databases and/or be used with other data mining processes in a run-time environment.
If desired, an inclusion terms list 244 and an exclusion terms list 246 can be utilized by term extraction transformation module 242. The inclusion terms list 244 can include words and/or phrases that are particularly relevant to the desired text mining procedure. In contrast, the exclusion terms list 246 can include words and/or phrases that are either too popular or too trivial (i.e. non discriminative) based on the desired text mining procedure.
These lists can generated using a statistical measure such as tf-idf ranking (term frequency-inverse document frequency). A tf-idf ranking measure is a way of weighting relevance of a term to a document. The tf-idf ranking takes into account term frequency (tf) in a given document and the inverse document frequency (idf) of the term in a collection of documents. Term frequency in a measure of how important a term is in the given document and the document frequency of the term (i.e. the percentage of documents that contain the term) is a measure of how important the term is for a text mining procedure.
Terms extracted from data source 240 are loaded into a glossary 248 based on the term extraction transformation module 242. If lists 244 and 246 are used, terms from the inclusion terms list 244 are loaded into the glossary 248 while terms from exclusion list 246 are excluded from glossary 248. The glossary 248 can be used during a term look-up transformation as discussed below or for other data mining purposes.
Next, applicable noun phrase patterns are selected for extraction at step 308 based on the identified parts of speech. For example, a phrase pattern of “noun”+“noun” (i.e. data service or SQL server) will be accepted but a pattern “verb”+“adverb” (i.e. work hard) will be rejected. At step 310, filtering criteria can be applied to the noun phrase patterns selected in step 308. For example, noun phrase patterns that are too short may be filtered. The amount of words in a noun phrase can be specified by a user. At step 312, the terms and/or phrases that are found are saved and counted.
At step 314, it is determined whether there are additional sentences within the row to be processed. If there are additional sentences, method 300 returns to step 304. If no additional sentences are found in the row, method 300 proceeds to step 316 where it is determined whether there are additional rows in the document. If additional rows are found, method 300 returns to step 302. If no additional rows are found, method 300 proceeds to step 318, wherein additional filtering can be applied. For example, terms from an exclusion term list can be filtered from a final output of the term extraction transformation. Additionally, tf-idf ranking can be used to apply filtering as discussed above. At step 320, the term list is loaded to an output database. As mentioned earlier, the output includes a glossary of terms that are indicative of a pattern in a collection of documents.
At step 360, each word is analyzed to see if each word is in a reference look-up table. The reference look-up table, for example, can be a glossary as developed using a term extraction transformation discussed above with regard to
After stemming or if the word is found in the reference table, a longest common prefix test is performed at step 364. The longest common prefix test combines the words determined in step 354 and matches the longest common prefix that is in the reference table. For example, if a given sentence includes “Windows XP Professional Edition is very powerful” and the reference table includes the terms “windows”, “Windows XP”, and “Windows XP Professional Edition” the longest common prefix test will only count “Windows XP Professional Edition”, and not “Windows” or “Windows XP”.
At step 366, the frequency of the terms and phrases found in the reference table is counted. This count is used to populate at least a portion of an output database. At step 368, it is determined whether additional sentences are found in the row. If there are additional sentences, method 350 returns to step 354. Otherwise, method 350 proceeds to step 370 where it is determined if there are additional rows in the document. If additional rows are found, method 350 returns to step 352 and otherwise loads a list of the terms in a database at step 372.
As mentioned above, the term extraction and term look-up transformations can be implemented in an extraction, transformation and loading environment such as data transformation services (DTS). DTS provides a set of graphical tools to centralize data for improved decision making. The DTS tools can create custom data movement solutions that are tailored towards a particular need.
A DTS package is an organized collection of connections, DTS tasks, DTS transformation and work flow constraints assembled with either a DTS tool or programmatically saved to a file. For example, the file can be a structured storage file. Each package contains one or more steps that are executed sequentially or in parallel when the package is executed. The package contains parameters to connect to data sources, copy data in database objects, transform data and notify other users or processes of events. Packages can be edited, password protected, scheduled for execution and retrieved.
A DTS task is a descrete set of functionality that is executed as a single step in a package. Each task defines a work item to be performed as part of the data movement and data transformation process. Alternatively, the task can be executed at run-time. A DTS transformation includes one or more functions or operations applied to a piece of data before the data arrives at a destination.
Data flow window 406 includes graphical representations of a term extraction transformation 412 and a term look-up transformation 414. An arrow connects the graphical representations 410 and 412 to create a visual representation of the data flow, which in this case is the look-up transformation referencing the term extraction transformation.
The graphical representations in the screen shots above can have various associated configurable parameters in order to customize the data flow. A connection can be defined for a database source as well as a database destination. A term extraction transformation 412 includes configurable parameters for establishing a connection to a database, inclusion terms and exclusion terms. The inclusion terms and the exclusion terms can be lists as described above. Furthermore, other options for term extraction relate to selecting whether terms can be words, phrases or words and phrases. Other parameters relate to frequency thresholds and a maximum length of terms allowed.
Other transformations, such as a term look-up transformation, can use other associated parameters to customize operation of a text mining process. In the term look-up transformation, a connection and a reference table can be specified in order to perform the look-up. Furthermore, source columns and destination columns can also be specified in the term look-up transformation.
By creating and defining a data flow pattern using term extraction and/or term look-up transformations, a reliable, efficient text mining process can be implemented. The process helps with identifying documents that are similar by establishing a glossary of common terms. Subsequent documents can further be classified by referencing the glossary.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.
The present application is based on and claims the benefit of U.S. provisional patent application Ser. No. 60/581,956 filed Jun. 22, 2004, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60581956 | Jun 2004 | US |