A semantic model may include information about various items, and relationships between those items, and may be used to represent and understand an artifact, such as a real world entity or device. In many cases, one or more documents about an artifact (e.g., instruction manuals, user guides, repair documents, etc.) may capture knowledge or requirements related to the artifact and may be authored by a subject matter expert who has detailed knowledge of the structure and behavior of the artifact. This knowledge may comprise a mental model for the author, and is often shared to a significant degree with other subject matter experts. Unfortunately, in many cases an explicit and formal model of the structure of the artifact may not exist.
Extracting knowledge about an artifact from unstructured or semi-structured text may be attempted by statistical or other means that do not include an explicit and formal model of the artifact. For example, it may be determined that a certain section of unstructured text includes a certain term or phrase relatively frequently, and as a result, it may be inferred that the section is therefore associated with a particular feature or portion of an artifact. This approach, however, may significantly limit the usefulness of the extracted knowledge as well as the ability of a knowledge management system to correctly capture the scope of applicability of the knowledge. Moreover, manually building a semantic model, such that extracted knowledge may then be aligned as appropriate, can be a labor-intensive, expensive, and error prone process.
It would therefore be desirable to provide systems and methods to create a structured semantic model in an automatic and accurate manner.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments. However it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the embodiments.
As used herein, the phrase “semantic model” may refer to, for example, a structured model that includes information about various items, and relationships between those items, and may be used to represent and understand an artifact. By way of example, the model might include: systems, subsystems, classes and subclasses, sets and subsets, and/or components and subcomponents. Note that any of these models may include further relationships between items (e.g., a sub-subsystem, relationships between sibling items, rules associated with items, etc.). As used herein, the phrase “artifact” may refer to, for example, any real world entity or device. By way of examples only, the artifact might be a physical apparatus (e.g., an airplane or heart monitor), an organization (e.g., a hospital), a business, a financial arrangement (e.g., a swap agreement or tax code), a government, a regulatory system, etc.
In many cases, one or more “documents” about an artifact may capture knowledge or requirements related to the artifact and may be authored by a subject matter expert who has detailed knowledge of the structure and behavior of the artifact. As used herein, the term document may refer to, for example, a web page, a text file, an image of a document, streaming document information, etc. As used herein, a “structured document” associated with an artifact contains explicit, defined, information about the artifact's items and relationships between those items. Moreover, the phrase “partially unstructured document” may refer to either a completely unstructured document or a semi-structured document.
As used herein, devices, including those associated with the system 100 and any other device described herein, may exchange information via any communication network which may be one or more of a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a proprietary network, a Public Switched Telephone Network (PSTN), a Wireless Application Protocol (WAP) network, a Bluetooth network, a wireless LAN network, and/or an Internet Protocol (IP) network such as the Internet, an intranet, or an extranet. Note that any devices described herein may communicate via one or more such communication networks.
The extraction platform 150 may store information into and/or retrieve information from the document database 160. The document database 160 may be locally stored or reside remote from the extraction platform 150. Although a single extraction platform 150 is shown in
The system 100 may extract the semantic model 170 from the documents 110 in accordance with any of the embodiments described herein. For example,
At S210, a document associated with an artifact may be received, and the document may be at least partially unstructured (e.g., the document may be completely unstructured or partially structured). The artifact might be associated with, for example, any physical apparatus, organization, business, financial arrangement, government, and/or regulatory system.
At S220, an extraction platform may automatically detect a first characteristic in an unstructured portion of the document. Similarly, at S230, the extraction platform may automatically detect a second characteristic in the unstructured portion of the document. As used herein, the term “characteristic” may comprise, for example, a feature of the unstructured portion of the document that was not authored with an intention to explicitly define an item or relationship between items for the artifact. According to some embodiments, the characteristic may be associated with a table, such as a table heading or a table column. As other examples, the characteristic might be associated with a table of contents, a chapter, a section, and/or a page number. Still other examples of characteristic that might be detected include a font size, a font attribute, a font type, an indentation, and a margin (left and/or right margin. According to some embodiments, the document includes text and images and the characteristic is associated with a location of images within the document.
At S240, the first and second characteristics may be used to automatically create a structured semantic model representing the artifact. The structured semantic model may include, for example: systems and subsystems; classes and subclasses; sets and subsets; and/or components and subcomponents.
By way of example,
Thus, some embodiments may recognize and exploit patterns, outside of the explicit meaning of sentences and phrases, which may exist within a document that is normally thought of as unstructured or semi-structured text. When these patterns parallel the structure of an artifact that is the topic of the document, they may be used to create an appropriately structured semantic model of the artifact and/or to align other knowledge extracted from the document with the various components of the artifact.
Note that a semantic model capturing the structure of an artifact (such as a complex piece of equipment) is not usually explicit in documents that describe the operation or other knowledge about the artifact. The structural model may, however, partially manifest itself in various ways. For example, one way is in the structure of the document itself For example, even documents that we normally refer to as unstructured text often have a hierarchical section heading structure. Such a sectioning hierarchy may parallel the structure of the artifact. In other cases, semi-structured text may use indentation levels or a table structure to make the document easier for humans to understand or use as a reference. When that indexing aligns with the hierarchical structure of the artifact, that artifact structure may be implicitly captured from the document.
Some embodiments described herein may recognize and exploit any such parallelism between recognizable patterns in the document and the structure of the artifact, and use these patterns to guide the construction of a semantic model for the artifact. In some cases, such a pattern may be regular and will reflect a fixed number of levels of artifact structure (e.g., system, sub-system, and sub-sub-system). The number of levels in the document pattern may be the optimal number needed for a supporting semantic model of artifact structure to provide a foundation for capturing the knowledge of the document. That is, the number of levels may reflect the way that the subject matter expert has encoded the knowledge in his mental model.
The embodiments described herein may be implemented using any number of different hardware configurations. For example,
The processor 410 also communicates with a storage device 430. The storage device 430 may comprise any appropriate information storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, mobile telephones, and/or semiconductor memory devices. The storage device 430 stores a program 412 and/or an extraction engine 414 for controlling the processor 410. The processor 410 performs instructions of the programs 412, 414, and thereby operates in accordance with any of the embodiments described herein. For example, the processor 410 may receive a document associated with an artifact, the document being at least partially unstructured. In an unstructured portion of the document, processor 410 may automatically detect a first characteristic. The processor 410 may also automatically detect a second characteristic in the unstructured portion of the document. Using the first and second characteristics, a structured semantic model representing the artifact may automatically be created by processor 410.
The programs 412, 414 may be stored in a compressed, uncompiled and/or encrypted format. The programs 412, 414 may furthermore include other program elements, such as an operating system, clipboard application a database management system, and/or device drivers used by the processor 410 to interface with peripheral devices.
As used herein, information may be “received” by or “transmitted” to, for example: (i) the extraction platform 400 from another device; or (ii) a software application or module within the extraction platform 400 from another software application, module, or any other source.
In some embodiments (such as shown in
Referring to
The semantic model identifier 502 may be, for example, a unique alphanumeric code identifying an artifact's structured semantic model that has been automatically created from a document associated with the artifact. The document identifier 504 may indicate or point to the document that was used to create the model. The component identifier 506 may describe the component, the parent component(s) 508 may indicate parents of the component, and the child component(s) 510 may indicate any children of the component. In this way, the components may for a hierarchical structure associated with the real world artifact.
Note that other types of document characteristics may be analyzed and used to create a structured sematic model. For example,
As still another example,
As yet another example,
Thus, some embodiments described here may provide systems and methods to create a structured semantic model in an automatic and accurate manner. Moreover, the knowledge of a subject matter expert who authored a document (e.g., representing the layout of a complex apparatus) may be captured and used to create the model even when that that knowledge is not explicitly defined within a document.
The following illustrates various additional embodiments of the invention. These do not constitute a definition of all possible embodiments, and those skilled in the art will understand that the present invention is applicable to many other embodiments. Further, although the following embodiments are briefly described for clarity, those skilled in the art will understand how to make any changes, if necessary, to the above-described apparatus and methods to accommodate these and other embodiments and applications.
Although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with embodiments of the present invention (e.g., some of the information associated with the databases described herein may be combined or stored in external systems). Moreover, although some document characteristics have been provide herein as examples, any other type of document characteristic might be detected and used to create a structured sematic model for an artifact.
The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.