The present invention generally relates to management of data associated with software application files. More particularly, the present invention relates to methods and systems for managing personally identifiable information and sensitive information in an application—independent manner.
With the advent of the computer age, computer and software users have grown accustomed to user-friendly software applications that help then write, calculate, organize, prepare presentations, send and receive electronic mail, make music, and the like. For example, modem electronic word processing applications allow users to prepare a variety of useful documents. Modem spreadsheet applications allow users to enter, manipulate, and organize data. Modem electronic slide presentation applications allow users to create a variety of slide presentations containing text, pictures, data or other useful objects.
When documents are created and edited by such applications, various forms of data are often attached to, imbedded in or otherwise associated with the documents in the form of metadata or even normal content that should be controlled from access by subsequent users or recipients of the documents. For example, personally identifiable information may be exposed in macros, VBA code, comments, author tables, user edit blocks, paths and the like, so that even if a document author/editor deletes certain personally identifiable information from simple document properties, that information may still be exposed. For example, personally identifiable information associated with a document may provide information about the author or editor of the document including the author/editor's full name, the author/editor's manager's name, the author/editor's company name, and alike. Other types of data that may be associated with a document that should be controlled from exposure to third parties include revisions and comments to documents. That is, revisions and comments made in a document may be exposed to a subsequent user of the documents that may allow the user to know the content of drafts of a document that should not be exposed.
Similarly, paths may show up in a variety of unexpected places in various documents. For example, simple URLs/hyperlinks, link content, VBA code and template properties can expose path information. Such information can be used to determine the identity of others involved in authoring and editing a given document in a collaborative authoring session. Additionally, such information provides potential means for attack by hackers who may use the paths to learn of the topology of an organization's computing network.
In addition to such personally identifiable information, certain sensitive information may be included in documents that should be controlled from exposure to third party users. For example, a government agency may wish to send a document to certain users but may wish that certain information in the document should not be exposed to certain users.
The management of such personally identifiable and sensitive information has become particularly critical in an increasingly collaborative and electronic world. While the management of such information in a manner to prevent unauthorized access is often primarily focused on security, an equally important effort must be done to help prevent a user from accidentally disclosing such information through the simple exchange of document files.
It is with respect to these and other considerations that the present invention has been made.
Embodiments of the present invention solved the above and other problems by providing methods and systems for managing personally identifiable and/or sensitive information (hereinafter PII/SI) in a manner that is independent of a software application that is used for creating or editing a document containing the PII/SI.
According to an embodiment of the invention, PII/SI in a document is marked or flagged in an application-independent manner so that a consuming application programmed to discover and handle marked PII/SI may readily discover the marked information for redacting the information, editing the information, or otherwise disposing of the information as desired. According to this embodiment, a single solution application may be built for scanning documents created and/or edited by a variety of different software applications for PII/SI. Such a single solution may be applied at the individual client application level (creation/editing application), or such a solution may be applied at a server level for handling PI/SI in all documents stored at or passed through the server.
According to another embodiment of the invention, PII/SI in documents is annotated according to the Extensible Markup Language (XML). A separate XML namespace is then used to distinguish the annotated PII/SI from other content in the document. An application-independent solution may then be built for scanning a given document for all annotated information belonging to the namespace associated with the PII/SI. Once the annotated information is located in a given document, it may be redacted, edited, or otherwise processed or disposed of as desired.
These and other features and advantages, which characterize the present invention, will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
As briefly described above, embodiments of the present invention are directed to methods and systems for managing personally identifiable information and/or sensitive information (PII/SI) in a manner that is independent of a software application that is used for creating or editing a document containing the information. These embodiments may be combined, other embodiments may be utilized, and structural changes may be made without departing from the spirit or scope of the present invention. The following detailed description is therefore not to be taken in a limiting sense and the scope of the present invention is defined by the appended claims and their equivalents.
Referring now to the drawings, in which like numerals represent like elements through the several figures, aspects of the present invention and the exemplary operating environment will be described.
Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Turning now to
The mass storage device 14 is connected to the CPU 4 through a mass storage controller (not shown) connected to the bus 12. The mass storage device 14 and its associated computer-readable media, provide non-volatile storage for the personal computer 2. Although the description of computer-readable media contained herein refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available media that can be accessed by the personal computer 2.
By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
According to various embodiments of the invention, the personal computer 2 may operate in a networked environment using logical connections to remote computers through a TCP/IP network 18, such as the Internet. The personal computer 2 may connect to the TCP/IP network 18 through a network interface unit 20 connected to the bus 12. It should be appreciated that the network interface unit 20 may also be utilized to connect to other types of networks and remote computer systems. The personal computer 2 may also include an input/output controller 22 for receiving and processing input from a number of devices, including a keyboard or mouse (not shown). Similarly, an input/output controller 22 may provide output to a display screen, a printer, or other type of output device.
As mentioned briefly above, a number of program modules and data files may be stored in the mass storage device 14 and RAM 8 of the personal computer 2, including an operating system 16 suitable for controlling the operation of a networked personal computer, such as the WINDOWS operating systems from Microsoft Corporation of Redmond, Wash. The mass storage device 14 and RAM 8 may also store one or more application programs. In particular, the mass storage device 14 and RAM 8 may store an application program 105 for providing a variety of functionalities to a user. For instance, the application program 105 may comprise many types of programs such as a word processing application, a spreadsheet application, a desktop publishing application, and the like. According to an embodiment of the present invention, the application program 105 comprises a multiple functionality software application suite for providing functionality from a number of different software applications. Some of the individual program modules that may comprise the multiple functionality application suite 105 include a word processing application 125, a slide presentation application 135, a spreadsheet application 140 and a database application 145. An example of such a multiple functionality application suite 205 is OFFICE manufactured by Microsoft Corporation. Other software applications illustrated in
According to embodiments of the present invention, personally identifiable information and/or sensitive information is marked in a document in a manner that is independent of the application that creates or edits the document. A given document may be created and/or edited by a word processing application, a spreadsheet application, a slide presentation application, and the like. As described above, various forms of personally identifiable information, for example, an author's name, editing dates, author's manager's name, author's office location, and the like may be attached to or associated with the document and may be accessible by others receiving and/or reviewing the document. Similarly, various types of content may be contained in a given document that may be sensitive in nature, for example, confidential business information or secret government information.
According to embodiments of the present invention, such personally identifiable information and/or sensitive information (PII/SI) is marked in the document so that the information may be readily discovered and processed as desired. According to one embodiment of the present invention, the PII/SI is marked in a manner that is independent of the particular programming of the application responsible for creating or editing the document. Accordingly, a solution application may be built for locating PII/SI in a document independent of the application responsible for creating or editing the document. Once the marked information is located a document, the solution application may process the marked information, as desired. For example, the marked information may be redacted from the document. For example, if it is desired that the author's name and identification information should be redacted from all documents to be sent to a given location, the solution application may parse such documents to locate the PII/SI marked in the documents followed by a redaction of the PII/SI information before allowing the documents to be forwarded to the intended recipients.
Similarly, the solution application may be utilized for editing PII/SI. For example, if it is acceptable to allow a receiving user to see an author's name, but it is not acceptable to allow a receiving user to view changes or edits made to a document, the solution application may be programmed to edit the PII/SI discovered in the document to leave the identification of the author, but to redact the changes or editing information associated with the document. In the case of sensitive information or content, the solution application may similarly redact or otherwise edit the sensitive information. For example, if a document contains sensitive government information that has been marked as PII/SI, the solution application, upon locating the marked sensitive information, may replace the sensitive information in the document with a phrase such as “redacted sensitive information.” Or, the solution application may redact the marked sensitive information altogether.
According to embodiments of the present invention, the solution application that is responsible for parsing the document to locate and process the PII/SI may be part of a multiple application suite that may be called upon to process PII/SI after the creation of a document prepared by one of the applications of the multiple application suite before the document is passed to a third party user. Alternatively, the solution application may be located at a server in a distributed computing environment and may be utilized for processing PII/SI for all documents stored at the server that are accessible by third party users. Alternatively, the solution application may be located on an electronic mail server for managing PII/SI of all documents passed through the server to third party users.
Referring now to
According to embodiments of the invention, the document 200 is associated with a schema file 210 for defining the XML applied to the document, including the XML markup tags applied to identified PII/SI and including a definition of an associated namespace utilized for the particular XML markup tags used for annotating identified PII/SI. Accordingly, a solution application 230 in association with the XML parser 220 may parse any document prepared by any application to locate PII/SI annotated with the XML markup tags. That is, so long as the solution application 230, in association with the XML parser 220, may read the schema file 210, the solution application 230 may locate identified and marked PII/SI based on the namespace associated with the markup tags applied to the PII/SI. Once the PII/SI is located, the solution application 230 may then manage and/or process the identified PII/SI to include redacting the information, editing the information, or otherwise disposing of the information as desired.
As described above, the solution application 230 and associated XML parser 220 may be a part of a multiple application suite containing different applications such as word processing applications, spreadsheet applications, slide presentation applications, and the like. Alternatively, the solution application 230 may be a stand-alone application that may be called by a user for processing PII/SI in a given document. Alternatively, as described above, the solution application 230 and the associated XML parser 220 may be located at a server for managing PII/SI contained in documents stored at or passing through the server to third party users.
By way of example, the following is an XML representation of a word processing document. In the example XML representation, a sample text content entry of “Here is a sample text” is included. Additionally, a portion of personally identifiable information is also included in the document, including the phrase “My name is Joe Smith” identifying the author of the document. As can be seen, the personally identifiable information in this document has not been annotated nor marked in any way to distinguish the PII/SI from other content of the document. Consequently, locating the PII/SI is difficult.
According embodiments of the present invention, the following is an XML representation of the same word processing document, described above, where the PII/SI has been annotated with XML markup associated with a an XML namespace highlighted in boldface text.
Now that the PII in the XML representation of the example word processing document has been marked with XML annotation associated with the PII/SI namespace, a solution application 230, in association with an XML parser 220, may readily parse the XML represented document to locate the PII/SI annotated according to the PII/SI namespace. As set out below, the XML represented document is illustrated after a solution application 230 has located and redacted the undesirable PII/SI. In effect, each PII/SI namespace used to identify and manage the PII/SI becomes a simple transform that can be run against any document using a file format wherein PII/SI is marked for identification according to embodiment of the present invention.
Having described embodiments of the present invention with respect to
At block 330, the document having marked and annotated PII/SI as described herein is passed to a solution application 230 for discovering and managing or otherwise processing any identified PII/SI. As described above, the solution application 230 and associated XML parser 220 may be a part of the application 105 used by the author/editor of the document 200. Alternatively, the solution application 230 may be a stand-alone application that may be called an author, editor of administrator of the document 200 for locating and managing PII/SI. Alternatively, the solution application 230 may be located at a server at which the document 200 may be stored or through which the document may be passed for receipt by a third party user.
At block 330, the document is parsed by the XML parser 220 for locating PII/SI marked up with XML tags identified as part of the PII/SI namespace as defined by the associated schema file 210. At block 335, the annotated PII/SI is identified as PII/SI. At 340, the solution application 230 is applied to the identified PII/SI as desired. For example, the identified PII/SI may be redacted, edited, or other information not defined as PII/SI may be inserted into the document as replacement information or content for the identified PII/SI. The method ends at block 395.
As described herein, methods and systems are provided for managing and/or processing personally identifiable information and/or sensitive information in a manner that is independent of a software application used for creating or editing a document containing the information. It will be apparent to those skilled in the art that various modifications and variations may be made in the present invention without departing from the scope or spirit of the invention. Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein.