Productivity applications such as those available in the Microsoft® Office suite of applications allow users to create a number of different types of documents incorporating various types of data object. Such objects include text, images and multimedia components. Often, only portions of these objects are seen in the display version of the document, with some of the data the object contains being hidden for various reasons.
Individuals and organizations have implicit or explicit policies for releasing a document to others. For example, a consultant or a lawyer does not want to release a Microsoft® Word document to a client that includes hidden edits in the document and a government agency would not want to release a spreadsheet that has classified information in a hidden column of a spreadsheet. This document release problem also applies to any content within an organization that needs to be shared with external entities.
Currently, there are only limited mechanisms for removing “hidden data” from such applications. As used herein, “hidden data” includes three types of information; metadata (name, value pairs), state (control) information, and content. The content category can be further subdivided into two categories: internal and external. Internal content is recognized and directly manipulated via the application being used. Storage of internal content is clearly defined within a file format. Internal hidden content can be inserted by users, such as hidden spreadsheet columns, off-page content, and overlapping or embedded objects. External content is treated as a separate entity associated via Object Linking and Embedding (OLE) with another application responsible for presentation and activation. External content can be added to a document via copy-paste operations or explicit object insertions (or links).
Previous efforts to address hidden data have included a variety of techniques to manage these types of hidden data. For example, Microsoft® produced two versions of a Remove Hidden Data (RHD) tool, RHD 1.0 and RHD 1.1. The first tool operated on a store file to remove a number of different types of hidden data. This required significant processing time and the tool had a limited user interface. The second version of the tool removed fewer types of hidden data, and therefore took less time to process, but was less comprehensive. Both tools operated on stored Office files. Recently the Navy Special Security Office developed an RHD tool that worked by first converting Microsoft® file formats to Open XML and then post-processing the XML data to detect a variety of hidden data. This produced a report that described a fixed and limited set of hidden data that required the user to go back into the Office document, find the hidden content based on the report, examine it for sensitive data and then keep it, edit it, or remove it as appropriate. In each case, the tools simply removed the hidden data found.
Technology is disclosed which allows users to identify hidden data contained in documents generated by productivity applications. The technology makes use of a user configurable document release policy file, and a document inspector which parses a document based on the configuration policy. Options may then be presented to the user to make changes, changes implemented automatically, or both, depending on the policy definition. The policy allows one to define the inspector interaction with the document object model to remove hidden data where appropriate, and/or insert unique comments and/or highlights into the document that a user will use to find hidden content when the type of hidden content requires human review.
In one aspect, a method implemented at least in part by a computing device is disclosed. The method includes loading a user defined document policy configuration including data types identified as hidden data. A document is then parsed for the defined hidden data and a policy defined action is executed on the hidden data in the document in accordance with the document policy configuration.
In another aspect, a method implemented at least in part by a document generation application program in a computing device is disclosed. The method includes loading a user defined document policy configuration and parsing a document for the hidden data. A list of the hidden data is provided in an interface to the user, the interface including a link redirecting the application program to display the location of the hidden data in the document.
In another aspect, a computer-readable medium in a computer having computer-executable components including an application program suitable for generating a document is provided. The computer readable medium includes a hidden document data policy definition file; and a policy execution component. The policy execution component includes a hidden data mark-up component responsive to the document policy definition and a hidden document data defined action execution component instructing the application program.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The technology disclosed herein allows users to identify potentially sensitive information contained in documents generated by the user in productivity applications, based on a configurable document release policy. In one embodiment, the policy is provided in XML format which is executed by a document inspector. The document inspector parses a document (or document data file) based on the configuration policy and either presents options to the user to make changes, implements changes automatically, or both, based on the policy definition. The policy allows one to define the inspector interaction with the document to mark and/or remove hidden data where appropriate. Marking may include inserting unique comments and/or highlights into the document that a user can use to find hidden content when the type of hidden content requires human review.
A document may be any file in any format for storing data for use by an application on a storage media. In particular, documents refer to any of the files used by the productivity applications referred to herein to store objects which may be rendered.
In one implementation, the technology is implemented as an add-in which can interact with other components in the productivity application. As discussed below, when the productivity applications comprise the Microsoft® Office suite of applications, the Office Task Pane can be used to produce a summary report of the actions taken by the add-in and provide additional textual and graphical information that can assist the user in finding hidden content. Once the user has reviewed the document and edited/removed all sensitive content they can click on a Finish button that causes the add-in to remove the comments and/or highlights and save the sanitized file for subsequent release. The user experience is streamlined since the user remains in the application and uses the native application tools to reveal the hidden content, inspect it, and edit or delete it where appropriate. This overcomes shortcomings in previous attempts to address this issue that dealt with automatic deletion of hidden data and did not provide users with a means of inspecting, editing, and/or removing hidden content types that require human review.
An additional feature of the invention is that it is policy driven through an XML file that can be customized. This capability permits a user or an organization to dictate the types of data that is wants detected (as well as actions, such as always delete) as part of its document release policy.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
In accordance with the techniques discussed herein, the document release policy is a definition of a set of data which a user or other configuring entity has determined to be of concern prior to release of the document beyond the user or entity. The policy includes definitions on how to deal with different types of data which may be overlooked before release, outside the viewable scope of a user in the document. The application programs may use a common document object model common document object model or other programmatic access to document content, which in one embodiment may be an XML document model, which may be parsed by an inspector application 270. Alternatively, the inspector application may parse the actual document file in order to work around any potential limitations which may appear in the document object model. Inspector application 270 may be a separate application developed for the specific purpose of parsing the document, a built-in component of a suite of productivity applications, or an add-in to one or more of the application program 210, 220, 230, 240.
It should be understood that the application programs 235 shown in
The policy actions disclosed in
Many of the data types illustrated in
The “user name” data is generally defined on a global level by the application programs. Normally, users can override this information and overriding the data in any one of the application programs 210, 220, 230, 240 will override it in other programs. Headers and footers are not normally viewable in one of a number of view modes in the application programs. In some entities, headers and footers are used to identify document classification. In the policy shown in
Three types of hidden data which may not be readily apparent to a user include overlapping graphics, non-standard text headings, and off-page content. Overlapping graphics can occur when users place two different graphic files such as image files in a document and a portion of one of the images is obscured by the other, or when an image overlays text in the document. While the image may display correctly on the screen, the hidden content “below” the obscuring content can result in potentially sensitive information being disclosed. Non-standard text headings include text which may have been minimized to a level that it becomes invisible to the viewer of the document during a normal print or screen view. Text may be reduced to a font size which is imperceptible to the user, or may be colored the color of the background, but may contain sensitive information. The non-standard text headings can include a definition requiring all text smaller than a certain size to be addressed by the document inspector tool. Off-page content occurs when an image, chart or other embedded object is dragged off a page. An object can be totally dragged outside the boundary of a page and disappears without being able to be retrieved. Nonetheless, the data remains in the file and may contain sensitive information not visible to the user. The document inspector is capable of finding each of these particular types of information within the common object model utilized by the application programs.
At step 404, the policy file configuration is loaded. As discussed above, the policy will contain definitions which may require automatic actions on the part of the method, or allow user interaction with certain types of potentially hidden data. At step 406, the method determines whether any data meeting the policy definition is included in the document. In one embodiment, where the user is operating the program in the context of the application program, the determination step will occur on a document presently in use by the application in the system memory. In another alternative, the tool and the determination can be launched on a stored file and brought into stored memory and loaded into the application for use by the inspector tool.
At step 410, a determination is made as to whether or not data choices are to be presented to the user. If the XML policy defined in FIG. 3 includes only delete and ignore commands, this determination will be negative and the method will continue to step 420 where it will automatically execute any delete commands on the data as defined in the policy. If data choices are to be presented to the user, then at step 412, a user interface such as that disclosed in
Following parsing of the document by the inspector, at step 510, a pop-up window 700 may be presented to the user illustrating certain types of data which the inspector determines is problematic based on the policy. The user is prompted to select whether or not to remove such data based on the type of data which is found. For example, in
At step 514, a determination is made as to whether more data exists which needs to be presented to the user. In the interface of
At step 610, in accordance with the document policy definition, any hidden data defined for the user to “Edit” in the policy of
At step 612, a list of all marked data is generated by type, and at step 614 the user is be presented with a list of editable hidden data items with links to the particular information within the document being generated.
Referring to
If the user does correct the data, at step 618 the list presented to the user can be updated at step 620 and the updated list regenerated at step 614 and a new list presented to the user at step 614. This loop continues until the user terminates the review process at step 616 and the method continues at step 414 as discussed above.
It should be recognized that any number of user interfaces presenting a review method to the user may be utilized. In a unique aspect, the technology uses the editing capabilities and presentation capabilities of the application program itself to present the hidden data to the user by marking the data in a fashion which can be easily discernible by the user. Standard linking techniques to the data objects within the documents are utilized to present links such information to the user in the user interface. In this manner, editing of the hidden data can be performed within the application program itself.
In addition, the types of objects and data reviewed by the application. For example, the inspector may include the ability to search for digital media which is the subject of copyright protection. In such case, the document release policy may include a warning action generating a flag to a user to warn the user to ensure that appropriate licenses for the subject matter are within the control of the user or controlling entity.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.