1. Technical Field
The present invention relates to virtual environments and more particularly to systems and methods for optimizing a natural language description of a specific object or scene within a virtual environment.
2. Description of the Related Art
Virtual environments comprise computer-generated three-dimensional (3D) renderings of a 3D world model that may be based on a real environment, or represent an artificial environment. The environment is typically composed of a set of virtual objects, each of which has a particular location and orientation within the environment. A user of a virtual environment also has a specific location and orientation within the environment, which may be represented by an avatar placed at the appropriate location within the environment. This location and orientation provides a viewpoint from which the user views a scene in the environment.
As described by Lipkin in U.S. Pat. No. 6,348,927, a view of a virtual environment can be presented to a user by consulting a database of objects, finding a relevant subset of objects, and rendering those objects visually to provide a graphical view of the environment. Users who cannot see well do not have access to this view. This includes both users who have a visual impairment, and users who are accessing the virtual environment through devices with limited graphics capabilities.
Sofer, in U.S. Patent Application No. 2006/0098089A1, teaches the use of a 3D model in combination with image processing to identify objects in a real environment, and audibly describes those objects to a person. Objects to be described are ordered according to their distance from the user, with nearer objects described first. Objects are described in natural language using names that are provided by a human in advance. In a complex environment, there may be many such objects in view.
There are two primary limitations in this approach. First, the name used to describe an object is static, but the appearance of the object within the environment changes depending on the user's viewpoint. Different features of the object may be hidden or visible, and the object may be partially occluded by other objects. Furthermore, the object may change its appearance based on the values of properties of the object. For example, a lamp may be on or off.
One known method for providing a text description that is accurate with respect to the state of a control object in a two-dimensional user interface is to provide a set of descriptions in advance, and select the appropriate description based on the state of the object at the time the description is requested. However, this method does not provide for descriptions that are sensitive to other factors such as a viewer's location with respect to the object, or to the state of the surrounding environment. This results in non-optimal, and even potentially misleading, descriptions.
A second factor that causes natural language descriptions of a scene to be sub-optimal is the complexity of the environment. If there are many objects to be described, the scene description becomes too long. In other applications involving large numbers of objects, object numbers are structured in a hierarchy, as taught by US 2006/019843B A1 to Negishi et al.
U.S. Pat. No. 6,329,986 to Cheng teaches a method of prioritizing a set of objects within a virtual environment to determine which objects to present to the user, and the quality of the presentation. Priority is determined using base and modifying parameters, where the modifying parameters represent circumstances, views, characteristics, and opinions of the participant. Base parameters can represent a characteristic of the environment, distance to the object, angle between the center of the viewpoint and the object, and the ‘circumstance of’ a user interaction with an object. Cheng does not teach the use of the prioritization to order objects for presentation.
U.S. Pat. No. 6,118,456 teaches a method of prioritizing objects within a virtual environment, according to their importance within a scene from a particular viewpoint. This prioritization is used to determine the order in which object data is fetched from a remote server, so that important objects are rendered more quickly than less important objects. Object importance is calculated by considering the distance to the object, the visual area taken up by the object in the scene, the user's inferred area of visual focus, object movement, and an application-specific assigned ‘message’ value that is used to elevate the importance of specific objects.
These grouping and prioritization techniques have not been applied to the problem of optimizing a natural language description of a scene within a virtual environment. Furthermore, these techniques do not include consideration of several factors that contribute to an efficient scene description.
One factor for efficient scene description includes the set of recent descriptions given to the user. In human communication, long descriptions are generally condensed when repeated. For example, the phrase “a red chair with four green legs” may be used the first time such a chair is described, whereas subsequent descriptions would take the form “a red chair” or, eventually, “another chair” or “five more chairs”. If five identical chairs have already been described to the user, it is preferable to group other similar chairs and describe them with a single phrase. Furthermore, conventional algorithms also do not take into account the other objects present in the scene, except to calculate whether an object is visible to the user. In a scene with hundreds of chairs, a single chair should not be given a high priority, whereas in a meeting room, it should.
In accordance with the present principles, a system and method of generating a natural language description of an object within a 3D model that is accurate with respect to the object's location, the viewer's viewpoint, recent activity, and the state of the object and surrounding environment are provided.
Another method by which such descriptions are composed into a scene description and presented to a user is constructed so as to limit the total number of objects described, and to describe more important objects before less important objects. Such a description is useful to help introduce and orient users who, for whatever reason, cannot see the visual representation of the virtual environment. In the context of virtual environments, a method that overcomes the shortcomings of existing techniques for describing a virtual scene in words is provided.
A system and method for constructing a natural language description of one or more objects in a virtual environment include determining a plurality of properties of an object and an environment given a current viewpoint in a virtual environment; creating an object description using the plurality of properties where the object description reflects multiple display characteristics of the object in the virtual environment; and combining object descriptions by classifying objects in the virtual environment to condense a natural language description.
Another system and method for constructing a natural language description of an object in a three-dimensional (3D) virtual model includes determining a plurality of properties of an object and an environment given a current viewpoint in a virtual environment; creating an object description using the plurality of properties where the object description reflects multiple display characteristics of the object in the virtual environment including at least one of an angle from which the object is viewed; a distance of the object from the viewpoint; a portion of the object that is visible from the viewpoint; and values of properties of other objects also present in the environment; classifying objects in the virtual environment to optimize a natural language description of a virtual scene including at least one of replacing sets of similar objects with a single group object and a corresponding natural language description, prioritizing a set of objects and filtering the set of objects; and outputting the natural language description of the virtual scene as synthesized speech.
A system in accordance with the present principles constructs a natural language description of one or more objects in a virtual environment. A processing system is configured to generate an object and an environment in a virtual rendering, to determine a plurality of properties of the object and the environment given a current viewpoint in the virtual environment, and to create an object description using the plurality of properties where the object description reflects multiple display characteristics of the object in the virtual environment. The one or more memory storage devices or memory in the processing unit is/are configured to provide constructions, templates or other formats for combining object descriptions by classifying objects in the virtual environment in accordance with stored criteria. This is employed to condense a natural language description of the virtual environment. An output device is configured to output the natural language description.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
In accordance with illustrative embodiments, a system and method for generating a natural language description of an object within a 3D model that is accurate with respect to the object's location, a viewer's viewpoint, recent activity, and a state of the object and surrounding environment are provided. Such descriptions are composed into a scene description and presented to a user. This scene description is constructed so as to limit the total number of objects described, and to describe more important objects before less important objects. Such a description is useful to help introduce and orient users who, for whatever reason, cannot see the visual representation of the virtual environment, or for other reasons.
According to one aspect, a system and method are provided for obtaining a natural language description of an individual object within a virtual environment. The description is constructed by considering the distance between the viewpoint and the object, the angle between the center of view and the object, the portion of the object that is visible from the viewpoint, the values of properties of other objects present in the environment, the value of properties of the object itself, and the object descriptions that have already been provided to the user.
According to a further aspect, a set of objects to be described is identified, the number of objects is reduced or condensed by grouping subsets of similar objects, and the objects are prioritized. Natural language descriptions of the resulting set of objects and groups are generated. This may be performed in any order, and may be repeated.
Filtering of the set of objects to be considered may be performed at any stage. For example, an initial set may include only objects that are visible from the user's current viewpoint, or only objects with certain properties. Finally, the set of natural language descriptions is presented to the user. This presentation may take the form of a single natural language description composed from the individual descriptions, or it may include a set of individual descriptions. This presentation may also involve the use of 3D audio properties to reflect the location of each object, the use of synthesized speech to speak the objects to the user, or any other method of communicating natural language text to a user.
The user may interrupt the presentation of the description, and the object being described at the point of interruption will be stored for use in further processing. For example, the user may set this object as a target they can then navigate to. The generation of the natural language description of a scene may be triggered by a user command, by the action of moving to a specific viewpoint, a particular state of the environment, or a particular state of the user within that environment.
Various features may be used to guide the prioritization of objects within the environment. Specifically, prioritization may be affected by: the degree of fit between a user's query and objects, the location relative to the viewpoint, the size of the object, the proportion of the view occupied by the object, visual properties of the object in the scene, the type of the object, metadata associated with the object, a text name and description associated with the object, the object's velocity, acceleration and orientation, the object's animation, the user's velocity, acceleration and direction of movement, previous descriptions provided to the user, and/or other objects present in the scene.
Embodiments in accordance with the present principles may be employed in video games, virtual tours, navigation systems, cellular telephone applications or other computer or virtual environments where a verbal description of a displayed or displayable scene needs to be audibly described. The present embodiments may be employed to provide explanation in a virtual environment which matches an actual environment, such as in a navigation application or virtual tour. In other embodiments, people with visual impairments will be able to hear a verbal (natural speech) description of a virtual environment or a virtual model of a real environment.
Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Objects as referred to herein include virtual objects that are rendered or are renderable on a screen, display device or other virtual environment. Such objects may also include software representations. The objects may be rendered visually or acoustically to be perceived by a user.
Referring now to the drawings in which like numerals represent the same or similar elements and initially to
In block 30, the composed object description is compared to other object descriptions. These may be descriptions already calculated for objects in a current set of objects, or descriptions that have already been generated and provided to the user. In block 40, a decision is made as to whether the current description should be condensed. In one embodiment, a description is condensed if it is identical to one of the other object descriptions. In another embodiment, the description is condensed if it is greater than a certain length threshold, and judged as similar to a threshold number of other descriptions, and judged as being of lower priority compared to those descriptions, where priority may be equated with distance from the viewpoint or some other property or combination of properties. If the description is judged to be one that should be condensed, then in block 50, the property of ‘condensed’ is applied to it, and the method of
Referring to
If the properties match, the description component is added to a list of description components that match the current object's properties in block 114. The next description component is then fetched in block 102 and examined in the same way, until all available description components have been examined.
When no more description components are available, a description template is selected in block 106, into which the components will be placed. Selection of this template may be based on the type of the object, the number and type of description components selected, or other criteria. When the template has been selected, the components needed to instantiate the template are compared with the set of selected components. If needed components are missing as determined in block 110, then generic object description components are fetched to fill these roles in block 112. For example, if no description components are available for an object, the generic component “unknown object” could be selected. This selection may include, for example, the same property matching approach described above with reference to
Referring to
If this portion of the object is not visible, this description component will not be selected. Another property is the direction from which the feature being described is visible. So for example, the door of the lighthouse is only visible from the South West, South or South East. If the viewpoint is not from one of these directions, this component will not be selected. Another property is the distance range from which the component is visible. This enables some features to be described only when the viewpoint is near. Another feature is a priority value for the component. More important components will take priority over less important ones. Another feature that may be provided is a relevance function or expression that calculates whether the description component is relevant to the current situation. This calculation would be based on the object properties calculated as described with respect to
An example of such a calculation is a function that evaluates a component to be relevant when only the base of the tower is visible, or when the object in question is an avatar moving away from the viewpoint. Description components also include a property that indicates the type of component. This is used as described in
Referring to
If filtering and prioritization are not needed, then the group variable is examined in block 306. If the value of the group variable indicates that grouping is needed, control passes to block 314 where grouping is carried out as described in
If grouping is not needed, then the getText variable is examined in block 308. If the value of the getText variable indicates that text should be generated, then control passes to block 316 in which natural language descriptions of all of the objects in the current set are generated according to
After filtering, prioritization, grouping or text generation has been performed, control then passes to block 318 where the control variables filter, prioritize, group and getText are updated to reflect the desired next steps in the procedure. Control passes back to the start in block 300. The values of the filter, prioritize, group and getText variables may be manipulated to produce any order or number of repetitions of the steps.
One procedure may filter objects to remove those that are not visible from the current viewpoint, then try to add groups, prioritize the resulting set, filter again to remove low priority items, and then generate descriptions for the remaining items. An alternative is to generate text for the items before grouping, to use the generated text as the basis for grouping decisions. Other procedures are also contemplated.
Referring to
This filter function checks whether the object is visible from a current viewpoint. If not, the object is again removed from the set in block 420. If the object passes this test, a third filter function is applied in block 416, which tests whether the object is within a certain range of the viewpoint. Objects outside this range are again removed from the set in block 420. Additional (or fewer) filter functions may be applied as illustrated in block 418. In some embodiments, only one filter function is applied. Any object that fails to pass a filter is removed from the set (block 420). Control then passes back to blocks 408 and 410 to get the next object. When all objects have been examined, the set of objects remaining in the set is returned as the filtered set.
Referring to
Referring to
When all objects have a description vector, standard cluster analysis algorithms known in the art may be applied to the set of objects to identify groups of similar objects in block 608. Control then passes to block 610, where a decision is made as to whether more groups are needed. This decision may be based on comparing the total number of objects in the set with a predefined ideal maximum number of objects. Coherence and size of the tightest cluster in the cluster analysis may also be taken into account. If the total number of objects is below the target level, or no suitable cluster has been identified, more groups are not needed. The system/method terminates, and the current set of objects (including any groups) is returned.
If more groups are desired, control passes to block 612, in which the tightest cluster in the analysis is identified. This is the set of objects with minimum distance between them according to the output of the cluster analysis. This group of objects is then removed from the cluster analysis in block 614, and the items in the cluster are removed from the set of objects in block 616.
A new object is created to reflect the group in block 618. This new object includes information about the objects in the group. For example, it the objects already have text descriptions, a group text description is generated at this stage, according to the steps outlined in
Referring to
After a description or group description has been generated, the set of objects is again tested (block 702) to see if further objects need descriptions. When all objects have descriptions, the system/method terminates.
Referring to
If the object descriptions are not all identical, the degree of similarity is assessed in block 808. If the descriptions are sufficiently similar (for example the primary noun is the same in all cases), a group description is generated, by e.g., combining the primary noun phrases and prefacing with the number of objects. For example, “Seven chairs and tables”. It should be understood that more sophisticated ways to provide a group description may be employed, e.g., using an ontology to identify a common superclass and getting a description for that superclass, or finding the “correct” collective noun phrase (e.g. “a crowd of people” or “several items of furniture”). Condensing descriptions extends to other parts of speech, idiomatic phrases, and the like.
The description is then returned from block 812. If the descriptions are not sufficiently similar for this merging, a more generic group description is produced in block 814. This may be a generic object description that matches all of the objects in the group. Again, this may be prefaced with the number of objects represented in the group. For example “Fifty large metal objects.” Other embodiments may provide for alternative schemes for generating the group descriptions, including the use of group templates, or group phrases explicitly provided as object description components. In another embodiment, description components for groups are explicitly defined in the same way as object description components.
Referring to
In block 908, a summary statement for the object list is generated to provide an overview of the scene. This may be generated from a template that calls out specific object types. For example “5 avatars, 6 villains nearby, 2 buildings and 35 other objects”. In another embodiment, the summary provides orienting information such as “North side of Wind Island, facing East. Standing on grass. Cloudy sky above.” In block 910, the summary statement is combined with the prioritized list of object descriptions to produce a final natural language description. This step of combining the descriptions may include removing objects from the list that have been described in the summary statement (e.g. sky, grass). It may also include the addition of information about the relative positions of objects in the list. For example, “a golden haired woman sitting in a green armchair”.
In block 912, the resulting statement may be presented to a user as audio, in which case some embodiments will attach 3D audio information to the items in the list such that their audio representation reflects the distance and location of the object relative to the viewpoint in block 914. For example, a distant object on the left could be described using a low volume in the left ear. The description may be presented to the user as synthesized speech, in block 916. If audio presentation is not desired, the description can be provided as electronic text to be rendered in a manner of the user's choosing in block 918. For example, it may be converted into Braille and provided via a refreshable Braille display.
Referring to
The system 1000 is configured to perform the actions and steps as described above with respect to
In one illustrative example, system 1000 constructs a natural language description of one or more objects in a virtual environment. The processing system 1002 is configured to generate an object and an environment in a virtual rendering, to determine a plurality of properties of the object and the environment given a current viewpoint in the virtual environment, and to create an object description using the plurality of properties where the object description reflects multiple display characteristics of the object in the virtual environment. The one or more memory storage devices 1004, 1012 or memory in the processing unit 1002 is/are configured to provide constructions, templates or other formats for combining object descriptions by classifying objects in the virtual environment in accordance with stored criteria. This is employed to condense a natural language description of the virtual environment. An output device 1026 is configured to output the natural language description. Output device 1026 may include a speaker which outputs synthesized speech or may include a text output displayed on display 1006.
System 1000 may be part of a network data processing system which may comprise the Internet, for example, with a network 1020 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. The Internet includes a backbone of high-speed data communication lines between major nodes or host computers including a multitude of commercial, governmental, educational and other computer systems that route data and messages.
Network data processing system 1020 may be implemented as any suitable type of networks, such as for example, an intranet, a local area network (LAN) and/or a wide area network (WAN). The network data processing elements in
Processing unit 1002 may include one or more processors, a main memory, and a graphics processor. An operating system 1024 may run on processing unit 1002 and coordinate and provide control of various components within the system 1000. For example, the operating system 1024 may be a commercially available operating system.
Having described preferred embodiments of a system and method for optimizing natural language descriptions of objects in a virtual environment (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.