The present disclosure relates to camera relocalization for real-time AR-supported network service visualization.
Examples of embodiments relate to apparatuses, methods and computer program products relating to camera relocalization for real-time AR-supported network service visualization.
Camera pose estimation can be classified into two types of problems, depending on the availability of data. In case the user enters an unknown environment for the first time, and no prior data is available, the problem is called camera localization. Such problem can be solved by the well-known visual-based simultaneous localization and mapping (SLAM) techniques, which estimate the camera pose at the same time when updating a map [AT+17]. A common method in SLAM is to find the correspondences between local features extracted from 2D image and 3D point cloud of the scene obtained from structure from motion (SfM), and recover the camera pose with such 2D-3D matches. However, such feature matching-based approaches does not work robustly and accurately in all scenarios, e.g., changing lighting conditions, textureless scenes, or repetitive structures. Moreover, in case that a user enters an environment where a prior map (or part of the map) has been previously learned, SLAM still needs to create the point cloud and estimate the camera pose from scratch. This is because, visual-based SLAM usually builds a map based on a reference coordinate system, e.g., based on the camera pose of the initial frame, and every following frame is expressed relative to the initial reference coordinate system. Each time when a device enters an environment and executes the SLAM algorithm, the algorithm may build a map with respect to a different reference coordinate system.
Thus, there is need provide a solution to the second type of pose estimation problems, which is camera relocalization, i.e. estimating a camera pose in real-time in the same or similar environment.
Efficiently solving this problem is essential for enabling real-time AR features in the network service of performance visualization.
[AD+19] Android Developer, Motion sensors, https://developerandroid.com/guide/topics/sensors/sensors_motion, visited on Dec. 7, 2019
[AT+17]: R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255-1262, 2017
The following meanings for the abbreviations used in this specification apply:
Various exemplary embodiments of the present disclosure aim at addressing at least part of the above issues and/or problems and drawbacks.
Various aspects of exemplary embodiments of the present disclosure are set out in the appended claims.
According to an example of an embodiment, there is provided, for example, an apparatus comprising an apparatus, comprising at least one processing circuitry, and at least one memory for storing instructions to be executed by the processing circuitry. The at least one memory and the instructions are configured to, with the at least one processing circuitry, cause the apparatus at least to input display data obtained from a first terminal endpoint device located in a first three-dimensional environment into a deep neural network model for terminal endpoint device pose estimation. The display data comprising at least image data and sensory data. Image data of a captured image of at least part of the first three-dimensional environment acquired by the first terminal endpoint device at a first point of time. Sensory data indicative of at least a motion vector of a movement of the first terminal endpoint device in the three-dimensional environment acquired by the first terminal endpoint device at a second point of time. The deep neural network model being trained with, as model input, training image data and training sensory data. Training image data of a captured training image of at least part of a three-dimensional training environment acquired by a training terminal endpoint device located in the three-dimensional training environment. Training sensory data indicative of at least a motion vector of a movement of the training terminal endpoint device in the three-dimensional training environment. The deep neural network model being trained with, as model output, training poses of the training terminal endpoint device in the three-dimensional training environment. Additionally, the apparatus is further caused to obtain from the deep neural network model for terminal endpoint device pose estimation, based on the input display data, a first estimated pose of the first terminal endpoint device in the first three-dimensional environment.
In addition, according to an example of an embodiment, there is provided, for example, a method comprising the steps of inputting display data obtained from a first terminal endpoint device located in a first three-dimensional environment into a deep neural network model for terminal endpoint device pose estimation. The display data comprising at least image data and sensory data. Image data of a captured image of at least part of the first three-dimensional environment acquired by the first terminal endpoint device at a first point of time. Sensory data indicative of at least a motion vector of a movement of the first terminal endpoint device in the three-dimensional environment acquired by the first terminal endpoint device at a second point of time. The deep neural network model being trained with, as model input, training image data and training sensory data. Training image data of a captured training image of at least part of a three-dimensional training environment acquired by a training terminal endpoint device located in the three-dimensional training environment. Training sensory data indicative of at least a motion vector of a movement of the training terminal endpoint device in the three-dimensional training environment. The deep neural network model being trained with, as model output, training poses of the training terminal endpoint device in the three-dimensional training environment. Additionally, the method further comprises the steps of obtaining from the deep neural network model for terminal endpoint device pose estimation, based on the input display data, a first estimated pose of the first terminal endpoint device in the first three-dimensional environment.
According to further refinements, these examples may include one or more of the following features:
Furthermore, according to an example of an embodiment, there is provided, for example, an apparatus configured for being connected to at least one camera unit and to at least one sensor unit. The apparatus, comprising at least one processing circuitry, and at least one memory for storing instructions to be executed by the processing circuitry, wherein the at least one memory and the instructions are configured to, with the at least one processing circuitry, cause the apparatus at least to provide display data. The display data comprising at least image data and sensory data. The image data of a captured image of at least part of a three-dimensional environment surrounding the apparatus captured by the at least one camera unit at a first point of time. The sensory data indicative of at least a motion vector of a movement of the apparatus in the three-dimensional environment acquired by the at least one sensor unit at a second point of time. The apparatus is further caused to display, based on the provided display data, network information associated with a first estimated pose of the apparatus in the three-dimensional environment overlaid with the captured image.
In addition, according to an example of an embodiment, there is provided, for example, a method comprising the steps of providing display data. The display data comprising at least image data and sensory data. The image data of a captured image of at least part of a three-dimensional environment surrounding an apparatus captured by a camera unit configured to be connected to the apparatus at a first point of time. The sensory data indicative of at least a motion vector of a movement of the apparatus in the three-dimensional environment acquired by a sensor unit configured to be connected to the apparatus at a second point of time. The method further comprises the steps of displaying, based on the provided display data, network information associated with a first estimated pose of the apparatus in the three-dimensional environment overlaid with the captured image.
According to further refinements, these examples may include one or more of the following features:
In addition, according to embodiments, there is provided, for example, a computer program product for a computer, including software code portions for performing the steps of the above defined methods, when said product is run on the computer. The computer program product may include a computer-readable medium on which said software code portions are stored. Furthermore, the computer program product may be directly loadable into the internal memory of the computer and/or transmittable via a network by means of at least one of upload, download and push procedures.
Any one of the above aspects enables camera relocalization for real-time AR-supported network service visualization thereby solving at least part of the problems and drawbacks identified in relation to the prior art.
Thus, improvement is achieved by apparatuses, methods, and computer program products enabling camera relocalization for real-time AR-supported network service visualization.
Some embodiments of the present disclosure are described below, by way of example only, with reference to the accompanying drawings, in which:
In the last years, an increasing extension of communication networks, e.g. of wire based communication networks, such as the Integrated Services Digital Network (ISDN), Digital Subscriber Line (DSL), or wireless communication networks, such as the cdma2000 (code division multiple access) system, cellular 3rd generation (3G) like the Universal Mobile Telecommunications System (UMTS), fourth generation (4G) communication networks or enhanced communication networks based e.g. on Long Term Evolution (LTE) or Long Term Evolution-Advanced (LTE-A), fifth generation (5G) communication networks, cellular 2nd generation (2G) communication networks like the Global System for Mobile communications (GSM), the General Packet Radio System (GPRS), the Enhanced Data Rates for Global Evolution (EDGE), or other wireless communication system, such as the Wireless Local Area Network (WLAN), Bluetooth or Worldwide Interoperability for Microwave Access (WiMAX), took place all over the world. Various organizations, such as the European Telecommunications Standards Institute (ETSI), the 3rd Generation Partnership Project (3GPP), Telecoms & Internet converged Services & Protocols for Advanced Networks (TISPAN), the International Telecommunication Union (ITU), 3rd Generation Partnership Project 2 (3GPP2), Internet Engineering Task Force (IETF), the IEEE (Institute of Electrical and Electronics Engineers), the WiMAX Forum and the like are working on standards or specifications for telecommunication network and access environments.
Basically, for properly establishing and handling a communication between two or more end points (e.g. communication stations or elements or functions, such as terminal devices, user equipments (UEs), or other communication network elements, a database, a server, host etc.), one or more network elements or functions (e.g. virtualized network functions), such as communication network control elements or functions, for example access network elements like access points, radio base stations, relay stations, eNBs, gNBs etc., and core network elements or functions, for example control nodes, support nodes, service nodes, gateways, user plane functions, access and mobility functions etc., may be involved, which may belong to one communication network system or different communication network systems.
In view of different types of communication, the conventional network service offers offline unidirectional communication between the customer and service provider. For example, many network planning tools require users to manually upload building plan or geographical maps, and, based on the uploaded data, they provide simple visualization features, such as two-dimensional (2D) view of the radio map. To enable a zero-touch network service with better use experience, it is disclosed herein an online interactive service interface that is automatically environment-aware and can visualize network performance with augmented reality (AR) in real-time [LS+18].
As outlined above, camera pose estimation can be classified into two types of problems, wherein it is an object of the present specification to provide a solution to the second type of the pose estimation problems, camera relocalization, i.e., estimating camera pose in real-time by exploiting fused sensory data and previously learned mapping and localization data in the same or similar environment.
The conventional camera relocalization methods estimate the 6 degree-of-freedom (DoF) camera pose by using visual odometry techniques. For example, in [JDV+13] the 2D-to-3D point correspondences are obtained from the inherent relationship between the real camera's 2D features and their matches on the virtual image (created by projecting the map points in prior map onto a plane using the previously localized pose of the real camera). Then, the well-known perspective-n-point (PnP) problem is solved to find the relative pose between the real and the virtual cameras. The projection error is minimized by using random sample consensus (RANSAC). However, because such visual odometry-based method iteratively minimizes the estimation error over image frames, the performance may converge slowly (if it ever converges), and it is very sensitive to fast scene changing. To realize real-time camera relocalization for enabling AR features, a new method called “PoseNet” based on the deep learning is introduced in [KC+17]. A direct mapping relationship between a single image and its corresponding camera pose is represented by a deep neural network (DNN) (as shown in
The recent work [BGK+18] proposes “MapNet” to add sensory data such as inertial measurement unit (IMU) measures to enhance the model, whereas IMU data is only used to modify the loss function, forming extra constraints on camera movement implied by IMU measures. The input (image) and the output (camera pose) of the DNN remain unchanged. Thus, the above-mentioned methods do not fully exploit the information of device motion provided by IMU and other sensors.
In the following, different exemplifying embodiments will be described using, as an example of a communication network to which examples of embodiments may be applied, a communication network architecture based on 3GPP standards for a communication network, such as a 5G/NR, without restricting the embodiments to such an architecture, however. It is obvious for a person skilled in the art that the embodiments may also be applied to other kinds of communication networks where mobile communication principles are integrated, e.g. Wi-Fi, worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, mobile ad-hoc networks (MANETs), wired access, etc. Furthermore, without loss of generality, the description of some examples of embodiments is related to a mobile communication network, but principles of the disclosure can be extended and applied to any other type of communication network, such as a wired communication network.
The following examples and embodiments are to be understood only as illustrative examples. Although the specification may refer to “an”, “one”, or “some” example(s) or embodiment(s) in several locations, this does not necessarily mean that each such reference is related to the same example(s) or embodiment(s), or that the feature only applies to a single example or embodiment. Single features of different embodiments may also be combined to provide other embodiments. Furthermore, terms like “comprising” and “including” should be understood as not limiting the described embodiments to consist of only those features that have been mentioned; such examples and embodiments may also contain features, structures, units, modules etc. that have not been specifically mentioned.
A basic system architecture of a (tele)communication network including a mobile communication system where some examples of embodiments are applicable may include an architecture of one or more communication networks including wireless access network subsystem(s) and core network(s). Such an architecture may include one or more communication network control elements or functions, access network elements, radio access network elements, access service network gateways or base transceiver stations, such as a base station (BS), an access point (AP), a NodeB (NB), an eNB or a gNB, a distributed or a centralized unit, which controls a respective coverage area or cell(s) and with which one or more communication stations such as communication elements or functions, like user devices or terminal devices, like a UE, or another device having a similar function, such as a modem chipset, a chip, a module etc., which can also be part of a station, an element, a function or an application capable of conducting a communication, such as a UE, an element or function usable in a machine-to-machine communication architecture, or attached as a separate element to such an element, function or application capable of conducting a communication, or the like, are capable to communicate via one or more channels via one or more communication beams for transmitting several types of data in a plurality of access domains. Furthermore, core network elements or network functions, such as gateway network elements/functions, mobility management entities, a mobile switching center, servers, databases and the like may be included.
The general functions and interconnections of the described elements and functions, which also depend on the actual network type, are known to those skilled in the art and described in corresponding specifications, so that a detailed description thereof is omitted herein. However, it is to be noted that several additional network elements and signaling links may be employed for a communication to or from an element, function or application, like a communication endpoint, a communication network control element, such as a server, a gateway, a radio network controller, and other elements of the same or other communication networks besides those described in detail herein below.
A communication network architecture as being considered in examples of embodiments may also be able to communicate with other networks, such as a public switched telephone network or the Internet. The communication network may also be able to support the usage of cloud services for virtual network elements or functions thereof, wherein it is to be noted that the virtual network part of the telecommunication network can also be provided by non-cloud resources, e.g. an internal network or the like. It should be appreciated that network elements of an access system, of a core network etc., and/or respective functionalities may be implemented by using any node, host, server, access node or entity etc. being suitable for such a usage. Generally, a network function can be implemented either as a network element on a dedicated hardware, as a software instance running on a dedicated hardware, or as a virtualized function instantiated on an appropriate platform, e.g., a cloud infrastructure.
Furthermore, a network element, such as communication elements, like a UE, a terminal device, control elements or functions, such as access network elements, like a base station (BS), an gNB, a radio network controller, a core network control element or function, such as a gateway element, or other network elements or functions, as described herein, and any other elements, functions or applications may be implemented by software, e.g. by a computer program product for a computer, and/or by hardware. For executing their respective processing, correspondingly used devices, nodes, functions or network elements may include several means, modules, units, components, etc. (not shown) which are required for control, processing and/or communication/signaling functionality. Such means, modules, units and components may include, for example, one or more processors or processor units including one or more processing portions for executing instructions and/or programs and/or for processing data, storage or memory units or means for storing instructions, programs and/or data, for serving as a work area of the processor or processing portion and the like (e.g. ROM, RAM, EEPROM, and the like), input or interface means for inputting data and instructions by software (e.g. floppy disc, CD-ROM, EEPROM, and the like), a user interface for providing monitor and manipulation possibilities to a user (e.g. a screen, a keyboard and the like), other interface or means for establishing links and/or connections under the control of the processor unit or portion (e.g. wired and wireless interface means, radio interface means including e.g. an antenna unit or the like, means for forming a radio communication part etc.) and the like, wherein respective means forming an interface, such as a radio communication part, can be also located on a remote site (e.g. a radio head or a radio station etc.). It is to be noted that in the present specification processing portions should not be only considered to represent physical portions of one or more processors, but may also be considered as a logical division of the referred processing tasks performed by one or more processors.
It should be appreciated that according to some examples, a so-called “liquid” or flexible network concept may be employed where the operations and functionalities of a network element, a network function, or of another entity of the network, may be performed in different entities or functions, such as in a node, host or server, in a flexible manner. In other words, a “division of labor” between involved network elements, functions or entities may vary case by case.
Referring now to
In particular, according to
According to various examples of embodiments, the first point of time may be equal to the second point of time.
Furthermore, according to at least some examples of embodiments, the method may further comprise the steps of adding to the display data a previous estimated pose of the first terminal endpoint device in the first three-dimensional environment obtained from the deep neural network model previous to the first estimated pose. In addition, the deep neural network model being further trained with previous output training poses of the training terminal endpoint device as model input.
Moreover, according to various examples of embodiments, the method may further comprise the steps of adding to the display data previous image data and previous sensory data. The previous image data being image data of a previous captured image of at least part of the first three-dimensional environment acquired by the first terminal endpoint device at a third point of time previous to the first point of time. The previous sensory data being sensory data indicative of at least a motion vector of a previous movement of the first terminal endpoint device in the three-dimensional environment acquired by the first terminal endpoint device at a fourth point of time previous to the second point of time. The deep neural network model being further trained with previous image data and previous sensory data as model input.
Optionally, according to various examples of embodiments, the third point of time is equal to the fourth point of time.
Further, according to at least some examples of embodiments, the sensory data may comprise at least data acquired from at least one of an accelerometer, a gyroscope, a magnetometer, and a fusion sensor.
Moreover, according to various examples of embodiments, the training terminal endpoint device may be a second terminal endpoint device.
Alternatively, according to various examples of embodiments, the training terminal endpoint device is a computer simulated terminal endpoint device and the three-dimensional training environment is a computer simulated three-dimensional training environment.
Optionally, according to at least some examples of embodiments, in case of the three-dimensional training environment being different from the first three-dimensional environment, the deep neural network model is used for terminal endpoint device pose estimation in the first three-dimensional environment through transfer learning of the first three-dimensional environment from the three-dimensional training environment.
Furthermore, according to at least some examples of embodiments, the method may further comprise the steps of projecting, based on the first estimated pose of the first terminal endpoint device in the first three-dimensional environment, three-dimensional virtual network information onto the captured image. In addition, the method comprises the steps of generating an augmented reality output image by overlaying the three-dimensional virtual network information with the captured image.
Furthermore, according to various examples of embodiments, the method may further comprise the steps of projecting the three-dimensional virtual network information onto the captured image further based on a three-dimensional virtual network information model for the first three-dimensional environment comprising the three-dimensional virtual network information. Wherein a field of view generated for the three-dimensional virtual network information is configured to be the same as a field of view captured by the captured image.
Additionally, according to at least some examples of embodiments, the three-dimensional virtual network information model may be provided to an apparatus applying the method.
Alternatively, according to at least some examples of embodiments, the three-dimensional virtual network information model is learned by an apparatus applying the method from at least part of the display data using 3D environment reconstruction techniques.
Further alternatively, according to various examples of embodiments, the three-dimensional virtual network information model is learned by an apparatus applying the method through transfer learning from a pre-learned three-dimensional virtual network information model for a second three-dimensional environment different from the first three-dimensional environment.
Optionally, according to at least some examples of embodiments, the deep neural network model comprises the three-dimensional virtual network information model.
Moreover, according to various examples of embodiments, the three-dimensional virtual network information for the first three-dimensional environment may be obtained from measurements of network performance indicators of a radio network in the first three-dimensional environment.
Furthermore, according to at least some examples of embodiments, the three-dimensional virtual network information for the first three-dimensional environment may be computer simulated network performance indicators of a computer simulated radio network in the first three-dimensional environment.
Optionally, according to various examples of embodiments, the three-dimensional virtual network information are three-dimensional radio map information indicative of radio network performance.
Further, according to various examples of embodiments, the method may be configured to be applied by an apparatus configured to be integrated in the first terminal endpoint device, wherein the deep neural network model is maintained at the first terminal endpoint device, or the method may be configured to be applied by an apparatus configured to be integrated in a network communication element, wherein the deep neural network model is maintained at the network communication element.
Moreover, according to at least some examples of embodiments, the captured image is a two-dimensional image captured by a monocular camera, or a stereo image comprising depth information captured by a stereoscopic camera unit, or a thermal image captured by a thermographic camera.
The above mentioned features, either alone or in combination, allow for camera relocalization for real-time AR-supported network service visualization. In this context, the above mentioned features, either alone or in combination, specifically allow, due to using the display data comprising at least image data and sensory data, to obtain more accurate and direct information about a camera's orientation and moving direction as compared with prior art methods.
Referring now to
In particular, according to
According to various examples of embodiments, the first point of time may be equal to the second point of time.
Furthermore, according to at least some examples of embodiments, the displayed network information may comprise an augmented reality image generated by overlaying three-dimensional virtual network information with the captured image.
Moreover, according to various examples of embodiments, the three-dimensional virtual network information may be three-dimensional radio map information being configured for AR-supported network service.
Optionally, according to at least some examples of embodiments, the sensor unit is at least one of an accelerometer, a gyroscope, a magnetometer, and a fusion sensor.
Further, according to various examples of embodiments, the camera unit may comprise at least one of a monocular camera, a stereoscopic camera unit, and a thermographic camera. Further, the principles outlined in relation to at least some examples of embodiments are also applicable to ultrasonic sound images captured by a corresponding sound emitter/detector arrangement.
The above mentioned features, either alone or in combination, allow for camera relocalization for real-time AR-supported network service visualization. In this context, the above mentioned features, either alone or in combination, specifically allow, due to providing the display data comprising at least image data and sensory data, to obtain more accurate and direct information about a camera's orientation and moving direction as compared with prior art methods.
Referring now to
The apparatus 500 shown in
The processor or processing function 510 is configured to execute processing related to the above described method. In particular, the processor or processing circuitry or function 510 includes one or more of the following sub-portions. Sub-portion 511 is a processing portion which is usable as a portion for inputting display data. The portion 511 may be configured to perform processing according to S320 of
Referring now to
The apparatus 600 shown in
The processor or processing function 610 is configured to execute processing related to the above described method. In particular, the processor or processing circuitry or function 610 includes one or more of the following sub-portions. Sub-portion 611 is a processing portion which is usable as a portion for providing display data. The portion 611 may be configured to perform processing according to S420 of
In the following, further details and implementation examples according to examples of embodiments are described with reference to
One idea of the present specification regarding cameral relocalization is to estimate camera pose using a DNN with fused image data and other sensory data (such as IMU measures) as the inputs of the DNN. Compared to [BGK+18], where IMU measurements are used in the design of the loss function while the single image input remains unchanged as in [KC+17], the method according to the present specification directly adds sensory data into the DNN inputs for better utilization of the motion sensor data. Fusing image and other sensory data as combined inputs to DNN also leads to the change of the DNN architecture, i.e., adding extra architecture features to the intermediate layers to read the sensory information.
The difference between the two approaches can be easily recognized by comparing the examples given in
Because the camera pose is highly temporal dependent, adding as extra input(s) the previous state(s) of camera pose to capture the temporal dependency is further outlined in the specification herein below.
A first approach for adding such extra input(s) is shown in
Moreover, in case AR-supported network service is requested for a new environment, it is detailed below in the present specification to use transfer learning to exploit the knowledge learned from a selected pre-trained environment and accelerate DNN model training for the new environment with limited data.
The estimated camera pose can be then used to project the virtual network information, such as 3D radio map of network performance, onto 2D image with user device's (e.g. a terminal endpoint device's) perspective on the device's display on real-time, to realize the AR features. It is to be noted that unlike the conventional definition of 3D radio map indicating radio signal strength only, the 3D radio map is given a more general definition in the present specification—it can be a position-based map of any performance metrics in radio networks, e.g., received signal strength, data throughput, latency, etc.
To give a big picture of how camera relocalization enables real-time AR-supported network service visualization, a process of AR-supported network service visualization is described first with reference to
The process consists of following three phases: (1) Model training S910, (2) Real-time camera relocalization S920, (3) and Real-time augmentation of 3D radio map S930.
In the model training step S910, the server collects S911 image data and sensory data from user device 990 and performs the following two tasks, which are as illustrated in detail in in
Returning to
The problem of camera relocalization can now be stated. The objective is to effectively estimate the camera pose by exploiting the collected image and motion-related sensory data. More specifically, according to examples of embodiments, given an image I(t) captured at time tin a given environment, with its corresponding selected sensory data s(t) collected at the same time t in the same device, it is an object to estimate the camera pose p(t) (camera's position and the orientation defining the view perspective of I(t)) with a pre-trained model fw(I, s), where the index w denotes the parameters characterizing the function f. The model is derived from a training dataset ={((I(i), (i), p(i))}i=1K, where a tuple (I, s) is the training input and p is the training output, and K denotes the number of training samples. In the following, first a solution to the above stated basic problem is described. Subsequently, the first solution is extended by adding the previous state of camera pose p′ into the inputs of the estimation model fw(I, s, p′) to predict the current state of camera pose p.
Before introducing each of the solutions according to examples of embodiments in detail, relevant definitions are to be defined for a better understanding first. An image I(t) (e.g. an input image of input image data) can be an RGB image represented by I(t)∈w×h×3, or a grey-scaled image represented by I(t)∈w×h, or RGB-D image represented by I(t)∈w×h×4 (where the last dimension includes three colour channels and a depth channel). A camera pose p(t)=[u(t), o(t)] (also referring to a user equipment pose, a terminal endpoint device pose, in case of the user equipment/terminal endpoint device being equipped with a camera (e.g. a camera unit) and/or being connected to a camera (e.g. a camera unit)) consisting of camera's position in 3D space u(t)=[x(t), y(t), z(t)] and its orientation o(t). The orientation can be represented by quaternion o(t)=q(t)∈4, or a 3D vector indicating the camera direction o(t)=d(t)∈3 in world coordinate system. Sensory data s(t) can be selected from raw or post-processed data collected from the motion sensors embedded in the user device, such as accelerometers, gyroscope, or magnetometer.
The general idea of the present specification is to use DNN to model the camera pose p as a function fw(I, s) of the fused image data I and sensor measurements s, characterized by parameters w. Unlike the state-of-the-art deep learning approach where only image is used as input [KC+17], the method disclosed herein adds sensory data into the DNN inputs which leads to a major modification to the DNN architecture and information flow. The motivation of using sensory data as additional inputs is that, comparing to images, the motion sensors provide more accurate and direct information about camera's orientation and moving direction. An up-to-date work MapNet [BGK+18] also proposed to introduce the sensory data in a proposed architecture MapNet+. However, different from the solution disclosed herein, they proposed to utilize the sensor measurements (e.g., IMU or GPS) to define additional terms of the loss function, while still using the single image as model input, as shown in
As the basis of the DNN proposed herein according to examples of embodiments, there can be used any of the convolutional neural network (CNN) architectures which are usually used for the task of image classification including LeNet, AlexNet, VGG, GoogLeNet (which includes inception modules), and ResNet (which enables residual learning). However, unlike the conventional CNN architectures, whose inputs are solely images, non-image features (a vector derived from the sensory data) as additional inputs are added.
The modification includes the following steps, as shown in
Construct the basic CNN architecture 1211 (fed with image data 1210) and stack the layers till the flatten-layer. Construct a fully connected network 1221 (fed with sensory data 1220) such as multi-layer perceptron (MLP). Concatenate 1230 the outputs of the flattened layer of CNN 1211 and the MLP 1221. Add dense layers 1240 and connect them to the last layer (activation for regression 1250) which represents the predicted vector of camera pose 1260.
As examples according to examples of embodiments of the DNN models for camera pose estimation with fused data of both image and sensors, the modification of ResNet and GoogLeNet is shown in
Specifically,
Specifically,
Another question is which sensory data to use as the additional input of the DNN according to examples of embodiments. The optional features are measurements extracted from accelerometer (a 3D vector which measures changes in acceleration in three axes), gyroscope (a 3D vector which measures angular velocity relative to itself, i.e., it measures rate of its own rotation around three axes), and magnetometer (a 3D vector pointing to the strongest magnetic field, i.e., more or less points in the direction of North) [AD+19]. Also the fusion sensors can be considered, e.g., the relative orientation sensor which applies Kalman filter or complementary filter on the measurements from accelerometer and gyroscope [W3+19]. More variants of the features can be extracted or post-processed from the above-mentioned measurements, e.g., it can be derived the quaternion from the fusion sensors. It can also be selected a subset of the features from the above-mentioned sensory data.
The remaining challenge is to collect a valid dataset for training the model according to examples of embodiments. To construct a valid dataset, three types of measurements are needed: images, their corresponding (in the sense that measurements are taken at the same time) sensor measurements, and the camera pose as the ground-truth. The image data and sensor measurements as training input are easy to obtain, for example, through existing Android API. However, the ground-truth of the corresponding camera pose as training output is not easy to derive directly. One option is to use the existing mapping and tracking algorithms such as SLAM, or 3D reconstruction tools such as KinectFusion, to return the estimated camera pose corresponding to the captured image. It can also be used the motion sensor data (e.g., camera orientation derived from the gyroscope and accelerometers, and velocity estimated by the accelerometers) to improve the camera pose derived by SLAM algorithms or KinectFusion.
In the following, detection of information flow is further described.
Considering real-time communication between cloud server and user device, an example according to examples of embodiments of the information flow for real-time camera pose estimation and 3D radio map augmentation is shown in
The information flow is detectable. The state-of-the-art methods [KC+17] [BGK+18] only request single image sent by the user device for pose estimation, while the method disclosed herein according to examples of embodiments requests both image and the corresponding sensory data. Note that although MapNet+ proposed in [BGK+18] also requires sensory data in the model training phase, because it only use the sensory data for improving loss function (see
Considering local computation in user device, an alternative to the real-time communication between cloud server and user device is to allow the user device to download the DNN model and/or 3D radio map model from cloud server and run the camera pose estimation and radio map augmentation locally. In this case, the models downloaded in the device are easy to be detected.
In the following, camera relocalization using DNN capturing temporal dependency according to examples of embodiments is further described.
As already mentioned above, for enhancing the performance of the DNN proposed above according to examples of embodiments, temporal dependency in the DNN model is to be incorporated. The motivation is rooted in the strong correlation between previous state(s) and current state of camera pose. Moreover, since the raw data of motion sensors usually measures the relative motion from previous state (e.g., relative rotation of sensor frame, angular acceleration, and linear acceleration in world/inertial coordinates), incorporating information of previous state(s) can capture the correlation over time and space.
One option is to add the previously estimated camera pose into the sensory data input, i.e., use [I(t), s(t), {circumflex over (p)}(t−1)] as input vector of DNN, where {circumflex over (p)}(t−1) is the estimated pose from the previous time slot. It can also be added more previous states into the DNN input [I(t), s(t), {circumflex over (p)}(t−N), {circumflex over (p)}(t−N+1), . . . , {circumflex over (p)}(t−1)] to capture the correlation between current state and previous multiple states. The DNN architecture remains similar to the DNN architecture illustrated in
A more complex model is inspired by the concept of recurrent neural network (RNN), which takes both the output of the network from the previous time step as input and uses the internal state from the previous time step as a starting point for the current time step. Such networks work well on sequence data. Taking advantage of the temporal nature of the image sequence and the motion sensor sequence, frame-wise camera pose estimation by using a variant of RNN can be provided. In
Specifically, for the RNN 1690 according to
Specifically, in
Other variants of RNN cells can also be considered, e.g., the long short-term memory (LSTM) cell which allows to bridge long time lags and is not limited to a fixed finite number of states.
The information flow is similar as outlined above, except that some buffer memory can be needed in the server or device (depending on whether the computation of camera pose estimation is executed in the cloud or in the local user device) to store the temporary data of the image sequence, sensor measurements, and estimated camera poses of previous states as model inputs.
In the following, fast adaptation to new environment using transfer learning according to examples of embodiments is further described.
This idea applies to the scenario when user enters a new environment, and the new environment is similar to a pre-learned environment with a pre-trained model. Instead of training a new model from scratch, transfer learning can be used to exploit the pre-learned knowledge and accelerate model training for the new environment.
To exploit the knowledge obtained from a previously trained environment and to transfer it to a new environment, partial pre-trained parameters and hyperparameters can be transferred, e.g., those characterizing the lower layers of DNN, as shown in
The operations of fine-tuning and updating can be achieved with standard transfer learning approach [PQ+10].
Another useful scenario for transfer learning is that, in case of lack of real data, synthetic (e.g. computer simulated) data generated from emulated environment and radio networks can be collected, and pre-train a model for camera pose estimation first. Then, the pre-trained model can be fine-tuned using the measurements in the real environment.
The steps of adapting DNN model to new environment are described according to examples of embodiments as follows, wherein
The cloud server 1900 stores a collection of data and pre-trained models for various environments 1901-1 to 1901-k. In particular, for each environment, at least a pre-trained model for camera pose estimation, a pre-learned 3D radio map model, and a set of images describing the environment are stored in the database.
When user device 1990 enters a new environment and requests S1911 AR-supported network service, the server 1900 sends S1912 an acknowledge and asks for some images to compare the new environment with the existing environments in the database.
The user device 1990 sends S1913 a set of the images of current environment to the server 1900. The server 1900 compares S1920 them with the images 1921 describing the existing environments in the database and selects one which is the most similar to the new environment.
Based on the similarity between the new environment and the selected matching environment, the server 1900 requests S1922 different amount of training data from the user device 1990.
The user device 1990 sends S1930 the required amount of data (including both image data and sensory data) to the server 1900. Using transfer learning, the server 1900 retrains/fine-tunes S1940 the pre-trained model of the selected environment 1941 with the data collected from the new environment.
The obtained models for the new environment and the corresponding set of images to describe this environment are then stored S1950 in the database.
The real-time camera relocalization and augmentation of 3D radio map follow the same process illustrated in
It should be appreciated that
Although the present disclosure has been described herein before with reference to particular embodiments thereof, the present disclosure is not limited thereto and various modifications can be made thereto.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/058111 | 3/24/2020 | WO |