The present disclosure generally relates to human-robot interactive systems, and in particular to a system and associated method for an environment-aware predictive modeling framework for a prosthetic or orthotic joint.
Robotic prostheses and orthotics have the potential to change the lives of millions of lower-limb amputees or non-amputees with mobility-related problems for the better by providing critical support during legged locomotion. Powered prostheses and orthotics enable complex capabilities such as level-ground walking and running or stair climbing, while also enabling reductions in metabolic cost and improvements in ergonomic comfort. However, most existing devices are tuned toward and heavily focus on unobstructed level-ground walking, to the detriment of other gait modes-especially those required in dynamic environments. Limitations to the range and adaptivity of gaits has negatively impacted the ability of amputees to navigate dynamic landscapes. Yet, the primary cause of falls is inadequate foot clearance during obstacle traversal during obstacle traversal. In many cases only millimeters decide whether a gait will be safe or whether it will lead to a dangerous contact with the environment. In light of this observation, control solutions are needed to facilitate safe and healthy locomotion over common and frequent barriers such as curbs or stairs. A notable challenge for intelligent prosthetics to overcome is therefore the ability sense and act upon important features in the environment.
Prior work in the field has centered on identifying discrete terrain classes based on kinematics including slopes, stairs, and uneven terrain. Vision systems in the form of depth sensors have recently been utilized in several vision-assisted exoskeleton robots. However, depth sensors with sufficient accuracy at close range are not portable, e.g., Li-DAR, and often prohibitively expensive. There is a current lack of solutions that provide high fidelity depth sensing and portability for use in environment-aware prosthetics.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.
Various embodiments of an environment-aware prediction and control system and associated framework for human-robot symbiotic walking are disclosed herein. The system takes a single, monocular RGB image from a leg-mounted camera to generate important visual features of the current surroundings, including the depth of objects and the location of the foot. In turn, the system includes a data-driven controller that uses these features to generate adaptive and responsive actuation signals. The system employs a data-driven technique to extract critical perceptual information from low-cost sensors including a simple RGB camera and IMUs. To this end, a new, multimodal data set was collected for walking with the system on variable ground across 57 varied scenarios, e.g., roadways, curbs, gravel, etc. In turn, the data set can be used to train modules for environmental awareness and robot control. Once trained, the system can process incoming images and generate depth estimates and segmentations of the foot. Together with kinematic sensor modalities from the prosthesis, these visual features are then used to generate predictive control actions. To this end, the system builds upon ensemble Bayesian interaction primitives (enBIP), which have previously been used for accurate prediction in human biomechanics and locomotion. However, going beyond prior work on Interaction Primitives, the present system incorporates the perceptual features directly into a probabilistic model formulation to learn a state of the environment and generate predictive control signals. As a result of this data-driven training scheme, the prosthesis automatically adapts to variations in the ground for mobility-related actions such as lifting a leg to step up a small curb.
Referring to
The computing device 120 receives image and/or video data from the camera 110 which captures information about various environmental surroundings and enables the computing device 120 to make informed decisions about the control signals applied to the prosthetic or orthotic joint 130. The computing device 120 includes a processor in communication with a memory, the memory including instructions that enable the processor to implement a framework 200 that receives the image and/or video data from the camera 110 as the wearer uses the prosthetic or orthotic joint 130, extracts a set of depth features from the image and/or video data that indicate perceived spatial depth information of a surrounding environment, and determines a control signal to be applied to the prosthetic or orthotic joint 130 based on the perceived depth features. Following determination of the control signal by the framework 200 implemented at the processor, the computing device 120 applies the control signal to the prosthetic or orthotic joint 130. The present disclosure investigates the efficacy of the system 100 by evaluating how well the prosthetic or orthotic joint 130 performs on tasks such as stepping onto stairs or curbs aided by the framework 200 of the computing device 120. In some embodiments, the camera 110 can be leg-mounted or mounted in a suitable location that enables the camera 110 to capture images of an environment that is in front of the prosthetic or orthotic joint 130. To achieve environmental awareness in human-robot symbiotic walking of the system 100, the following is achieved: (a) perform visual and kinematic data collection from an able-bodied subject, (b) augment the data set with segmented depth features from a trained depth estimation deep neural network, and (c) train a probabilistic model to synthesize control signals to be applied to the prosthetic or orthotic joint 130 given perceived depth features.
The framework 200 implemented at the computing device 120 of system 100 is depicted in
Referring to
Referring to
Network Architecture: The depth estimation network 212, shown in
The depth feature extraction process starts with an input image It captured by the camera 110, where I∈H×W×3 and where the image It includes RGB data for each pixel therewithin. The input image It is provided to the depth estimation network 212. The encoder network 214 of the depth estimation network 212 receives the input image It first. As shown, in one embodiment, the encoder network 214 includes five stages. Following a typical AE network architecture, each stage of the encoder network 214 narrows the size of the representation from 2048 neurons down to 64 neurons at a final convolutional bottleneck layer.
The decoder network 216 increases the network size at each layer after the final convolutional bottleneck layer of the encoder network 214 in a pattern symmetrical with that of the encoder network 214. While the first two stages of the decoder network 216 are transpose residual blocks of 3×3 size kernels, the third and fourth stages of the decoder network 216 are convolutional projection layers with two 1×1 kernels each. Although ReLU activation functions connect each stage of the decoder network 216, additional sigmoid activation function outputs facilitate disparity estimation in the decoder network 216 by computing the loss for each output of varied resolutions. The output of the depth estimation network 212 is a combination of a final depth feature estimate {circumflex over (f)}∈H×W for the input image It, as well as intermediate feature map outputs {circumflex over (f)}n∈
HR
Loss Function: In order to use the full combination of final and intermediate outputs of D in a loss function of the depth estimation network 212 during training, it is necessary to first define a loss £ as a summation of a depth image reconstruction loss ER over each of the prediction of various resolutions (e.g., the final feature map output from a final output layer of the decoder network 216 in addition to intermediate feature map outputs). The loss £ can be described as:
where estimated depth values at each stage D of the decoder network 216 are compared to a ground truth vector D, downsampled for each corresponding feature map resolution using average pooling operations. The depth estimation network 212 uses a loss weight vector to adjust for each feature size with associated elements of w′=[ 1/64, 1/32, 1/16, ⅛, ¼, ½]. The depth estimation network 212 uses a reconstruction loss R for depth image reconstruction that includes four loss elements, including: (1) a mean squared error measure
m, (2) a structural similarity measure
s, (3) an inter-image gradient measure
g, and (4) a total variation measure
tv:
where {circumflex over (d)} is an iterated depth feature output {circumflex over (d)}∈{circumflex over (D)} and d is a ground truth feature d∈D, and hyperparameters α1=102, α2=1, α3=1, and α4=10−7, influence the importance of each respective loss element on R. The mean squared error measure can be represented as:
A structural similarity index measure (SSIM) is adopted since it can be used to avoid distortions by capturing a covariance alongside an average of a ground truth feature map and a predicted depth feature map. A SSIM loss s can be represented as:
Losses based on inter-image gradient primarily ensure illumination-invariance such that bright lights or shadows cast across the image do not affect the end depth prediction. The inter-image gradient measure g can be implemented as:
where ∇ denotes a gradient calculation, and ∥·∥ is the absolute value. Since g is computed pixelwise both horizontally and vertically, a number of pixels of the output image is denoted as n. In order to add an additional loss to facilitate image deblurring and denoising, the total variation measure
tv passes the gradient strictly over the output depth feature maps to minimize noise and round terrain features:
Temporal Consistency: Prediction consistency over time is a critical necessity for stable and accurate control of a robotic prosthesis. Temporal consistency of the depth predictions provided by framework 200 is achieved during training via the four loss functions, m,
s,
g, and
tv by fine-tuning the depth estimation network 212. In particular, the framework 200 fine-tunes the depth estimation network 212 through application of a temporal consistency training methodology to the resultant depth feature output {circumflex over (d)}∈{circumflex over (D)}, which includes employing binary masks to outline one or more regions within an image which require higher accuracy, and further includes applying a disparity loss
dis between two consecutive frames including a first frame taken at time t−1 and a second frame taken at time t (e.g., images within video data captured by camera 110). An overlapping mask of the two frames can be defined as M and can be set to equal to the size of a ground truth feature d. As such, the disparity loss
dis is formulated as:
where {circumflex over (d)}tM indicates for a predicted frame at time t with a binary mask applied: {circumflex over (d)}tM={circumflex over (d)}t·M. Additionally, to sustain the precision of prediction over time, the framework 200 applies R is on each frame individually. Each loss element of the fine-tuning process is weighted by corresponding hyperparameters, β1=0.7, β2=0.3, γ−10. The fine-tuning step, therefore, makes the predicted frames more similar to one another in static regions while maintaining the reconstruction accuracy from prior network training.
Given the extracted environmental information provided by depth and segmentation module 210 including depth features {circumflex over (D)}=[{circumflex over (f)}5, {circumflex over (f)}4, {circumflex over (f)}3, {circumflex over (f)}2, {circumflex over (f)}1 {circumflex over (f)}]. for each pixel within images captured by the camera 110, the control output module 220 uses enBIP to generate appropriate responses for the prosthetic or orthotic joint 130. As a data-driven method, enBIP uses example demonstrations of interactions between multiple agents to generate a behavior model that represents an observed system of human kinematics with respect to the prosthetic or orthotic joint 130. enBIP was selected as a modeling formulation for this purpose because enBIP enables inference of future observable human-robot states as well as non-observable human-robot states. Additionally, enBIP supplies uncertainty estimates which can allow a controller such as computing device 120 to validate predicted control actions, possibly adding modifications if the model is highly unsure. Lastly, enBIP provides robustness against sensor noise as well as real-time inference capabilities in complex human-robot interactive control tasks.
In one aspect, assisted locomotion with a prosthetic or orthotic is cast as a close interaction between the human kinematics, environmental features, and robotic prosthetic. The control output module 220 incorporates environmental information in the form of predicted depth features along with sensed kinematic information from an inertial measurement unit (IMU) 160 (D
Latent Space Formulation: Generating an accurate model from an example demonstration matrix would be difficult due to high internal dimensionality, especially with no guarantee of temporal consistency between demonstrations. One main goal of a latent space formulation determined by control output module 220 is therefore to reduce modeling dimensionality by projecting training demonstrations (e.g., recorded demonstrations of human-prosthesis behavior) into a latent space that encompasses both spatial and temporal features. Notably, this process must be done in a way that allows for estimation of future state distributions for both observed and unobserved variables Yt+1:T with only a partial observation of the state space and the example demonstration matrix Y1/1:T1, YN/1:TN:
Basis function decomposition sidesteps the significant modeling challenges of requiring a generative model over all variables and a nonlinear transition function. Basis function decomposition enables the control output module 220 to approximate each trajectory as a linear combination of Bd functions in the form of: Ytd=Φϕ(t)Twd+∈y. Each basis function Φϕ(t)∈B
B
, where 0≤ϕ(t)≤1.
Although the time-invariant latent space formulation facilitates estimation of entire trajectories, filtering over both spatial and temporal features interactions was more robust and accurate. As such, the control output module 220 incorporates phase, phase velocity, and weight vectors, w=[ϕ, {dot over (ϕ)}, w0B where B=ΣdDBd into the state representation. By assuming that a training demonstration advances linearly in time, the phase velocity is estimated with, ϕ=1/Tn. Substituting the weight vector into 8 and applying the Bayes' rule yields:
since the time-invariant weight vector p(wt Y1:t, w0), models the entire trajectory Y.
Inference: In order to accommodate a variety of control modifications based on observed or predicted environmental features, the control output module 220 leverages ensemble Bayesian estimation from enBIP to produce approximate inferences of the posterior distribution according to Equation (9), which include human kinematics and environmental features. Assuming, of course, that higher-order statistical moments between states are negligible and that the Markov property holds. Algorithmically, enBIP first generates an ensemble of latent observation models, taken randomly from the demonstration set. As the subject walks with the prosthetic or orthotic joint 130, the control output module 220 propagates the ensemble forward one step with a state transition function. Then, as new sensor and depth observations periodically become available as the camera 110 and the depth and segmentation module 210 work, the control output module 220 performs a measurement update step across the entire ensemble. From the updated ensemble, the control output module 220 calculates the mean and variance of each latent component, and subsequently projects the mean and variance into a trajectory space by applying the linear combination of basis functions to the weight vectors.
The control output module 220 uniformly samples the initial ensemble of E members X=[x1, . . . , xE] from the observed demonstrations x0j=[0, ϕi, wi], 1≤j≤E with i˜(1,N), and E≤N. Inference through Bayesian estimation begins as the control output module 220 iteratively propagates each ensemble member forward one step to approximate p(wt|yt-1,w0) with:
which utilizes a constant-velocity state transition function g(·) and stochastic error ∈x≈N(0, Qt), estimated with a normal distribution from the sample demonstrations. Next, the control output module 220 updates each ensemble member with the observation through the nonlinear observation operator h(·),
followed by computing a deviation of the ensemble from the sample mean:
The control output module 220 uses the deviation HtAt and observation noise R to compute an innovation covariance:
The control output module 220 uses the innovation covariance as well as the deviation of the ensemble to calculate the Kalman gain from the ensemble members without a covariance matrix through:
Finally, the control output module 220 realizes a measurement update by applying a difference between a new observation at time t and an expected observation given t−1 to the ensemble through the Kalman gain,
The control output module 220 accommodates for partial observations by artificially inflating the observation noise for non-observable variables such as the control signals such that the Kalman filter does not condition on these unknown input values.
To validate the system 100, a number of experiments were conducted with a focus on real-world human-subject data. The following describes in detail how the data was collected, processed, and how the models were trained, including specific hardware and software utilized. To better discuss specific model results, the following experimental section is further broken up into experiments and results for (1) the network architecture and (2) the enBIP model with environmental awareness. Experiment 1 examines the efficacy of our network architecture in predicting accurate depth values from RGB images collected from a body mounted camera. While Experiment 2 applies the network architecture on embedded hardware to infer environmentally conditioned control trajectories for a lower-limb prosthesis in a testbed environment.
Multimodal data sets were collected from participants who were outfitted with advanced inertial measurement units (IMUs) and a camera/depth sensor module. The IMUs are BNO080 system-in-package and include a triaxial accelerometer, a triaxial gyroscope, and a magnetometer with a 32-bit ARM Cortex microcontroller running Hillcrest Labs proprietary SH-2 firmware for sensor filtering and fusion. IMU devices are combined with an ESP32 microprocessor, in ergonomic cases that can easily be fitted to subjects' bodies over clothing, to send Bluetooth data packages out at 100 Hz. These inertial sensor modules were mounted to the subjects' lower limb and foot during the data collection process to collect kinematic data. However, during the testing phase, the foot sensor is removed. Additionally, an Intel RealSense D435 depth camera module was mounted to the subjects' lower limb.
A custom vision data set of 57 varied scenes was collected with over 30,000 RGB-depth image pairs from the lower-limb, during locomotion tasks in a dynamic urban environment. Data collection involved a subject walking over various obstacles and surfaces, including, but not limited to: sidewalks, roadways, curbs, gravel, carpeting, and up/down stairs; in differing lighting conditions at a fixed depth range (0.0-1.0 meters). Example images from the custom data set are visible in the upper row of
In order for the network architecture of depth estimation network 212 to operate under real-world conditions, it must have both a low depth-prediction error, as well as real-time computation capabilities. The following section details the learning process and accuracy of our network architecture on the custom human-subject data set.
The last row shows results from the present depth estimation model. 1 indicates the higher the better; \ indicates the lower the better.
Training: 80% (46 scenes) of the custom data set was utilized for training and 20% (11 scenes) were utilized for testing. Adam was selected as the optimizer with learning rate n1=10−5. The input has the shape 90×160×3, whereas the ground truth and the output have the shape of 90×160×1. The ground truth is down-sampled to 3×5, 6×10, 12×20, 23×40, and 45×80 for the loss weight schedule of DispNet. Training was performed on 3 other AE architectures for comparison in an empirical manner. Residual learning, DispNet, and the combination of using convolutional layers are investigated for the decoder network 216 (see Table. I). All models were trained for 100 epochs and fine-tuned by applying a disparity loss with learning rate n2=10−7 for 30 epochs.
A pre-trained masked RCNN was used as the masking network for object detection. The masked RCNN was fine-tuned given masks provided from the custom dataset using binary cross-entropy loss.
Results: The depth prediction results shown in
Evaluation: The evaluation and the ablation study of the depth prediction results are shown in Table I, the evaluation process takes the RGB data from testing set and compared the model predictions with the ground truth in terms of commonly accepted evaluation metrics-absolute REL, sq REL, RMSE, RMSE log, δ1, δ2 and δ3, where δN, is the percentage of the ground truth pixels under the constraint: max({circumflex over (d)}i/di,di/{circumflex over (d)}i)<1.25N) since the for temporal consistency, C is proposed as the metric for consistency evaluation:
where T is the number of frames in a video sequence. Since the terrain from the camera view is dynamic, we conclude the lower C the better under the constrain:
C≈>0. Another visual way in evaluating depth estimation models is to review 3D point clouds generated with predicted depth maps.
Because the final version of the framework 200 must integrate with the prosthetic or orthotic joint 130 and it must be capable of fast inference over a range of environmental variables. Therefore, the framework 200 was deployed on an embedded hardware serving as the computing device 120, which is in some embodiments a Jetson Xavier NX, which is a system on module (SOM) device capable of up to 21 TOPS of accelerated computing and tailored toward streaming data from multiple sensors into modern deep neural networks. The framework 200 performed inference in an average of 0.0860 sec (11.57 FPS) with a standard deviation of 0.0013 sec.
One critical use of environmental and terrain awareness is in stepping over curbs or onto stairs. If a terrain prediction algorithm is even 99% effective in stair prediction it would still pose a grave safety concern, due to the chances of causing the subject to fall down a set of stairs. Since environmental features are directly incorporated with a very high accuracy, the evaluation focuses experimentally on two criteria in the prosthesis experiments (1) “Can enBIP model stair stepping over a range of step distances?” and (2) “For a given step, does incorporating environmental features produce a more accurate model?”.
Training: Collected data for stair stepping was used to train an enBIP model with modalities from tibia-mounted inertial sensors, predicted depth features, and the ankle angle control trajectory. To produce depth features from the predicted depth map, the system 100 took the average over two masks which bisect the image horizontally and subtracting the area behind the shoe from the area in front. A one-dimensional depth feature was produced which showed the changes in terrain due to slopes or steps. While the depth features for this experiment were simplified, other and more complex features were possible, such as, calculating curvature, detecting stair edges, or incorporating the entire predicted depth map. Subjects were asked to perform 50 steps of stair-stepping up onto a custom built curb during which the subject was instructed to start from with their toe at varying positions away from the curb in a range from 0 inches to 16 inches. Applying the framework 200, the system 100 ends up with a generic model to predict ankle angle control actions given IMU observations and depth features. The compiled point cloud in
Results: The system 100 produced an average control error of 1.07° over 10 withheld example demonstration when using depth features for the stair stepping task compared to an average control error of 6.60° without depth features. The system 100 performed even better when examined at 35% phase, the approximate temporal location where the foot traverses the leading edge of the stair, with average control error of 2.32° compared to 9.25° for inference with kinematics only.
At block 302 of method 300 shown in
With reference to
With reference to
Continuing with
Continuing with
Device 400 comprises one or more network interfaces 410 (e.g., wired, wireless, PLC, etc.), at least one processor 420, and a memory 440 interconnected by a system bus 450, as well as a power supply 460 (e.g., battery, plug-in, etc.).
Network interface(s) 410 include the mechanical, electrical, and signaling circuitry for communicating data over the communication links coupled to a communication network. Network interfaces 410 are configured to transmit and/or receive data using a variety of different communication protocols. As illustrated, the box representing network interfaces 410 is shown for simplicity, and it is appreciated that such interfaces may represent different types of network connections such as wireless and wired (physical) connections. Network interfaces 410 are shown separately from power supply 460, however it is appreciated that the interfaces that support PLC protocols may communicate through power supply 460 and/or may be an integral component coupled to power supply 460.
Memory 440 includes a plurality of storage locations that are addressable by processor 420 and network interfaces 410 for storing software programs and data structures associated with the embodiments described herein. In some embodiments, device 400 may have limited memory or no memory (e.g., no memory for storage other than for programs/processes operating on the device and associated caches).
Processor 420 comprises hardware elements or logic adapted to execute the software programs (e.g., instructions) and manipulate data structures 445. An operating system 442, portions of which are typically resident in memory 440 and executed by the processor, functionally organizes device 400 by, inter alia, invoking operations in support of software processes and/or services executing on the device. These software processes and/or services may include prosthetic or orthotic joint processes/services 490 described herein. Note that while prosthetic or orthotic joint processes/services 490 is illustrated in centralized memory 440, alternative embodiments provide for the process to be operated within the network interfaces 410, such as a component of a MAC layer, and/or as part of a distributed computing network environment.
It will be apparent to those skilled in the art that other processor and memory types, including various computer-readable media, may be used to store and execute program instructions pertaining to the techniques described herein. Also, while the description illustrates various processes, it is expressly contemplated that various processes may be embodied as modules or engines configured to operate in accordance with the techniques herein (e.g., according to the functionality of a similar process). In this context, the term module and engine may be interchangeable. In general, the term module or engine refers to model or an organization of interrelated software components/functions. Further, while the prosthetic or orthotic joint processes/services 490 is shown as a standalone process, those skilled in the art will appreciate that this process may be executed as a routine or module within other processes.
It should be understood from the foregoing that, while particular embodiments have been illustrated and described, various modifications can be made thereto without departing from the spirit and scope of the invention as will be apparent to those skilled in the art. Such changes and modifications are within the scope and teachings of this invention as defined in the claims appended hereto.
This is a PCT application that claims benefit to U.S. Provisional Patent Application Ser. No. 63/210,187 filed 14 Jun. 2021, which is herein incorporated by reference in its entirety.
This invention was made with government support under 1749783 awarded by the National Science Foundation. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/33464 | 6/14/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63210187 | Jun 2021 | US |