The disclosure relates to tactile and visual sensors, and more particularly to a semitransparent tactile sensor capable of tactile and visual sensing, and a method of sensing an interaction with an object using the semitransparent sensor.
Related robotic grasping methods struggle to combine visual and tactile feedback to reliably retrieve objects. Robots may be useful in homes and factories by grasping objects. Current grasping solutions use visual-only feedback and do not integrate tactile sensing effectively. Despite recent advances in tactile sensor technologies and complex robotic hands, robotic manipulation systems largely operate with open-loop visual sensing and do not effectively process tactile sensory information. This results in an inability of robotic systems to react to errors and grasp unknown objects.
While there are some available tactile sensing technologies, the feedback from such sensors has previously been difficult to be integrated by robotic sensors. Recently, there have been optical tactile sensors that return image-based measurements that contain information but from which it may be difficult to extract meaningful signals.
Related optical tactile sensing technologies do not allow a sensor to visualize the world beyond the tactile membrane. A recent family of tactile sensors called See-Through-your-Skin sensors (STS) allow for controlling a transparency of a reflective membrane and to visualize information beyond the sensor. This offers a sensing capability allowing robots to detect a relative position of object relative to the robot hand while being occlusion free. One difficulty may be that an output of the STS sensor is an RGB image and may be difficult to interpret.
Further, object slippage detection methods in the related art do not estimate a magnitude and direction of slip, but return a binary slip value.
According to an aspect of the present disclosure, visual and tactile signals may be extracted that are accurate and useful for robotic grasp control, thus enabling robots to sense a location of objects prior to contact and detect object slippage.
According to an aspect of the present disclosure, a method for identifying and manipulating objects, the method includes: obtaining, from an image sensor, image sensor data; identifying, using the image sensor data, a location of an object; controlling a robotic element, which includes the image sensor, to move towards the location of the object; determining a slippage based on contact between the image sensor and the object; and controlling a movement of the robotic element based on the determined slippage.
The identifying the location of the object may include identifying a bounding box for the object in an image that is represented by the image sensor data.
The identifying the bounding box may include predicting coordinates of the bounding box in the image.
The identifying the bounding box may include identifying a centroid of the bounding box and determining a distance between the centroid and a center of the image sensor.
The determining the slippage may include measuring a deformation of a surface of the image sensor when the image sensor is in contact with the object.
The determining the slippage may include determining a marker flow and determining an object flow, and determining a slip field as a difference between the object flow and the marker flow.
The determining the marker flow may include identifying movement of at least one marker, and wherein determining the object flow may include determining a motion of the object in relation to the image sensor.
The method may further include combining the marker flow and the object flow using a convolutional neural network architecture.
According to another aspect of the present disclosure, an electronic device for performing image authentication includes: at least memory storing instructions; and at least one processor configured to execute the instructions to: obtain, from an image sensor, image sensor data; identify, using the image sensor data, a location of an object; control a robotic element, which includes the image sensor, to move towards the location of the object; determine a slippage based on contact between the image sensor and the object; and control a movement of the robotic element based on the determined slippage.
The at least one processor may be further configured to identify a bounding box for the object in an image that is represented by the image sensor data.
The at least one processor is further configured to predict coordinates of the bounding box in the image.
The at least one processor is further configured to identify a centroid of the bounding box and determine a distance between the centroid and a center of the image sensor.
The at least one processor is further configured to measure a deformation of a surface of the image sensor when the image sensor is in contact with the object.
The at least one processor is further configured to determine a marker flow and determine an object flow, and determine a slip field as a difference between the object flow and the marker flow.
The at least one processor is further configured to identify movement of at least one marker, and wherein determining the object flow may include determining a motion of the object in relation to the image sensor.
The at least one processor is further configured to combine the marker flow and the object flow using a convolutional neural network architecture.
According to another aspect of the present disclosure, a non-transitory computer readable storage medium that stores instructions to be executed by at least one processor to perform a method for identifying and manipulating objects includes: obtaining, from an image sensor, image sensor data; identifying, using the image sensor data, a location of an object; controlling a robotic element, which includes the image sensor, to move towards the location of the object; determining a slippage based on contact between the image sensor and the object; and controlling a movement of the robotic element based on the determined slippage.
The identifying the location of the object may include identifying a bounding box for the object in an image that is represented by the image sensor data.
The identifying the bounding box may include predicting coordinates of the bounding box in the image.
The identifying the bounding box may include identifying a centroid of the bounding box and determining a distance between the centroid and a center of the image sensor.
Additional aspects will be set forth in part in the description that follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments of the disclosure.
The above and other aspects, features, and aspects of embodiments of the disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
The following detailed description of example embodiments refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements.
The disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.
As used herein, the term “component” is intended to be broadly construed as hardware, firmware, or a combination of hardware and software.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code—it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Furthermore, as used herein, the term “set” is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, etc.), and may be used interchangeably with “one or more.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
According to one or more embodiments, computer vision algorithms may enable robots using optical tactile sensors to better sense and react to interactions with an object in their environment. Algorithms provide a unified solution that may use a single sensor to detect the location and identify of an object as a robot hand approaches the object and to characterize the contact state (stick/slip) once the robot makes contact with the object. Embodiments are not limited to this.
According to one or more embodiments, a computer vision-based algorithm estimates the distance and relative position of the object relative to the sensor, allowing a robot to reliably approach objects prior to contact.
According to one or more embodiments, a slip detection algorithm may estimate dense pixel-wise values to identify a direction and magnitude of object slippage. These algorithms may be applied to optical tactile sensor technologies resulting in new robot grasping capabilities and more reliable robotic behavior. These may allow for robots to be deployed and to be effective in real-world applications such as home and service robotics.
Manipulating an object effectively may require both proximity and contact sensing. The disclosure presents visuotactile operators for proximity sensing and contact control, and algorithms that use the measurements from high-resolution optical tactile sensors. This enables a robot to sense an object prior to, during, and after contact has occurred. The disclosure describes algorithmic solutions that use a visuotactile sensor to estimate several object properties essential to robotic manipulation, including distance to contact and object slip. These algorithms enable a robot to approach an object with precision up to the moment of contact. During contact, the algorithm senses object slippage by detecting how tactile cues move relative to the sensor's contact surface. This capability provides robots with feedback to accomplish improved robotic dexterity, such as grasping of household objects, assembling of complex parts (e.g., electronics), and natural interaction with humans.
Image 104 is an enlarged view of the image returned by sensor 101. A bounding box 103 may be identified to determine the properties of object 102 (e.g., soda can). According to an embodiment, an object's bounding box 103 may allow for a determination of its centroid and relative position to the center of the sensor (e.g., dy, dz). An area of the bounding box may be used to infer the proximity of the object relative to the sensor (e.g., dx). Given an area of the object in pixels using object detection, a model may be used to estimate the distance to contact dx using the linear relation A=k (dx)2, where A is the current measured area of the object in pixels. A model may be trained by collecting a dataset of (dxi;Ai) pairs, obtained by running robot trajectories where the robot gripper approaches the target object. Thus, according to an embodiment, a method may predict the object location (x,y,z) in a three-dimension (3D) relative to the sensor using optical tactile sensors. As illustrated in
As shown in
The convolutional neural network architecture 300 may represent a type of deep artificial neural networks, which are often applied to analyze images. In this example, the convolutional neural network architecture 300 is formed using an encoder network 203 and a corresponding decoder network 205. The encoder network 203 is formed using multiple encoder layers, which include multiple convolutional layers 310a-310d and multiple pooling layers 312a-312d. Each of the convolutional layers 310a-310d represents a layer of convolutional neurons, which apply a convolution operation that emulates the response of individual neurons to visual stimuli. Each neuron typically applies some function to its input values (often by weighting different input values differently) to generate output values. Each of the pooling layers 312a-312d represents a layer that combines the output values of neuron clusters from one convolutional layer into input values for the next layer. The encoder network 203 here is shown as including four encoder layers having four convolutional layers 310a-310d and four pooling layers 312a-312d, although the encoder network 203 could include different numbers of encoder layers, convolutional layers, and pooling layers.
In some embodiments, each of the convolutional layers 310a-310d can perform convolution with a filter bank (containing filters or kernels) to produce a set of features maps. These feature maps can be batch normalized, and an element-wise rectified linear unit (ReLU) function can be applied to the normalized feature map values. The ReLU function typically operates to ensure that none of its output values is negative, such as by selecting (for each normalized feature map value) the greater of that value or zero. Following that, each of the pooling layers 312a-312d can perform max-pooling with a window and a stride of two (non-overlapping window), and the resulting output is sub-sampled by a factor of two. Max-pooling can be used to achieve translation invariance over small spatial shifts in the input image patch. Sub-sampling results in a large input image context (spatial window) for each pixel in the feature maps.
The decoder network 205 is formed using multiple decoder layers, which include multiple upsampling layers 314a-314d and multiple convolutional layers 316a-316d. Each of the upsampling layers 314a-314d represents a layer that upsamples input feature maps. Each of the convolutional layers 316a-316d represents a trainable convolutional layer that produces dense feature maps, which can be batch normalized. The decoder network 205 here is shown as including four decoder layers having four upsampling layers 314a-314d and four convolutional layers 316a-316d, although the decoder network 205 could include different numbers of decoder layers, upsampling layers, and convolutional layers. Each encoder layer in the encoder network 203 could have a corresponding decoder layer in the decoder network 205, so there could be an equal number of layers in the encoder network 203 and in the decoder network 205.
A convolutional layer 318 processes the feature maps that are output by the decoder network 205. For example, the convolutional layer 318 could perform convolution operations to produce pixel-level blending map patches for the input image patches 302 independently. This allows, for instance, the convolutional layer 318 to convert the feature maps into the blending map patches 304. The blending map patches 304 are dense per-pixel representations of pixel quality measurements involving information about motion degree and well-exposedness.
In some embodiments, the convolutional neural network architecture 300 operates as follows. The initial layers in the encoder network 203 are responsible for extracting scene contents and spatially down-sizing feature maps associated with the scene contents. This enables the effective aggregation of information over large areas of the input image. The later layers in the encoder network 203 learn to merge the feature maps. The layers of the decoder network 205 simulates coarse-to-fine reconstruction of the downsized representations by gradually upsampling the feature maps and translate the feature maps into blending maps. This allows for a more reliable recovery of the details lost by the encoder network 203.
It should be noted that the convolutional neural network architecture 300 shown in
In addition, it may be possible to compress and accelerate the operation of the convolutional neural network architecture 300 for real-time applications in various ways. For example, parameter pruning and parameter sharing can be used to remove redundancy in the parameters. As another example, low-rank factorization can be used to estimate informative parameters in learning-based models. As a third example, convolutional filters' utilization can be transferred or compacted by designing special structural convolutional filters to reduce storage and computation complexity.
Although
In tactile sensing mode, the sensor may detect a dense object slippage field when the object is in contact with the sensor. An algorithm may be used that provides dense pixel-wise slip measurements (e.g., pixel displacements) that describe how each section of the object moves relative to contact with the visuotactile sensor. According to an embodiment, a feature of the slip detection algorithm may be to track the object motion relative to the sensor's gel membrane. However, the quantity may not be observable from the raw tactile measurements, as the membrane's elastic deformation may not be known and may not be determinable if the object motion in the image occurs together with the membrane (e.g., sticking); and/or independently from the membrane (e.g., slippage). To detect object slippage, the motion of the membrane (e.g., marker motion) may be compared with the motion of a moving object behind the sensor (e.g., object motion) to characterize the nature of contact. When the object sticks to the sensor, the marker and object motions are consistent, whereas in situations of slip the two motions are different. An additional advantage of this method is that the difference between the object motion and the marker motion renders a dense pixel-wise description of the slip field, that can be used as a rich feedback signal for tactile manipulation policies.
An example of object flow, according to an embodiment, is illustrated in
An example of marker flow, according to an embodiment, is illustrated in
An example of slip flow, according to an embodiment, is illustrated in
The time to contact, i.e., the estimating time at which the sensor will collide with an object may be estimated based on a number of image-space properties. For example, the time to contact may be estimated under perspective projection and reasonable assumptions. An object may be visualized on a rectangular area of width w and height h with a constant depth z0. Furthermore, the size of the object including the area of the object at contact is known. If the size and area of the object is known, the estimated area At of the target at time t, along with the known area of the target at contact Atouch, an estimated time to contact may be equal to:
As illustrated in
The methods and features described above may be used to provide robots with sensing capabilities that will allow them to grasp objects with more dexterity and increased speed. The number of grasps per hour that a robot can perform may be useful for identifying an efficiency of a grasping system.
Faster Grasp Approach Phase: during an approach phase when a robot moves to grasp the object, the object detection and localization described above allows a robot to maintain a line of sight with the object (i.e., no occlusions) and estimate the distance to contact. This enables grasping systems to approach objects faster by reducing the uncertainty on the location of the object and the time to collision. This may have an impact on accelerating the deployment of robots in factories when robots must compete with the effectiveness of human pickers.
Slip aware trajectory optimization: once an object is grasped by the robot, the robot can exploit Feature #2 of the patent (slip detection) to move the object to its target location as fast as possible while avoiding object slippage. Accuracy and resolution of a slip detection algorithm may allow the robot to reduce the acceleration in the direction of the slip vectors to prevent grasping failures.
There are a number of robotic automation factories that rely on robots to assemble consumer electronic devices, that include manipulating cables and inserting them in their respective locations. The features presented in this patent present opportunities to fully exploit visuotactile measurements to accomplish dexterous robotic tasks that were previously impossible.
Cables may be bundled together and must be disentangled during a grasping phase. As such, it may be useful to recognize an identity and location of a cable to be grasped dynamically, i.e. during the grasp phase. Object detection and localization, according to an embodiment, enables fine grasping capabilities and may open opportunities for robotics and automation in factories operated by human workers.
Once a desired cable is grasped, it should be inserted in the required slot. This insertion task may be challenging for robots to accomplish as they may not visualize a hole during an insertion process and the process is driven by tactile cues. A dense slip detection algorithm, according to an embodiment, provides a robot with valuable feedback on the geometry and location of the hole. This enables new sensing capabilities that may help robots accomplish tasks in a large factory.
Robots may be used to integrate a number of environments such as restaurants or customer service, where they may be expected to handover objects to humans. For example, robot waiters may be in a restaurant and tasked with setting and serving a table while interacting with humans.
Object detection and localization may provide for visually detecting possible robot grasps by integrating information from within the fingertips. A human that hands over an object to a human may move an object in difficult to anticipate ways that may complicate grasp planning. By using object detection according to embodiments of the present disclosure, a relative distance between an object and a robot may be inferred and used to effectively retrieve the object.
High resolution slip detection information may be used by a robot to determine a timing for when a robot should let go of the object as it is handing over objects to a human. For example, as a human secures a stable grasp on an object, the additional constraints on the object may cause object slippage relative to a robot gripper. By detecting a magnitude and direction of such slip vectors, a robot may determine when it is safe to let go of the object.
The above description provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.
As used herein, the term “component” is intended to be broadly construed as hardware, firmware, or a combination of hardware and software.
It will be apparent that systems and/or methods, described herein, may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to specific software code—it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Furthermore, as used herein, the term “set” is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, etc.), and may be used interchangeably with “one or more.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. For example, the expression, “at least one of a, b, and c,” should be understood as including only a, only b, only c, both a and b, both a and c, both b and c, all of a, b, and c, or any variations of the aforementioned examples.
While such terms as “first,” “second,” etc., may be used to describe various elements, such elements must not be limited to the above terms. The above terms may be used only to distinguish one element from another.
This application is based on and claims priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/313,711 filed on Feb. 24, 2022, in the U.S. Patent & Trademark Office, the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63313711 | Feb 2022 | US |