Identifying Models of Under-actuated Adative Hands
6D Pose Identification
Object Recognition and 6-D Pose Estimation in Clutter (2016 - now)
Problem: Robotic manipulation systems frequently depend on a perception pipeline that can accurately perform object
recognition and six degrees-of-freedom (6-DOF) pose estimation. This is becoming increasingly important as
robots are deployed in environments less structured than traditional automation setups. An example application domain
corresponds to warehouse automation and logistics, as highlighted by the Amazon Robotics Challenge (ARC).
In such tasks, robots have to deal with a large variety of objects potentially placed in complex arrangements, including
cluttered scenes where objects are resting on top of each other and are often only partially visible.
This work considers a similar setup, where a perception system has access to RGB-D images of objects in clutter, as
well as 3D CAD models of the objects, and must provide accurate pose estimation for the entire scene. In this domain,
solutions have been developed that use a Convolutional Neural Network (CNN) for object segmentation, followed by
a 3D model alignment step using point cloud registration techniques for pose estimation. The focus of this project is
improving this last step and increasing the accuracy of pose estimation by reasoning at a scene-level about the physical
interactions between objects.
Proposed solution: We proposed a process for efficiently
searching over combinations of individual object 6D pose
hypotheses in cluttered scenes, especially in cases involving
occlusions and objects resting on each other. The initial set
of candidate object poses is generated from state-of-the-art
object detection and global point cloud registration techniques.
The best scored pose per object by using these techniques may
not be accurate due to overlaps and occlusions. Nevertheless,
experimental indications provided in this work show that object
poses with lower ranks may be closer to the real poses than
ones with high ranks according to registration techniques. This
motivates a global optimization process for improving these
poses by taking into account scene-level physical interactions
between objects. It also implies that the Cartesian product of
candidate poses for interacting objects must be searched so as to
identify the best scene-level hypothesis. To perform the search
efficiently, the candidate poses for each object are clustered so as
to reduce their number but still keep a sufficient diversity. Then,
searching over the combinations of candidate object poses is
performed through a Monte Carlo Tree Search (MCTS) process
that uses the similarity between the observed depth image of
the scene and a rendering of the scene given the hypothesized
pose as a score that guides the search procedure. MCTS
handles in a principled way the tradeoff between fine-tuning
the most promising poses and exploring new ones, by using
the Upper Confidence Bound (UCB) technique.
We also proposed an autonomous process for training CNNs using artificial images. In particular, given access to 3D object models, several
aspects of the environment are physically simulated. The models are placed in physically realistic poses with respect to
their environment to generate a labeled synthetic dataset. To further improve object detection, the network self-trains over
real images that are labeled using a robust multi-view pose estimation process.
The key contributions are the incorporation of physical reasoning in the synthetic data generation process
and the automation of the annotation process over real images
Results: Experimental results indicate that this process is able to quickly identify
in cluttered scenes physically-consistent object poses that are significantly closer to ground truth compared to poses found by point cloud registration methods. The proposed self-training process was also evaluated
on several existing datasets and on a dataset that we collected with a Motoman robotic arm. Results show that
the proposed approach outperforms popular training processes relying on synthetic - but not physically realistic - data and
manual annotation.
Future directions:
We are currently expanding this method by using more refined probabilistic segmentation methods, such Fully Convolutional Networks (FCN) for semantic segmentation.
We are developing a new pose estimation algorithm that takes as inputs a known model of an object and an R-GBD image of a scene wherein the boundary of the object in question is uncertain and given by the probabilistic output of FCN. The algorithm utilizes the known object model to refine the segmentation and estimate the pose of the object.
Learning Mechanical Models of Unknown Objects Online (2016 - now)
Problem: Identifying mechanical and geometric models of unknown objects online and on the fly, while manipulating them, is key to a successful deployment of robots in unstructured and unknown environments. The goal of this work is to find efficient algorithms for online model identification, and to validate the proposed algorithms on clearing clutters of objects using a real robotic manipulator.
Simulation-Reality Gap
Proposed solution: We developed a practical approach for identifying unknown mechanical parameters, such as mass and friction models of manipulated rigid objects or actuated robotic links, in a succinct manner that aims to improve the performance of policy search algorithms. Key features of this approach are the use of off-the-shelf physics engines and the adaptation of a black-box Bayesian optimization framework for this purpose. The physics engine is used to reproduce in simulation experiments that are performed on a real robot, and the mechanical parameters of the simulated system are automatically fine-tuned so that the simulated trajectories match with the real ones. The optimized model is then used for learning a policy in simulation, before safely deploying it on the real robot. Given the well-known limitations of physics engines in modeling real-world objects, it is generally not possible to find a mechanical model that reproduces in simulation the real trajectories exactly. Moreover, there are many scenarios where a near-optimal policy can be found without having a perfect knowledge of the system. Therefore, searching for a perfect model may not be worth the computational effort in practice. The proposed approach aims then to identify a model that is good enough to approximate the value of a locally optimal policy with a certain confidence, instead of spending all the computational resources on searching for the most accurate model. The pipeline of the online identification process is illustrated in the figure below.
Results: Empirical evaluations, performed in simulation and on a real robotic manipulation task, show that model identification via physics engines can significantly boost the performance of policy search algorithms that are popular in robotics, such as
TRPO, PoWER and PILCO, with no additional real-world data. Based on this research, a paper was submitted to ICRA 2018.
Future directions: We are currently performing an analysis of the properties for the proposed method, such as expressing the conditions under which the
inclusion of the model identification approach reduces the needs for physical rollouts and the speed-up in convergence
in terms of physical rollouts. It is also interesting to consider alternative physical tasks, such as locomotion challenges, which can benefit by the proposed framework.
Apprenticeship Learning of Sequential Manipulation Tasks (2017 - now)
Problem: A major source of failure in robotic manipulation is the
high variety of the objects and their configurations and poses
in such environments. Consequently, a robot needs to be
autonomous and to adapt to changes. However, developing
the software necessary for performing every single new
task autonomously is costly and can be accomplished only
by experienced robotics engineers. This is an issue that is
severely limiting the popularization of robots today.
Ideally, robots should be multi-purpose, polyvalent and
user-friendly machines. For example, factory workers should
be able to train an assistant robot in the same way they
train an apprentice to perform a new task. Humans can successfully
demonstrate how to perform a given task through
vision alone. For instance, a worker records a video showing
how to paint an object and the robot should be able to learn from the video how to plan and execute the task and how
to adapt when the sizes and locations of the brush and paint
bucket vary. This would eventually constitute the ultimate
user-friendly interface for training robots.
Complex manipulation tasks, such as changing a tire,
packaging, stacking or painting on a canvas are
composed of a sequence of primitive actions such as picking
up the brush, dipping it in the paint, and pressing it against
the canvas while moving it. A robot can learn to reproduce
each of the primitive actions if they were demonstrated separately,
using any of the many imitation learning techniques
that are available. The main challenge, however, is learning
to create a high-level plan from low-level observations. This
problem is particularly difficult because the demonstrations
are provided as a raw video or a stream of unlabeled images
Proposed solution: we presented an Inverse Reinforcement Learning (IRL) approach for learning to
perform sequential manipulation tasks from unlabeled visual
demonstrations. We use a deep convolutional neural network
to identify all the objects in an image. A pose estimation
algorithm is used to estimate the 6D poses of the identified
objects (3D positions and 3D orientations) from depth images.
During training, the poses of the teacher's fingers are
also tracked and recorded using the Leap motion controller,
which is a depth sensor customized for tracking hands.
Our main algorithmic contribution is a task-state machine,
implemented as a shallow neural network, that takes as inputs
the 6D poses of the objects in the scene and returns a
reward function to use for planning a low-level trajectory for
the robotic arm. The task-state machine constantly receives
these inputs from the vision network. Whenever a low-level
trajectory finishes a subtask, for example picking up the
brush, the task-state machine detects the change in the scene
and triggers the next subtask by switching to the next reward
function that leads to dipping the brush, for example. This
process is repeated until the main task is finished. Both the
reward functions of the subtasks and the task-state machine
are learned from the demonstrations.
Results: Empirical results on manipulating table-top objects
using an industrial robot show that the proposed algorithm can efficiently learn high-level planning strategies from unlabeled visual demonstrations.
The algorithm outperforms classical behavioral cloning that learns to map states directly to actions without any reward representation.
However, this work still needs to be compared to some recent IRL methods that are becoming increasingly popular.
Future directions: We are investigating new methods for segmenting expert demonstrations into a sequence of subgoals. Currently,
we are using the change in the nearest object to expert's hand as an indication of a change in the subgoal, and segment the demonstrations
accordingly. The results are rather sensitive to thresholds used for segmentation, especially for tasks with many objects that are near each other. We are
also looking into ways to generalize the demonstrated behavior to substantially different setups. Another interesting direction would be incorporating natural language
into the framework, where a human could be explaining what she is doing while demonstrating the task.
Other Projects
Learning geometric models of objects in real-time: In this ongoing work, we are interested in constructing 3D mesh and texture models of
objects that are seen by a robot for the first time. The objects are contained in a pile of clutter, and the robot's task is to search for a particular object, or
to clear all the clutter. To achieve that goal, the robot needs to learn on the fly models of the objects in order to effectively manipulate them. This task
is performed by pushing objects in selected directions that maximize the information gain, while continuously tracking the moved objects and reconstructing
full mesh models by combining the observed angles.
Learning to detect objects and predict their movements from raw videos: Based on optical flow in raw outdoor videos, we segment the frames into
moving objects. The obtained segments could belong to mobile objects such as humans, cars, and bicycles, or any stationary object that appears to be moving from
the perspective of a camera mounted on a robot. The second step consists in clustering the various segments into categories based on their features. Segments
that belong to the same category are given the same numerical label by the robot. Finally, a convolutional neural network is trained using the collected and automatically labeled
data. This process would allow robots to autonomously learn to detect and recognize objects. The same method could be extended to learn dynamical models of the objects.
Optimal stopping problems in reinforcement learning: Most RL algorithms fall into one of the following two categories: (1) fixed and finite planning horizon, and (2), infinite
horizon with a discount factor. Robotic tasks do not typically fit within these categories. They usually require a finite, but variable, time. To throw an object, for example, a robot needs to gain a certain momentum (velocity) at the end-effector by accelerating its arm, and then releases the object at a given moment. Deciding when to switch from one mode of control to another (e.g., from accelerating the arm to releasing the object) can be better formalized by borrowing methods from the optimal stopping literature. We are currently investigating new RL techniques where the planning horizon in itself is a decision variable that needs to be optimized.