Monocular SLAM Supported Object Recognition

Monocular SLAM Supported Object Recognition Pillai & Leonard – 2015

Yesterday we looked at the SLAM problem. Once we’ve made a map and identified some landmarks, a next obvious challenge is to figure out what those landmarks actually are. This is the object recognition problem. I’ve chosen today’s paper because it’s recent (2015) and contains a good overview of related work which will help us to get oriented in the problem domain. It also brings together SLAM and object recognition.

Exploring the world of robotics and AI is a wonderful blend of wonder at the all amazing things that have been developed, the amount of progress and activity, the volume of open source libraries available etc., and at the same time a reality check of how difficult some ‘basic’ tasks such as recognising everyday objects on a table are.

Say we were sitting down together looking at a table across the room. On that table are a few objects. I ask you whether the object on the right is a brightly painted coffee cup or a soda can because it’s hard to tell from here. If you were motivated enough to find the answer, what would you do? Probably get up and walk towards the object to get a better view, right? Once you’d observed it from a few angles, it would become obvious that it was in fact a coffee cup…

And that’s the big idea behind ‘Monocular SLAM supported Object Recognition’ ! It’s not as daunting as the title makes it sound: monocular means that we have a single camera through which we view the world, SLAM we now know means making a map of the environment and identifying landmarks, and object recognition is simply figuring out what those things are.

Object recognition is a vital component in a robot’s repertoire of skills. Traditional object recognition methods have focused on improving recognition performance (Precision-Recall, or mean Average-Precision) on specific datasets. While these datasets provide sufficient variability in object categories and instances, the training data mostly consists of images of arbitrarily picked scenes and/or objects. Robots, on the other hand, perceive their environment as a continuous image stream, observing the same object several times, and from multiple viewpoints, as it constantly moves around in its immediate environment. As a result, object detection and recognition can be further bolstered if the robot were capable of simultaneously localizing itself and mapping (SLAM) its immediate environment – by integrating object detection evidences across multiple views.

If you want the 2:40s version of the paper, check out this YouTube video posted by the first author. Seeing how robots see and understand their world is a visual thing so it adds a lot to watch it in action. Go ahead, I’ll wait for you here…

Let’s build back up to the SLAM-aware solution and start by reviewing some of the related work to get a feel for how object recognition works in general.

Sliding windows and templates

For traditional object recognition the HOG and DPM (Deformable Part-based Models) are the benchmark. For each object that can be recognised they have a model of the shape of the object and its parts via ‘oriented-edge’ templates across several scales. This template is then scanned across the entire image in a sliding-window fashion at several scales, and this is repeated for each object that needs to be identified. Obviously this approach doesn’t scale too well if you want to be able to recognise lots of different types of objects…

Feature encoding

An alternative approach is to sample features over the image – these are then described, encoded and aggregated over the image or a region to provide a rich description of the object contained in it.

(It sounds a bit like playing the guessing game: “I’m round, white, and have a handle – what am I?” 😉 ).

The aggregated feature encodings lie as feature vectors in high-dimensional space, on which linear or kernel-based classification methods perform remarkably well. Among the most popular encoding schemes include Bag-of-Visual-Words (BoVW) [12, 31], and more recently Super-Vectors [35], VLAD [22], and Fisher Vectors [28].

VLAD and Fisher Vectors outperform Bag-of-Visual-Words and can be used an drop-in replacements. Monocular SLAM-based object recognition makes use of VLAD…

Recognising Objects (Object Proposals)

Instead of scanning an entire image looking for object template matches, we can layer responsibilities. First run an algorithm that can detect candidate regions in an image that may contain likely objects, and then do specific object matching just within those regions. In other words, we can recognise that something is an object (“a category independent object proposal method”) without attempting to recognise what sort of object it is.

The object candidates proposed are category-independent, and achieve detection rates (DR) of 95-99% at 0.7 intersection-over-union (IoU) threshold, by generating about 1000-5000 candidate proposal windows. This dramatically reduces the search space for existing sliding-window approaches that scan templates over the entire image, and across multiple scales.

Intersection-over-Union (IoU) is a common technique to evaluate the quality of candidate object proposals with respect to ground truth. The intersection area of the ground truth bounding box and that of the candidate is divided by the union of their areas.

Scalable feature encoding

We can use this object proposal technique with feature encoding as well as with sliding-window templates. The Bag-of-Visual-Word encoding schemes lack the ability to localize objects in an image, but combining these with category-independent object proposal methods creates a fast and accurate technique in which object proposal windows are described using feature encoding methods.

Multiple-view Object Detection

Lai et al. took an RGB-D stream and use HOG-based sliding-window detectors trained from object views in the RGB-D dataset to assign class probabilities to pixels in each of the frames in the stream. These are used to reconstruct a 3D scene and give improved object recognition performance and robustness. It does have a run-time of 20 minutes per image-pair though, and only works with limited object categories. Thus it is not suitable for on-line robot operation…

Monocular SLAM supported Object Recognition

Most object proposal strategies use either superpixel-based or edge-based representations to identify candidate proposal windows in a single image that may contain objects. Contrary to classical per-frame object proposal methodologies, robots observe the same instances of objects in its environment several times and from disparate viewpoints. It is natural to think of object proposals from a spatio-temporal or reconstructed 3D context, and a key realization is the added robustness that the temporal component provides in rejecting spatially inconsistent edge observations or candidate proposal regions.

Monocular SLAM builds on top of an existing monocular SLAM solution called ORB-SLAM. A special density segmentation produces a point cloud sufficient for object proposals. These ‘object seeds’ are projected onto each of the camera views and used as a basis for further occlusion handling, refinement, and candidate object proposal generation.

Given the object proposals, the next stage uses dense Bag-of-Visual-Words with VLAD to extract features. This is done using SIFT (Scale-invariant feature transform) and the RGB values. The end result of this part of the process is a feature vector that appropriately describes the specific object contained within the candidate object proposal / bounding-box.

While it may be practical to describe a few object proposals in the scene with these encoding methods, it can be highly impractical to do so as the number of object proposals grows. To this end, van de Sande et al. introduced FLAIR – an encoding mechanism that utilizes summed-area tables of histograms to enable fast descriptions for arbitrarily many boxes in the image. By constructing integral histograms for each code in the codebook, the histograms or descriptions for an arbitrary number of boxes B can be computed independent of their area. As shown in van de Sande et al., these descriptions can also be extended to the VLAD encoding technique. Additionally, FLAIR affords performing spatial pyramid binning rather naturally, with only requiring a few additional table look-ups, while being independent of the area of B.

A linear classifier is trained using VLAD descriptions and Stochastic Gradient Descent.

To compose the multi-view object recognition the ORB-SLAM based mapping solution processes a continuous image stream to recover a scale-ambiguous map, M, keyframes K, and poses (robot position) ε corresponding to each of the frames in the input image stream. The object seeds discovered are projected back into each of the individual frames using the known projection matrix derived from the corresponding viewpoint εi.

Using these as candidate object proposals, we evaluate our detector on each of the O object clusters, per image, providing probability estimates of belonging to one of the C object classes or categories. Thus, the maximum-likelihood estimate of the object o ∈ O can be formalized as maximizing the data-likelihood term for all observable viewpoints (assuming uniform prior across the C classes).