What is R-CNN? – Summarizing Regions with CNN Features

What is R-CNN? - Summarizing Regions with CNN Features

Original Paper — Rich feature hierarchies for accurate object detection and semantic segmentation

\
Object Detection and Image Segmentation are two topics that are most researched in the field of Computer Vision. In this article, we take a look at one of the famous yet ‘old’ algorithms in the domain of Object Detection — RCNN. This algorithm was first proposed in a paper by (Girshick et al.) in the year 2013 and became very famous soon after.

Overview

The system itself consists of multiple major modules and the combination of these modules gave a result that was better than SOTA at that time.

Major Modules

The first module or the Region Proposal Module was used to generate category-independent region proposals. There are several methods available for region proposals but the authors in the paper chose “Selective Search(Fast Mode)” for result comparison purposes. R-CNN is however agnostic of these methods.

\
The second module or the Feature Extraction module is where CNN is used to extract features from the Proposed Regions of Interest. The image data in the proposed region is converted into a form that is compatible with CNN architecture. All the pixels are warped in a tight bounding box as per the required size irrespective of the size or the aspect ratio.

\

The features are extracted by forward propagating a mean subtracted 227*227 RGB image through five convolutional layers and two fully connected layers.”

\
\

\

Prediction

The selective search is run on the test set images to extract around 2000 region proposals and forward propagate it through the CNN to extract the required features in the form of a feature vector. Now, for each of the candidate classes of the objects, an SVM is trained and then is used to classify whether the feature vector belongs to that class or not. Now, what happens for images which have an overlapping region of two objects? In such cases, a greedy non-maximum suppression technique is used where it rejects a region if it has an intersection over union(IOU) overlap with a higher scoring selected region larger than a learned threshold.

\

Training

  1. Pretraining — The authors discriminatively had pre-trained the CNN on a large auxiliary dataset using image-level annotations only.
  2. Domain-specific fine-tuning — The difference with regular CNN training was that here object detection was the primary task and there were warped region proposal windows. The authors did a SGD training of CNN parameters using only warped region proposals. Aside from replacing the classification layer in CNN; everything was unchanged.

\

Results

  1. On PASCAL VOC 2010–12 dataset, the RCNN method has achieved a significant mAP of ~53%.
  2. On the 200-class ILSVRC2013 dataset, RCNN achieved an mAP of 31.4% which is almost a 7% improvement over the second-best 24.3%.

\
\
If you have liked the summary, you can consider buying me a coffee here.

Leave a Reply

Your email address will not be published.