What is Feature Pyramid Network (FPN)?
This post is an explanation of what Feature Pyramid Network (FPN) is, from a beginner’s point of view. Having learnt about Convolutional Neural Networks(CNN) and how it can be used as feature extractors, when I came across Feature Pyramid Networks I was very confused about what it means. What is multi-scales that is being referred to often in conjunction with FPN? What is the relationship between FPN and multi-scale features? Does changing the scale in FPN mean preprocessing the input image to different scales?
So I delved in and found answers to these questions, which I am about to share here.
First, lets understand what multi-scale is…
In the above image, we can see there are large scale edges and small scale textures. These details in images at different scales is said to be multi-scale features. Multi-scale is not same as changing resolution or size. But the concepts are usually addressed interchangeably in convolutional layers.
Searching across scales to detect objects is also used in David Lowe’s SIFT keypoint detection. An image pyramid is constructed by smoothing the image with Gaussian filter and Difference of Gaussians (DoG) results in details image. Then the Gaussian image is resampled and again filtered at a different octave. Finally the keypoints are detected across different scales and positions.
So how does this concept translate to FPNs?
Well, in deep CNN, after few layers of convolutional filtering we do maxpooling. This is similar to filtering and downsampling in traditional computer vision, just that we do not know what kind of filtering the kernel learns in CNN. And we do prediction using the downsampled feature maps at the end. One drawback with this deep CNN is that it loses all the features from the initial scale space. Hence large objects are detected efficiently, whereas smaller object are not recognised by the network.
At this juncture, Feature Pyramid Networks for Object Detection by Facebook AI Research (FAIR), Cornell University and Cornell Tech, introduced how feature pyramids can be build with CNN feature maps to detect objects at different scales.
Although one can rescale the image at different scales and feed it to the network and build pyramids with extracted features, this approach increased inference time. Also training networks end-to-end on image pyramids was exhaustive on memory.
Thus, utilising features from multi-scale means using feature maps across different scales to build “feature” pyramid and make final prediction. To do that, learned downsampled feature map from bottom-up pathway of a CNN is upsampled and concatenated with feature maps of similar scale. This helps us to include small scale features ignored in initial layers of bottom-up pathway during resampling. Hence, the network can now detect effectively small scale object too.
To conclude, Feature Pyramid Network (FPN) is a deep convolutional neural network which makes use of “Feature Pyramids” made of feature maps instead of images, at multiple scales to make prediction.
Reference
Feature Pyramid Networks for Object Detection 2017 CVPR
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 60, 2, 2004, pp. 91–110