Single Stage Instance Segmentation — A Review

Instance segmentation is a challenging computer vision task that requires the prediction of object instances and their per-pixel segmentation mask. This makes it a hybrid of semantic segmentation and object detection.

Ever since  Mask R-CNN  was invented, the state-of-the-art method for instance segmentation has largely been Mask RCNN and its variants ( PANet ,  Mask Score RCNN , etc). It adopts the detect-then-segment approach, first perform object detection to extract bounding boxes around each object instances, and then perform binary segmentation inside each bounding box to separate the foreground (object) and the background.
However, Mask RCNN is quite slow and precludes the use of many real-time applications. In addition, masks predicted by Mask RCNN have fixed resolution and thus are not refined enough for large objects with complex shapes. There has been a wave of studies on single-stage instance segmentation, fueled by the advances in anchor-free object detection methods (such as  CenterNet  and  FCOS . See my  slides  for a quick intro into anchor-free object detection). Many of these methods are faster and more accurate than Mask RCNN, as shown in the image below.

Inference time of recent one-stage methods tested on a Tesla V100 GPU ( source )

This blog will review the recent advances in single-stage instance segmentation, with a focus on mask representation — one key aspect of instance segmentation.
If this in-depth educational content on convolutional neural networks is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. 
Local Mask and Global Mask
One core question to ask in instance segmentation is the representation or parameterization of instance masks — 1) whether to use local masks or global masks and 2) how to represent/parameterize the mask.

Mask representation: Local Masks and Global Masks

There are largely two ways to represent an instance mask: local masks and global masks. A  global mask  is what we ultimately want, which has the same spatial extent to the input image, although the resolution may be smaller such as 1/4 or 1/8 of the original image. It has the natural advantage of having the same resolution (and thus fixed-length features) for big and small objects. This will not sacrifice resolution for bigger objects and the fixed resolution lends itself to perform batching for optimization. A  local mask  is usually more compact in the sense that it does not have excessive boundaries as a global mask. It has to be used with mask location to be recovered to the global mask, and local mask size will depend on object size. But to perform effective batching, instance masks require a fixed-length parameterization. The simplest solution is by resizing instance masks to fixed image resolution, as adopted...