Dai et al. propose Multi-task Network Cascades (MCN) for instance-aware semantic segmentation (without stuff categories, however). Their approach is based on a decomposition of the task into three subtrass: differentiating instances (i.e. detecting instances), estimating masks, categorizing objects. While the proposed approachs demonstrates impressive performance on PASCAL VOC and MS COCO, it is unclear whether this is fully contributed to this decomposition. It may alsobe possible to combine object detection with object classification …
While detailed figures of the used network architectures are not provided, Figure 1 shows the general structure of their approach, defining input and output to the three subtasks. Stage 1 is trained to predict box proposals; network architecture and employed loss follows the work of Region Proposal Networks . The second stage, given the object proposals as input, employs a differentiable version of Region of Interest pooling  to extract features used for mask prediction. The third stage takes the masks as input, uses Region of Interest pooling to extract features which are then masked and predicts object classes. The final loss can then be written as
$L(\theta) = L_1(B(\theta)) + L_2(M(\theta)|B(\theta)) + L_3(C(\theta)|B(\theta),M(\theta))$
Where $L_i$ denotes the loss of stage $i$, $B$ are the predicted proposals in stage 1, $M$ the proposed masks of stage 2 and $C$ the class predictions of stage 3. For end-to-end training this is challengingas the later stages depend on the output of earlier stages. In particular, this requires differentiable Region of Interest pooling layers which are used to extract the features for stages 2 and 3. These layers need to be differentiable with respect to the bounding box or mask input. Details can be found in the paper.
On top of the third stage, they additionally perform class-wise bounding box regression. The regressed bounding boxes are then again fed into stages 2 and 3. In total, this results in a 5-staged model as depcited in Figure 2.
They provide quantiative and qualitative results on PASCAL VOC and MS COCO. And also provide an ablation study showing that end-to-end training of all 5 stages with shared features indeed improves performance over multi-step/multi-network training without shared features. They also outperform related work [13,14,7] on PASCAL VOC. Qualitative results are shown in Figure 3.