Tulsiani et al. use unsupervised convolutional neural networks (CNNs) to learn representing 3D shapes using shape primitives such as cuboids. An illustration of the idea is given in Figure 1, showing how 3D shapes are approximated by a set of cuboids. The main contribution is learning to predict these cuboids using CNNs in a unsupervised fashion, i.e. without explicit ground truth. The key idea is to define a coverage loss and a consistency loss, expressing how well the set of cuboids approximates the real shape.
The coverage loss tries to enforce that the input shape lies completely within the predicted set of cuboids. To this end, Tulsiani et al. define a distance transform on the predicted cuboids. For each point on the input shape, the distance to the closest point on the set of cuboids is expected to be zero (i.e. the point lies within the cuboids).
The consistency loss, in contrast, tries to ensure that the predicted cuboids lie completely within the input shape. This is also done by expecting the distance of randomly sampled point son the cuboids to the closes point on the input shape to be zero.
Although the above descriptions do not go into detail (see the paper for the corresponding definitions), it is interesting that Tulsiani et al. predict a maximum set of $M$ cuboids by predicting the parameters of each cuboid and a rotation matrix as well as a translation vector. In addition, the network predicts a score for each cuboid indicating how probable the cuboid is to be part of the abstraction.