Minerva Elements Records

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Deep Learning Approaches for 3D Inference from Monocular Vision
    Jack, D ; Maire, F ; Eriksson, A ; Denman, S ( 2020)
    Deep learning has contributed significant advances to computer vision in the last decade. This thesis looks at two problems involving 3D inference from 2D inputs: human pose estimation, and single-view object reconstruction. Each of our methods considers a different type of 3D representation that seeks to take advantage of the representation's strengths, including keypoints, occupancy grids, deformable meshes and point clouds. We additionally investigate methods for learning from unstructured 3D data directly, including point clouds and event streams. In particular, we focus on methods targeted towards applications on moderately-sized mobile robotics platforms with modest computational power on board. We prioritize methods that operate in real-time with relatively low memory footprint and power usage compared to those tuned purely for accuracy-like performance metrics. Our first contribution looks at 2D-to-3D human pose keypoint lifting, i.e. how to infer a 3D human pose from 2D keypoints. We use a generative adversarial network to learn a latent space corresponding to feasible 3D poses, and optimize this latent space at inference time to find an element corresponding to the 3D pose which is most consistent with the 2D observation using a known camera model. This results in competitive accuracies using a very small generator model. Our second contribution looks at single-view object reconstruction using deformable mesh models. We learn to simultaneously choose a template mesh from a small number of candidates and infer a continuous deformation to apply to that mesh based on an input image. We tackle both problems of human pose estimation and single-view object reconstruction in our third contribution. Through a reformulation of the model presented in our first contribution, we combine multiple separate optimization steps into a single multi-level optimization problem that takes into account the feasibility of the 3D representation and its consistency with observed 2D features. We show that approximate solutions to the inner optimization process can be expressed as a learnable layer and propose problem-specific networks which we call Inverse Graphics Energy Networks (IGE-Nets). For human pose estimation, we achieve comparable results to benchmark deep learning models with a fraction of the number of operations and memory footprint, while our voxel-based object reconstruction model achieves state-of-the-art results at high resolution on a standard desktop GPU. Our final contribution was initially intended to extend our IGE-Net architecture to accommodate point clouds. However, a search of the literature found no simple network architectures which were both hierarchical in cloud density and continuous in coordinates -- both necessary conditions for efficient IGE-Nets. As such, we present various approaches that improve performance of existing point cloud methods, and present a modification which is not only hierarchical and continuous, but also runs significantly faster and requires significantly less memory than existing methods. We further extend this work for use with event camera streams, producing networks that take advantage of the asynchronous nature of the input format and achieve state-of-the-art results on multiple classification benchmarks.