AI@Unity is working on amazing research and products in robotics, computer vision, and machine learning. Our summer interns worked on AI projects with real product impact.
The Unity Computer Vision team developed the Perception Package to help our users build synthetic datasets using Unity’s realtime 3D engine. Synthetic data helps computer vision developers supplement their real-world datasets in machine learning based vision applications by eliminating bias, producing edge cases, adding diversity, and perfectly-labeling images. We also leverage our expertise and research with synthetic data to generate unique, custom datasets for our customers.
During the summer of 2021, our interns worked diligently to create valuable contributions to our work at Unity. Read about their projects and experiences in the following sections.
Instance segmentation is a computer vision task in which the goal is to build models that can say which pixels in an input image belong to different objects. Deep neural networks, such as Mask R-CNN , are the current state-of-the-art for instance segmentation; however, these networks are incredibly data-hungry, requiring many tens of thousands of labeled examples to achieve high performance in real-world domains. For every object in each training image, the network uses two kinds of labels: a bounding box (i.e., a 2D box specifying the object’s location) and a segmentation mask (i.e., an image-sized binary mask specifying which pixels belong to the object). These labels are typically provided by human annotators, for whom segmentation masks require significantly more time and effort than bounding boxes. In this project, we investigated whether it is possible to train deep instance segmentation networks with comparatively few hard-to-obtain segmentation masks, and whether synthetic data can help boost performance when segmentation masks are scarce or even non-existent.
As a first step, we sought to answer the following question: on real data (e.g., the COCO dataset), given that you have many bounding box labels, what fraction of those need to be accompanied by segmentation mask labels to achieve high segmentation performance? We found that with only 1% of data labeled with segmentation masks (i.e., fully labeled), one can reach nearly 90% of the performance attained when all data is fully labeled (see plot).
This result is exciting because it may significantly reduce the labeling burden for instance segmentation. It also opens exciting avenues for future work. For example, can we use synthetic data to close the remaining performance gap between 1% fully labeled and 100% fully labeled? Can we completely replace the ground-truth segmentation masks with synthetic data? How does the viability of using synthetic data change as the number of ground-truth bounding boxes is reduced? Initial results from training on synthetic instance segmentation data are promising.
The Dataset Visualizer python tool allows users to explore and visualize computer vision datasets that utilize the Unity Perception format, for example, datasets generated using the Unity Perception Package. These datasets contain images of synthetic environments and objects along with ground truth annotations, including: 2D and 3D bounding boxes, semantic and instance segmentation, and keypoints. These datasets are used to train computer vision AI models for tasks such as object detection or classification.
The Dataset Visualizer makes it easy and efficient for users to browse through these datasets and inspect the images along with their ground truth annotations, visualized as multiple selectable overlays. This tool can help users in a variety of use cases:
While creating this tool, I had the opportunity to learn about many topics in computer science such as artificial intelligence, computer networks, computer graphics, and web development. With the use of the resources offered at Unity and the amazing people I had the opportunity to work with, I solved several unexpected and challenging problems such as rendering 3D boxes with different camera projection types, automatically solving port conflicts, creating an application that is compatible with a variety of operating systems, and much more.
AI@Unity supports its computer vision customers by providing the ability to create large synthetic datasets. We have released a couple example datasets but previously the user was required to download and extract the images in order to view them. These datasets were not small, so it introduced a delay for customers based on their bandwidth before they could inspect the images in the dataset. Custom datasets also required users to download before viewing which could slow down iteration on the data.
The Dataset Preview feature improves the user experience by allowing users to inspect a sample of the dataset before downloading. If the dataset needs to be adjusted to meet the user’s needs, users can generate a new one, preview it again before downloading. Within the feature, users can modify the size and number of images displayed per page as well as magnify each image. To assist the users with image inspection, users can enable bounding boxes on the zoomed image and change the boxes’ colors if need be.
The most terrifying part of the project was knowing that it would be a user-facing feature. In fact, on top of being the tool almost all users are guaranteed to use, it was commonly used by other developers on the project to preview their work as well! It was challenging to integrate a feature into an existing product while learning and conforming to the organization’s code submission guidelines, but ultimately, the entire experience was an incredibly satisfying learning journey.
Finding large, adequately labeled datasets is a major challenge facing machine learning professionals. Datasets can be expensive to label, may contain unwanted bias, or be unrepresentative of the real world. Unity addresses this problem by exploiting the rendering pipeline to produce synthetic labeled datasets for computer vision tasks. My internship focuses on developing and studying synthetic depth images as replacements for depth training data collected from real sensors. Can machine learning models trained with synthetic depth images perform well when tested with real-world data? What degree of realism is required to bridge the gap?
Depth maps are single-channel images in which each pixel corresponds to the distance from the camera to the object constituting the pixel, in the direction of the camera’s forward axis. I developed a labeler that produces depth maps corresponding to a camera’s view in a Unity scene. To assess the utility of depth images, we are creating synthetic datasets to train models for single-object 6D pose estimation. We are studying two state-of-the-art models alongside a novel architecture of our design. Each example in a dataset includes a color image, depth image, object mask, and an object pose label composed of ground truth rotation and translation. We then test the models trained on synthetic data on real-world images from the LineMOD dataset. To explore the impact of realism, we also conduct experiments using synthetic depth images modified with a noise model. The added noise is meant to resemble the noise present in the LineMOD depth images. Since the project is ongoing at the time of writing, results are not yet available.
If synthetic depth training sets yield satisfactory real-world performance, users could apply Unity’s synthetic data pipeline to a greater variety of computer vision problems. Quality depth images can be expensive and finicky to collect, making synthetic data an attractive possibility.
Computer vision training requires an extensive amount of labeled images to be successful, but the process for labeling real-world data is long and tedious. To address this, we create custom, synthetic datasets for customers, powered by the Unity Perception package. With this technology, we can create a large variety of environments populated with various objects and humans. Through Randomizer scripts, we can randomize several parameters, such as objects’ position, rotation, animation, texture, and lighting. The resulting images, referred to as frames, are generated almost instantly, thanks to the rendering of lifelike 3D scenes in real-time. Additional features are consistently being added to the Perception package with this project addressing Rig automation and resizing. This project’s goal is to randomize digital humans’ blend shapes and automatically adapt their rigs, by using a Blend Shape Randomizer script and other rigging and skinning tools currently in development. We worked closely with one of our customers to create the synthetic dataset they needed over the span of four weeks. We modified existing randomizers and set up interior and exterior scene environments with people and lighting randomization to meet the customer’s needs. After prioritizing this work, I was able to return my focus to the human rigging and skinning automation tools. I expect to have completed a Bones Placer tool by the end of my internship, working alongside a Skinning automation tool.
My experience on both projects was very exciting and rewarding as I was able to learn more about synthetic data, as well as work in a fast-paced environment on challenging problems with the support of managers, mentors and colleagues. Working on a customer project was initially very daunting, but it turned out to be invaluable. Iterating through their feedback was instructive, as machine-learning has different needs from gaming, where I have more experience. Furthermore, I gained more knowledge on the HD Render Pipeline and the Shader Graph with lighting, post-processing and creating a Shadergraph that randomizes textures’ appearance. I quickly familiarized myself with the Perception package and more specifically with its Randomizers’ logic, so that I could modify them as needed. Then I used this new-found knowledge to write from start to finish the Blend Shape Randomizer, which adds meshes as new blend shapes to a target mesh and randomizes their weight. This taught me more about blend shapes as well as Unity and Perception’s specific API. In addition, I delved further into Houdini Python scripting, as I worked on exporting vertex data from a mesh in Houdini to a .json file. This file will then be handed to a Bones Placer tool in Unity, which will take the vertex data to calculate expected bones position, and generate them onto a target mesh which has the same vertex IDs as the one we previously collected the data from. This tool is what I am currently developing and is due to be completed by the end of my internship as aforementioned. The generated bones will then be used to skin the mesh with a Skinning automation tool, currently developed by my colleague. Overall, I acquired a vast amount of technical knowledge, which I am very grateful for, and I am looking forward to learning even more to help Unity and its customers be successful in synthetic data generation!
Real-time 3D is changing the world today, and to create the most realistic content for immersive 3D experiences, 3D and volumetric capture is fundamental to producing lifelike reproductions. As a founding member of the Volumetric Format Association, the Unity Simulation platform is paving the path to enter this growing market by means of volumetric and 3D capture simulations. With 3D capture still being relatively new technology, there is a need for large-scale simulation to improve 3D capture algorithms and create optimal capture scenarios. Exhaustive testing of camera layouts, scenarios, actors, and lighting are necessary to capture realistic simulated content, but can be prohibitively expensive. This project presents an optimal solution and lowers the barrier of entry for customers looking to make informed decisions about 3D capture and volumetric video simulations as they move forward with their projects.
This 3D and volumetric capture simulator helps customers better understand the entire process from end-to-end for their use cases and visualize the final 3D content at the end of the process. It is specifically designed to narrow down and resolve problems customers face, such as finding the optimal layout of cameras for a volumetric capture scenario, or simulating the capture of a limitless variety of objects. The simulator enables customers to simulate multiple setups and evaluate hardware deployments, so that they can make informed long term decisions for hardware intensive technology like 3D and volumetric capture on a large scale. A configurable simulator like this not only provides detailed insights into the process, it also provides the opportunity to run multiple randomized experiments to generate source 3D data in varying environments, which is a key requirement for optimal, high-quality 3D content generation.
Though I have used Unity to build games in the past, it was incredibly exciting and rewarding to work with 3D content capture and simulation for the first time. Over the course of this internship, I quickly learned how to integrate 3D content into the engine with simulator specific requirements like varying scenario options for more diverse data, storing metadata and simulated sensor datasets, and providing modifiable configurations to enable customers to improve the quality of their 3D capture solutions in a variety of situations. As the global volumetric video market continues to grow, I am excited to see how customers use this tool to help them be successful with their 3D capture needs!
 He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).