Supervised machine learning (ML) has revolutionized artificial intelligence and has led to the creation of numerous innovative products. However, with supervised machine learning, there is always a need for larger and more complex datasets, and collecting these datasets is costly. How can you be sure of the label quality? How do you ensure that the data is representative of production data? An exciting new solution to this problem, particularly for object detection tasks, is to generate a massive synthetic dataset. Synthetic data alleviates the challenge of acquiring the large labeled datasets needed to train machine learning models.
This blog post is the third in our series on generating synthetic data with Unity. In the first blog post, we discussed the challenges of gathering a large volume of labeled images to train machine learning models for computer vision tasks. More recently, we showed you how to generate labeled data frames using Unity’s Computer Vision tools.
Now, we will show you how to:
The result is a model that performs well on a new real-world dataset we’re releasing with this post and performs better than the model trained using only real data. We will also provide you with pipelines and instructions to create an environment, generate data, and train a model with your customized assets and data in datasetinsights.
We have chosen to use a Faster R-CNN to detect 63 everyday grocery objects. Training this model requires creating 3D assets of the objects of interest, automatic scene creation, rendering image data, and generating the bounding box labels.
We created 3D asset scans for all 63 objects for this project and used the Unity Perception package to generate labeled data automatically. As described in a previous blog post, we controlled the placement and orientation of the target objects along with the arrangement, shape, and texture of the background objects for each render. Additionally, we randomly chose lighting, object hue, blur, and noise for every rendered image. Using the Perception package, we captured RGB images and JSON files with corresponding bounding boxes.
To create our environment, we had two types of assets: foreground assets and background assets. Foreground assets are scans of the objects we detected. In contrast, background objects make up our background or occlude our target objects (distractors).
Crafting these assets had a set of unique challenges. Firstly, for background and distractor objects, we needed to expose and vary their textures and hues. Secondly, the foreground assets had to be realistic. Therefore, scanning foreground objects required more attention and some touch-up post-scan.
To create the real-world dataset, we purchased the products, arranged them in several different locations in our Bellevue office, and took several pictures. To ensure that our real-world dataset was diverse, we placed objects with varied lighting and background conditions. We also ensured that the set of objects in each photo varied and that the object locations, orientations, and configurations varied in every shot.
To annotate these images, we used the VGG Image Annotator tool and spent over 200 person-hours of effort, including quality assurance and addressing data correctness issues. We had 1,267 usable images with bounding boxes and class labels at the end of the annotation process. We split this data into 760 images for training, 253 images for validation, and held out 254 images for testing. Any model that used real-world images for training used the training set. To choose the best performing model and prevent overfitting, we used the validation set. Finally, we used the held-out test set to report model performance. The models never saw the held-out set, and we did not use the held-out data implicitly or explicitly to choose models or set model hyperparameters.
For these experiments, we used the popular Faster R-CNN model with a ResNet50 backbone pre-trained on ImageNet. We used the publicly available implementation from the torchvision. The exact code, including kubeflow pipelines, is available in our open-source datasetinsights python package.
To measure the model’s performance, we used three standard metrics used in COCO, PASCAL VOC, and OpenImages challenges. These measures allow us to quantify the false-positive rate, the bounding box localization, and the false-negative rate of our model. In general, we use the thresholds on the intersection over union (IoU) between a predicted bounding box and a ground truth bounding box to determine if the prediction is true or false. We use the mean average precision (mAP) of the object detection at an IoU greater than or equal to 0.5 (mAPIoU=0.5) to measure the rate of false-positive detections. Additionally, we use the mAP averaged over the range of thresholds 0.5 to 0.95 with a step size of 0.05 to measure the quality of bounding box localization. Improvements to the mAP indicate bounding boxes’ localization improved, whereas mAPIoU=0.5 is a more general measure of improvements in detection precision, i.e., reducing the false-positive rate. Lastly, we measure the mean average recall of the model given a maximum of 100 detection candidates (mAR100). Improvements in the mAR100 indicate that the false-negative rate has decreased. For details on how we calculated these metrics, see PASCAL VOC developer kit and COCO detection evaluation.
We trained all models with an initial learning rate of 2x10-4 and a batch size of four. Before training, we split the train data into a train set and validation set. We chose the model with the highest mAPIoU=0.5 and the mAR100 on the validation set.
When training on real-world only data, we used 760 images for training. We selected the best performing model using the mAPIoU=0.5 and mAR100 on the real-world validation dataset. When training on synthetic data, we used 400,000 images for training. We found the best performing model, again as determined based on performance on a synthetic validation set. Lastly, for fine-tuning, we took our best performing synthetic model and fine-tuned using the real-world images. In this case, we chose the best performing model using the performance on the real-world validation set.
Mean Average Precision averaged across IoU thresholds of [0.5:0.95] (mAP), Mean Average Precision with a single IoU threshold of 0.5 (mAPIoU=0.5), and the Mean Average Recall with a maximum of 100 detections (mAR) measured on a held-out set of 254 real-world images.
|Training Data (number of training examples)
|1.1 Real World (760)
|1.2 Synthetic (400,000)
|1.3 Synthetic (400,000) + Real World (76)
|1.4 Synthetic (400,000) + Real World (380)
|1.5 Synthetic (400,000) + Real World (760)
Taking an off-the-shelf model like Faster R-CNN and training it on a small real-world dataset is a standard workflow for many machine learning practitioners. We found that for our GroceriesReal dataset, Faster R-CNN achieved decent results (mAP of 0.48, mAPIoU=0.5 of 0.73, and mAR100 of 0.59). We can see in the above image that in situations when there are no occlusions, and the product name is front-facing, that model performs well. However, when distractor objects and occlusions are present, there are many false-positive predictions. Conversely, when the lighting is complex, we see many missed detection (false-negatives). These results show that Faster R-CNN can perform this task. However, there is significant room for improvement in the bounding box localization (mAP), false-positive rates (mAPIoU=0.5), and false-negative rates (mAR100). Instead of collecting a large amount of real-world data to push performance, we synthesized a massive amount of randomized data.
One of the advantages of using synthetic data is that we can rapidly iterate on the dataset without going through a tedious and time-consuming data collection process. To generate a usable synthetic dataset, we iterated on the many aspects of the dataset generation process. Initially, we did not include ambient lighting effects or occlusion effects. Models trained on these datasets had a terrible performance on real-world data, with many false-positives and false-negatives. To achieve even moderately successful results, we found it critical to allow for ambient lighting effects and occlusions by background objects and other target objects. Thankfully, using Unity Simulation and the Unity Perception package we had created a well-parameterized environment where we could easily and rapidly customize our dataset, and it took just minutes to generate a new large dataset with different domain randomization parameters.
Including ambient lighting and occlusion, we trained a model on 400,000 synthetic examples. This model performed poorly on real images (see table above), failing particularly in situations with heavy occlusions and low lighting. However, when objects had complex orientations, the model performed reasonably well (Figure 3). These results suggested to us that the issue was a domain gap problem. To test this hypothesis, we tried fine-tuning this model with some real-world training data.
The first thing we did was fine-tune our synthetic-trained model on 76 real-world images. We found that with 10% of the real-world examples, we outperformed the model trained with only real-world data. Specifically, we found that the number of false-positives and false-negatives dropped significantly. Additionally, we found remarkable improvements in the bounding box localization. However, we still see that the model is struggling in situations with complex lighting (row three, column three in Figure 3).
However, fine-tuning this model with 380 real-world examples, we find remarkable improvements in all metrics. We see a nearly 22% improvement in the mAPIoU=0.5 and an approximately 12% improvement in the mAR100 compared to a real-world-only model. Furthermore, we find a 42% improvement in the mAP, which shows that fine-tuning a model trained on synthetic data with real data also improves the bounding box localization. In particular, in cluttered examples, the fine-tuned model performs admirably with only a few false-positives. Additionally, in environments with challenging lighting, the fine-tuned model detected almost all the objects. These results show that a large, randomized, synthetic dataset reduces the false-negative and false-positive rate of the model.
Training a model with synthetic data that performs in the real world presents several challenges related to how you generate a useful dataset. For object detection, the main difficulties are how to create a digital twin of the target objects, diversify your synthetic dataset, and train your model using this dataset. Using asset scanning solutions allows you to create a synthetic version of your target assets in Unity. Then using Unity Simulation, you can parametrically vary aspects of the assets and environment and generate synthetic data at scale.
In a previous blog post, we released a python package to parse synthetic datasets generated on Unity Simulation, compute statistics and generate charts to understand the dataset. We are now fully open sourcing it on github and extending the capabilities of the project to include model training code and kubeflow pipelines so users can reproduce our work by training a model with synthetic data and fine-tune a model on a small amount of real data.
The updated package includes a guide to running the end-to-end pipeline, from generating a dataset to training a Faster-RCNN model and then evaluating and visualizing the model’s predictions on a new jupyter notebook. The guide also includes optional paths to do transfer learning, which generally results in good performance, or to run predictions using one of our pre-trained models.
Using these pipelines, the real-world dataset, and the parameters provided to generate the synthetic data, you can train a model to perform object detection in the real world. Additionally, we have provided the source code and iOS AR application and ML model-hosting instructions that allow you to use your model (or one we provide) to perform object detection in the wild.
Is this article helpful for you?
Thank you for your feedback!