Every year 1.25 million people are killed in traffic accidents. In 2016, road crashes resulted in 40,000 deaths and 4.6 million injuries in United States alone. Unity has partnered with the City of Bellevue in Washington State to work towards reducing those numbers through technology.
By using machine learning and simulations, we believe we can identify unsafe intersections and mitigate the risks before the actual accidents happen. In this blog post, we’ll talk about the initial steps on this collaboration: we'll introduce the project, talk about the basic computer vision concepts behind its idea and showcase how the Unity Engine is used to create simulated environments to train machine learning models used in the solution.
This partnership between Unity and the City of Bellevue is part of a project called Video Analytics towards Vision Zero.
It's a long name, so let's break it down to understand its scope.
The overarching goal of the initiative is to create a computer vision system that can leverage cameras spread across the city’s many intersections to identify unsafe intersections. By recognizing objects in the video streams - cars, people, bicycles, etc - and their trajectory, the system is expected to provide input to traffic planners about frequency and nature of incidents, allow for reviewing dangerous situations and generate more accurate general reports about those intersections. All this data will allow cities to make sure safety measures are put in place - such as redesigning of crossing lanes, adjustment of stop lights timing, introduction of clearer signage - getting us closer to the the zero fatalities goal aimed by Vision Zero.
This article focuses on the object recognition task, which is the initial piece of the safety system planned for the initiative. Unity Engine’s ability to create synthetic representations of the real world is a valuable resource for computer vision applications, as we’ll explain ahead.
Computer vision is a subfield of artificial intelligence that aims to extract information from images and videos. There has been significant progress in this area for the last five years and it has been widely applied to automatically make sense of real images. In particular, a computer vision model can learn to understand the video feeds from multiple intersections and what goes on around them.
This understanding requires detecting, classifying and tracking different elements of the image. By detecting, we mean locating each object or piece of the scenery in the image. The classification provides information about the type of the detected parts. Additionally, each object is tracked as an individual instance in the image - which is particularly important when dealing with sequences of images or videos. All this metadata allows for the machine learning model to understand the image as a whole.
Usage of real world images is a common approach to provide bootstrapping data for computer vision models. The initial identification and classification metadata is provided through manual work, using specialized tools in a process called annotation.
Here’s an example taken from Vision Zero annotation tool:
Notice the bounding boxes created manually by the tool operators around each pedestrian in the street, as well as the “area of interest” defined by the dashed red line.
Once the images are annotated, a part of the set is used for training the machine learning model while a smaller hold-out of the dataset is used to evaluate the performance of the trained model on scenes it has never seen before.
On an abstract level this is how the data is used for training supervised models:
A common challenge with computer vision applications is that finding enough meaningful training and evaluation data is hard. Manual annotation of real world images and videos is the norm and was a big focus on the initial phase of the Video Analytics project. But it is a costly process, and the quality of the labels can be affected by operator fatigue, inconsistencies in the procedures and other human factors.
Even discounting the costs and time, the data obtained through capture is still limited by what the real world can provide. If you need to train your model with some bicycles and buses, wait for them to show up in the same frame. If you want to see what your model does when there’s snow, or rain, or fog, then you need to befriend a meteorologist… and be ready to fire off the cameras when they say so.
Simulation is a good way to overcome these limitations. By providing full control of the contents - including full understanding of the nature of each element of the scene - a simulation environment can produce virtually infinite sets of training and evaluation data with absolutely accurate annotations in a multitude of situations that can be either designed for specific cases or generated procedurally to cover as many scenarios as possible.
Let’s go over a couple of concepts at the core of simulations for computer vision: Scenes and Episodes.
Scenes are all the static and dynamic elements, as well as parameters that are modeled in a simulation environment. They include buildings, streets, vegetation (static elements); cars, pedestrians, bicycles (dynamic elements); weather conditions, time of the day (sun position), fog (parameters).
Episodes are determined configurations of the scene elements. For example, a certain placement for pedestrians, a certain route for cars, the presence of rain, road conditions, etc. One can imagine an episode as being an instance of a scene.
In this picture, the boxes on top row represent the individual elements of the scene: static assets, dynamic assets, and parameters - in this case, weather. On the bottom box all pieces from the scene are combined into a complete episode.
When creating simulated data, we typically refer to episode variation as the process of generating different episodes for a given scene. This process can be tailored to create a comprehensive set of situations expected to be found in the real world application of the machine learning model being trained .
Given the costs and limitations in gathering real-world data, it is natural to consider replacing or augmenting it with synthetic data generated by a game engine such as Unity. Due to recent advances in graphics hardware, rendering, and advent of virtual and augmented reality, the Unity Engine has evolved into a complete 3D modeling tool, able to generate highly photo-realistic simulations. This has been noticed by both industry and academia, and many projects have been developed to take advantage of Unity’s simulation capabilities. One of the most recognized projects is SYNTHIA.
Developed entirely on Unity by The Computer Vision Center (CVC) at Universitat Autónoma de Barcelona (UAB), the project focuses on creating a collection of synthetic images and videos depicting street scenes in a diverse range of episode variations.
CVC has been pioneering computer vision research for the past 20 years and its SYNTHIA dataset has become a seminal source for those working on autonomous vehicles perception systems.
For Vision Zero, Unity joined forces once again with CVC to provide the City of Bellevue with the best technology and expertise available.
Through imagery and 3D models provided by the City of Bellevue, and leveraging the integration of Otoy’s OctaneRender with Unity Engine, CVC took on creating a set of scenes that can be leveraged to improve both the training and the evaluation of the computer vision models built by Microsoft.
Vision Zero focuses on vehicles and pedestrians interaction on intersections. It means that a proper simulation needs to represent a few city areas with high level of detail, with challenging camera angles, in a multitude of situations - i.e. variation of objects in the scenes - in order to generate the data coverage required by the computer vision models.
The video below shows an intersection (116th Ave NE and NE 12th Street) in the City of Bellevue. The real camera picture at the 16 seconds point is a good baseline to understand the amazing level of photorealism achieved in this simulation.
Not only are the images impressively realistic, but because they are Unity assets, we have all metadata needed for a 100% error-free segmentation - i.e. pixel-level classification of everything in the scene, fundamental data for Computer Vision model training. There’s also precise information about distances, depth and materials, eliminating the need for human annotation.
Here's an example of depth metadata, taken from the video above:
The shades of gray represent the distance of each object from the camera, the darker, the closer. The sky has no data and is represented as pure black.
Notice that because we have very fine grained information about the image, it's possible to distinguish individual leaves in the trees and different elements of the buildings facades, for example.
Another snapshot from the video showing the Semantic Segmentation:
Compared to the manual image annotation tool shown earlier in this blog, the difference in quality and precision is clear. Here, it’s possible to have pixel-level labeling for full semantic segmentation instead of just a bounding box for coarse object detection. There are many classes of segments, represented in the picture by the different colors. Notice the ability to correctly differentiate between cars and buses, streets and sidewalks. This is powerful metadata that can be leveraged by the model to predict the overlap between different objects in the scene much more accurately than is possible with manually annotated data.
The next steps in our efforts are to start experimenting with the episode variation strategies: different cars, more near-misses, weather variation, etc. and generate a comprehensive dataset to be fed into the training pipeline for the computer vision model. We are also looking at new street intersections to be generated, as we scale out the project.
Initially, the models will target high accuracy on semantic segmentation. Eventually, trajectories will be detected and a complete “near miss” model will be developed to provide the automatic analytics that is the project’s goal.
Unity will be supporting Microsoft in the process of evaluating improvements to the computer vision models’ performance, tweaking the simulation as needed.
Based on research from different teams - including CVC itself - we expect that the approach of mixing real and simulated data will ensure the best results for the models. We’ll be posting concrete results once available.
Whereas this project tackles a problem in itself technically complex, its end goal - to improve safety in our cities and save lives - works as a natural catalyst to get all these great teams together. Unity, CVC, Microsoft and City of Bellevue - Industry, Academia and Government working towards a common goal.
It’s only natural for Unity to be in the middle of all this, empowering and enabling its partners. After all, it aligns perfectly with in Unity’s core values: democratize development, solve hard problems, and enable success.
We have been collecting invaluable knowledge as the project evolves, and those learnings will get incorporated back into the engine so that everybody can benefit.
You can expect Simulation to get even easier and more powerful on Unity as we go, and we are glad that we can do our part to have a safer world in the process.
The project is a collaboration with:
Franz Loewenherz, principal transportation planner for the City of Bellevue and head of the Video Analytics initiative
Prof. Antonio M. López, Principal Investigator at the Computer Vision Center (CVC), and Associate Professor of the Computer Science Department, both from the Universitat Autònoma de Barcelona (UAB); as well as Dr. Jose A. Iglesias Research Scientist at CVC.