Search Unity

Real-time style transfer in Unity using deep neural networks

Topics covered

Is this article helpful for you?

Thank you for your feedback!

Deep Learning is now powering numerous AI technologies in daily life, and convolutional neural networks (CNNs) can apply complex treatments to images at high speeds. At Unity, we aim to propose seamless integration of CNN inference in the 3D rendering pipeline. Unity Labs, therefore, works on improving state-of-the-art research and developing an efficient neural inference engine called Barracuda. In this post, we experiment with a challenging use case: multi-style in-game style transfer.

Deep learning has long been confined to supercomputers and offline computation, but their usability at real-time on consumer hardware is fast approaching thanks to ever-increasing compute capability. With Barracuda, Unity Labs hopes to accelerate its arrival in creators’ hands. While neural networks are already being used for game AI thanks to ML-Agents, there are many applications to rendering which have yet to be demonstrated in real-time game engines. For example: deep-learned supersampling, ambient occlusion, global illumination, style transfer, etc. We chose the latter to demonstrate the full pipeline going from training the network to integration in Unity’s rendering loop.

Neural style transfer

Style transfer is the process of transferring the style of one image onto the content of another. Famous examples are to transfer the style of famous paintings onto a real photograph. Since 2015, the quality of results dramatically improved thanks to the use of convolutional neural networks (CNNs). And more recently, large efforts have been made by the research community to train a CNN at processing this task in a single pass: a given image is taken as an input to the network which outputs a stylized version of it in less than a second (on GPU). In this work, we use a small version of such a network that we train to the task of multi-style transfer. Later, we plug it into the Unity rendering pipeline so that it takes as input the framebuffer, and transforms it into its stylized version in real-time.

The result is a real-time style transfer in your game. Here we see the great visuals from the Book of the Dead environment, stylized with the neural network applied in real-time at 30 FPS on current high-end PC hardware, with on-the-Fly style switching.

Realtime 30 FPS Style Transfer in Full HD powered by Barracuda
Book of the Dead demo

Training a deep convolutional neural network

To start we chose the state-of-the-art fast style-transfer neural network from Ghiasi and colleagues. This network has two parts:

1) from a style image, it estimates a compact representation of style using a neural network, and 

2) it injects this compact representation into the actual style transfer network that transforms an input image into a stylized image. This way, one can change the style image at runtime, and the style transfer adapts.

The network is composed of two parts: the Style inference network deduces a compact representation of style from style images, while the Style Transfer Network uses this representation to transfer style onto its input image.

Our Style Transfer Network is composed of two downsampling and symmetric upsampling layers with in-between five residual blocks.

Once the architecture is chosen, we first pre-train this full network offline (once trained, it will be used at runtime). To this end, we use a custom dataset of “content” images taken from videos and computer animation movies, and “style” images taken from a database of approximately 80k paintings. The neural network’s weights are optimized so that when given a style and content image, the resulting output image shows fidelity to the style, and allows for recognizability of the content.

Also, consecutive frames can be stylized quite differently, inducing heavy flickering artifacts. So we need to train the network to handle the time dimension. In practice, this is a training goal that forces two consecutive frames to be stylized similarly (after applying displacement vectors).

The balance between these different constraints is a delicate one, and this process requires quite some trial and error.

Training lasts around 2-3 days using the Tensorflow library with Cuda/CuDNN backend on a single NVidia RTX 2080 GPU. After training, the network architecture and its trained parameters are saved to disk, ready to be loaded into Unity subsequently for runtime usage.

Unity integration using Barracuda

With Barracuda, Unity Labs has built a cross-platform neural network inference engine for Unity. Neural networks - pre-trained in the library of your choosing and saved to disk - can be imported and run in Unity via Barracuda. Its documentation goes into detail including how to prepare your network trained in Pytorch or Tensorflow. Barracuda is powered by Unity’s multi-platform design and runs on CPU or GPU. CPU inference is supported on all Unity platforms while GPU inference requires Unity compute shader and runs almost everywhere except WebGL, currently.

Barracuda's recommended import path is via ONNX, an open format most Deep Learning libraries can export to. For the user, importing is as simple as drag-and-dropping the file into your Unity project. The asset inspector then gives you information such as the input, output, and layers of the network. Here is an example:

It then becomes a matter of supplying the inputs (content image and style image) to the network and displaying the stylized output. Within Unity, it’s as simple as creating a custom post-processing script that loads the neural network with Barracuda then gets the camera’s rendered image each frame, infers the network on that input, and copies the output to the screen.

As a result, we now have a full rendering pipeline in which the usual rendering process writes into the framebuffer, which is subsequently transformed by the neural network, inferred in Barracuda:

Style Transferred Rendering is a two-stage process: the Rendering stage computes the usual game images, while the Post-process stage style transfers it into a stylized game depending on the provided style.

Visual results & performance

We showcase real-time style transfer on the beautiful and complex Book of the Dead scene. The 3D rendering stage and especially the neural network inference (ie, post-process) stage are very computationally-intensive, and therefore our demo requires high-end hardware. Using a NVidia RTX 2080 GPU at 1080p resolution, the total time spent per frame is 23ms (6-9ms for the rendering stage, and 14ms for the neural network inference stage). With an AMD Vega RX 64, total time spent per frame is 28 ms, composed of 7-10 ms to render the scene and 18 ms for inference. In both cases, the demo runs at a solid 30 fps. Those numbers include optimizations that were done both to the network and to Barracuda, more on this below.

As seen in the recording above the viewer can do as usual in Book of the Dead: navigate freely, enjoy the complex and beautiful vegetation. But now the viewer can also decide to apply a style of choice: a Picasso painting for example. The game is then stylized according to the requested style, in real time.

Note that the part of the neural network that infers the compact style representation is only run once when the style changes and can even be loaded from disk; meaning no lag when changing style.

The current version of the neural network handles a wide variety of styles. Still, improving the quality of style transfer and the variety of handled styles, while remaining in the scope of efficient networks usable in real-time, is an open research question.

Optimizing performance for PS4 Pro

Barracuda being multi-platform by design, we can switch to the PS4 Pro to showcase style transfer, without any modifications to the code or network. However, this hardware target has far less computing power to dedicate to inference compared to our RTX 2080. We therefore first start by switching to the classic Unity Vikings Village scene to reduce the time spent on the 3D rendering stage.

Stylized Vikings Village scene, inset shows the applied style

With this cheaper scene, the stylized render initially took around 166 ms (10 ms for 3D rendering at 1080p and 156 ms for neural network inference at 720p). Furthermore raising inference resolution at 1080p made the demo run out of memory. We thus require a significant speed-up and memory size reduction to run at 30 fps at full 1080p resolution. As a proof of concept, we optimized this demo in three ways to reach 28ms per frame at 1080p: Barracuda GPU-level optimization, a smaller (and thus faster) neural network, and screen space temporalization. Those optimizations are also valid on PC and helped reach the timings we saw above, however, screen space temporalization is not needed to run at 30 fps on recent GPUs.

Barracuda GPU level optimization

In terms of performance the style transfer network of this experiment is mainly composed of: Convolutions, Instance normalizations, and ReLU activations, furthermore the runtime part of the network has two interesting particularities: it is run at an overall high resolution (residual blocks run at 480x270) and the input and output size is 1920x1080, with a channel count of 3 (for RGB).

We will further discuss Barracuda level optimization in a future blog post. However, here is an overview:

  • Memory layout was changed from channel last to channel first, increasing memory coherency.
  • ReLUs were fused inside other operators where possible.
  • New Convolution kernels were written to cover both the up/down sampling case and the residual case.
  • Instance normalization kernel was rewritten.

On PS4 Pro those optimizations allowed inference to go from 166 ms to 70ms on the reference network at 720p.

Reduce neural network size

The neural network’s architecture was designed to handle arbitrary styles on any scene. We profiled time spent on each layer of the CNN (on PC one can simply use Unity GPU profiler, on PS4 we used Sony dedicated profiling tool) and conducted several experiments to assert quality versus speed. In the end, we optimized the network in two ways : 

  • For up and downsampling, the number of convolutions has been reduced (from 3 to 2) and channel count is kept small when data is at higher resolutions.
  • Channel count of the network was reduced from 48 to 32 channels.
For speedup, we improved up and down sampling and reduced CNN filters from 48 to 32 channels.

The reduced neural network can now be inferred in 56ms at 1080p resolution (instead of 70ms at 720p) on PS4 Pro.

Temporal Upsampling

An obvious way to further reduce time spent on inference is to reduce the rendering resolution, as the network’s complexity scales directly with it. However, this is a compromise we cannot make as stylized results tend to look blurry at low resolutions, losing a lot of the scene’s detail. Let’s instead look at applying another trick out of the game dev handbook. We can take advantage of the fact that our style transfer demo is fully integrated as a regular post-effect in Unity, much in the same way as e.g. ambient occlusion methods. This allows us to apply computer graphics techniques to a deep neural network, as game engines like Unity give us much more information each frame than just the final render.

Current games often use temporalization schemes when it comes to improving either the quality or performance of an expensive screen-space effect like temporal anti-aliasing. The idea is to re-use information from previously rendered frames to improve or complete the current one, taking advantage of the coherency between consecutive frames. Coincidentally with Barracuda, network inference can be manually scheduled layer-by-layer, thus we can divide the full inference into equal time shares and stylize an image over several frames.

To display intermediate frames while the next stylized frame is being computed by Barracuda, we use reprojection much like other temporal methods in computer graphics. Specifically, we apply Image-space Bidirectional Scene Reprojection (Yang et al, 2011) to generate high-quality intermediate frames in-between each network output frames, with as few disocclusion errors as possible.

We apply this to compute stylization over four frames, which brings us into the 30 FPS frame budget on PS4 Pro: 14ms per frame for sliced inference + 4ms reprojection overhead + 10ms for scene rendering = 28ms total. And here is the final result captured running on the console!

Using this temporalization scheme on style transfer does however present issues. For example, style transfer alters the shape of objects at their boundaries and adds halos around them, invalidating the depth and motion vectors around edges. This creates ghosting in the reprojected intermediate frames. We fixed this in this demo by fetching the motion vector of the minimum depth in the neighborhood of each pixel. This makes the halos stick to the objects they’re created by, reducing the artifact, but not eliminating it completely.

The future: CNNs in the rendering loop

In the previous sections, we took advantage of the integration in Unity. This allowed us to plug and play the network as a post-effect in a rendering pipeline, to the benefit of temporalization. We could go further: one can imagine applying a neural network that takes multiple G-Buffers as input in the deferred pipeline, for tasks like denoising, texture hallucination, antialiasing, or global lighting.

We also saw how mixing CNNs with computer graphics techniques can present challenges. In our case, style transfer alters shapes making reprojection error-prone and expensive. A better solution could be weaved into the network training itself, using an improved network designed with the game engine constraints in mind. These kinds of issues are at the foreground of the new intersection between real-time graphics and deep learning, which you and Unity Labs can now fully invest into researching thanks to Barracuda.

We have used this demo to drive research on both neural texture synthesis and style transfer and the development of Barracuda. Barracuda is available, yet features and optimizations are actively being developed, for example, neural networks containing non-standard layers are unlikely to be supported in the current version. Let us know what you think on the forum.

Download the sample project

A Unity sample project showcasing the runtime style transfer model from the demo above is available here on Github. It allows you to choose your style on the fly. This sample is meant for experimental purposes only. You may, for example, want to plug it into your own scene. We provide it as is, yet feel free to explore, play with it, or break it!

Is this article helpful for you?

Thank you for your feedback!

Topics covered