Engine & platform

Optimizing loading performance: Understanding the Async Upload Pipeline

JOSEPH SCHEINBERG / UNITY TECHNOLOGIESContributor

Oct 8, 2018|7 Min

Optimizing loading performance: Understanding the Async Upload Pipeline

Nobody likes loading screens. Did you know that you can quickly adjust Async Upload Pipeline (AUP) parameters to significantly improve your loading times? This article details how meshes and textures are loaded through the AUP. This understanding could help you speed up loading time significantly - some projects have seen over 2x performance improvements!

Read on to learn how the AUP works from a technical standpoint and what APIs you should be using to get the most out of it.

Try it Out

The latest, most optimal implementation of the Asset Upload Pipeline is available in the 2018.3 beta.

Download 2018.3 Beta Today

First, let’s take a detailed look at when the AUP is used and how the loading process works.

When is the Async Upload Pipeline used?

Prior to 2018.3, the AUP only handled textures. Starting with 2018.3 beta, the AUP now loads textures and meshes, but there are some exceptions. Textures that are read/write enabled, or meshes that are read/write enabled or compressed, will not use the AUP. (Note that Texture Mipmap Streaming, which was introduced in 2018.2, also uses AUP.)

How the loading process works

During the build process, the Texture or Mesh Object is written to a serialized file and the large binary data (texture or vertex data) is written to an accompanying .resS file. This layout applies to both player data and asset bundles. The separation of the object and binary data allows for faster loading of the serialized file (which will generally contain small objects), and it enables streamlined loading of the large binary data from the .resS file after. When the Texture or Mesh Object is deserialized, it submits a command to the AUP’s command queue. Once that command completes, the Texture or Mesh data has been uploaded to the GPU and the object can be integrated on the main thread.

Once that command completes, the Texture or Mesh data has been uploaded to the GPU and the object can be integrated on the main thread

During the upload process, the large binary data from the .resS file is read to a fixed-sized ring buffer. Once in memory, the data is uploaded to the GPU in a time-sliced fashion on the render thread. The size of the ring buffer and the duration of the time-slice are the two parameters that you can change to affect the behavior of the system.

The Async Upload Pipeline has the following process for each command:

1. Wait until the required memory is available in the ring buffer.

2. Read data from the source .resS file to the allocated memory.

3. Perform post-processing (texture decompression, mesh collision generation, per platform fixup, etc).

4. Upload in a time-sliced manner on the render thread

5. Release Ring Buffer memory.

Multiple commands can be in progress simultaneously, but all must allocate their required memory out of the same shared ring buffer. When the ring buffer fills up, new commands will wait; this waiting will not cause main-thread blocking or affect frame rate, it simply slows the async loading process.

A summary of these impacts are as follows:

What public APIs are available to adjust loading parameters

To take full advantage of the AUP in 2018.3, there are three parameters that can be adjusted at runtime for this system:

QualitySettings.asyncUploadTimeSlice - The amount of time in milliseconds spent uploading textures and mesh data on the render thread for each frame. When an async load operation is in progress, the system will perform two time slices of this size. The default value is 2ms. If this value is too small, you could become bottlenecked on texture/mesh GPU uploading. A value too large, on the other hand, might result in framerate hitching.
QualitySettings.asyncUploadBufferSize - The size of the Ring Buffer in Megabytes. When the upload time slice occurs each frame, we want to be sure that we have enough data in the ring buffer to utilize the entire time-slice. If the ring buffer is too small, the upload time slice will be cut short. The default was 4MB in 2018.2 but has increased 16MB in 2018.3.
QualitySettings.asyncUploadPersistentBuffer - Introduced in 2018.3, this flag determines if the upload ring buffer is deallocated when all pending reads are complete. Allocating and deallocating this buffer can often cause memory fragmentation, so it should generally be left at its default(true). If you really need to reclaim memory when you are not loading, you can set this value to false.

These settings can be adjusted through the scripting API or via the QualitySettings menu.

Image of Async Upload Persistent Buffer selected

Example workflow

Let’s examine a workload with lots of textures and meshes being uploaded through the Async Upload Pipeline using the default 2ms time slice and a 4MB ring buffer. Since we’re loading, we get 2 time-slices per render frame, so we should have 4 milliseconds of upload time. Looking at the profiler data, we only use about 1.5 milliseconds. We can also see that immediately after the upload, a new read operation is issued now that memory is available in the ring buffer. This is a sign that a larger ring buffer is needed.

Let’s try increasing the Ring Buffer and since we’re in a loading screen, it is also a good idea to increase the upload time-slice. Here’s what a 16MB Ring Buffer and 4-millisecond time slice look like:

Image of what a 16MB Ring Buffer and 4-millisecond time slice look like:

Now we can see that we are spending almost all our render thread time uploading, and just a short time between uploads rendering the frame.

Below are the loading times of the sample workload with a variety of upload time slices and Ring Buffer sizes. Tests were run on a MacBook Pro, 2.8GHz Intel Core i7 running OS X El Capitan. Upload speeds and I/O speeds will vary on different platforms and devices. The workload is a subset of the Viking Village sample project that we use internally for performance testing. Because there are other objects being loaded, we aren’t able to get the precise performance win of the different values. It’s safe to say in this case, however, that the texture and mesh loading is at least twice as fast when switching from the 4MB/2MS settings to the 16MB/4MS settings.

Experimenting with these parameters outputs the following results.

To optimize loading times for this particular sample project, we should, therefore, configure settings like this:

QualitySettings.asyncUploadTimeSlice = 4
QualitySettings.asyncUploadBufferSize = 16
QualitySettings.asyncUploadPersistentBuffer = true

Takeaways and recommendations

General recommendations for optimizing loading speed of textures and meshes:

Choose the largest QualitySettings.asyncUploadTimeSlice that doesn’t result in dropping frames.
During loading screens, temporarily increase QualitySettings.asyncUploadTimeSlice.
Use the profiler to examine the time slice utilization. The time slice will show up as AsyncUploadManager.AsyncResourceUpload in the profiler. Increase QualitySettings.asyncUploadBufferSize if your time slice is not being fully utilized.
Things will generally load faster with a larger QualitySettings.asyncUploadBufferSize, so if you can afford the memory, increase it to 16MB or 32MB.
Leave QualitySettings.asyncUploadPersistentBuffer set to true unless you have a compelling reason to reduce your runtime memory usage while not loading.

FAQ

Q: How often will time-sliced uploading occur on the render thread?

Time-sliced uploading will occur once per render frame, or twice during an async load operation. VSync affects this pipeline. While the render thread is waiting for a VSync, you could be uploading. If you are running at 16ms frames and then one frame goes long, say 17ms, you will end up waiting for the vsync for 15ms. In general, the higher the frame rate, the more frequently upload time slices will occur.

Q: What is loaded through the AUP?

Textures that are not read/write-enabled are uploaded through the AUP.
As of 2018.2, texture mipmaps are streamed through the AUP.
As of 2018.3, meshes are also uploaded through the AUP so long as they are uncompressed and not read/write enabled.

Q: What if the ring buffer is not large enough to hold the data being uploaded(for example a really large texture)?

Upload commands that are larger than the ring buffer will wait until the ring buffer is fully consumed, then the ring buffer will be reallocated to fit the large allocation. Once the upload is complete, the ring buffer will be reallocated to its original size.

Q: How do synchronous load APIs work? For example, Resources.Load, AssetBundle.LoadAsset, etc.

Synchronous loading calls use the AUP and will essentially block the main thread until the async upload operation completes. The type of loading API used is not relevant.

Tell us what you think

We’re always looking for feedback. Let us know what you think in the comments or on the Unity 2018.3 beta forum!