Search Unity

Improving job system performance scaling in 2022.2 – part 2: Overhead

March 14, 2023 in Engine & platform | 20 min. read
Improving job system performance scaling in 2022.2 – part 2: Overhead | Hero image
Improving job system performance scaling in 2022.2 – part 2: Overhead | Hero image
Share

Is this article helpful for you?

Thank you for your feedback!

The 2022.2 and 2021.3.14f1 releases have improved the scheduling cost and performance scaling of the Unity job system. In part one of this two-part article on what’s new with job systems, I offered some background information on parallel programming and why you might use a job system. For part two, let’s dive deeper into what job system overhead is and Unity’s approach to mitigating it.

Job system overhead

Overhead means any time the CPU spends not running your job, from the moment you begin to schedule it until the moment it finishes, unblocking any waiting jobs. Broadly, there are two areas where time is spent:

  1. The C# Job API layer
  2. The native job scheduler (which manages and runs all scheduled C# and, internally, C++ jobs)

C# Job API overhead

The C# Job API’s purpose is to provide a safe means to access the native job system. While this is a binding layer for the C# to C++ transition, it’s also a layer that allows you to prevent accidental scheduling of C# jobs that will run into race conditions or deadlocks when accessing NativeContainers from within a job.

In addition, this separation provides a richer way of creating jobs themselves. At the C++ layer, jobs are just a pointer to some data and a function pointer. But with the C# API on top, you can customize the types of jobs you schedule, allowing for better control over how job data should be split up and parallelized to fit user-specific use cases.

When scheduling a job, the C# job binding layer copies the job struct into an unmanaged memory allocation. This allows the lifetime of the C# job struct to be disconnected from the job lifetime in the job system, since this is affected by the job’s dependencies and overall load on the platform. The job system then conditionally performs safety checks in Editor playmode builds to ensure a job is safe to run.

These steps are important, but they are not free and contribute to job system overhead. Since job size can vary, as well as the number of NativeContainers and dependencies a job might have, the cost to copy jobs and validate their safety is not fixed. Because of this, it’s important Unity keeps costs small and constrained to linear computational complexity.

In the 2021.2 Tech Stream, the engineering team made significant improvements to the job safety system by caching the safety check result for individual job handles. This is particularly important, since the safety system needs to understand entire chains of job dependencies and each native memory reference all jobs contain to understand which may be missing dependency information and to which job a dependency should be added to. This can result in a non-linear amount of items to iterate over when scheduling (i.e., for each job and its dependencies, check the read/write access for each NativeContainer the job refers to and any job referring to the NativeContainers).

However, Unity can take advantage of the fact that C# jobs are only scheduled one at a time, and check safety during this scheduling. Instead of rescanning all jobs each schedule, we can quickly determine if revalidating job dependency chains is necessary or not, allowing large amounts of work to be skipped. For even small job dependency chains, this dramatically reduces the cost of job safety checks. Ideally there should be no reason to turn job safety checks off when developing (job safety checks are not on in player/shipping builds).

Job scheduler

Whenever a C# or C++ job is scheduled for execution, it goes through the job scheduler. The scheduler’s role is to:

  • Track jobs via job handles
  • Manage job dependencies, ensuring jobs only start executing once all dependencies have completed
  • Manage “worker threads,” which are the threads that will execute jobs
  • Ensure jobs are executed as quickly as possible – usually meaning they should run in parallel when dependencies allow

Additionally, while the C# Job API only allows jobs to be scheduled from the main thread, the job scheduler needs to support multiple threads scheduling jobs at the same time. This is because the underlying Unity engine uses many threads which schedule jobs and can even schedule jobs from within jobs. This functionality has pros and cons, but requires much more scrutiny for correctness and adds the requirement that the job scheduler must be thread safe.

In the 2017.3 release, the basic look of the job scheduler was:

  • Queue for jobs
  • Stack for jobs
  • Semaphore
  • Array of worker threads

The typical usage follows this pattern: As jobs are scheduled, they are enqueued into a global, lock-free, multiple-producer, multiple-consumer queue, which represents jobs that are ready for handling by a worker thread. The main thread then signals using a semaphore to wake up worker threads.

The number of workers told to wake up depends on the job type being scheduled – single jobs such as IJob only wake a single worker, since that job type doesn’t spread work across multiple worker threads. IJobParallelFor jobs, however, represent multiple pieces of work that can be run in parallel. While one job is scheduled, there might be many pieces for some or all workers to help with at the same time. As such, the scheduler figures out how many workers can potentially help and wakes that number up.

Once awake, worker threads are where the actual job work happens. In 2017.3, they were responsible for dequeuing a job from the job queue, ensuring all relevant job dependencies were complete. If they weren’t complete yet, the job and incomplete dependencies would be added to a lock-free stack as a way to jump to the front of the queue to try and run again. Worker threads do this in a loop until either the engine signals that it wants to shut down, or there are no more jobs in the stack and queue. At which point, the worker threads go to sleep by waiting on a signal from the main thread semaphore.

while(!scheduler.isQuitting)
{
    // Usually empty unless we need to prioritize a dependency
    // to unblock a job we got from the queue. Alternatively 
    // pieces of work from a IJobParallelFor job can end up here to let
    // many workers help finish IJobParallelFor work quickly
    Job* pJob = m_stack.pop();
    if(!pJob)
        Job* pJob = m_queue.dequeue();

    if(pJob) {
        // ExecuteJob if all dependencies are complete, otherwise
        // push this job and the dependencies to the stack and try again
        if(EnsureDependenciesAreCompleteOtherwiseAddToStack(pJob))
            ExecuteJob(pJob);
    }
    else
    {
        // Put the thread to sleep until more jobs are scheduled
        m_semaphore.Wait(1);
    }
}

The job scheduler creates as many worker threads as there are virtual cores on the CPU, minus one by default. The intention here is for each worker thread to run on its own CPU core, while leaving one CPU core free for the main thread to continue running. In practice, on platforms where a core isn’t reserved for non-game processes, it can be better to reduce the amount of worker threads so computation done by the operating system or driver threads doesn’t compete with the game’s main or job worker threads.

Since the main thread is the primary place where jobs are scheduled from, it’s very important to not delay the main thread. Doing so directly affects how many jobs enter the job system and thus how much parallelism can occur within a frame.

With the main thread theoretically scheduling lots of jobs and the rest of the CPU cores executing those jobs, we should be able to maximize how much parallel work can be done on the CPU and allow performance to scale as the hardware changes. If we had more worker threads than cores, the operating system could context switch the main thread, and switch to a worker thread. Having an additional worker thread running might help empty your job queue faster, but it would certainly prevent new work from entering the queue, which ultimately has a larger negative effect on performance.

Thread-signaling overhead

There are a couple of potential problems with the above job scheduler approach that can lead to job system overhead. Let’s look at some examples.

Main thread schedules an IJob (non-parallel job) with no dependencies:

  • A job is added to the queue, and a worker thread is signaled to wake up
  • A worker thread wakes up
  • The worker executes the job
  • The worker checks for any more jobs to execute
  • The worker goes to sleep since there are no more jobs

Once the main thread signals using the job scheduler’s semaphore, one of the sleeping worker threads (not necessarily worker 0) will wake up. Waking up and context switching takes some time on the worker core. This is because, while the worker thread is asleep, the CPU core that the worker thread will end up running on was likely doing something – maybe running another thread spawned by the game or some other process on the machine that was using the thread.

To enable threads to be paused and resumed later, a thread’s register state needs to be saved, instruction pipelines need to be flushed, and the switched-to thread’s state needs to be restored. Even signaling the thread takes time on the main thread’s core, since notifying which thread to wake up is handled by the operating system. Ultimately, this all means that work is being done on the main thread core and the worker thread core that is not our job, and thus is overhead we want to reduce.

A job is scheduled on the Main thread and eventually runs on thread Worker 0. The job execution is delayed by overhead signaling Worker 0 to wake up on the Main thread, the context switch time on the Worker 0 thread, and the time the job system takes to find the job to run.
A job is scheduled on the main thread and eventually runs on thread Worker 0. The job execution is delayed by overhead signaling Worker 0 to wake up on the main thread, the context switch time on the Worker 0 thread, and the time the job system takes to find the job to run.

How quickly workers can be notified and how much time an individual job takes to run can also have an impact on the system. For instance, if you take the above use case but schedule two jobs instead of one:

  • A job is added to the queue, and a worker thread is signaled to wake up
  • The second job is added to the queue, and a worker thread is signaled to wake up
  • In some order, but twice:
    • A worker thread wakes up
    • A worker executes the job
    • The worker checks for any more jobs to execute
    • The worker goes to sleep since there are no more jobs

If the timing works out, you have two workers working in parallel on the job.

A parallel job is scheduled on the Main thread and eventually runs on thread Worker 0 and Worker 1 simultaneously.
A parallel job is scheduled on the main thread and eventually runs on thread Worker 0 and Worker 1 simultaneously.

However, if one of the jobs is too small and/or it takes too long to signal and wake up both workers, one worker might steal all the work in the queue, and as a result we’ve signaled a worker for no reason.

Two jobs are scheduled on the main thread but both run on Worker 0 due to Worker 1 not waking up before Worker 0 consumes all jobs in the job queue. There may be too many non-worker threads in the system occupying CPU cores, or the jobs are too small to give worker threads enough time to wake up on average.
Two jobs are scheduled on the main thread but both run on Worker 0 due to Worker 1 not waking up before Worker 0 consumes all jobs in the job queue. There may be too many non-worker threads in the system occupying CPU cores, or the jobs are too small to give worker threads enough time to wake up on average.

This type of job starvation and wake <-> sleep cycle can end up being quite expensive and limit the amount of parallelism the job system offers.

You might be thinking, “Isn’t overhead from signaling threads and context switching a cost of doing business when dealing with threads in the first place?” You certainly aren’t wrong. But, while you don’t have direct control over how expensive signaling or waking up threads is, you can control how often those operations occur.

One solution to avoid waking up workers for no reason is to only wake them when you suspect there are lots of work items in the queue for workers to take justifying the wake-up cost. This can be done by batching: Instead of signaling workers as soon as you schedule a job, add the job to a list and, at specific times, flush that batch of jobs into the job system, waking up an appropriate amount of workers at the same time.

Two jobs are scheduled into a batch and then the whole batch is flushed, waking up two workers at nearly the same time. This batching approach improves the chances that both workers will find work when they wake up.
Two jobs are scheduled into a batch and then the whole batch is flushed, waking up two workers at nearly the same time. This batching approach improves the chances that both workers will find work when they wake up.

There is still a risk that the actual wake-up takes too long, the batched jobs are very small, or the number of jobs in a batch is just not very high. In general, the more jobs you include in the batch, the more likely it is to avoid overhead from waking up threads for no reason. Unity maintains a global batch which is flushed whenever a call to JobHandle.Complete() is called. So if you need to explicitly wait for a job to complete, try to do so as late and infrequently as possible, and generally prefer scheduling jobs with job dependencies to best control safe access to data.

You might also be asking yourself, “If signaling threads and waiting for them to wake up/go to sleep is purely overhead, why don’t we keep our threads awake all the time looking for work?” When there are plenty of jobs in the queue, this can actually occur naturally. Unless the operating system deems the worker thread to be lower priority than some other work (or is explicitly time sliced and should be swapped to give other threads their fair share of CPU time – it depends on your platform), worker threads will happily keep working.

However, as with the PartialUpdateA and PartialUpdateB functions we saw in part one, not all jobs are parallelizable and free of data dependencies. As such, you usually need to wait for some subset of jobs to complete before you can run others. As a result, we see bottlenecks in a job graph’s parallelism when there becomes fewer runnable jobs (jobs with no outstanding dependencies) than there are worker threads, resulting in some workers having nothing productive left to do.

If you don’t ever let worker threads sleep, you can run into a handful of issues. When worker threads constantly check for new jobs and fail to find any, this is considered “busy waiting,” or work that’s wasteful and doesn’t progress the program. Keeping all cores running with maximum parallelism, but without progressing the game, is a drain on battery life. Not only that, if a core doesn’t have idle time, without sufficient cooling the CPU’s temperature will rise, leading to downclocking – running slower to avoid damage from overheating. In fact, on mobile platforms, it’s not uncommon for entire CPU cores to become temporarily disabled if they get too hot. For a job system, being able to use cores efficiently is very important, so there is a balance between putting workers asleep, and having them constantly loop looking for new jobs, hoping they get lucky.

Compare-and-swap overhead

Another area that can generate overhead in the design above is the lock-free queue and stack. We won’t go into all the nuance of implementing these data structures, but one common trait of lock-free implementations is the use of a compare-and-swap (CAS) loop. Lock-free algorithms don’t use locking synchronization primitives to provide safe access to shared state, but instead use atomic instructions to carefully create higher-order atomic operations such as inserting an item into a queue in a thread-safe manner. However, perhaps unintuitively, lock-free algorithms can still prevent one thread from progressing until another is complete. They can also have secondary effects on the CPU instruction and memory pipelines, hurting performance scaling. (“wait-free” algorithms would allow all threads to always progress, but that doesn’t always provide the best overall performance in practice.)

Here is a contrived example of adding a number to a member variable, m_Sum, with a CAS loop:

int Add(int val)
{
    int newSum;
    do
    {
        // Load the current value we want to update
        var oldSum = m_Sum;

        // Compute new value we want to store
        newSum = oldSum + val;

        // Attempt to write the new value. CompareExchange returns 
        // the value seen inside m_Sum when writing newSum to m_Sum. 
        // If newSum doesn't match oldSum, we will retry the loop 
        // since it means another thread wrote to the memory before us.
        // If we wrote our value without this check, we might 
        // write an incorrect value
    }while (oldSum != Interlocked.CompareExchange(ref m_Sum, newSum, oldSum));

    return newSum ;
}

CAS loops rely on the compare-and-swap instruction (here we use the C# Interlocked library abstracting platform specifics away), which “compares two values for equality and, if they are equal, replaces the first value.” Since we want Add() function users to not worry about this function potentially failing, a loop is used to retry if it fails because some other thread beat us to updating m_Sum.

This retry loop is, in essence, a “busy-wait” loop. This has a nasty implication for performance scaling: If multiple threads enter the CAS loop at the same time, only one will ever leave at a time, serializing the operations each thread is performing. Fortunately, CAS loops generally do an intentionally small amount of work, but it can still have large negative impacts on performance. As more cores execute the loop in parallel, it will take each thread longer to complete the loop while the threads are in contention.

Further, because CAS loops rely on atomic read-and-writes to shared memory, each thread generally requires its cache lines to be invalidated on each iteration, causing additional overhead. This overhead can be very expensive in comparison to the cost of redoing the calculations inside the CAS loop (in the case above, redoing the work of adding two numbers together). So, how high the cost is can be non-obvious at first glance.

Under the 2017.3 job scheduler, when worker threads were not running jobs, they were looking for work in either a shared, lock-free stack or queue. Both of these data structures used at least one CAS loop to remove work from the data structure. So, as more cores became available, the cost of taking work from the stack or queue increased when the data structures had contention. In particular, when jobs were small, worker threads proportionally spent more time looking for work in the queue or stack.

In a small project, I’ve generated deterministic job graphs that a typical game may have for its frame update. The graph below is composed of single jobs and parallel jobs (each parallelizing into 1–100 parallel jobs), where each job may have 0–10 job dependencies and the main thread has occasional explicit sync points where it must wait for certain jobs to finish before scheduling more. If I generate 500 jobs in the job graph, and make each take a fixed amount of time to execute (each portion of a parallel job takes this time as well), you can see that, as more cores are used, overhead in the job system increases.

Windows 11 AMD Ryzen 9 3950X
Windows 11 AMD Ryzen 9 3950X

For jobs that take 0.5μs, once there are 20 workers, the frame updates as fast as not using the job system at all, and runs nearly twice as slow when using all cores on my machine. By default, all cores are used in Unity, so with 1μs jobs, there is almost no improvement in performance despite using 31 worker threads. This is a direct result of high contention on the lock-free queue and stack. Luckily, user jobs tend to be larger in size and can hide this overhead. However, the scaling issue is there, and small jobs are still common enough (especially for parallel jobs). Even when using larger jobs, your scheduling patterns and worker timing can cause large amounts of overhead due to contention with the global, lock-free stack and queue in the job scheduler.

2022.2 job scheduler

By now, you can see that there are a few areas our team needed to address to reduce overhead in the job system, both on Unity’s side and on the game creator’s side:

  • Avoiding stalls on the main thread:
    • Signaling to wake worker threads is expensive – keep this to a minimum.
    • Modifying state on the main thread shared with worker threads is likely to lead to cache invalidations and potential busy-waiting.
    • The main thread should schedule jobs frequently – avoid explicitly waiting on jobs to .Complete(). Prefer submitting jobs with dependencies instead.
  • Avoiding stalls on worker threads:
    • Worker thread efficiency directly impacts parallelism. Avoid contending on shared resources where possible.
    • Busy-waits on worker threads will drain battery life and can result in downclocking due to increases in temperature.

While Unity can’t change how many jobs users submit in their games, there are a decent number of issues that our engineers can tackle with a different job scheduler approach. In the 2022.2 release, the job scheduler, at a high level, breaks down into a few basic components:

  • Array of worker threads
  • Array of queues for jobs
  • Array of semaphores

This is very similar to the previous job scheduler. However, the main difference is the removal of the shared state between the main thread and worker threads. Instead, we make the queues and semaphores (or futex on platforms that support it) local to each worker thread. Now, when the main thread schedules a job, it’s enqueued into the main thread’s queue rather than a global queue.

Similarly, if a worker thread needs to schedule a job (e.g., a job schedules a job in its Execute), that job is scheduled in the worker’s own queue rather than in the main thread queue. This reduces memory traffic, since workers reduce the frequency of invalidating cache lines when they write to a queue. As such, workers don’t read/write to all the different queues at the same frequency.

The worker loop has also changed, now that there are more queues to work with:

while(!scheduler.isQuitting)
{
    // Take a job from our worker thread’s local queue
    Job* pJob = m_worker_queue[m_workerId].dequeue();
    // If our queue is empty try to steal work from someone
    // else's queue to help them out.
    if(pJob == nullptr) {
        pJob = StealFromOtherQueues()
    }

    if(pJob) {
        // If we found work, there may be more conditionally
        // wake up other workers as necessary
        WakeWorkers();
        ExecuteJob(pJob);
    }
    // Conditionally go to sleep (perhaps we were told there is a 
    // parallel job we can help with)
    else if(ShouldSleep())
    {
        // Put the thread to sleep until more jobs are scheduled
        m_semaphores[m_workerId].Wait(1);
    }
}

Workers look in their own queue for work and only look at other worker queues when theirs is empty. Since workers prefer their own queues for dequeuing and enqueuing work, the amount of contention on any one queue is reduced.

Another difference is how threads are signaled to wake up. Worker threads are now responsible for waking up other worker threads, and the main thread is responsible for ensuring that at least one worker thread is awake when it schedules a job.

This change in responsibility allows the main thread to remove excessive overhead since it no longer needs to be solely responsible for waking threads when parallel jobs are submitted. Instead, the job system performs tracking to know if it needs to wake any workers at all. The main thread can ensure a worker is always awake to make progress on jobs and when workers wake and find a job in its own queue or another’s, workers can signal other workers to wake up and help empty the queue if needed.

Windows 11 AMD Ryzen 9 3950X
Windows 11 AMD Ryzen 9 3950X
Windows 11 AMD Ryzen 9 3950X
Windows 11 AMD Ryzen 9 3950X

The queue separation for workers also provides some interesting leeway for configuration and optimizations, which our team is continuing to add to and improve on. In 2022.2, users should see reduced cost on the main thread to wake up worker threads and improved throughput of jobs on worker threads, regardless of how many cores their platform has. Additionally, while Unity has not backported the queue separation to 2021.3 LTS, we have brought back the design change to make worker threads responsible for signaling each other rather than the main thread solely. High job system overhead on the main thread due to signaling the global semaphore should no longer be an issue as of 2021.3.14f1.

If you have questions or want to learn more, visit us in the C# Job System forum. You can also connect with me directly through the Unity Discord at username @Antifreeze#2763. Be sure to watch for new technical blogs from other Unity developers as part of the ongoing Tech from the Trenches series.

March 14, 2023 in Engine & platform | 20 min. read

Is this article helpful for you?

Thank you for your feedback!

Related Posts