In June, we hosted a webinar featuring experts from Arm, the Unity Accelerate Solutions team, and SYBO Games, the creator of Subway Surfers. The resulting roundtable focused on profiling tips and strategies for mobile games, the business implications of poor performance, and how SYBO shipped a hit mobile game with 3 billion downloads to date.
Let’s dive into some of the follow-up questions we didn’t have time to cover during the webinar. You can also watch the full recording.
There are no immediate plans to integrate the Profile Analyzer into the core Editor, but this might change as our profiling tools evolve.
That’s a great idea, and while we can’t say yes or no at the time of this blog post, it’s a request that’s been shared with our R&D teams for possible future consideration.
Although we don’t have specific plans for tracking ANR without stack trace at the moment, we will consider it for the future roadmap.
You can keep track of upcoming features and share feedback via our product board and forums. We are also conducting a survey to learn more about our customers’ experience with the profiling tools. If you’ve used profiling tools before (either daily or just once) or are working on a project that requires optimization, we would love to get your input. The survey is designed to take no more than 5–10 minutes to complete.
By participating, you’ll also have the chance to opt into a follow-up interview to share more feedback directly with the development team, including the opportunity to discuss potential prototypes of new features.
A rule of thumb we hear from many Unity game developers is to target devices that are five years old at the time of your game’s release, as this helps to ensure the largest user base. But we also see teams reducing their release-date scope to devices that are only three years old if they’re aiming for higher graphical quality. A visually complex 3D application, for example, will have higher device requirements than a simple 2D application. This approach allows for a higher “min spec,” but reduces the size of the initial install base. It’s essentially a business decision: Will it cost more to develop for and support old devices than what your game will earn running on them?
Sometimes the technical requirements of your game will dictate your minimum target specifications. So if your game uses up large amounts of texture memory even after optimization, but you absolutely cannot reduce quality or resolution, that probably rules out running on phones with insufficient memory. If your rendering solution requires compute shaders, that likely rules out devices with drivers that can’t support OpenGL ES 3.1, Metal, or Vulkan.
It’s a good idea to look at market data for your priority target audience. For instance, mobile device specs can vary a lot between countries and regions. Remember to define some target “budgets” so that benchmarking goals for what’s acceptable are set prior to choosing low-end devices for testing.
For live service games that will run for years, you’ll need to monitor their compatibility continuously and adapt over time based on both your actual user base and current devices on the market.
It might be, if you have a uniform workload on all devices. However, you still need to consider variations across hardware from different vendors and/or driver versions.
It’s common for graphically rich games to have tiers of graphical fidelity – the higher the visual tier, the more resources required on capable devices. This tier selection might be automatic, but increasingly, users themselves can control the choice via a graphical settings menu. For this style of development, you’ll need to test at least one “min spec” target device per feature/workload tier that your game supports.
If your game detects the capabilities of the device it’s running on and adapts the graphics output as needed, it could perform differently on higher end devices. So be sure to test on a range of devices with the different quality levels you’ve programmed the title for.
Note: In this section, we’ve specified whether the expert answering is from Arm or Unity.
Arm: We typically see developers doing coarse capability binning based on CPU and GPU models, as well as the GPU shader core count. This is never perfect, but it’s “about right.” A lot of studios collect live analytics from deployed devices, so they can supplement the automated binning with device-specific opt-in/opt-out to work around point issues where the capability binning isn’t accurate enough.
As related to the previous question, for graphically rich content, we see a trend in mobile toward settings menus where users can choose to turn effects on or off, thereby allowing them to make performance choices that suit their preferences.
Unity: Device memory and screen resolution are also important factors for choosing quality settings. Regarding textures, developers should be aware that Render Textures used by effects or post-processing can become a problem on devices with high resolution screens, but without a lot of memory to match.
Arm: The number of tiers your team optimizes for is really a game design and business decision, and should be based on how important pushing visual quality is to the value proposition of the game. For some genres it might not matter at all, but for others, users will have high expectations for the visual fidelity.
Arm: To a first-order approximation, we would expect the total amount of texture memory to be similar across vendors and hardware generations. There will be minor differences caused by memory layout and alignment restrictions, so it won’t be exactly the same.
Arm: It’s entirely content dependent. The CPU, GPU, or the DRAM can individually overheat a high-end device if pushed hard enough, even if you ignore the other two completely. The exact balance will vary based on the workload you are running.
Arm: Optimizing for frame time can be misleading on Android because devices will constantly adjust frequency to optimize energy usage, making frame time an incomplete measure by itself. Preferably, monitor CPU and GPU cycles per frame, as well as GPU memory bandwidth per frame, to get some value that is independent of frequency. The cycle target you need will depend on each device’s chip design, so you’ll need to experiment.
Any optimization helps when it comes to managing power consumption, even if it doesn’t directly improve frame rate. For example, reducing CPU cycles will reduce thermal load even if the CPU isn’t the critical path for your game.
Beyond that, optimizing memory bandwidth is one of the biggest savings you can make. Accessing DRAM is orders of magnitude more expensive than accessing local data on-chip, so watch your triangle budget and keep data types in memory as small as possible.
Unity: To limit the impact of CPU clock frequency on the performance metrics, we recommend trying to run at a consistent temperature. There are a couple of approaches for doing this:
With some hardware, you can fix the clock frequency for more stable performance metrics. However, this is not representative of most devices your users will be using, and will not report accurate real-world performance. Basically, it’s a handy technique if you are using a continuous integration setup to check for performance changes in your codebase over time.
Arm: Recent drivers and engine builds have vastly improved the quality of the Vulkan implementations available; so for an equivalent workload, there shouldn’t be a performance gap between OpenGL ES and Vulkan (if there is, please let us know). The switch to Vulkan is picking up speed and we expect to see more people choosing Vulkan by default over the next year or two. If you have counterexamples of areas where Vulkan isn’t performing well, please get in touch with us. We’d love to hear from you.
Arm: The Streamline Profiler in Arm Mobile Studio can measure bandwidth between Mali GPUs and the external DRAM (or system cache).
Arm: You can get the best result by retuning assets, but it’s expensive to do. Start by reducing resolution and frame rate, or disabling some optional post-processing effects.
Arm: You can use the Performance Advisor tool in Arm Mobile Studio to automatically capture and export performance metrics from the Mali GPUs, although this comes with a caveat: The generation of JSON reports requires a Professional Edition license.
Unity: The Unity Profiler can be used to view common rendering metrics, such as vertex and triangle counts in the Rendering module. Plus you can include custom packages, such as System Metrics Mali, in your project to add low-level Mali GPU metrics to the Unity Profiler.
You need a GPU Profiler to do this. The one you choose depends on your target platform. For example, on iOS devices, Xcode’s GPU Profiler includes the Shader Profiler, which breaks down shader performance on a line-by-line basis.
Arm Mobile Studio supports Mali Offline Compiler, a static analysis tool for shader code and compute kernels. This tool provides some overall performance estimates and recommendations for the Arm Mali GPU family.
The proliferation of chipsets is primarily a concern on desktop platforms. There are a limited number of hardware architectures to test for console games. On mobile, there’s Apple’s A Series for iOS devices and a range of Arm and Qualcomm architectures for Android – but selecting a manageable list of representative mobile devices is pretty straightforward.
On desktop it’s trickier because there’s a wide range of available chipsets and architectures, and buying Macs and PCs for testing can be expensive. Our best advice is to do what you can. No studio has infinite time and money for testing. We generally wouldn’t expect any huge surprises when comparing performance between an Intel x86 CPU and a similarly specced AMD processor, for instance. As long as the game performs comfortably on your minimum spec machine, you should be reasonably confident about other machines. It’s also worth considering using analytics, such as Unity Analytics, to record frame rates, system specs, and player options’ settings to identify hotspots or problematic configurations.
We’re seeing more studios move to using at least some level of automated testing for regular on-device profiling, with summary stats published where the whole team can keep an eye on performance across the range of target devices. With well-designed test scenes, this can usually be made into a mechanical process that’s suited for automation, so you don’t need an experienced technical artist or QA tester running builds through the process manually.
It’s uncommon, but we have seen it. Often the issue lies in how the project is configured, such as with the use of fancy shaders and high-res textures on high-end devices, which can put extra pressure on the GPU or memory. Sometimes a high-end mobile device or console will use a high-res phone screen or 4K TV output as a selling point but not necessarily have enough GPU power or memory to live up to that promise without further optimization.
If you make use of the current versions of the C# Job System, verify whether there’s a job scheduling overhead that scales with the number of worker threads, which in turn, scales with the number of CPU cores. This can result in code that runs more slowly on a 64+ core Threadripper™ than on a modest 4-core or 8-core CPU. This issue will be addressed in future versions of Unity, but in the meantime, try limiting the number of job worker threads by setting JobsUtility.JobWorkerCount.
Most of the time when we talk about frame budgets, we’re talking about the overall time budget for the frame. You calculate 1000/target frames per second (fps) to get your frame budget: 33.33 ms for 30 fps, 16.66 ms for 60 fps, 8.33 ms for 120 Hz, etc. Reduce that number by around 35% if you’re on mobile to give the chips a chance to cool down between each frame. Dividing the budget up to get specific sub-budgets for different features and/or systems is probably overkill except for projects with very specific, predictable systems, or those that make heavy use of Time Slicing.
Generally, profiling is the process of finding the biggest bottlenecks – and therefore, the biggest potential performance gains. So rather than saying, “Physics is taking 1.2 ms when the budget only allows for 1 ms,” you might look at a frame and say, “Rendering is taking 6 ms, making it the biggest main thread CPU cost in the frame. How can we reduce that?”
Building, releasing, promoting, and managing a game is difficult work on multiple fronts. So there will always be numerous priorities vying for a developer’s attention, and profiling can fall by the wayside. They know it’s something they should do, but perhaps they’re unfamiliar with the tools and don’t feel like they have time to learn. Or, they don’t know how to fit profiling into their workflows because they’re pushed toward completing features rather than performance optimization.
Just as with bugs and technical debt, performance issues are cheaper and less risky to address early on, rather than later in a project’s development cycle. Our focus is on helping to demystify profiling tools and techniques for those developers who are unfamiliar with them. That’s what the profiling e-book and its related blog post and webinar aim to support.
You can enable Allocation call stacks to see the full call stacks that lead to managed allocations (shown as magenta in the Unity CPU Profiler Timeline view). Additionally, you can – and should! – manually instrument long-running methods and processes by sprinkling ProfilerMarkers throughout your code. There’s currently no way to automatically enable Deep Profiling or disable profiling entirely in specific parts of your application. But manually adding ProfilerMarkers and enabling Allocation call stacks when required can help you dig down into problem areas without having to resort to Deep Profiling.
As of Unity 2022.2, you can also use our IgnoredByDeepProfilerAttribute to prevent the Unity Profiler from capturing method calls. Just add the IgnoredByDeepProfiler attribute to classes, structures, and methods.
Deep Profiling is covered in our Profiler documentation. Then there’s the most in-depth, single resource for profiling information, the Ultimate Guide to profiling Unity games e-book, which links to relevant documentation and other resources throughout.
Deep Profiling can be used to find the specific causes of managed allocations, although Allocation call stacks can do the same thing with less overhead, overall. At the same time, Deep Profiling can be helpful for quickly investigating why one specific ProfilerMarker seems to be taking so long, as it’s more convenient to enable than to add numerous ProfilerMarkers to your scripts and rebuild your game. But yes, it does skew performance quite heavily and so shouldn’t be enabled for general profiling.
Mobile devices force VSync to be enabled at a driver/hardware level, so disabling it in Unity’s Quality settings shouldn’t make any difference on those platforms. We haven’t heard of a case where disabling VSync negatively affects performance. Try taking a profile capture with VSync enabled, along with another capture of the same scene but with VSync disabled. Then compare the captures using Profile Analyzer to try to understand why the performance is so different.
This is covered in the Ultimate Guide to profiling Unity games. You can also get more information in the blog post, Detecting performance bottlenecks with Unity Frame Timing Manager.
Generally speaking, the telltale sign is that the main thread waits for the Render thread while the Render thread waits for the GPU. The specific marker names will differ depending on your target platform and graphics API, but you should look out for markers with names such as “PresentFrame” or “WaitForPresent.”
Use the Memory Profiler to compare memory snapshots and check for leaks. For example, you can take a snapshot in your main menu, enter your game and then quit, go back to the main menu, and take a second snapshot. Comparing these two will tell you whether any objects/allocations from the game are still hanging around in memory.
A number of game projects now make use of parts of the Data-Oriented Technology Stack (DOTS). Native Containers, the C# Job System, Mathematics, and the Burst compiler are all fully supported packages that you can use right away to write optimal, parallelized, high-performance C# (HPC#) code to improve your project’s CPU performance.
A smaller number of projects are also using Entities and associated packages, such as the Hybrid Renderer, Unity Physics, and NetCode. However, at this time, the packages listed are experimental, and using them involves accepting a degree of technical risk. This risk derives from an API that is still evolving, missing or incomplete features, as well as the engineering learning curve required to understand Data-Oriented Design (DOD) to get the most out of Unity’s Entity Component System (ECS). Unity engineer Steve McGreal wrote a guide on DOTS best practices, which includes some DOD fundamentals and tips for improving ECS performance.
Rendering is a complex process and there is no practical way to set a hard limit on the maximum number of SetPass calls or a metric for shader complexity. Even on a fixed hardware platform, such as a single console, the limits will depend on what kind of scene you want to render, and what other work is happening on the CPU and GPU during a frame.
That’s why the rule on when to profile is “early and often.” Teams tend to create a “vertical slice” demo early on during production – usually a short burst of gameplay developed to the level of visual fidelity intended for the final game. This is your first opportunity to profile rendering and figure out what optimizations and limits might be needed. The profiling process should be repeated every time a new area or other major piece of visual content is added.
Here are additional resources for learning about performance optimization:
Even more advanced technical content is coming soon – but in the meantime, please feel free to suggest topics for us to cover on the forum and check out the full roundtable webinar recording.