My name is Igor and I am a Toolsmith at Unity, which means I am part of the team that build tools to increase productivity of Devs and QA in Unity with the aim to improve the overall quality of the product and the experience of our Users.
Let's start with a little bit of history of handling bugs in Unity. There is a tool installed with the Editor called the Bug Reporter, which could be launched either manually or automatically in case of a crash (see more at Reporting-a-bug). After a user submits a bug report, QA has to try to reproduce the issue and turn this report (which is initially called an incident) into a bug, which will be passed to development teams for triage and/or fix or other solution. Good reports should contain a descriptive title, steps to reproduce and have a project (ideally - a small one focused on the problem) attached to it, so it is easy for us to verify and reproduce the issue. That’s what we at Unity always hope to get in the report (more at Attaching-your-project-to-a-bug-report).
But here comes another side of the equation. Having more than a million registered users we receive A LOT of reports of the same bug, which is good, until it becomes bad: someone has to look at every report that gets sent (around 6000 per month), verify it, reply to the user and so on. And after we verified the bug (and/or even have a fix for it ready), all the additional reports sent aren’t helping to solve the problem but taking valuable time from QA going through them. Automation to the rescue!
It turns out that crashes are the perfect example of a problem which in most cases is identified by one characteristic common to all of them: the callstack of the crash (i.e. the sequence of function calls in the program code of the Editor which eventually lead to a crash - with the name of the crashed function on top of it). Which means, unlike many other Bugs, those kinds of problems are much easier to group together by the machine without any user intervention (there are exceptions to that rule, but more about that later).
When we started this project a few years ago we had no idea how many additional insights it would give us. From historical data of all the crashes across different versions of Editor (quantity, dynamics, etc), to the ability to immediately identify if a crash happened on a user’s machine already has a fix in a Unity version.
We built a tool which analyzes all the reports sent by users, parses all logs from the Editor attached to them to find the callstack of a crash and then maps identical or similar crashes together (figure 1). That gave tremendous value and increase in productivity for both developers fixing the issue (who now have all the similar reports at their fingertips and can quickly look for more information or other repro project) and testers (who can immediately see if the reported issue falls into a certain category and if there might already be a solution for it or at least a verified bug with a public Issue Tracker item for it where users can keep track of the issue, therefore providing the user with help in a timely fashion). Now release managers can also assess the stability and production readiness of the builds way before they make their way into alpha or beta testing, let alone stable releases (Unity Roadmap) and look for possible regressions and the User Pain caused.
If for some type of crash we were able to turn one of the reports into a bug (i.e. have steps to reproduce / project provided by the user), we might want to close all other similar reports as duplicates (while providing users a link to track the progress on the bug fixing with Issue Tracker). What we can do is to mark the report as the repro for the crash and then resolve all the others as duplicates (figure 2).
For everyone’s convenience Slack integrations were also added, so now if you want to receive notifications of new reported crashes (along with info showing if it is known or not, Unity version it was reported against, and so on) all you need is subscribe to a few channels (Figure 3).
We are not out of the woods yet! We keep working to improve our algorithms (some of the called functions on the stack are meaningless and should be filtered out, some of the platforms provide us with better callstack collecting mechanisms than others, etc). Some of the crashes are fully identified by their callstack and as a result, could be processed completely automatically. Sometimes callstacks must be 100% identical to be the same bug. Sometimes it is enough to be ‘similar’ (for example, have the same top frame - crashed function name, but varies down the stack a little bit). But in some cases even identical callstacks could mean different root causes. This often happens with the external calls into 3rd party libraries or drivers, where exact place of the crash itself is not enough and different parameters of the call or varying setup could result in different problems. For those crashes full automation is still not possible yet and it requires human investigation to tell the difference. The goal is to at least semi-automate cases like that, which means someone has to take a look, resolve possible issues manually before we can advance to automatic handling.
The plan is to further integrate with existing tools like Issue Tracker and Bug Reporter, so instead of collecting a user’s crash report, storing it at our servers, analyzing and then providing references to an existing bug and/or a solution for it, we'll be able to exchange data right away between Bug Reporter and Crash Analyzer’s backend to prevent sending reports for already fixed issues and provide users with immediate feedback / solutions instead of submitting reports and waiting for a response from QA.
Stay tuned for more...