If you are developing a simple library and you design it from scratch using a methodology like Test Driven Development, for example, you are likely to end up with something that works well and has a nice set of fast and stable unit tests to accompany it. If you are developing anything more complex than that, unit tests will not be enough and you will need to add at least an additional integration test suite. Unity is very complex due to the number of features that need to integrate with each other and the fact that we support building games and applications on more than 20 different platforms. Testing software of this complexity meant we ended up creating a lot of high level internal testing frameworks.
Let’s take a look at some numbers! Our code base is 12 years old which means there is lots of legacy code intermixed with lots of new code. Here are some detailed stats collected using cloc:
Engine (Runtimes, Modules, Shaders):
Editor (Editor, Extensions, some Tools):
Tests (does not include all tests; some are inside editor/runtime, especially c++ unit tests):
These numbers don’t include any of the external libraries we integrate into Unity. The tests are split into low-level C++ unit tests and high-level C# tests. The C# tests can be of many different kinds based on what they test: runtime, integration, asset import, graphics, performance, etc. In total we have about 60000 automated tests that are being executed tens of millions of times every month, both locally through manual runs and on our build farm.
The fact that we rely so much on high level automated tests means that we have to deal with test failures and instabilities on a constant basis. In order to keep these to a minimum we started doing a number of things:
We use a development method that relies heavily on having multiple code branches that are kept in sync with and eventually merged back to trunk when ready. We want to always be able to have a build ready for release from the head revision of trunk, which means we want to make absolutely sure that it is always green. Everyone branching from trunk also wants to start their work on code that passes all automation.
The current way we are keeping trunk always green is by using a staging branch. Every day, multiple people submit code that should be merged to trunk. Requests like these get bundled together, merged onto the staging branch and all test automation is executed. If anything fails, we have a tool that reruns the failed tests on the same revision again to verify if it is just an instability or an actual failure. If it is an instability, a notification is posted to an internal chat, where we always have one or more developers investigating any issue that gets posted there. If it is an actual failure, we run a bisection process to quickly figure out which one of the code merge requests introduced it. The person responsible for that gets notified and the code is removed from the staging branch. If everything passes as expected, the staging branch gets merged to trunk. We call this the Trunk Queue Verification process.
This process does help keeping the main development branch always green, but it is far from ideal. It is costly to maintain because running all our test suites takes hours and finding the source of some failures require human intervention in a lot of cases. The ideal scenario would be for us to run tests on all branches after every new set of changes is pushed achieving something close to continuous integration. Right now, we are running tests in the most naive way possible, which usually means that for most new batch of changes, we run all the tests. We are taking the first steps towards changing this and improving everyone’s iteration time on test automation runs here at Unity by introducing a smart test selection service.
We have previously blogged about our Unified Test Runner which also stores lots of information about every test run in a database. We now have tens of millions of test data points where we can see when a test was executed, by whom, on which machine, if it failed or passed, how long it took to execute, etc. We are starting to leverage all this data and build a rule based system for selecting which tests should run based on which code was changed on a specific branch. Here are a few examples:
Using this rule-based system will save everyone a lot of time, but it will not remove instabilities. Instabilities make working with tests unreliable and slow, which is why we need to fix the source of the instability as fast as possible. Instabilities can be caused by tests (in which case the test is disabled and a bug with the highest priority is opened for someone to fix it) or by infrastructure (mobile devices in the build farm get disconnected or crash/freeze and need to be restarted, etc). For infrastructure issues all we can do is have good management and monitoring tools.
We are not the only ones struggling to keep test automation green and stable. Google has written about this quite extensively in their Flaky Tests blogpost and they also offer great advice on what one can do to avoid this in their Hackable Projects blogpost. Facebook also uses a system of bots to make sure automation runs fast and stable. You can see more in one of their presentations from GTAC and another from the F8 2015 conference.