Hello,
As you might know, preparing for the upcoming TF-A release, there was motion to test how the system behaves under the extra usually caused by the release work, to anticipate what to expect and possibly to make adjustments to improve situation comparing with the previous releases (were overloads were all but common).
The testing started a few weeks ago by Joanna, joined last week by the OpenCI maintenance team, especially when "death spiral" system behavior was detected, familiar from the previous releases. This behavior was suspected to be caused, and have been confirmed by the following circumstances:
1. A patch is submitted and enabled for testing (AllowCI+2) which causes a large number of (LAVA) tests to fail due to timeout. 2. These tests keep FVP virtual devices in LAVA busy for much larger time than usual (~10x). 3. As there're many such tests, they block devices and cause LAVA queue to grow and bottleneck (400-500 tests waiting). 4. Jenkins jobs also retry failing LAVA tests number of times as a stopgap measure against non-deterministically and randomly failed tests. 5. As Jenkins jobs also have timeouts, waiting for queued/retries test results caused them timeout. 6. All these factors have a positive feedback effect on each, causing that "death spiral" effect when both Jenkins and LAVA were severely overloaded, while doing nothing useful (effectively, waiting). And the whole system was effectively deadlocked, where any new started job just kept waiting until its timeout to fail, requiring manual intervention to clear this state.
The measure to address this situation were:
1. Decrease default LAVA test timeout to more reasonable values (better average value, while being ready to override it as needed for individual tests). 2. Decrease number of test failure retries. 3. Increase number of Jenkins and LAVA workers/devices/containers.
The last Joanna's test after this change showed that the system no longer exhibits "death spiral" behavior under heavy, but realistic load. I also performed additional "extreme" test of running AllowCI+2 for multiple timeout-failing patches at once. I still was able to reproduce a situation when a Jenkins job timed out on its side, but at least there was no obvious "domino" effect to other jobs.
All this work was tracked via https://linaro.atlassian.net/browse/TFC-498, which contain much more detail in it subtickets. This tasks is closed now, per above. Given that OpenCI is a complex and busy system, it is hard to be 100% that it was single underlying issue which caused the problems. So, if you see unexpected/problematic behavior, please open a TFC ticket (which is still the standard workflow to report and track issues).
Thanks, Paul
Linaro.org | Open source software for ARM SoCs Follow Linaro: http://www.facebook.com/pages/Linaro http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog