Update on OpenCI release readiness in regard to ability to sustain increased load - Tf-openci

9 Oct 2023


      Hello,
As you might know, preparing for the upcoming TF-A release, there was
motion to test how the system behaves under the extra usually caused by
the release work, to anticipate what to expect and possibly to make
adjustments to improve situation comparing with the previous releases
(were overloads were all but common).
The testing started a few weeks ago by Joanna, joined last week by the
OpenCI maintenance team, especially when "death spiral" system behavior
was detected, familiar from the previous releases. This behavior was
suspected to be caused, and have been confirmed by the following
circumstances:
1. A patch is submitted and enabled for testing (AllowCI+2) which causes
a large number of (LAVA) tests to fail due to timeout.
2. These tests keep FVP virtual devices in LAVA busy for much larger
time than usual (~10x).
3. As there're many such tests, they block devices and cause LAVA queue
to grow and bottleneck (400-500 tests waiting).
4. Jenkins jobs also retry failing LAVA tests number of times as a
stopgap measure against non-deterministically and randomly failed tests.
5. As Jenkins jobs also have timeouts, waiting for queued/retries test
results caused them timeout.
6. All these factors have a positive feedback effect on each, causing
that "death spiral" effect when both Jenkins and LAVA were severely
overloaded, while doing nothing useful (effectively, waiting). And the
whole system was effectively deadlocked, where any new started job just
kept waiting until its timeout to fail, requiring manual intervention to
clear this state.
The measure to address this situation were:
1. Decrease default LAVA test timeout to more reasonable values (better
average value, while being ready to override it as needed for
individual tests).
2. Decrease number of test failure retries.
3. Increase number of Jenkins and LAVA workers/devices/containers.
The last Joanna's test after this change showed that the system no
longer exhibits "death spiral" behavior under heavy, but realistic load.
I also performed additional "extreme" test of running AllowCI+2
for multiple timeout-failing patches at once. I still was able to
reproduce a situation when a Jenkins job timed out on its side, but at
least there was no obvious "domino" effect to other jobs.
All this work was tracked via
https://linaro.atlassian.net/browse/TFC-498, which contain much more
detail in it subtickets. This tasks is closed now, per above. Given
that OpenCI is a complex and busy system, it is hard to be 100% that it
was single underlying issue which caused the problems. So, if you see
unexpected/problematic behavior, please open a TFC ticket (which is
still the standard workflow to report and track issues).
Thanks,
Paul
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog