Hello All,
The jenkins server will stop processing jobs on 2023-10-13 at around
21:00 UTC, as the server will be put into "Shutdown mode".
This downtime is needed in order to upgrade Jenkins to address some
minor security vulnerabilities and upgrade internal dependencies off of
obsolete versions. This upgrade will be to the same version that's been
running on staging for the past week and a half.
Start: 2023-10-13 21:00 UTC
Finish: 2023-10-13 23:00 UTC
Regards,
--
Kelley Spoon <kelley.spoon(a)linaro.org>
Hello,
As you might know, preparing for the upcoming TF-A release, there was
motion to test how the system behaves under the extra usually caused by
the release work, to anticipate what to expect and possibly to make
adjustments to improve situation comparing with the previous releases
(were overloads were all but common).
The testing started a few weeks ago by Joanna, joined last week by the
OpenCI maintenance team, especially when "death spiral" system behavior
was detected, familiar from the previous releases. This behavior was
suspected to be caused, and have been confirmed by the following
circumstances:
1. A patch is submitted and enabled for testing (AllowCI+2) which causes
a large number of (LAVA) tests to fail due to timeout.
2. These tests keep FVP virtual devices in LAVA busy for much larger
time than usual (~10x).
3. As there're many such tests, they block devices and cause LAVA queue
to grow and bottleneck (400-500 tests waiting).
4. Jenkins jobs also retry failing LAVA tests number of times as a
stopgap measure against non-deterministically and randomly failed tests.
5. As Jenkins jobs also have timeouts, waiting for queued/retries test
results caused them timeout.
6. All these factors have a positive feedback effect on each, causing
that "death spiral" effect when both Jenkins and LAVA were severely
overloaded, while doing nothing useful (effectively, waiting). And the
whole system was effectively deadlocked, where any new started job just
kept waiting until its timeout to fail, requiring manual intervention to
clear this state.
The measure to address this situation were:
1. Decrease default LAVA test timeout to more reasonable values (better
average value, while being ready to override it as needed for
individual tests).
2. Decrease number of test failure retries.
3. Increase number of Jenkins and LAVA workers/devices/containers.
The last Joanna's test after this change showed that the system no
longer exhibits "death spiral" behavior under heavy, but realistic load.
I also performed additional "extreme" test of running AllowCI+2
for multiple timeout-failing patches at once. I still was able to
reproduce a situation when a Jenkins job timed out on its side, but at
least there was no obvious "domino" effect to other jobs.
All this work was tracked via
https://linaro.atlassian.net/browse/TFC-498, which contain much more
detail in it subtickets. This tasks is closed now, per above. Given
that OpenCI is a complex and busy system, it is hard to be 100% that it
was single underlying issue which caused the problems. So, if you see
unexpected/problematic behavior, please open a TFC ticket (which is
still the standard workflow to report and track issues).
Thanks,
Paul
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linarohttp://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog
Hello,
I would like to share some developments and updates regarding
TrustedFirmware MISRA testing throughout September:
1. MISRA CI testing for TF-M was formally launched. That doesn't mean
it runs to all its power yet, but the infrastructure is in place, and
the next steps are for the TF-M team to see how it fits into their
development workflow, and decide how to address identified MISRA issues
- either record them as deviations or fix in the TF-M source code
(likely combination of both).
2. One of the development done for the TF-M testing was implementation
of the cumulative report across multiple configurations (vs myriad of
individual per-configuration reports, which are hard to follow). This
feature was already forward-ported to the TF-A "daily" build. It
immediately made visible the fact that a MISRA mandatory rule violation
crept into the codebase:
https://ci-builds.trustedfirmware.org/static-files/llodfObQwsfBE_M8BN9W1URq…
, select "Mandatory rules - violations" (note that the link will expire
after some time).
Further development plans are:
1. Cooperate with the TF-M team regarding MISRA rule, etc.
configuration to get the reports into a shape useful for developers and
contributors.
2. Forward-port cumulative report feature to the TF-A "delta" (i.e.
patch) testing.
These will be worked on starting from October, subject to other feature
development and maintenance work.
Thanks,
Paul
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linarohttp://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog
Hello all,
The gerrit server on review.trustedfirmware.org will be offline for a 2
hour maintenance window starting today (Oct 4, 2023) at 19:00 UTC.
This downtime is needed in order to upgrade gerrit to the current version
(3.8.1).
Changelog is available at:
--
Kelley Spoon <kelley.spoon(a)linaro.org>
Hello All,
The jenkins server will stop processing jobs on 2023-10-03 at around
21:00 UTC, as the server will be put into "Shutdown mode".
This downtime is needed in order to upgrade Jenkins to address some minor
security vulnerabilities and upgrade internal dependencies off of obsolete
versions.
Start: 2023-10-03 21:00 UTC
Finish: 2023-10-03 23:00 UTC
--
Kelley Spoon <kelley.spoon(a)linaro.org>