Hi Varun,

The PSCI power down/up operations depend on the power domain level being managed. Depending on the power domain topology declared by the platform , power domain locks corresponding to the suspend request and the containing CPU node is acquired.

For the sake of simplicity, let us assume that the power domain level 0 represents the CPU and level 1 represents the cluster domain level. When cluster power domain is powered down, it affects shared hardware resources (like L2 cache, interconnects) shared by CPUs within the cluster and hence the last-CPU (determined via PSCI state coordination) in the cluster must hold a lock to maintain critical section while these operations are performed. The case is similar for the first-CPU in the cluster as it is responsible for initializing all the shared resources within the cluster and other woken-up CPUs in the same cluster must not be allowed to initialize per CPU resource before the cluster level initialization is complete.

Note that Level 0 suspend requests will not need to acquire locks.

With the introduction of DynamIQ CPUs and DSU, the cluster resource is managed by the hardware (in collaboration with SCP) AIU. This means that the firmware no longer has to do software sequences to initialize/finalize cluster resources and all that is needed is to write to a per-thread CLUSTERPWRDN register [1]. This means that the PSCI lib no longer need to maintain the critical section for cluster power domain level and the locks for cluster level and lower can be optimized [There is still state coordination to do with the locks acquired but this a short step and the locks can be released after that].

If your platform does indeed have DSU , then these cluster level locks can be removed. This means that the generic PSCI lib code now needs to figure out the cluster power domain level from platform (or CPU lib) and should not acquire locks for power domain nodes <= cluster power domain level.

Best Regards

Soby Mathew

[1] See section B1.14 in DSU TRM https://developer.arm.com/documentation/100453/0401

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 02 November 2021 11:53
To: Soby Mathew <Soby.Mathew@arm.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>; tf-a@lists.trustedfirmware.org
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: RE: PSCI lock contention

Thanks, Soby. I agree that this needs to be re-evaluated for platforms. I think we should introduce an option to disable them, if required.

We plan to try some more experiments and hopefully remove the locks at least for Tegra platforms.

Looking forward to the elaborate answer.

From: Soby Mathew <Soby.Mathew@arm.com>
Sent: Tuesday, 2 November 2021 10:18 AM
To: Varun Wadekar <vwadekar@nvidia.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: RE: PSCI lock contention

External email: Use caution opening links or attachments

Hi Varun,

The short answer is that the locks are used to differentiate the last-CPU-to-suspend and similarly first-CPU-to-powerup at a given power domain level. Now, recent CPU features like DynamIQ means that we don’t need to do this differentiation upto cluster level which TF-A hasn’t optimized for yet AFAICS. I am happy to elaborate further , but could you please send the query to the TF-A mailing list as I would prefer this discussion to happen in the open if possible.

Best Regards

Soby Mathew

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 01 November 2021 20:14
To: Soby Mathew <Soby.Mathew@arm.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: PSCI lock contention

Hi,

We were trying performance benchmarking for CPU_SUSPEND on Tegra platforms. We take all CPU cores to CPU_SUSPEND and then wake them up with IPI – all at once and in serial order. From the numbers, we see that the CPUs powering up later take more time than the first one. We have narrowed the most time consumed to the PSCI locks – documented at docs/perf/psci-performance-juno.rst.

Can you please help me understand why these locks were added? As a quick experiment we tried the same benchmarking *without* the locks and the firmware does not blow up, but I would like to understand the impact from the analysis on Juno (docs/perf/psci-performance-juno.rst)

Happy to hop on a call to discuss further.

Thanks.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.