Hi Varun,

We may need to understand how the cluster coordination mechanism in hardware works because , if we indeed go the platform direction, this config would need to permeate into the various data structures and code in PSCI lib (this option would cover aspects more than locks). I would consider that aa a significant design activity for TF-A.

Best Regards

Soby Mathew

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 11 November 2021 17:39
To: Soby Mathew <Soby.Mathew@arm.com>
Cc: tf-a@lists.trustedfirmware.org
Subject: RE: PSCI lock contention

Hi Soby,

Glad that we have a way forward. I understand that you are looking at it from the DSU side, but for older Tegra platforms (without DSU) we have hardware mechanisms that can manage cluster power state entry/exit. So, we should look at this as a platform configuration option.

I hope this does not further complicate the design.

-Varun

From: Soby Mathew <Soby.Mathew@arm.com>
Sent: Monday, 8 November 2021 11:31 PM
To: Varun Wadekar <vwadekar@nvidia.com>
Cc: tf-a@lists.trustedfirmware.org
Subject: RE: PSCI lock contention

External email: Use caution opening links or attachments

Hi Varun,

Further clarification on my statement:

> We do need lock during the state coordination because we refer the power down requests across multiple CPUs for a given power domain level. But this lock could perhaps be decoupled from the power domain locks which might allow for optimal lock removal

On further reflection, my statement below is a software design limitation in PSCI lib. Effectively the state coordination is trying to determine the shallowest sleep agreed by all the PEs in the power domain. This is currently done via software data structures in code. With the introduction of CLUSTERPWRDN register, we have a hardware voting mechanism wherein every PE/thread of execution votes whether the cluster power domain can be powered off and this has the same effect of state coordination in software.

So the answer is, yes , we don’t need to hold locks for state coordination upto cluster level, but the work is more involved that just removing associated locks. We would need some design changes to accommodate this hardware voting mechanism bypassing software state coordination. But I imagine the performance boost would be significant as there is no longer need to serialize suspend request upto cluster power domain in firmware.

Best Regards

Soby Mathew

From: TF-A <tf-a-bounces@lists.trustedfirmware.org> On Behalf Of Soby Mathew via TF-A
Sent: 08 November 2021 10:50
To: Varun Wadekar <vwadekar@nvidia.com>
Cc: tf-a@lists.trustedfirmware.org
Subject: Re: [TF-A] PSCI lock contention

Hi Varun,

> [VW] Can we remove the need for a lock for this step? What do you reckon?

We do need lock during the state coordination because we refer the power down requests across multiple CPUs for a given power domain level. But this lock could perhaps be decoupled from the power domain locks which might allow for optimal lock removal

> [VW] Can you elaborate on the mechanism to remove the locks? I was planning on removing the spinlock from PSCI lib when HW_COHERENCY_SUPPORT=1. My understanding is that HW_COHERENCY_SUPPORT=1 is only for platform that have a DSU.

> Are you saying that we need a runtime check instead? If yes, I recommend a platform level function to indicate it supports cluster resource management at HW. Sounds good?

We cannot remove all the locks in power domain topology. Only locks upto Cluster power domain level can be removed and locks at levels > cluster domain should be retained. The other complexity is that the cluster power domain level depends on the platform/CPU, for eg : on a threaded CPU, the cluster level is 2 whereas on a non – threaded CPU it would be 1.

Either PSCI lib would need to acquire these locks for state coordination and release them soon or use another set of locks for state coordination and not acquire these locks at all.

The other aspect which I haven’t fully investigated is whether the CLUSTERPWRDN register is supported well on Total Compute FVPs (SCP support need to be present AIU). Also need to understand the situation on AEM FVPs since they may not expose DSU registers to firmware to program and the implication if the registers are not present. These doesn’t prevent any of the generic PSCI lib optimization as discussed above, but something we need to investigate on our side before this can be enabled and tested on Arm platforms.

Best Regards

Soby Mathew

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 05 November 2021 10:38
To: Soby Mathew <Soby.Mathew@arm.com>
Cc: tf-a@lists.trustedfirmware.org
Subject: RE: PSCI lock contention

Hi Soby,

The explanation matches my understanding. Thanks.

>> [There is still state coordination to do with the locks acquired but this a short step and the locks can be released after that]

[VW] Can we remove the need for a lock for this step? What do you reckon?

>> If your platform does indeed have DSU , then these cluster level locks can be removed. This means that the generic PSCI lib code now needs to figure out the cluster power domain level from platform (or CPU lib) and should not acquire locks for power domain nodes <= cluster power domain level.

[VW] Can you elaborate on the mechanism to remove the locks? I was planning on removing the spinlock from PSCI lib when HW_COHERENCY_SUPPORT=1. My understanding is that HW_COHERENCY_SUPPORT=1 is only for platform that have a DSU.

Are you saying that we need a runtime check instead? If yes, I recommend a platform level function to indicate it supports cluster resource management at HW. Sounds good?

-Varun

From: Soby Mathew <Soby.Mathew@arm.com>
Sent: Wednesday, 3 November 2021 3:03 PM
To: Varun Wadekar <vwadekar@nvidia.com>; tf-a@lists.trustedfirmware.org
Subject: RE: PSCI lock contention

External email: Use caution opening links or attachments

Hi Varun,

The PSCI power down/up operations depend on the power domain level being managed. Depending on the power domain topology declared by the platform , power domain locks corresponding to the suspend request and the containing CPU node is acquired.

For the sake of simplicity, let us assume that the power domain level 0 represents the CPU and level 1 represents the cluster domain level. When cluster power domain is powered down, it affects shared hardware resources (like L2 cache, interconnects) shared by CPUs within the cluster and hence the last-CPU (determined via PSCI state coordination) in the cluster must hold a lock to maintain critical section while these operations are performed. The case is similar for the first-CPU in the cluster as it is responsible for initializing all the shared resources within the cluster and other woken-up CPUs in the same cluster must not be allowed to initialize per CPU resource before the cluster level initialization is complete.

Note that Level 0 suspend requests will not need to acquire locks.

With the introduction of DynamIQ CPUs and DSU, the cluster resource is managed by the hardware (in collaboration with SCP) AIU. This means that the firmware no longer has to do software sequences to initialize/finalize cluster resources and all that is needed is to write to a per-thread CLUSTERPWRDN register [1]. This means that the PSCI lib no longer need to maintain the critical section for cluster power domain level and the locks for cluster level and lower can be optimized [There is still state coordination to do with the locks acquired but this a short step and the locks can be released after that].

If your platform does indeed have DSU , then these cluster level locks can be removed. This means that the generic PSCI lib code now needs to figure out the cluster power domain level from platform (or CPU lib) and should not acquire locks for power domain nodes <= cluster power domain level.

Best Regards

Soby Mathew

[1] See section B1.14 in DSU TRM https://developer.arm.com/documentation/100453/0401

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 02 November 2021 11:53
To: Soby Mathew <Soby.Mathew@arm.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>; tf-a@lists.trustedfirmware.org
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: RE: PSCI lock contention

Thanks, Soby. I agree that this needs to be re-evaluated for platforms. I think we should introduce an option to disable them, if required.

We plan to try some more experiments and hopefully remove the locks at least for Tegra platforms.

Looking forward to the elaborate answer.

From: Soby Mathew <Soby.Mathew@arm.com>
Sent: Tuesday, 2 November 2021 10:18 AM
To: Varun Wadekar <vwadekar@nvidia.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: RE: PSCI lock contention

External email: Use caution opening links or attachments

Hi Varun,

The short answer is that the locks are used to differentiate the last-CPU-to-suspend and similarly first-CPU-to-powerup at a given power domain level. Now, recent CPU features like DynamIQ means that we don’t need to do this differentiation upto cluster level which TF-A hasn’t optimized for yet AFAICS. I am happy to elaborate further , but could you please send the query to the TF-A mailing list as I would prefer this discussion to happen in the open if possible.

Best Regards

Soby Mathew

From: Varun Wadekar <vwadekar@nvidia.com>
Sent: 01 November 2021 20:14
To: Soby Mathew <Soby.Mathew@arm.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Dan Handley <Dan.Handley@arm.com>
Cc: Joanna Farley <Joanna.Farley@arm.com>; Matteo Carlini <Matteo.Carlini@arm.com>
Subject: PSCI lock contention

Hi,

We were trying performance benchmarking for CPU_SUSPEND on Tegra platforms. We take all CPU cores to CPU_SUSPEND and then wake them up with IPI – all at once and in serial order. From the numbers, we see that the CPUs powering up later take more time than the first one. We have narrowed the most time consumed to the PSCI locks – documented at docs/perf/psci-performance-juno.rst.

Can you please help me understand why these locks were added? As a quick experiment we tried the same benchmarking *without* the locks and the firmware does not blow up, but I would like to understand the impact from the analysis on Juno (docs/perf/psci-performance-juno.rst)

Happy to hop on a call to discuss further.

Thanks.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.