Hi Soby et. al.,
I'd like to implement a small new feature and ask some guidance for
how to go about it: Chrome OS has the ability to automatically collect
crash reports from runtime crashes in Trusted Firmware, and we would
like to set up automated tests to ensure this feature stays working.
In order to do this we need a way for the non-secure OS to
intentionally trigger a panic in EL3. The obvious solution would be to
implement a new SMC for that. (It's common for operating systems to
have similar facilities, e.g. Linux can force a kernel panic by
writing 'c' into /proc/sysrq-trigger.)
My main question is: where should I get an SMC function ID for this?
This is not a silicon or OEM specific feature, so the SiP Service
Calls and OEM Service Calls ID ranges seem inappropriate (or do you
think it would make sense to treat Google or Chrome OS as the "OEM"
here, even though that's not quite accurate?). There are ranges for
Trusted Applications and the Trusted OS but unfortunately none for the
normal world OS. Is this something that would make sense to allocate
under Standard Service Calls? Could you just find an ID for me to use
there or does everything in that range need a big specification
document written by Arm?
Thanks,
Julius
Hi Julius,
OK, in that case I can see that a solution based on TF-A's DebugFS
interface might not be desirable. Indeed, our original intention was to
make the whole DebugFS system a debug-only feature (hence its name!). As
such, I agree that it is likely not to get the same level of scrutiny
and testing as other features intended for production systems.
One of the main use cases we have in mind for DebugFS is, being able to
peek and poke into the firmware for testing purposes. Today, when doing
functional testing from the normal world (for example, using TF-A
Tests), we are limited to what's exposed through the SMC interface. And
even then, we have limited visibility on what really happened in the
firmware, as we can only deduce so much from the SMC return value(s).
DebugFS could be used to bridge this gap, by providing a side channel
for getting internal firmware state information.
Going back to the SMC-based solution then, I am not quite convinced
SYSTEM_RESET2 is the right interface for intentionally triggering a
panic in TF-A. I think the semantics do not quite match. If anything, a
firmware crash seems more like a shutdown operation to me rather than a
reset (we don't recover from a firmware crash). I am not even sure we
should look into the PSCI SMC range, as it's not a power-management
operation.
Julius, you wrote:
> It's the same problem that the SMC/PSCI spec and the TF repository layout is only designed to deal with generic vs. SoC-vendor-specific differentiation. If the normal world OS needs a feature, we can only make it generic or duplicate it across all vendors running that OS.
So it sounds like it's not the first time that you hit this issue, is
it? Do you have any other example of Normal World OS feature you would
have liked to expose through a generic SMC interface? I am wondering
whether this could help choosing the right SMC range, if we can identify
some common criteria among a set of such features.
Regards,
Sandrine
Hi Julius
> -----Original Message-----
> From: Julius Werner <jwerner(a)chromium.org>
> Sent: 11 September 2019 03:00
> To: Dan Handley <Dan.Handley(a)arm.com>
> Cc: tf-a(a)lists.trustedfirmware.org
> Subject: Re: [TF-A] SMC to intentionally trigger a panic in TF-A
>
> Hi Dan,
>
> Whoops, sorry, this fell through the cracks for me since I wasn't on the to:
> line. Thanks for your response!
>
You're welcome.
<snip>
> > However, I think there might already be support for what you need. PSCI is
> part of the standard service and the function SYSTEM_RESET2 allows for both
> architectural and vendor-specific resets. The latter allows for vendor-
> specific semantics, which could include crashing the firmware as you suggest.
> >
> > Chrome OS could specify what such a vendor-specific reset looks like and
> each Chromebook's platform PSCI hooks could be implemented accordingly.
>
> Right, but defining a separate vendor-specific reset type for each platform
> is roughly the same as defining a separate SiP SMC for each of them. It's the
> same problem that the SMC/PSCI spec and the TF repository layout is only
> designed to deal with generic vs.
> SoC-vendor-specific differentiation. If the normal world OS needs a feature,
> we can only make it generic or duplicate it across all vendors running that
> OS.
>
Not quite. The SiP SMC range is already populated with existing SiP stuff whereas the vendor-specific bits of the reset_type in PSCI SYSTEM_RESET2 is unlikely to contain much/any vendor specific stuff. Therefore Chrome OS could define something "generic to Chrome OS" in this space that Chromebooks could implement. There could also be a Chrome OS specific folder for this kind of functionality that Chromebooks pull in.
> > Alternatively, this could potentially be defined as an additional
> architectural reset. This would enable a generic implementation but would
> require approval/definition by Arm's Architecture team. Like me they might
> have concerns about this being defined at a generic architectural level.
>
> Yes, I think that would be the best option. Could you kick off that process
> with the Architecture team? Or tell me who I should talk to about this?
>
OK, I'll fire off an email internally now and then either put you in contact or let you know how it goes.
Regards
Dan.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Julius,
As you were mentioning that the Linux kernel uses /proc/sysrq-trigger
for a similar purpose, I was wondering whether you'd be open to a
solution based on a "DebugFS" entry. As you may have seen on the mailing
list, Olivier posted a proposal for introducing a firmware debug
interface, which has many similarities to how /proc or /sys works in the
kernel world:
https://lists.trustedfirmware.org/pipermail/tf-a/2019-October/000120.html
TF-A patches for this feature are up for review right now and Olivier
has also posted some TF-A Tests patches that demonstrate how this can be
used from normal world. In addition, we are also working on a Linux
driver for this.
As you can imagine, DebugFS uses an SMC interface under the hood
(currently allocated in the SiP range). But being an abstraction over
the SMC layer, which specific SMC function ID is used does not matter so
much and it does not need to be standardized by any Arm specification.
You'd need to mandate all Chrome OS devices to have this DebugFS entry
in the firmware but the backend could vary from platform to platform.
Would that suit your use case?
Regards,
Sandrine
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Julius
> -----Original Message-----
> From: TF-A <tf-a-bounces(a)lists.trustedfirmware.org> On Behalf Of Julius
> Werner via TF-A
> Sent: 20 August 2019 02:15
>
> Hi Soby et. al.,
>
> I'd like to implement a small new feature and ask some guidance for how to go
> about it: Chrome OS has the ability to automatically collect crash reports
> from runtime crashes in Trusted Firmware, and we would like to set up
> automated tests to ensure this feature stays working.
> In order to do this we need a way for the non-secure OS to intentionally
> trigger a panic in EL3. The obvious solution would be to implement a new SMC
> for that. (It's common for operating systems to have similar facilities, e.g.
> Linux can force a kernel panic by writing 'c' into /proc/sysrq-trigger.)
>
OK I can see the use of that, although I'd be a bit concerned about such a thing being available as a general service in case it gets used as an attack vector. For example, a test program could aggressively use this service to try to get the firmware to leak secure world information or something about its behaviour.
> My main question is: where should I get an SMC function ID for this?
> This is not a silicon or OEM specific feature, so the SiP Service Calls and
> OEM Service Calls ID ranges seem inappropriate (or do you think it would make
> sense to treat Google or Chrome OS as the "OEM"
> here, even though that's not quite accurate?).
I guess in theory you could mandate that all Chrome OS SiPs provide a specific function ID in their own specific SiP service, but I don't think that's the right solution here...
> There are ranges for Trusted
> Applications and the Trusted OS but unfortunately none for the normal world
> OS.
I don't think the TOS range is right either.
> Is this something that would make sense to allocate under Standard
> Service Calls? Could you just find an ID for me to use there or does
> everything in that range need a big specification document written by Arm?
>
For sure everything in the standard or architectural ranges require specification by Arm, although this does not necessarily need to be big.
However, I think there might already be support for what you need. PSCI is part of the standard service and the function SYSTEM_RESET2 allows for both architectural and vendor-specific resets. The latter allows for vendor-specific semantics, which could include crashing the firmware as you suggest.
Chrome OS could specify what such a vendor-specific reset looks like and each Chromebook's platform PSCI hooks could be implemented accordingly.
Alternatively, this could potentially be defined as an additional architectural reset. This would enable a generic implementation but would require approval/definition by Arm's Architecture team. Like me they might have concerns about this being defined at a generic architectural level.
Regards
Dan.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Julius,
On the subject of DebugFS's purpose it was envisages and is today as Sandrine describes as a debug build only capability. Saying that though there has been some early thoughts that it could evolve into a Secure Debug feature where this type of capability or something like it is always on requiring debug certificates for authenticated access. This is something very much for a possible future evolution and is not in the patches available today. We would welcome any thoughts on such an evolution in this space.
Joanna
On 13/12/2019, 13:01, "TF-A on behalf of Sandrine Bailleux via TF-A" <tf-a-bounces(a)lists.trustedfirmware.org on behalf of tf-a(a)lists.trustedfirmware.org> wrote:
Hi Julius,
OK, in that case I can see that a solution based on TF-A's DebugFS
interface might not be desirable. Indeed, our original intention was to
make the whole DebugFS system a debug-only feature (hence its name!). As
such, I agree that it is likely not to get the same level of scrutiny
and testing as other features intended for production systems.
One of the main use cases we have in mind for DebugFS is, being able to
peek and poke into the firmware for testing purposes. Today, when doing
functional testing from the normal world (for example, using TF-A
Tests), we are limited to what's exposed through the SMC interface. And
even then, we have limited visibility on what really happened in the
firmware, as we can only deduce so much from the SMC return value(s).
DebugFS could be used to bridge this gap, by providing a side channel
for getting internal firmware state information.
Going back to the SMC-based solution then, I am not quite convinced
SYSTEM_RESET2 is the right interface for intentionally triggering a
panic in TF-A. I think the semantics do not quite match. If anything, a
firmware crash seems more like a shutdown operation to me rather than a
reset (we don't recover from a firmware crash). I am not even sure we
should look into the PSCI SMC range, as it's not a power-management
operation.
Julius, you wrote:
> It's the same problem that the SMC/PSCI spec and the TF repository layout is only designed to deal with generic vs. SoC-vendor-specific differentiation. If the normal world OS needs a feature, we can only make it generic or duplicate it across all vendors running that OS.
So it sounds like it's not the first time that you hit this issue, is
it? Do you have any other example of Normal World OS feature you would
have liked to expose through a generic SMC interface? I am wondering
whether this could help choosing the right SMC range, if we can identify
some common criteria among a set of such features.
Regards,
Sandrine
--
TF-A mailing list
TF-A(a)lists.trustedfirmware.org
https://lists.trustedfirmware.org/mailman/listinfo/tf-a
Hi Sandrine,
Yes, I think using debugfs on the kernel side to control this feature
(and other firmware control/debug stuff) is perfectly fine and as a
good a solution as any other. I am not quite sure about pulling the
whole 9p interface into Trusted Firmware, though... it feels a bit
"heavy" for a firmware use case (e.g. the 9p layer alone is a thousand
lines of code). I'm not sure I see the benefit over using the same
debugfs interface on the kernel side but backing it by a kernel driver
that translates the file accesses into a simpler, record-based SMC
interface (that could also avoid the shared memory requirement, at
least for simple requests). Maybe this depends on what else you're
planning to do with this interface.
It also seems that you're intending for this to only be used for
developer builds and never make it into a production image (e.g. it's
printing a warning when not building with DEBUG=1). In that case, I
can see that code size and complexity may not be a big concern.
However, for the use case I had in mind I'd need this to be enabled in
production images (Chrome OS doesn't really distinguish between test
and production images for automated testing), so I'm looking more for
a lightweight API where each command can be enabled/disabled
individually at compile time.
I'd have to look at and test it in more detail when it's done, but
given that it will likely increase code size by quite a bit and that
I'm not sure how much I can trust it to be secure for production, I
don't think this would work for my use case.
Hi Dan,
Whoops, sorry, this fell through the cracks for me since I wasn't on
the to: line. Thanks for your response!
> OK I can see the use of that, although I'd be a bit concerned about such a thing being available as a general service in case it gets used as an attack vector. For example, a test program could aggressively use this service to try to get the firmware to leak secure world information or something about its behaviour.
Yes, of course, we can gate this with a build option so it would only
be available where desired.
> However, I think there might already be support for what you need. PSCI is part of the standard service and the function SYSTEM_RESET2 allows for both architectural and vendor-specific resets. The latter allows for vendor-specific semantics, which could include crashing the firmware as you suggest.
>
> Chrome OS could specify what such a vendor-specific reset looks like and each Chromebook's platform PSCI hooks could be implemented accordingly.
Right, but defining a separate vendor-specific reset type for each
platform is roughly the same as defining a separate SiP SMC for each
of them. It's the same problem that the SMC/PSCI spec and the TF
repository layout is only designed to deal with generic vs.
SoC-vendor-specific differentiation. If the normal world OS needs a
feature, we can only make it generic or duplicate it across all
vendors running that OS.
> Alternatively, this could potentially be defined as an additional architectural reset. This would enable a generic implementation but would require approval/definition by Arm's Architecture team. Like me they might have concerns about this being defined at a generic architectural level.
Yes, I think that would be the best option. Could you kick off that
process with the Architecture team? Or tell me who I should talk to
about this?
Thanks,
Julius
On Fri, Dec 13, 2019 at 6:20 AM Joanna Farley <Joanna.Farley(a)arm.com> wrote:
> On the subject of DebugFS's purpose it was envisages and is today as Sandrine describes as a debug build only capability. Saying that though there has been some early thoughts that it could evolve into a Secure Debug feature where this type of capability or something like it is always on requiring debug certificates for authenticated access. This is something very much for a possible future evolution and is not in the patches available today. We would welcome any thoughts on such an evolution in this space.
I guess this gets into a bit of a philosophy discussion and becomes a
matter of opinion, so there's probably no one right answer.
Personally, adding authentication on top of this doesn't really
resolve my concerns and adds yet more on top. I'm a strong proponent
of the concept of a minimal Trusted Computing Base, i.e. keeping the
amount of code executing at the highest privilege level as small and
low-complexity as possible. Any code can have bugs, so the idea is
that the more complicated the code you run in EL3 is (and the more
complicated APIs it exposes), the more likely it becomes that you
accidentally have an exploitable vulnerability in there. Like a p9
filesystem driver, a certificate-based authentication system
(especially if it's based on x509/ASN.1 which are notoriously hard to
implement safely) is a pretty complex piece of code with a pretty
large attack surface that I'd rather not have in my EL3 firmware if I
can avoid it. I understand that for certain use cases you may need
something like this (if you really want a very extensive and
extensible debugging API that must be restricted to a few
authenticated actors), but in my use case I really just need the
ability to trigger one small debugging feature and that feature itself
doesn't need to be restricted, so a minimal SMC interface would work
much better for that case.
> On 13/12/2019, 13:01, "TF-A on behalf of Sandrine Bailleux via TF-A" <tf-a-bounces(a)lists.trustedfirmware.org on behalf of tf-a(a)lists.trustedfirmware.org> wrote:
> Going back to the SMC-based solution then, I am not quite convinced
> SYSTEM_RESET2 is the right interface for intentionally triggering a
> panic in TF-A. I think the semantics do not quite match. If anything, a
> firmware crash seems more like a shutdown operation to me rather than a
> reset (we don't recover from a firmware crash). I am not even sure we
> should look into the PSCI SMC range, as it's not a power-management
> operation.
Crash recovery behavior is platform dependent (via
plat_panic_handler()). On all the platforms we use in Chrome OS we
have that implemented as a system reboot. I think for most systems
(whether it's a Chromebook, a server or some embedded device) that's
probably what you want for random runtime crashes (and least in a
production environment), but I agree that TF doesn't enforce any
standard behavior so it's hard to clearly match it to one or the other
SMC.
> So it sounds like it's not the first time that you hit this issue, is
> it? Do you have any other example of Normal World OS feature you would
> have liked to expose through a generic SMC interface? I am wondering
> whether this could help choosing the right SMC range, if we can identify
> some common criteria among a set of such features.
No, it's the first time I've really run into this. But I think we
might quickly come up with more uses for a "non-secure OS" SMC range
if we had one. We often see roughly the same SMC again on different
platforms, because fundamentally they usually need to do the same
kinds of things -- for example, most platforms have some kind of DDR
frequency scaling which always needs part of it implemented in EL3, so
they all need some kind of SMC to switch to a new DDR frequency. Many
also need some kind of "write value to secure register" SMC that just
allows the non-secure OS to write a few whitelisted registers that are
only accessible in EL3 for some reason. If we could standardize these
interfaces in a non-vendor-specific SMC range, we might be able to
reduce some code duplication both on the TF and the Linux side.
I guess none of these things are really Linux-specific, now that I
think of it. So really, I guess the problem is that it would be great
to have a range of "generic" SMC IDs that can be easily and
unbureaucratically allocated to try out new features, without having
to ask Arm to write a big specification document about it every time.
It's sort of a development velocity issue.
On 13/12/2019 22:04, Julius Werner via TF-A wrote:
> On Fri, Dec 13, 2019 at 6:20 AM Joanna Farley <Joanna.Farley(a)arm.com> wrote:
>> On the subject of DebugFS's purpose it was envisages and is today as Sandrine describes as a debug build only capability. Saying that though there has been some early thoughts that it could evolve into a Secure Debug feature where this type of capability or something like it is always on requiring debug certificates for authenticated access. This is something very much for a possible future evolution and is not in the patches available today. We would welcome any thoughts on such an evolution in this space.
>
> I guess this gets into a bit of a philosophy discussion and becomes a
> matter of opinion, so there's probably no one right answer.
> Personally, adding authentication on top of this doesn't really
> resolve my concerns and adds yet more on top. I'm a strong proponent
> of the concept of a minimal Trusted Computing Base, i.e. keeping the
> amount of code executing at the highest privilege level as small and
> low-complexity as possible. Any code can have bugs, so the idea is
> that the more complicated the code you run in EL3 is (and the more
> complicated APIs it exposes), the more likely it becomes that you
> accidentally have an exploitable vulnerability in there. Like a p9
> filesystem driver, a certificate-based authentication system
> (especially if it's based on x509/ASN.1 which are notoriously hard to
> implement safely) is a pretty complex piece of code with a pretty
> large attack surface that I'd rather not have in my EL3 firmware if I
> can avoid it. I understand that for certain use cases you may need
> something like this (if you really want a very extensive and
> extensible debugging API that must be restricted to a few
> authenticated actors), but in my use case I really just need the
> ability to trigger one small debugging feature and that feature itself
> doesn't need to be restricted, so a minimal SMC interface would work
> much better for that case.
Hi Julius,
Just to trying to understand, if TF-A were to expose a crash inducing
SMC, this would still be restricted to special builds for your test runs
? This would not make it to production for Chromebook right ?
I agree 9p filesystem is not desirable in a EL3 runtime firmware. We
could enhance it to use a more tight data structure, if there is a
desire in that direction.
If that is the case, leaving aside the 9p filesystem issues, can
DebugFS serve this requirement (we can remove the limitation that it is
restricted to only Debug builds) ?
The intention that DebugFS can prove useful atleast in the
verification/testing space and if there is more we can do to get there,
it would be good to know.
>
>> On 13/12/2019, 13:01, "TF-A on behalf of Sandrine Bailleux via TF-A" <tf-a-bounces(a)lists.trustedfirmware.org on behalf of tf-a(a)lists.trustedfirmware.org> wrote:
>> Going back to the SMC-based solution then, I am not quite convinced
>> SYSTEM_RESET2 is the right interface for intentionally triggering a
>> panic in TF-A. I think the semantics do not quite match. If anything, a
>> firmware crash seems more like a shutdown operation to me rather than a
>> reset (we don't recover from a firmware crash). I am not even sure we
>> should look into the PSCI SMC range, as it's not a power-management
>> operation.
>
> Crash recovery behavior is platform dependent (via
> plat_panic_handler()). On all the platforms we use in Chrome OS we
> have that implemented as a system reboot. I think for most systems
> (whether it's a Chromebook, a server or some embedded device) that's
> probably what you want for random runtime crashes (and least in a
> production environment), but I agree that TF doesn't enforce any
> standard behavior so it's hard to clearly match it to one or the other
> SMC.
>
>> So it sounds like it's not the first time that you hit this issue, is
>> it? Do you have any other example of Normal World OS feature you would
>> have liked to expose through a generic SMC interface? I am wondering
>> whether this could help choosing the right SMC range, if we can identify
>> some common criteria among a set of such features.
>
> No, it's the first time I've really run into this. But I think we
> might quickly come up with more uses for a "non-secure OS" SMC range
> if we had one. We often see roughly the same SMC again on different
> platforms, because fundamentally they usually need to do the same
> kinds of things -- for example, most platforms have some kind of DDR
> frequency scaling which always needs part of it implemented in EL3, so
> they all need some kind of SMC to switch to a new DDR frequency. Many
> also need some kind of "write value to secure register" SMC that just
> allows the non-secure OS to write a few whitelisted registers that are
> only accessible in EL3 for some reason. If we could standardize these
> interfaces in a non-vendor-specific SMC range, we might be able to
> reduce some code duplication both on the TF and the Linux side.
>
> I guess none of these things are really Linux-specific, now that I
> think of it. So really, I guess the problem is that it would be great
> to have a range of "generic" SMC IDs that can be easily and
> unbureaucratically allocated to try out new features, without having
> to ask Arm to write a big specification document about it every time.
> It's sort of a development velocity issue.
>
We have utilized the ARM SiP range for some "generic" purposes in the
past (see PMF and the execution state switch SMCs). This could be
direction for the some of use-cases. But if the SMCs are meant to be
truly generic and to be relied on for use by generic normal world
software components, it would need to be properly specified I would think.
For dynamically modifying some EL3 registers, it would be good to get
these requirements out. Perhaps there is scope for architecting some of
them as an ARM specification. If not, we could revert to a TF-A standard
if there is enough pull for them (perhaps utilizing the ARM SiP range).
Best Regards
Soby Mathew