On Wed, 18 Dec 2024 at 19:04, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 12:07 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 18 Dec 2024 at 16:02, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg sumit.garg@linaro.org wrote:
- Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
This sounds like the problem Amirreza is trying to solve for the QCOMTEE driver. If we could get the supplicant context removed soon after the supplicant has died we wouldn't need this, except that we may need some trick to avoid ignoring an eventual signal received while tee-supplicant is dying.
That would be an improvement but it may still get unnoticed in future once something else starts ref counting the supplicant context.
That depends on the implementation. We were discussing adding a callback when the file descriptor is closed.
Wait, would it work to break the loop on SIGKILL? It's an uncatchable signal so there's no reason for the calling process to wait anyway.
I agree this can be one way to solve the issue when the supplicant gets killed but what if the supplicant gets crashed then it will be another signal to handle. This approach sounds error prone to me as we might miss a corner case.
So the question here is why do we need an infinite wait loop for the supplicant which breaks only if we receive events from the user-space? Isn't it rather robust for the kernel to have a bounded supplicant wait loop? Do you have any particular use-case where this bounded wait loop won't suffice?
It will make it tricky to debug tee-supplicant with GDB.
Debugging with GDB is always a bit tricky when you are debugging on the kernel/user-space boundary.
Is there any risk of timeout during suspend?
I don't think so since this timeout is per supplicant request and I can't think of a scenario where the system suspend occurs without all CPUs going into idle state first. That means there won't be any supplicant request running at that point.
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0