OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org --- drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/* + * OP-TEE supplicant timeout, the user-space supplicant may get + * crashed or killed while servicing an RPC call. This will just lead + * to OP-TEE client hung indefinitely just waiting for supplicant to + * serve requests which isn't expected. It is rather expected to fail + * gracefully with a timeout which is long enough. + */ +#define SUPP_TIMEOUT (msecs_to_jiffies(10000)) + struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) { - list_del(&req->link); - req->in_queue = false; + if (req->in_queue) { + list_del(&req->link); + req->in_queue = false; + } req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); } @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret; + int res = 1;
/* * Return in case there is no supplicant available and @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
- /* - * Wait for supplicant to process and return result, once we've - * returned from wait_for_completion(&req->c) successfully we have - * exclusive access again. - */ - while (wait_for_completion_interruptible(&req->c)) { + /* Wait for supplicant to process and return result */ + while (res) { + res = wait_for_completion_interruptible_timeout(&req->c, + SUPP_TIMEOUT); + /* Check if supplicant served the request */ + if (res > 0) + break; + mutex_lock(&supp->mutex); + /* + * There's no supplicant available and since the supp->mutex + * currently is held none can become available until the mutex + * released again. + * + * Interrupting an RPC to supplicant is only allowed as a way + * of slightly improving the user experience in case the + * supplicant hasn't been started yet. During normal operation + * the supplicant will serve all requests in a timely manner and + * interrupting then wouldn't make sense. + */ interruptable = !supp->ctx; - if (interruptable) { - /* - * There's no supplicant available and since the - * supp->mutex currently is held none can - * become available until the mutex released - * again. - * - * Interrupting an RPC to supplicant is only - * allowed as a way of slightly improving the user - * experience in case the supplicant hasn't been - * started yet. During normal operation the supplicant - * will serve all requests in a timely manner and - * interrupting then wouldn't make sense. - */ + if (interruptable || (res == 0)) { if (req->in_queue) { list_del(&req->link); req->in_queue = false; @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; } + if (res == 0) + req->ret = TEE_ERROR_TIMEOUT; }
ret = req->ret;
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one? If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
+ Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg sumit.garg@linaro.org wrote:
- Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
This sounds like the problem Amirreza is trying to solve for the QCOMTEE driver. If we could get the supplicant context removed soon after the supplicant has died we wouldn't need this, except that we may need some trick to avoid ignoring an eventual signal received while tee-supplicant is dying.
Wait, would it work to break the loop on SIGKILL? It's an uncatchable signal so there's no reason for the calling process to wait anyway.
Cheers, Jens
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Wed, 18 Dec 2024 at 16:02, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg sumit.garg@linaro.org wrote:
- Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
This sounds like the problem Amirreza is trying to solve for the QCOMTEE driver. If we could get the supplicant context removed soon after the supplicant has died we wouldn't need this, except that we may need some trick to avoid ignoring an eventual signal received while tee-supplicant is dying.
That would be an improvement but it may still get unnoticed in future once something else starts ref counting the supplicant context.
Wait, would it work to break the loop on SIGKILL? It's an uncatchable signal so there's no reason for the calling process to wait anyway.
I agree this can be one way to solve the issue when the supplicant gets killed but what if the supplicant gets crashed then it will be another signal to handle. This approach sounds error prone to me as we might miss a corner case.
So the question here is why do we need an infinite wait loop for the supplicant which breaks only if we receive events from the user-space? Isn't it rather robust for the kernel to have a bounded supplicant wait loop? Do you have any particular use-case where this bounded wait loop won't suffice?
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Wed, Dec 18, 2024 at 12:07 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 18 Dec 2024 at 16:02, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg sumit.garg@linaro.org wrote:
- Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
This sounds like the problem Amirreza is trying to solve for the QCOMTEE driver. If we could get the supplicant context removed soon after the supplicant has died we wouldn't need this, except that we may need some trick to avoid ignoring an eventual signal received while tee-supplicant is dying.
That would be an improvement but it may still get unnoticed in future once something else starts ref counting the supplicant context.
That depends on the implementation. We were discussing adding a callback when the file descriptor is closed.
Wait, would it work to break the loop on SIGKILL? It's an uncatchable signal so there's no reason for the calling process to wait anyway.
I agree this can be one way to solve the issue when the supplicant gets killed but what if the supplicant gets crashed then it will be another signal to handle. This approach sounds error prone to me as we might miss a corner case.
So the question here is why do we need an infinite wait loop for the supplicant which breaks only if we receive events from the user-space? Isn't it rather robust for the kernel to have a bounded supplicant wait loop? Do you have any particular use-case where this bounded wait loop won't suffice?
It will make it tricky to debug tee-supplicant with GDB. Is there any risk of timeout during suspend?
Cheers, Jens
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Wed, 18 Dec 2024 at 19:04, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 12:07 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 18 Dec 2024 at 16:02, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg sumit.garg@linaro.org wrote:
- Erik
Hi Jens,
On Wed, 18 Dec 2024 at 12:27, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi Sumit,
On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) {
Are you fixing an observed problem or a theoretical one?
It is an observed problem, I was able to reproduce it using following sequence with OP-TEE buildroot setup:
$ xtest 600 & // Run some secure storage tests using supplicant in the background $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are in progress.
This will cause the xtest to hang up.
If the supplicant has died then "interruptable" is expected to be true so the timeout shouldn't matter.
When the supplicant dies, it doesn't lead to releasing the supplicant context in the above test scenario. The reason is probably the supplicant shared memory reference is held by OP-TEE which is in turn is holding a reference to supplicant context.
This sounds like the problem Amirreza is trying to solve for the QCOMTEE driver. If we could get the supplicant context removed soon after the supplicant has died we wouldn't need this, except that we may need some trick to avoid ignoring an eventual signal received while tee-supplicant is dying.
That would be an improvement but it may still get unnoticed in future once something else starts ref counting the supplicant context.
That depends on the implementation. We were discussing adding a callback when the file descriptor is closed.
Wait, would it work to break the loop on SIGKILL? It's an uncatchable signal so there's no reason for the calling process to wait anyway.
I agree this can be one way to solve the issue when the supplicant gets killed but what if the supplicant gets crashed then it will be another signal to handle. This approach sounds error prone to me as we might miss a corner case.
So the question here is why do we need an infinite wait loop for the supplicant which breaks only if we receive events from the user-space? Isn't it rather robust for the kernel to have a bounded supplicant wait loop? Do you have any particular use-case where this bounded wait loop won't suffice?
It will make it tricky to debug tee-supplicant with GDB.
Debugging with GDB is always a bit tricky when you are debugging on the kernel/user-space boundary.
Is there any risk of timeout during suspend?
I don't think so since this timeout is per supplicant request and I can't think of a scenario where the system suspend occurs without all CPUs going into idle state first. That means there won't be any supplicant request running at that point.
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
-Sumit
Cheers, Jens
if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
-Sumit
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) { if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Cheers, Jens
-Sumit
diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c index 322a543b8c27..92e86ac4cdd4 100644 --- a/drivers/tee/optee/supp.c +++ b/drivers/tee/optee/supp.c @@ -7,6 +7,15 @@ #include <linux/uaccess.h> #include "optee_private.h"
+/*
- OP-TEE supplicant timeout, the user-space supplicant may get
- crashed or killed while servicing an RPC call. This will just lead
- to OP-TEE client hung indefinitely just waiting for supplicant to
- serve requests which isn't expected. It is rather expected to fail
- gracefully with a timeout which is long enough.
- */
+#define SUPP_TIMEOUT (msecs_to_jiffies(10000))
struct optee_supp_req { struct list_head link;
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
/* Abort all queued requests */ list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
list_del(&req->link);
req->in_queue = false;
if (req->in_queue) {
list_del(&req->link);
req->in_queue = false;
} req->ret = TEEC_ERROR_COMMUNICATION; complete(&req->c); }
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, struct optee_supp_req *req; bool interruptable; u32 ret;
int res = 1; /* * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, /* Tell an eventual waiter there's a new request */ complete(&supp->reqs_c);
/*
* Wait for supplicant to process and return result, once we've
* returned from wait_for_completion(&req->c) successfully we have
* exclusive access again.
*/
while (wait_for_completion_interruptible(&req->c)) {
/* Wait for supplicant to process and return result */
while (res) {
res = wait_for_completion_interruptible_timeout(&req->c,
SUPP_TIMEOUT);
/* Check if supplicant served the request */
if (res > 0)
break;
mutex_lock(&supp->mutex);
/*
* There's no supplicant available and since the supp->mutex
* currently is held none can become available until the mutex
* released again.
*
* Interrupting an RPC to supplicant is only allowed as a way
* of slightly improving the user experience in case the
* supplicant hasn't been started yet. During normal operation
* the supplicant will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/ interruptable = !supp->ctx;
if (interruptable) {
/*
* There's no supplicant available and since the
* supp->mutex currently is held none can
* become available until the mutex released
* again.
*
* Interrupting an RPC to supplicant is only
* allowed as a way of slightly improving the user
* experience in case the supplicant hasn't been
* started yet. During normal operation the supplicant
* will serve all requests in a timely manner and
* interrupting then wouldn't make sense.
*/
if (interruptable || (res == 0)) { if (req->in_queue) { list_del(&req->link); req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params, req->ret = TEEC_ERROR_COMMUNICATION; break; }
if (res == 0)
req->ret = TEE_ERROR_TIMEOUT; } ret = req->ret;
-- 2.43.0
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why? Currently the kernel is relying too heavily on tee-supplicant daemon to respond properly in an unbounded loop. I think this is really a bug for devices in production in scenarios where the tee-supplicant gets stuck for some reason or crashes in the middle or gets killed. These can simply lead to system hung up issues during shutdown requiring a hard power off or reset which isn't expected from a sane system. So I rather consider the unbounded wait loop for tee-supplicant a bug we have currently requiring a fix to be backported.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
-Sumit
On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg sumit.garg@linaro.org wrote:
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why?
Tee-supplicant is supposed to respond in a timely manner. What is timely manner depends on the use case, but you now want to say that it's 10 seconds regardless of what's going on. This makes it impossible to debug tee-supplicant with a debugger unless you're very quick. It might also introduce random timouts in a system under a heavy IO load.
Currently the kernel is relying too heavily on tee-supplicant daemon to respond properly in an unbounded loop. I think this is really a bug for devices in production in scenarios where the tee-supplicant gets stuck for some reason or crashes in the middle or gets killed.
Tee-supplicant is a system daemon, the system depends heavily on it. It should not be killable by unprivileged processes. What happens if init crashes or is killed?
These can simply lead to system hung up issues during shutdown requiring a hard power off or reset which isn't expected from a sane system.
This is new information. Is the use case to avoid blocking for too long during shutdown or reset? We may need to do something to ensure that the tee-supplicant isn't killed before OP-TEE is done with it. Enforcing this timeout during shutdown makes sense, but not during normal operation.
So I rather consider the unbounded wait loop for tee-supplicant a bug we have currently requiring a fix to be backported.
Yeah, but only during certain circumstances.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
Let's start by finding out what problem you're trying to fix.
Cheers, Jens
-Sumit
On Wed, 22 Jan 2025 at 15:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg sumit.garg@linaro.org wrote:
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why?
Tee-supplicant is supposed to respond in a timely manner. What is timely manner depends on the use case, but you now want to say that it's 10 seconds regardless of what's going on. This makes it impossible to debug tee-supplicant with a debugger unless you're very quick. It might also introduce random timouts in a system under a heavy IO load.
Although, initially I thought 10 seconds should be enough for any user-space process to be considered hung but then I saw DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be considered hung. How about rather a Kconfig option like OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be configured as 0 to disable timeout entirely for debugging purposes.
Currently the kernel is relying too heavily on tee-supplicant daemon to respond properly in an unbounded loop. I think this is really a bug for devices in production in scenarios where the tee-supplicant gets stuck for some reason or crashes in the middle or gets killed.
Tee-supplicant is a system daemon, the system depends heavily on it. It should not be killable by unprivileged processes. What happens if init crashes or is killed?
I am not aware of any other place in the kernel where a kernel thread does an unbounded loop for even other system daemons.
These can simply lead to system hung up issues during shutdown requiring a hard power off or reset which isn't expected from a sane system.
This is new information. Is the use case to avoid blocking for too long during shutdown or reset? We may need to do something to ensure that the tee-supplicant isn't killed before OP-TEE is done with it. Enforcing this timeout during shutdown makes sense, but not during normal operation.
This was entirely the motivation of this patch, maybe the commit description wasn't clear as I expected.
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
It's essentially trying to fix the system hang up problem during shutdown/reboot. You can't always control user-space to do the right thing but the golden rule is that the user-space shouldn't be able to break or hang-up the kernel.
So I rather consider the unbounded wait loop for tee-supplicant a bug we have currently requiring a fix to be backported.
Yeah, but only during certain circumstances.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
Let's start by finding out what problem you're trying to fix.
Let me know if the Kconfig option as proposed above sounds reasonable to you.
-Sumit
[+Arnd for a question below]
On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 22 Jan 2025 at 15:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg sumit.garg@linaro.org wrote:
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote:
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
In order to gracefully handle this scenario, let's add a long enough timeout to wait for supplicant to process requests. In case there is a timeout then we return a proper error code for the RPC request.
Signed-off-by: Sumit Garg sumit.garg@linaro.org
drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 22 deletions(-)
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why?
Tee-supplicant is supposed to respond in a timely manner. What is timely manner depends on the use case, but you now want to say that it's 10 seconds regardless of what's going on. This makes it impossible to debug tee-supplicant with a debugger unless you're very quick. It might also introduce random timouts in a system under a heavy IO load.
Although, initially I thought 10 seconds should be enough for any user-space process to be considered hung but then I saw DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be considered hung. How about rather a Kconfig option like OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be configured as 0 to disable timeout entirely for debugging purposes.
Adding a timeout when a timeout isn't needed seems wrong, even if the timeout is very long. Arnd, what do you think?
Currently the kernel is relying too heavily on tee-supplicant daemon to respond properly in an unbounded loop. I think this is really a bug for devices in production in scenarios where the tee-supplicant gets stuck for some reason or crashes in the middle or gets killed.
Tee-supplicant is a system daemon, the system depends heavily on it. It should not be killable by unprivileged processes. What happens if init crashes or is killed?
I am not aware of any other place in the kernel where a kernel thread does an unbounded loop for even other system daemons.
I imagine that FUSE and NFS mounts behave similarly.
These can simply lead to system hung up issues during shutdown requiring a hard power off or reset which isn't expected from a sane system.
This is new information. Is the use case to avoid blocking for too long during shutdown or reset? We may need to do something to ensure that the tee-supplicant isn't killed before OP-TEE is done with it. Enforcing this timeout during shutdown makes sense, but not during normal operation.
This was entirely the motivation of this patch, maybe the commit description wasn't clear as I expected.
OP-TEE supplicant is a user-space daemon and it's possible for it being crashed or killed in the middle of processing an OP-TEE RPC call. It becomes more complicated when there is incorrect shutdown ordering of the supplicant process vs the OP-TEE client application which can eventually lead to system hang-up waiting for the closure of the client application.
It's essentially trying to fix the system hang up problem during shutdown/reboot. You can't always control user-space to do the right thing but the golden rule is that the user-space shouldn't be able to break or hang-up the kernel.
So I rather consider the unbounded wait loop for tee-supplicant a bug we have currently requiring a fix to be backported.
Yeah, but only during certain circumstances.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
Let's start by finding out what problem you're trying to fix.
Let me know if the Kconfig option as proposed above sounds reasonable to you.
No one is going to bother with that config option, recompiling the kernel to be able to debug tee-supplicant is a bit much.
Cheers, Jens
-Sumit
On Mon, Jan 27, 2025, at 08:33, Jens Wiklander wrote:
[+Arnd for a question below]
On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 22 Jan 2025 at 15:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg sumit.garg@linaro.org wrote:
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote:
Hi Jens,
On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote: > > OP-TEE supplicant is a user-space daemon and it's possible for it > being crashed or killed in the middle of processing an OP-TEE RPC call. > It becomes more complicated when there is incorrect shutdown ordering > of the supplicant process vs the OP-TEE client application which can > eventually lead to system hang-up waiting for the closure of the client > application. > > In order to gracefully handle this scenario, let's add a long enough > timeout to wait for supplicant to process requests. In case there is a > timeout then we return a proper error code for the RPC request. > > Signed-off-by: Sumit Garg sumit.garg@linaro.org > --- > drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- > 1 file changed, 36 insertions(+), 22 deletions(-) >
Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why?
Tee-supplicant is supposed to respond in a timely manner. What is timely manner depends on the use case, but you now want to say that it's 10 seconds regardless of what's going on. This makes it impossible to debug tee-supplicant with a debugger unless you're very quick. It might also introduce random timouts in a system under a heavy IO load.
Although, initially I thought 10 seconds should be enough for any user-space process to be considered hung but then I saw DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be considered hung. How about rather a Kconfig option like OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be configured as 0 to disable timeout entirely for debugging purposes.
Adding a timeout when a timeout isn't needed seems wrong, even if the timeout is very long. Arnd, what do you think?
It's hard to put an upper bound on user space latency.
As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT limit is only for tasks that are in an unkillable state for more than two minutes, but the supplicant not providing results to the kernel could also happen when it waits in a killable or interruptible state, or when it does multiple I/Os in a row that each block for a time under 120 seconds.
A single sector write to an eMMC can easily take multiple seconds by itself when nothing is going on and the device is in need of garbage collection. If the system is already in low memory or there are other tasks writing to the file system, you can have many such I/O operations queued up in the device when it adds another I/O to the back of the queue.
Looking at the function that Sumit suggested changing, I see another problem, both before and after the patch:
while (wait_for_completion_interruptible(&req->c)) { mutex_lock(&supp->mutex); interruptable = !supp->ctx; if (interruptable) { ... } mutex_unlock(&supp->mutex);
if (interruptable) { req->ret = TEEC_ERROR_COMMUNICATION; break; } }
The "_interruptible()" wait makes no sense here if the "interruptable" variable is unset: The task remains in interrupted state, so the while() loop that was waiting for the wake_up_interruptible() turns into a busy loop if the task actually gets a signal.
If the task at this point is running at a realtime priority, it would prevent the thing it is waiting for from getting scheduled on the same CPU.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
Let's start by finding out what problem you're trying to fix.
Let me know if the Kconfig option as proposed above sounds reasonable to you.
No one is going to bother with that config option, recompiling the kernel to be able to debug tee-supplicant is a bit much.
If a timeout is needed, having it runtime configurable seems more useful than a build time option.
Arnd
On Mon, Jan 27, 2025 at 9:29 AM Arnd Bergmann arnd@arndb.de wrote:
On Mon, Jan 27, 2025, at 08:33, Jens Wiklander wrote:
[+Arnd for a question below]
On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg sumit.garg@linaro.org wrote:
On Wed, 22 Jan 2025 at 15:36, Jens Wiklander jens.wiklander@linaro.org wrote:
On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg sumit.garg@linaro.org wrote:
On Mon, 20 Jan 2025 at 19:01, Jens Wiklander jens.wiklander@linaro.org wrote:
On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg sumit.garg@linaro.org wrote: > > Hi Jens, > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg sumit.garg@linaro.org wrote: > > > > OP-TEE supplicant is a user-space daemon and it's possible for it > > being crashed or killed in the middle of processing an OP-TEE RPC call. > > It becomes more complicated when there is incorrect shutdown ordering > > of the supplicant process vs the OP-TEE client application which can > > eventually lead to system hang-up waiting for the closure of the client > > application. > > > > In order to gracefully handle this scenario, let's add a long enough > > timeout to wait for supplicant to process requests. In case there is a > > timeout then we return a proper error code for the RPC request. > > > > Signed-off-by: Sumit Garg sumit.garg@linaro.org > > --- > > drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++--------------- > > 1 file changed, 36 insertions(+), 22 deletions(-) > > > > Do you have any further comments here? Or is it fine for you to pick it up?
I don't like the timeout, it's a bit too hacky.
Can you please elaborate here as to why?
Tee-supplicant is supposed to respond in a timely manner. What is timely manner depends on the use case, but you now want to say that it's 10 seconds regardless of what's going on. This makes it impossible to debug tee-supplicant with a debugger unless you're very quick. It might also introduce random timouts in a system under a heavy IO load.
Although, initially I thought 10 seconds should be enough for any user-space process to be considered hung but then I saw DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be considered hung. How about rather a Kconfig option like OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be configured as 0 to disable timeout entirely for debugging purposes.
Adding a timeout when a timeout isn't needed seems wrong, even if the timeout is very long. Arnd, what do you think?
It's hard to put an upper bound on user space latency.
As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT limit is only for tasks that are in an unkillable state for more than two minutes, but the supplicant not providing results to the kernel could also happen when it waits in a killable or interruptible state, or when it does multiple I/Os in a row that each block for a time under 120 seconds.
A single sector write to an eMMC can easily take multiple seconds by itself when nothing is going on and the device is in need of garbage collection. If the system is already in low memory or there are other tasks writing to the file system, you can have many such I/O operations queued up in the device when it adds another I/O to the back of the queue.
Adding a timeout means we must somehow handle them to avoid spurious errors.
Looking at the function that Sumit suggested changing, I see another problem, both before and after the patch:
while (wait_for_completion_interruptible(&req->c)) { mutex_lock(&supp->mutex); interruptable = !supp->ctx; if (interruptable) {
... } mutex_unlock(&supp->mutex);
if (interruptable) { req->ret = TEEC_ERROR_COMMUNICATION; break; } }
The "_interruptible()" wait makes no sense here if the "interruptable" variable is unset: The task remains in interrupted state, so the while() loop that was waiting for the wake_up_interruptible() turns into a busy loop if the task actually gets a signal.
If the task at this point is running at a realtime priority, it would prevent the thing it is waiting for from getting scheduled on the same CPU.
I see the problem, thanks.
So do you have a better suggestion to fix this in the mainline as well as backported to stable releases?
Let's start by finding out what problem you're trying to fix.
Let me know if the Kconfig option as proposed above sounds reasonable to you.
No one is going to bother with that config option, recompiling the kernel to be able to debug tee-supplicant is a bit much.
If a timeout is needed, having it runtime configurable seems more useful than a build time option.
To address Sumit's issue of hung shutdowns and reboots, the timeout could be activated only during shutdowns and reboots.
Thanks, Jens
op-tee@lists.trustedfirmware.org