Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected long time until server session with subscriptions times out despite keepalive #1577

Closed
anyc opened this issue Feb 6, 2025 · 27 comments

Comments

@anyc
Copy link
Contributor

anyc commented Feb 6, 2025

Hello,

I am currently experimenting with the new server-side keepalive feature and it looks to me like the session takes too long to time out. In my server context, I set a keepalive interval of 180 seconds as I just want to get notified if a (mobile) client vanishes and release resources. My client observes a lot of resources and, in this experiment, I kill it hard after starting a block transfer. Below, the last block before the kill was received (the RTC was not synchronized during this run):

May 08 00:35:20.359 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   45 bytes
May 08 00:35:20.360 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent   16 bytes
v:1 t:ACK c:2.31 i:a2f0 {01010000000001e8} [ Block1:256/M/1024 ]

Afterwards, I see occasional retransmits:

May 08 00:35:23.900 DEBG Observe PDU presented to app.
v:1 t:CON c:GET i:8b3f {01e0} [ Observe:, Uri-Path:my, Uri-Path:long, Uri-Path:path, Request-Tag:0xf8cd527a ]
May 08 00:35:23.900 DEBG call custom handler for resource 'my/long/path' (4)
May 08 00:35:23.901 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent  106 bytes
May 08 00:35:23.901 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent   77 bytes
v:1 t:CON c:2.05 i:8b3f {01e0} [ ETag:0x0172305e, Observe:116, Content-Format:text/plain ] :: '<some text>'
May 08 00:35:23.902 DEBG ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b3f: added to retransmit queue (2938ms)
May 08 00:35:26.841 DEBG ***EVENT: COAP_EVENT_MSG_RETRANSMITTED
May 08 00:35:26.841 DEBG ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b3f: retransmission #1 (next 5876ms)
May 08 00:35:26.842 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent  106 bytes
May 08 00:35:26.842 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent   77 bytes
v:1 t:CON c:2.05 i:8b3f {01e0} [ ETag:0x0172305e, Observe:116, Content-Format:text/plain ] :: '<some text>'

Single removals of a subscription:

May 08 00:36:54.983 WARN ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b3f: give up after 4 attempts
May 08 00:36:54.983 DEBG removed subscription 0x73295aa0 with token '01e0' key 0x5838ae5c
[...]
May 08 00:38:23.153 WARN ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b46: give up after 4 attempts
May 08 00:38:23.153 DEBG removed subscription 0x73290010 with token '0128' key 0x3c2a5148

And after over ten minutes, COAP_EVENT_KEEPALIVE_FAILURE is emitted:

May 08 00:46:54.626 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   33 bytes
May 08 00:46:54.626 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent    4 bytes
v:1 t:CON c:0.00 i:8b58 {} [ ]
May 08 00:46:54.627 DEBG ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b58: added to retransmit queue (2219ms)
May 08 00:46:54.627 DEBG ***EVENT: COAP_EVENT_KEEPALIVE_FAILURE
May 08 00:46:54.629 DEBG ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: mid=0x8b58: removed (6)

and lots of other subscriptions are removed:

May 08 00:46:54.629 DEBG removed subscription 0x7328efc0 with token '9b' key 0x2b226c60
May 08 00:46:54.630 DEBG removed subscription 0x7328f080 with token '99' key 0xd5e32a5e
May 08 00:46:54.630 DEBG removed subscription 0x7328eed0 with token '9f' key 0xbcb5883e
May 08 00:46:54.630 DEBG removed subscription 0x7328ef60 with token '9d' key 0x9f98a611
May 08 00:46:54.630 DEBG removed subscription 0x7328ee70 with token 'a1' key 0xc9791bed
....

In my COAP_EVENT_KEEPALIVE_FAILURE handler, I call coap_session_release().

Interestingly, I still see debug output with the session's port 49738 afterwards:

May 08 01:01:14.146 DEBG removed subscription 0x7328a580 with token '06' key 0xa95b0aae
May 08 01:01:14.147 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   34 bytes
May 08 01:01:14.147 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:5.00 i:8b5c {06} [ ]

And after I stop the server application with CTRL+C there is more traffic:

^CMay 08 01:04:37.694 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   34 bytes
May 08 01:04:37.695 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:8b5d {12} [ ]
May 08 01:04:37.696 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   34 bytes
May 08 01:04:37.696 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:8b5e {10} [ ]
May 08 01:04:37.699 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: netif: sent   34 bytes
May 08 01:04:37.699 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::4dd8:5d0a:e344:f549]:49738 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:8b5f {22} [ ]
....

The long delay is not a problem for me, I just wanted to ask if this is to be expected for some reason in this case? Regarding the traffic after the event, I guess that I should check if I have the session referenced somewhere else or is there a way to force close a session after a keepalive failure so my COAP_EVENT_*_CLOSED handlers release all my corresponding structures.

The server is using the fairly recent commit ba09d0f.

Thank you!

@anyc
Copy link
Contributor Author

anyc commented Feb 6, 2025

I just made a similar test with current develop HEAD and coap-server & -client:

Feb 06 13:34:05.001 DEBG call custom handler for resource 'time' (4)
Feb 06 13:34:05.001 DEBG *  [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : netif: sent   30 bytes
v:1 t:CON c:2.05 i:33ee {} [ ETag:0x936c7beb, Observe:10, Content-Format:text/plain, Max-Age:1 ] :: 'Feb 06 12:34:05'
Feb 06 13:34:05.001 DEBG ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: added to retransmit queue (2969ms)

Feb 06 13:34:07.970 DEBG ***EVENT: COAP_EVENT_MSG_RETRANSMITTED
Feb 06 13:34:07.970 DEBG ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: retransmission #1 (next 5938ms)
Feb 06 13:34:07.970 DEBG *  [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : netif: sent   30 bytes
v:1 t:CON c:2.05 i:33ee {} [ ETag:0x936c7beb, Observe:10, Content-Format:text/plain, Max-Age:1 ] :: 'Feb 06 12:34:05'

Feb 06 13:34:13.908 DEBG ***EVENT: COAP_EVENT_MSG_RETRANSMITTED
Feb 06 13:34:13.908 DEBG ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: retransmission #2 (next 11876ms)
Feb 06 13:34:13.908 DEBG *  [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : netif: sent   30 bytes
v:1 t:CON c:2.05 i:33ee {} [ ETag:0x936c7beb, Observe:10, Content-Format:text/plain, Max-Age:1 ] :: 'Feb 06 12:34:05'
Feb 06 13:34:25.784 DEBG ***EVENT: COAP_EVENT_MSG_RETRANSMITTED
Feb 06 13:34:25.784 DEBG ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: retransmission #3 (next 23752ms)
Feb 06 13:34:25.784 DEBG *  [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : netif: sent   30 bytes
v:1 t:CON c:2.05 i:33ee {} [ ETag:0x936c7beb, Observe:10, Content-Format:text/plain, Max-Age:1 ] :: 'Feb 06 12:34:05'

Feb 06 13:34:49.536 DEBG ***EVENT: COAP_EVENT_MSG_RETRANSMITTED
Feb 06 13:34:49.536 DEBG ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: retransmission #4 (next 29750ms)
Feb 06 13:34:49.536 DEBG *  [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : netif: sent   30 bytes
v:1 t:CON c:2.05 i:33ee {} [ ETag:0x936c7beb, Observe:10, Content-Format:text/plain, Max-Age:1 ] :: 'Feb 06 12:34:05'

Feb 06 13:35:19.286 WARN ** [::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : mid=0x33ee: give up after 4 attempts
Feb 06 13:35:19.286 DEBG removed subscription 0x5fa9af952700 with token '' key 0xb84276ff

Feb 06 13:39:49.536 DEBG ***EVENT: COAP_EVENT_SERVER_SESSION_DEL
Feb 06 13:39:49.536 DEBG ***[::ffff:127.0.0.1]:5683 <-> [::ffff:127.0.0.1]:49215 (if1) UDP : session 0x5fa9af952010: closed

It takes over 5 minutes in this case until the session is closed but there is no keepalive failure.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 6, 2025

The (unmodified) coap-server sets the keepalive timeout to 30 secs, not the 180 secs you were referring to.

A keep-alive is only transmitted if there is no other traffic active on the session for the keepalive timeout. In this case, a 'time' response is actively transmitting, but not getting any response and hence the retransmission n times.

Here, the libcoap logic says after retransmission 4 plus 30secs - time to send a keep-alive - ah - transmission active but not been a response for 30 seconds and then closes the session as you are seeing. This is instead of trying to send a keep-alive (which is not allowed to do as active CON transmissions are 1 which is the value of NSTART). So, I think the code is doing the correct thing here for this scenario.

The default time to when the subscription is deleted is (2+4+8+16)*1.5 + 30 = 75 seconds - in your case this was 74.286 (there needs to be an element of randomness here)

The default time for an idle server session to get deleted is 300 secs (configurable via coap_context_set_session_timeout()) which is the gap after the subscription was deleted to when the session was 'idled' out.

[Upping the value of NSTART is generally not recommended as there are lots of potential side effects.]

I am trying to emulate your initial scenario to see what is happening.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 6, 2025

In my COAP_EVENT_KEEPALIVE_FAILURE handler, I call coap_session_release().

I think this is your issue. A server side session will have a reference count of 0 when created. An Observe request (that succeeds) will bump the reference count. When the subscription is deleted, the reference count is decremented.

The fact that when the server session has a reference count of 0 means it will eventually time out and be deleted. However if there is any new traffic that comes in this session will get used for handling the request/response, but starting from a reference count of 0.

So, calling coap_session_release() is dangerous when the session is server side in the application unless the application has done a coap_session_reference() and the coap_session_release() is balancing it out.

This is different however for a client side session - here, the client side session starts off with a reference count of 1 and if the application calls coap_session_release(), then the client side session gets deleted. Subtly different, but has the potential for confusion.

However, server side, you can do coap_session_set_type_client(session) ; coap_session_release(session); which will cause the reference count to get bumped and then decremented and hence the session immediately gets deleted.

@anyc
Copy link
Contributor Author

anyc commented Feb 6, 2025

Thank you for the detailed explanation!

So - in theory - a observed resource could decide to publish a new value right before the keepalive timeout and postpone the timeout for another ~75 seconds and if I have n such resources in the worst case a "dead" session could live on for almost n * 75 seconds (plus session or keepalive timeout) and n * 4 PDUs will be sent out. Do you think an new COAP_EVENT_RETRANSMIT_TIMEOUT-like type would help here to give an application more control?

I will try your suggestion and change the session type.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 6, 2025

Thinking about things, it is unsafe to remove a session in an event handler as it is possible / likely that the code that called the event handler will continue to reference the session. So don't try my suggestion.

The libcoap code that calls the event handler with COAP_EVENT_KEEPALIVE_FAILURE goes on to delete all the observe subscriptions and then forces the removal of the session (done in coap_io_prepare_io_lkd()).

So - in theory - a observed resource ...

For a particular session, if COAP_EVENT_KEEPALIVE_FAILURE event is triggered, then all should be cleaned up and things not continue as you are seeing. Certainly the n * 75 seconds should not be taking place. This needs to be reproduced and looked further into.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 7, 2025

Please try #1578 with your original scenario and let us know how it goes.

@anyc
Copy link
Contributor Author

anyc commented Feb 11, 2025

I was just able to do a quick test and it looks better now with less than 5 minutes until the COAP_EVENT_SERVER_SESSION_DEL. I will take a closer look tomorrow. Thank you!

May 13 05:10:40.304 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:55930 (if7) DTLS: netif: sent   45 bytes
May 13 05:10:40.304 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:55930 (if7) DTLS: dtls:  sent   16 bytes
v:1 t:ACK c:2.31 i:c52a {052b0000000001e6} [ Block1:1322/M/1024 ]
May 13 05:14:50.772 WARN ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:55930 (if7) DTLS: mid=0x0d9c: give up after 4 attempts
May 13 05:14:50.772 DEBG removed subscription 0x59bb38 with token 'b6' key 0x3ab24360
May 13 05:15:10.635 DEBG ***EVENT: COAP_EVENT_SERVER_SESSION_DEL
session [ff02::fd]:5683 <- [fe80::c02b:fc8f:99e5:337c]:55591 closed (0x57ced8 0x5711c0)
May 13 05:15:10.635 DEBG ***[ff02::fd]:5683 <-> [fe80::c02b:fc8f:99e5:337c]:55591 (if7) UDP : session 0x57ced8: closed

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 11, 2025

Thanks for the update. I note that there are 2 different session if the logs you recently added.

@anyc
Copy link
Contributor Author

anyc commented Feb 12, 2025

After some more tries it looks like the COAP_EVENT_KEEPALIVE_FAILURE event is now emitted around 1.5 to 2 minutes after killing the client:

May 14 04:08:27.449 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: netif: sent   45 bytes
May 14 04:08:27.449 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: dtls:  sent   16 bytes
v:1 t:ACK c:2.31 i:f1bb {02f30000000001f2} [ Block1:754/M/1024 ]

[...]

May 14 04:09:57.628 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: netif: sent   45 bytes
May 14 04:09:57.628 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: dtls:  sent   16 bytes
v:1 t:NON c:2.05 i:a0c7 {0128} [ ETag:0xe1fc34ff, Observe:13, Content-Format:text/plain ] :: '1'
May 14 04:10:20.496 WARN ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: mid=0xa0c0: give up after 4 attempts
May 14 04:10:20.496 DEBG ***EVENT: COAP_EVENT_KEEPALIVE_FAILURE
May 14 04:10:20.497 DEBG removed subscription 0x732b1070 with token '9b' key 0x1612d675
May 14 04:10:20.498 DEBG removed subscription 0x732b1130 with token '99' key 0xa1768a9b

In my event handler, I now call coap_session_set_type_client(session); coap_session_release(session); but it seems that the session is never truly closed. If I stop my server after some more minutes, I see a lot of additional subscription removals like:

^CMay 14 04:16:19.083 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: netif: sent   34 bytes
May 14 04:16:19.084 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:a0ca {12} [ ]
May 14 04:16:19.085 DEBG removed subscription 0x732ae280 with token '12' key 0x0c410ac7

and finally:

May 14 04:16:19.136 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: netif: sent   35 bytes
May 14 04:16:19.136 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: dtls:  sent    6 bytes
v:1 t:NON c:4.04 i:a0e8 {0168} [ ]
May 14 04:16:19.136 DEBG removed subscription 0x732b3800 with token '0168' key 0x4815a9b0
May 14 04:16:19.137 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: netif: sent   31 bytes
May 14 04:16:19.137 INFO *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: SSL3 alert write:warning:close notify
May 14 04:16:19.139 DEBG ***EVENT: COAP_EVENT_DTLS_CLOSED
DTLS [fe80::20f:69ff:feff:d6c3]:5684 <- [fe80::c02b:fc8f:99e5:337c]:39845 closed
May 14 04:16:19.139 DEBG ***[fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:39845 (if7) DTLS: session 0x69c37580: closed
github/libcoap/src/coap_openssl.c:3523: coap_dtls_send: Assertion `ssl != NULL' failed.
Aborted (core dumped)

I will check if I am doing something wrong here.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 12, 2025

Definitely something I have missed here, but not sure what

In my event handler, I now call coap_session_set_type_client(session); coap_session_release(session);

This is done in coap_session_server_keepalive_failed() which is invoked when there is a server side detection of a keepalive failure so should not be done in the application event handler.

Looking at coap_session_server_keepalive_failed(()

coap_session_server_keepalive_failed(coap_session_t *session) {
  coap_session_reference_lkd(session);
  coap_handle_event_lkd(session->context, COAP_EVENT_KEEPALIVE_FAILURE, session);
  coap_cancel_all_messages(session->context, session, NULL);
  RESOURCES_ITER(session->context->resources, r) {
    coap_delete_observer(r, session, NULL);
  }

all the resources are iterated through, deleting any observer subscriptions for the session for that resource - but you are only seeing some of them deleted - as if there are multiple subscriptions active for the same resource, and this logic is only deleting the first one per resource.

It may help to have visibility of your client code you are using for testing.

@anyc
Copy link
Contributor Author

anyc commented Feb 12, 2025

This is done in coap_session_server_keepalive_failed() which is invoked when there is a server side detection of a keepalive failure so should not be done in the application event handler.

Ok, I will remove it again.

as if there are multiple subscriptions active for the same resource, and this logic is only deleting the first one per resource.

This is indeed what is happening. The client in this case is a development tool that observes all available resources in one window and the relevant resources for a chosen topic in another and (due to limited development time) every window observes the resources itself. I will try it again with only single observation per resource tomorrow. Thank you!

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 12, 2025

Multiple observations per resource needs to be tested out (as well as the single observer case).

I have just pushed a minor change which should handle multiple observations per resource for further testing.

@anyc
Copy link
Contributor Author

anyc commented Feb 13, 2025

With the new commit, I see a lot more subscription removals after the keepalive failure but the session still stays alive:

May 14 21:04:37.316 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   44 bytes
May 14 21:04:37.316 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent   15 bytes
v:1 t:ACK c:2.31 i:9bdc {930000000001e4} [ Block1:146/M/1024 ]
May 14 21:06:19.672 WARN ** [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: mid=0xdb03: give up after 4 attempts
May 14 21:06:19.672 DEBG ***EVENT: COAP_EVENT_KEEPALIVE_FAILURE
May 14 21:06:19.675 DEBG removed subscription 0x73131090 with token '9b' key 0x5622f7e9
May 14 21:06:19.676 DEBG removed subscription 0x73131150 with token '99' key 0xb25c522e
^CMay 14 21:51:39.655 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   35 bytes
May 14 21:51:39.655 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    6 bytes
v:1 t:NON c:4.04 i:db0b {01cb} [ ]
May 14 21:51:39.656 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   35 bytes
May 14 21:51:39.656 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    6 bytes
v:1 t:NON c:4.04 i:db0c {0162} [ ]
May 14 21:51:39.657 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   34 bytes
May 14 21:51:39.658 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:db0d {18} [ ]
May 14 21:51:39.658 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   34 bytes
May 14 21:51:39.659 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:db0e {0e} [ ]
May 14 21:51:39.659 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   34 bytes
May 14 21:51:39.660 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:db0f {0a} [ ]
May 14 21:51:39.660 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   34 bytes
May 14 21:51:39.661 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:db10 {08} [ ]
May 14 21:51:39.661 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   34 bytes
May 14 21:51:39.662 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: dtls:  sent    5 bytes
v:1 t:NON c:4.04 i:db11 {06} [ ]
May 14 21:51:39.662 DEBG removed subscription 0x73135cb0 with token '01cb' key 0xe5753faa
May 14 21:51:39.663 DEBG removed subscription 0x73133e20 with token '0162' key 0x3cfb62ce
May 14 21:51:39.663 DEBG removed subscription 0x7312d220 with token '18' key 0xce874583
May 14 21:51:39.663 DEBG removed subscription 0x7312d940 with token '0e' key 0xf503246b
May 14 21:51:39.664 DEBG removed subscription 0x7312c530 with token '0a' key 0xbc979201
May 14 21:51:39.664 DEBG removed subscription 0x7312c710 with token '08' key 0xef47a79f
May 14 21:51:39.664 DEBG *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: netif: sent   31 bytes
May 14 21:51:39.665 INFO *  [fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: SSL3 alert write:warning:close notify
May 14 21:51:39.668 DEBG ***EVENT: COAP_EVENT_DTLS_CLOSED
DTLS [fe80::20f:69ff:feff:d6c3]:5684 <- [fe80::c02b:fc8f:99e5:337c]:58650 closed
May 14 21:51:39.669 DEBG ***[fe80::20f:69ff:feff:d6c3]:5684 <-> [fe80::c02b:fc8f:99e5:337c]:58650 (if7) DTLS: session 0x72118680: closed
May 14 21:51:39.669 DEBG removed subscription 0x7312c980 with token '06' key 0xfcbe1df9

It looks like the last subscriptions are resources that are handled by the unknown resource handler. If I am not mistaken, the handler creates resources dynamically for these requests though.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

It looks like the last subscriptions are resources that are handled by the unknown resource handler.
If I am not mistaken, the handler creates resources dynamically for these requests though.

Interesting thought. However, when these resources are created then added, they are added into the list in the same way that the initial resources are created by using coap_add_resource(). However, coap_resource_unknown_init2() should not be used, just coap_resource_init() when creating these new resources from the unknown resource handler.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

I have just pushed a change that reports on the resource URI for the subscription being deleted to try to help establish why they are not all getting deleted on keep-alive failure for the session.

@anyc
Copy link
Contributor Author

anyc commented Feb 13, 2025

As the initial request already activates observation, is this handled in a special way?

I found this note: "It is not possible to observe the unknown_resource"

* Note: It is not possible to observe the unknown_resource, a separate

but in the following it looks like coap_add_observer() is called nonetheless. Maybe the subscription is added to the unknown_resource of the context?

@anyc
Copy link
Contributor Author

anyc commented Feb 13, 2025

The remaining subscriptions are all for paths that were initially handled by the unknown resource handler. (Edit: better wording)

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

The remaining subscriptions are all for paths that were initially handled by the unknown resource handler.

Good to establish that.

I see that coap_resource_set_get_observable() does not actually check if this is being used for the unknown resource. Easy to fix that to stop it happening, but is that the right answer?

Should we allow observation of the unknown resource - we do now track the subscription by uri?query so that when the (unknown resource) observe fires, the pseudo request packet gets handled by the unknown resource handler, but for the correct uri?query.

However, if the unknown resource handler spawns off a new resource matching the uri, as well as has (unknown resource) observable set, then each time an (unknown resource) observe is triggered a new resource is created. A reason for not being able to observe the unknown resource.

I am inclined to disable coap_resource_set_get_observable() for unknown resources.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

See #1583 for disabling coap_resource_set_get_observable() for unknown resources.

@anyc
Copy link
Contributor Author

anyc commented Feb 13, 2025

In my case it would be good if the newly created resource takes over the subscriptions from the unknown resource. On my device, I use one CoAP server which handles all incoming traffic and it forwards certain URIs to other processes on the device which handle requests that require elevated privileges. To avoid close coupling of these parts (especially during boot), the CoAP server/reverse proxy queries the .well-known/core of the other processes during the first unknown handler call and then decides to which process the original request has to be forwarded.

If you think this is doable I can try to come up with an implementation. If an observe can only be triggered by coap_resource_notify_observers(), maybe disabling this function for the unknown resource would be good.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

OK. So, if I understand what you are saying, an Observe request come in, handle_request() sees it is for the Unknown Handler, (currently) sets up the subscription, calls the unknown request handler who creates a new resource for future requests to go to the appropriate back-end server, not via the unknown handler.

You would like that subscription currently associated with the unknown resource to be moved across to the newly created resource. Correct?

I guess my question is - who should be generating the unsolicited observe response - the reverse proxy or the back end server?
I think I would be expecting this to be the back end server.

I would be expecting the Observe Establish to be passed through the reverse proxy to the backend server without the reverse-proxy having to know about the observe.

If an observe can only be triggered by coap_resource_notify_observers(), maybe disabling this function
for the unknown resource would be good.

Need to double check the code that they only are triggered by coap_resource_notify_observers().

@anyc
Copy link
Contributor Author

anyc commented Feb 13, 2025

You would like that subscription currently associated with the unknown resource to be moved across to the newly created resource. Correct?

Yes. This would also solve this last issue, I guess?

I guess my question is - who should be generating the unsolicited observe response - the reverse proxy or the back end server? I think I would be expecting this to be the back end server.

I would be expecting the Observe Establish to be passed through the reverse proxy to the backend server without the reverse-proxy having to know about the observe.

At the moment, the reverse proxy uses async to "delay" the response to the client and then sends an own (mostly copied) request to the backend. If the response is received, it triggers the async and returns the response (code and payload) to the client. If the initial request has observe set, the request to the backend also has this set and the following responses are "forwarded" to the client using coap_resource_notify_observers(). However, this part of the server is not well tested yet but it seems to work. Most of the required functionality is still implemented in the server itself at the moment.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 13, 2025

Yes. This would also solve this last issue, I guess?

Yes. Not sure yet how to trap the adding of a new resource in application's unknown handler to move the subscription across. It may be that the application has to add in the subscription request.

At the moment, the reverse proxy uses async to "delay" the response

Have you looked at coap_proxy_forward_request() to do all of this handling? It does not modify the request PDU, it is left 'as is'. But there is nothing to stop you creating your own (mostly copied) to use as the request parameter. Then the reverse-proxy response handler just needs to call coap_proxy_forward_response() for all to be sent back to the client. I am pretty sure that this also works for unsolicited observe responses from the back-end server.

See coap_proxy_forward_request(3), It seems the man page is broken - I will get that fixed.

Then there would be no need to call coap_resource_notify_observers() on the reverse-proxy.

@anyc
Copy link
Contributor Author

anyc commented Feb 14, 2025

Yes. Not sure yet how to trap the adding of a new resource in application's unknown handler to move the subscription across. It may be that the application has to add in the subscription request.

Maybe some sort of coap_resource_init_from_unknown() kind of function? Though I could also see access to the subscription information useful in other cases like showing internal usage statistics from time to time in the logs. So an application does not have to keep track of it itself. Or can it even keep track itself? I'll have to check if the handler is called for an "unsubscribe" PDU. But I have to leave now.

Have you looked at coap_proxy_forward_request() to do all of this handling?

Ah, I saw it but I thought this is maybe for another use case as I did not understand what I should put into server_list parameter and I "simply" went with the async approach.

Thank you for your help!

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 17, 2025

#1586 should now clear down all observations on keepalive timeout.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Feb 17, 2025

I did not understand what I should put into server_list parameter

If you have internal servers A, B, C and D, then you will need 4 different server_list entries, each being told about how to connect to the server.

So for server A, it could be reverse_proxyA

static const coap_uri_t uriA = {{sizeof("1.2.3.4")-1, (const uint8_t*)"1.2.3.4"}, 5683, {0, NULL}, {0, NULL}, COAP_URI_SCHEME_COAP};
static coap_proxy_server_t redirect_serverA[] = {{uriA, NULL, NULL, NULL}};
static coap_proxy_server_list_t reverse_proxyA = { redirect_serverA, 1, 0, COAP_PROXY_REVERSE_STRIP, 0, 1000};

etc.

I have updated the documentation in #1587 to make usage clearer as well as fixed an issue with COAP_PROXY_REVERSE_STRIP.

@anyc
Copy link
Contributor Author

anyc commented Feb 19, 2025

The session is now completely shutdown after a COAP_EVENT_KEEPALIVE_FAILURE, thank you!

I read the new proxy documentation and it is much clearer to me now, thanks. I received some additional tasks hence it might take a few days until I can replace my async logic with these proxy functions.

@anyc anyc closed this as completed Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants