Reported here: https://github.com/FRRouting/frr/issues/10826 [https://github.com/FRRouting/frr/issues/10826] From: frog-request@lists.frrouting.org To: frog@lists.frrouting.org Date: Sun, 03 Apr 2022 12:00:02 +0000 Subject: frog Digest, Vol 61, Issue 2 Send frog mailing list submissions to frog@lists.frrouting.org [mailto:frog%40lists.frrouting.org] To subscribe or unsubscribe via the World Wide Web, visit https://lists.frrouting.org/listinfo/frog [https://lists.frrouting.org/listinfo/frog] or, via email, send a message with subject or body 'help' to frog-request@lists.frrouting.org [mailto:frog-request%40lists.frrouting.org] You can reach the person managing the list at frog-owner@lists.frrouting.org [mailto:frog-owner%40lists.frrouting.org] When replying, please edit your Subject line so it is more specific than "Re: Contents of frog digest..." Today's Topics: 1. BGPD hanging in FRR 8.2.2 (Philip Smith) ---------------------------------------------------------------------- Message: 1 Date: Sat, 2 Apr 2022 20:47:42 +0100 From: Philip Smith <philip@nsrc.org [mailto:philip%40nsrc.org]> To:frog@lists.frrouting.org [mailto:frog%40lists.frrouting.org] Subject: [FROG] BGPD hanging in FRR 8.2.2 Message-ID: <54869a9a-07db-2033-cc16-c0b8a6612060@nsrc.org [mailto:54869a9a-07db-2033-cc16-c0b8a6612060%40nsrc.org]> Content-Type: text/plain; charset=UTF-8; format=flowed Hi everyone, Just following up on my previous note about BGPD hanging in FRR 8.2.2. I now have more info to share. As background, I've got around 60 BGP feeds total in 30 different "views", to form a route collector for analysis work I'm doing of the global R&E routing table. This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu 20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2. The latest hang earlier today allowed a colleague to grab debug info which I hope will help. /var/log/frr/frr.log shows entries like this: Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1674696 still running after 20 seconds, sending signal 15 Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1674696 terminated due to signal 15 <snip> Apr 2 14:18:03 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1697956]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 14:18:23 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1697956 still running after 20 seconds, sending signal 15 Apr 2 14:18:23 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1697956 terminated due to signal 15 which just repeat every 10 minutes or so. A few hours earlier I was getting: Apr 1 22:53:19 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5566a35c01a0 arg=0x556682b31da0 timer r=-5.940 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Apr 1 23:24:34 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5567954b16c0 arg=0x556682f14870 timer r=-5.224 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Trying to connect by vtysh prints message of day, but never a command prompt. Same if trying to connect via telnet. The only way out is a kill -9 of the BGPD process, followed by a "systemctl restart frr". The process stack for bgpd shows: root@frr:~# cat /proc/52925/stack [<0>] futex_wait_queue_me+0xbb/0x120 [<0>] futex_wait+0x105/0x290 [<0>] do_futex+0x157/0x4d0 [<0>] __x64_sys_futex+0x13f/0x170 [<0>] do_syscall_64+0x57/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Thread debugging shows: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 145 pthread_join_common.c: No such file or directory. (gdb) bt #0 __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 #1 0x00007f07b1f3d985 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0 #2 0x00007f07b1f38dc1 in rtr_mgr_stop () from /lib/x86_64-linux-gnu/librtr.so.0 #3 0x00007f07b1f53ef0 in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #4 0x00007f07b1f53f7d in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #5 0x00007f07b1f543ca in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #6 0x00007f07b2586621 in thread_call () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #7 0x00007f07b2540198 in frr_run () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #8 0x00005566800b6678 in main () I've got about 2.5Mbytes of strace which I'll happily unicast to whoever would like to have a look at it. It looks very repetitive/boring to my non-developer eye, like something's got stuck waiting for something else. BTW, this is what's running (after I killed and restarted), including command line options: 1707406 ? S<s 0:02 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd 1707423 ? S<sl 0:01 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000 1707428 ? S<sl 17:03 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki 1707435 ? S<s 0:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1 Any ideas? I'd hate to revert to 8.1 but... philip -- ------------------------------ Subject: Digest Footer _______________________________________________ frog mailing list frog@lists.frrouting.org [mailto:frog%40lists.frrouting.org] https://lists.frrouting.org/listinfo/frog [https://lists.frrouting.org/listinfo/frog] ------------------------------ End of frog Digest, Vol 61, Issue 2 *********************************** ***************************************************** Best Service and Trustworthy From Us Our Mail Server Support IPv6 & IPv4 Mail ======================================================