BGPD hanging in FRR 8.2.2
Hi everyone, Just following up on my previous note about BGPD hanging in FRR 8.2.2. I now have more info to share. As background, I've got around 60 BGP feeds total in 30 different "views", to form a route collector for analysis work I'm doing of the global R&E routing table. This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu 20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2. The latest hang earlier today allowed a colleague to grab debug info which I hope will help. /var/log/frr/frr.log shows entries like this: Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1674696 still running after 20 seconds, sending signal 15 Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1674696 terminated due to signal 15 <snip> Apr 2 14:18:03 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1697956]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 14:18:23 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1697956 still running after 20 seconds, sending signal 15 Apr 2 14:18:23 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1697956 terminated due to signal 15 which just repeat every 10 minutes or so. A few hours earlier I was getting: Apr 1 22:53:19 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5566a35c01a0 arg=0x556682b31da0 timer r=-5.940 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Apr 1 23:24:34 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5567954b16c0 arg=0x556682f14870 timer r=-5.224 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Trying to connect by vtysh prints message of day, but never a command prompt. Same if trying to connect via telnet. The only way out is a kill -9 of the BGPD process, followed by a "systemctl restart frr". The process stack for bgpd shows: root@frr:~# cat /proc/52925/stack [<0>] futex_wait_queue_me+0xbb/0x120 [<0>] futex_wait+0x105/0x290 [<0>] do_futex+0x157/0x4d0 [<0>] __x64_sys_futex+0x13f/0x170 [<0>] do_syscall_64+0x57/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Thread debugging shows: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 145 pthread_join_common.c: No such file or directory. (gdb) bt #0 __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145 #1 0x00007f07b1f3d985 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0 #2 0x00007f07b1f38dc1 in rtr_mgr_stop () from /lib/x86_64-linux-gnu/librtr.so.0 #3 0x00007f07b1f53ef0 in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #4 0x00007f07b1f53f7d in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #5 0x00007f07b1f543ca in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #6 0x00007f07b2586621 in thread_call () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #7 0x00007f07b2540198 in frr_run () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #8 0x00005566800b6678 in main () I've got about 2.5Mbytes of strace which I'll happily unicast to whoever would like to have a look at it. It looks very repetitive/boring to my non-developer eye, like something's got stuck waiting for something else. BTW, this is what's running (after I killed and restarted), including command line options: 1707406 ? S<s 0:02 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd 1707423 ? S<sl 0:01 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000 1707428 ? S<sl 17:03 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki 1707435 ? S<s 0:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1 Any ideas? I'd hate to revert to 8.1 but... philip --
participants (1)
-
Philip Smith