[FROG] BGPD hanging in FRR 8.2.2
Philip Smith
philip at nsrc.org
Sat Apr 2 19:47:42 UTC 2022
Hi everyone,
Just following up on my previous note about BGPD hanging in FRR 8.2.2. I
now have more info to share.
As background, I've got around 60 BGP feeds total in 30 different
"views", to form a route collector for analysis work I'm doing of the
global R&E routing table.
This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu
20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2.
The latest hang earlier today allowed a colleague to grab debug info
which I hope will help.
/var/log/frr/frr.log shows entries like this:
Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd
state -> unresponsive : no response yet to ping sent 90 seconds ago
Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background
command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd
Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child
process 1674696 still running after 20 seconds, sending signal 15
Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process
1674696 terminated due to signal 15
<snip>
Apr 2 14:18:03 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background
command [pid 1697956]: /usr/lib/frr/watchfrr.sh restart bgpd
Apr 2 14:18:23 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child
process 1697956 still running after 20 seconds, sending signal 15
Apr 2 14:18:23 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process
1697956 terminated due to signal 15
which just repeat every 10 minutes or so.
A few hours earlier I was getting:
Apr 1 22:53:19 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread
Starvation: {(thread *)0x5566a35c01a0 arg=0x556682b31da0 timer r=-5.940
bgp_announce_route_timer_expired() &paf->t_announce_route from
bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago
Apr 1 23:24:34 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread
Starvation: {(thread *)0x5567954b16c0 arg=0x556682f14870 timer r=-5.224
bgp_announce_route_timer_expired() &paf->t_announce_route from
bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago
Trying to connect by vtysh prints message of day, but never a command
prompt. Same if trying to connect via telnet.
The only way out is a kill -9 of the BGPD process, followed by a
"systemctl restart frr".
The process stack for bgpd shows:
root at frr:~# cat /proc/52925/stack
[<0>] futex_wait_queue_me+0xbb/0x120
[<0>] futex_wait+0x105/0x290
[<0>] do_futex+0x157/0x4d0
[<0>] __x64_sys_futex+0x13f/0x170
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Thread debugging shows:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0,
clockid=<optimized out>, abstime=<optimized out>,
block=<optimized out>) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) bt
#0 __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0,
clockid=<optimized out>, abstime=<optimized out>,
block=<optimized out>) at pthread_join_common.c:145
#1 0x00007f07b1f3d985 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0
#2 0x00007f07b1f38dc1 in rtr_mgr_stop () from
/lib/x86_64-linux-gnu/librtr.so.0
#3 0x00007f07b1f53ef0 in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#4 0x00007f07b1f53f7d in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#5 0x00007f07b1f543ca in ?? () from
/usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so
#6 0x00007f07b2586621 in thread_call () from
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0
#7 0x00007f07b2540198 in frr_run () from
/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0
#8 0x00005566800b6678 in main ()
I've got about 2.5Mbytes of strace which I'll happily unicast to whoever
would like to have a look at it. It looks very repetitive/boring to my
non-developer eye, like something's got stuck waiting for something else.
BTW, this is what's running (after I killed and restarted), including
command line options:
1707406 ? S<s 0:02 /usr/lib/frr/watchfrr -d -F traditional
zebra bgpd staticd
1707423 ? S<sl 0:01 /usr/lib/frr/zebra -d -F traditional -A
127.0.0.1 -s 90000000
1707428 ? S<sl 17:03 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
1707435 ? S<s 0:00 /usr/lib/frr/staticd -d -F traditional -A
127.0.0.1
Any ideas? I'd hate to revert to 8.1 but...
philip
--
More information about the frog
mailing list