Ran into a strange issue when I upgraded frr6 to frr7 on a FreeBSD 11 box. We have a LOT of bgp peers terminated (650 -- not all established) with a routing table less than 500 prefixes. Everything was working just fine on 6, but upgrading to 7, bgp would sig6 a few min after startup. Not much in the logs. Jan 16 12:25:33 kit-b zebra[88629]: [EC 4043309122] Client 'bgp' encountered an error and is shutting down. Jan 16 12:25:33 kit-b zebra[88629]: [EC 4043309122] Client 'vnc' encountered an error and is shutting down. Jan 16 12:25:33 kit-b zebra[88629]: release_daemon_table_chunks: Released 0 table chunks Jan 16 12:25:33 kit-b zebra[88629]: zebra/zebra_ptm.c:1348 failed to find process pid registration Jan 16 12:25:33 kit-b zebra[88629]: client 11 disconnected 66 bgp routes removed from the rib Jan 16 12:25:33 kit-b zebra[88629]: release_daemon_table_chunks: Released 0 table chunks Jan 16 12:25:33 kit-b zebra[88629]: client 24 disconnected 0 vnc routes removed from the rib Jan 16 12:25:33 kit-b zebra[88629]: [EC 100663303] kernel_rtm: 0.0.0.0/0: rtm_write() unexpectedly returned -4 for command RTM_DELETE It would also generate a lot of kernel messages while the daemon was running such as Jan 16 16:00:59 kit-b kernel: sonewconn: pcb 0xfffff801e65123a0: Listen queue overflow: 193 already in queue awaiting acceptance (622 occurrences) Jan 16 16:01:59 kit-b kernel: sonewconn: pcb 0xfffff801e65123a0: Listen queue overflow: 193 already in queue awaiting acceptance (557 occurrences) Jan 16 16:02:59 kit-b kernel: sonewconn: pcb 0xfffff801e65123a0: Listen queue overflow: 193 already in queue awaiting acceptance (622 occurrences) Jan 16 16:03:59 kit-b kernel: sonewconn: pcb 0xfffff801e65123a0: Listen queue overflow: 193 already in queue awaiting acceptance (556 occurrences) that are not generated when running frr6. The problem version was built from the ports. 7.5_1. The only I option I used was build vtysh. In case it was some memory issue, I tried a version with tcmalloc, however it was failing as well. We use frr7 elsewhere with a lot less peers and all works just fine, but those are on RELENG_12. As this is a production box, I cant do much experimenting on it. Not sure how to recreate in the lab easily. Do these problems ring a bell with anyone ? the box is pretty quiet. There are no memory nor CPU pressures on it. It doesnt seem to be a "thundering herd" problem as I tried shutting half the peers at startup to no avail. ---Mike
participants (1)
-
mike tancsa