Yesterday, one of my sites went down while using a Cradlepoint MBR1400. This was especially odd given that the area did not have an internet outage and because this site has a Comcast primary and 4G secondary connection.
I was not on site and the local staff chose to reboot the router which resolved the issue short term. The question remained, why did the connection fail with primary and secondary connections that were up?
The answer came from the the failure check settings in the MBR1400. Both the primary and secondary connections were set to check 18.104.22.168 for failure testing and failback was based on time. 22.214.171.124 (google dns) went down causing the primary connection test to fail so the cradlepoint switched to the secondary connection. The secondary connection test to 126.96.36.199 also failed causing the MBR1400 to re-try primary. Essentially at this point the system entered a logical loop in which it stayed stuck until a reboot. The total downtime was approximately 25 minutes (unacceptably poor).
There are two parts to the resolution here. The settings need to be altered as they were programmed poorly (though the documentation on this is lacking). After some testing and communication with Cradlepoint I will update which settings we found to be effective. The second part to the resolution requires a fundamental change from Cradlepoint. Simply put, you cannot test a single ip otherwise you may end up with a false outage as we did. Even if that ip has Google reliability. The failure check testing should be using 2 to 4 ip addresses with programmability for how quickly a positive result is active upon.
So, please improve your failure check Cradlepoint to easily prevent false outages for the market you serve.
Update: Per recommendations from Cradlepoint, I changed the settings per the images below. The results were much improved. 45 seconds and several lost packets to failover on a primary line disconnect. ~5 seconds and one lost packet on primary modem power failure. Failback in both cases lost only one packet and didn’t come up until the cable modem was ready. We also worked with comcast to improve the modem cycle time.
While these improvements are a huge, there are still issues. The Cradlepoint problem of testing only one destination for our primary wired connection is a clearly identifiable point of failure that largely negates the redundancy benefits. The failover event could be improved on our primary line disconnect test case, bringing the time down from 45 seconds and many lost packets which we know competing hardware can achieve.