The Hong Kong Digital Data Center operated by our company experienced a serious failure of all the network interruption between 12:10 and 13:10 on January 2, 2022. Here we explain the failure situation as follows:
1. After the engineers checked the reason for this failure, because we have access to many international operators, and all of them use IPV4&IPV6 dual-stack operation, and we also have 4 different BGP combinations, resulting in multiple times the routing table In the normal BGP routing table, the memory resource consumption of the core routing equipment remained high, which eventually caused the core routing equipment to be overwhelmed. The core routing equipment was completely down at about 12 o'clock on January 2. It was fully restored to normal around 13:10, and this failure caused all network services to be interrupted for 70 minutes.
2. In response to this failure, we have already estimated and have been actively promoting new equipment to relieve the pressure on the core routers. We originally planned to expand and cut over from 9 am to 21 am on January 2. With the full cooperation of on-site engineers and remote engineers, we also completed all expansion operations at 23:00 on the 2nd. Now the entire network architecture adopts a dual redundancy scheme, from routers to access network core switches. plan. I believe that such extreme network failures will not occur again in the future. There is only one possibility that the entire network core cabinet is powered down, but this probability is much lower than the probability of equipment failure. At the same time, we will also promote multiple data center redundancy programs in the future. We cannot achieve 100% SLA, but we will do our best to achieve infinitely close to 100%.
3. Here we apologize for the trouble caused by this fault, and once again express our heartfelt thanks to all friends who support and care about us!