Rockset is a highly elastic and distributed service that is deployed via Kubernetes. Due to the elasticity of the system, many servers are spun up and down all the time during normal operation. Unknown to us, there was a bug in our container networking interface that failed to de-allocate IP blocks assigned to new servers. Eventually we ran out of IP blocks to assign to new servers that get spun up, which caused a cascading failure leading to the long outage.
Not being able to assign IP blocks to new servers caused lots of warning log lines to be generated on our Kubernetes control plane nodes, which eventually filled up the disks and triggered the automatic eviction of several control plane services to try and reclaim disk space, but it was ultimately unsuccessful. The absence of one of these control plane services caused another one of our services that provides fined grained access to AWS APIs to fail as well, which ultimately led to our data nodes not being able to do their job.
We will be making several changes to our infrastructure to prevent this type of failure from occurring again. Some of the changes address the root cause, such as fixing the underlying IP address allocation bug, and monitoring available number of intranet IP addresses. Other changes are designed to prevent the failures from cascading from one part of the infrastructure to the rest, and adding more monitoring for all mission critical services. We also have long-term plans that will make such a failure mode no longer possible.
We sincerely apologize for the impact this has caused for our customers. We will be doing everything we can to learn from this incident and do better for the future.