At approximately 20:50 (EDT) on July 21, 2021, a virtual-server host stopped responding, for reasons unknown. This server hosts, among other services, one of our internal DNS servers.
Although we run two DNS servers for redundancy, some services were only configured to use one DNS server (which is being rectified). It is unclear to us why certain hosted services were not querying the backup server even when configured to do so. Regardless, in this state, these affected services were unable to resolve important hostnames, such as those for user authentication.
The unresponsive host server and its guest VMs were brought back to service approximately 6 hours later, shortly after staff in the CET timezone became aware of the issue.
We will continue to test the specific conditions under which name resolution has failed, and implement fixes to ensure a single DNS server outage does not cause these problems again.