In the morning hours of June 28th as traffic started picking up, one of our services started having issues serving web requests.
This service is used for handover between the main chatbot service and external customer service systems and lets users talk to human agents. The main chatbot was unaffected and continued to respond as normal, it was only communication between end users and agents that was affected.
We received automatic alert about errors soon after it started to fail and began to investigate.
One of the error messages that the webserver displayed, suggested that the reason might be that it was out of memory, so we started by investigating this and trying to increase memory for the service. Unfortunately this was a misleading error message that lead us to spend our time looking into the wrong solution.
The actual problem was latency, as the webserver was responding to requests, but too slowly and the connections were timing out. Once we figured this out we were able to scale up the service to have more capacity and the service recovered quickly and went back to handling requests in milliseconds instead of minutes.
To avoid this happening again we have increased the capacity that the service can scale up to automatically so that it can dynamically respond to increased traffic.
We’re also looking into if we can add more alerts that tell us explicitly if server response time starts to increase a lot, so that we will get an accurate error message that indicates the actual problem.
We are also looking into if it’s possible to change the misleading error message from the webserver, to prevent being mislead again if something similar were to happen in the future