Problems with handover to live chat

Incident Report for Kindly

Postmortem

In the morning hours of June 28th as traffic started picking up, one of our services started having issues serving web requests.

This service is used for handover between the main chatbot service and external customer service systems and lets users talk to human agents. The main chatbot was unaffected and continued to respond as normal, it was only communication between end users and agents that was affected.

We received automatic alert about errors soon after it started to fail and began to investigate.

One of the error messages that the webserver displayed, suggested that the reason might be that it was out of memory, so we started by investigating this and trying to increase memory for the service. Unfortunately this was a misleading error message that lead us to spend our time looking into the wrong solution.

The actual problem was latency, as the webserver was responding to requests, but too slowly and the connections were timing out. Once we figured this out we were able to scale up the service to have more capacity and the service recovered quickly and went back to handling requests in milliseconds instead of minutes.

To avoid this happening again we have increased the capacity that the service can scale up to automatically so that it can dynamically respond to increased traffic.

We’re also looking into if we can add more alerts that tell us explicitly if server response time starts to increase a lot, so that we will get an accurate error message that indicates the actual problem.

We are also looking into if it’s possible to change the misleading error message from the webserver, to prevent being mislead again if something similar were to happen in the future

Posted Jun 28, 2024 - 17:08 CEST

Resolved

This incident has been resolved.

Posted Jun 28, 2024 - 14:19 CEST

Monitoring

This morning there was an incident with a sub-system handling handover of chats from bot to human agents.
The issue has been handled and everything appears to be working as intended now, but we are still monitoring to make sure.

The incident only affected handover for certain clients using this particular service, not all clients.
Conversations with chatbots were unaffected, only users attempting to contact a human would have noticed the problem.

Posted Jun 28, 2024 - 10:28 CEST

This incident affected: Handover to human chat.