SchedulePro - Users Unable to Sign In
Incident Report for SchedulePro
Postmortem

On March 19th, our systems experienced three separate outages, each lasting approximately an hour before they were resolved. Upon investigation, it was determined that all three outages were caused by the same issue: an unexpected spike in authentication traffic load. This surge rendered our backend servers unresponsive to login requests via both web and mobile applications.To address the immediate cause of the outages, we executed full reboots of all system components. While this action successfully restored functionality, it may have resulted in brief additional periods of outage for customers within the app, as well as apparent anomalies such as failed save requests during the reboot process.Yesterday, our team investigated the root cause and implemented several solutions to prevent similar occurrences in the future:

  1. Implementation of Cache Lookup: We introduced a cache lookup mechanism for our API management. This enhancement ensures that the majority of traffic, especially during spike situations, is intercepted before reaching our backend services, thereby preventing overwhelming server loads.
  2. Introduction of Rate Limiting: Rate limiting protocols have been integrated into the API management layer. This measure guarantees that even in extreme scenarios, our system will remain operational for all users. High loads will no longer result in service outages.
  3. Scaling of API Backend Servers: We bolstered the capacity of our API backend servers to accommodate increased load and traffic demands in production in Canada.
  4. Implementation of Proactive Alerts: We established new alert mechanisms to proactively notify the development team of potential issues similar to those experienced on March 19th. These alerts will ensure swift action and minimize downtime.

The combination of these proactive measures and infrastructure improvements significantly reduces the likelihood of similar incidents affecting our customers in the future. We are confident that our system is now better equipped to handle unexpected spikes in traffic and maintain uninterrupted service.

Posted Mar 21, 2024 - 11:19 PDT

Resolved
This incident has been resolved.
Posted Mar 19, 2024 - 14:59 PDT
Monitoring
Our team has resolved the sign-in issues and they are actively monitoring the system.
Posted Mar 19, 2024 - 14:11 PDT
Investigating
Users are currently unable to log into the website. Our team is currently investigating connectivity issues and we will provide an update as quickly as possible.
Posted Mar 19, 2024 - 13:10 PDT
This incident affected: System Access.