By Lee Smith
3 min read
The following is an incident report for the Logit.io platform outage which occurred in the early hours of March 10, 2021.
We understand that our users rely on Logit.io and the high availability of its services to maintain their log management and managed Elastic Stacks. The platform is deployed in multiple data centres across the US, EU and UK, unfortunately, a major fire broke out in one of the data centres in the EU where Logit.io had a number of servers.
Logit.io deploys its platform in a highly available manner, however, due to the impact on the data centre, this made some parts of our platform unavailable.
As soon as the data centre outage started to affect Logit.io systems our uptime monitoring alerted us immediately and called the on-call engineers to respond. Once we had confirmed the issue was with the data centre, the Incident Management team reached out to our hosting provider (OVH) to inquire as to the reason behind our systems going offline.
Due to the impact of the outage, the Incident Management team began enacting our disaster recovery plan (DRP). As part of our DRP, the Incident Management team collaborated with our on-call engineers to coordinate the required activities necessary to restore the affected services in priority order, to counteract the effects that resulted from the outage.
You can view the step by step timeline of events on our status page: https://status.logit.io/incidents/81qdgnmpbncb
The outage impacted several of the centralised services that we provide. Engineers started to recover the services from our backups to bring the platform back online to a newly provisioned data centre.
We have worked extremely hard over the years to automate the platform as efficiently and securely as possible. Due to the outage, we first had to rebuild our automation and provisioning services to allow us to rebuild other parts of the platform.
Once the initial installation into the new data centre was completed we took steps to restore the availability of the platform dashboard and databases from our backups as a priority.
Following this, our engineers started to recreate all the necessary infrastructure to bring the Elasticsearch API functionality and Kibana Service back online. Logstash Ingestion and Elasticsearch were unaffected by this outage.
After this step was completed, work began on restoring the shared ingestion API which was brought back online along with several other essential core services.
The final hours of our resolution process involved working to bring back both our alerting infrastructure and ensuring that health checks for all of the ELK clusters were operational. The team then ensured all the backups for the services were resumed.
During this incident, we benefited from having effective communication throughout our teams and with the affected customers to resolve this incident.
As our infrastructure has been fully restored to the expected availability levels we continue to monitor the performance of our platform to ensure stability. We have already begun future work to make incremental improvements to our time to resolution (TTR).
- We will continue to improve the architecture of the platform to reduce the impact of a single datacentre going offline for core services.
- Continue to improve our disaster recovery plan which greatly assisted in minimising any long term damage occurring against our platform.
- We will look to further improve our response time by introducing additional resources to our platform.
- We also seek to improve our status page by displaying more granular information in the unlikely event that future incident reporting of this nature is required.
In the event of an emergency as widespread as the one that occurred in the EU data centre, it serves as a reminder of just how important an effective backup strategy and disaster recovery plan are to any business.
At this time, our thoughts are also with the team at OVH who we hope can recover quickly as well as with all of the site owners that have suffered as a result of this incident.
All of our staff here, at Logit.io, are fully committed to continuously working to improve our technology, processes and operations to prevent downtime, as well as ensuring we can restore regular service in the event of incidents that occur outside of our control.
We sincerely apologise for any impact that has been experienced by our customers and would like to thank you for your business, continued support and your understanding.
The customer feedback we received during and after our disaster recovery process was much appreciated by our team who worked hard to resume normal service.
As always please feel free to share your comments with the team via your account manager, live chat or [email protected] and we’ll be happy to answer any questions you may have about this incident.