Cloud service outage

Incident Report for k6

Resolved

This incident was caused by an unexpected memory consumption and table locking during a scheduled schema migration of our main metrics database. The database migration started on Monday at 20:00 UTC and ran in the background for 8 hours and 12 minutes without impacting the service. At 00:15 UTC a large database table storing HTTP metrics began migration. At 4:12 UTC, the migration consumed about 60GB of memory and started impacting INSERT performance, possibly due to a database lock (still investigating). Affected users started seeing delays in metrics insertion and errors retrieving data using app.k6.io. Our engineers began investigating the issue and looking for the cause. The database migration was aborted. The service was fully restored by 7:30 UTC.

64 k6 test runs timed out or got aborted during this time window.

We continue investigating the root cause of the incident and revise our internal procedures for monitoring long-running database migrations.

We apologize for any impact the service disruption may have had on your organization.

Posted Nov 23, 2021 - 05:30 CET