Confluent Cloud Metrics API error rates and latencies are higher than normal
Incident Report for Confluent Cloud
Resolved
This incident has been resolved.
Posted Mar 07, 2023 - 17:11 UTC
Monitoring
As of 3:00 UTC, error levels and latency values for the Confluent Cloud Metrics API are back to normal. The Confluent engineering team will continue to monitor the situation.
Posted Mar 07, 2023 - 03:54 UTC
Investigating
Confluent Cloud Metrics API error rates are higher than normal. This issue started at Mar 7th 00:34 AM UTC time. We are aware of the issue and investigating. We will provide an update when we have more information.
Posted Mar 07, 2023 - 01:43 UTC
Update
Error levels and latency values for the Confluent Cloud Metrics API are back to normal. The Confluent engineering team will continue to monitor the situation.
Posted Mar 02, 2023 - 22:35 UTC
Update
The Confluent team continues to test the rollout of the changes as we move toward a full recovery. We will provide an update once we have completed applying the fixes.
Posted Mar 02, 2023 - 20:30 UTC
Update
The issue which was a major bottleneck is fixed and is done. The team will now continue with the earlier plan and deploy the remaining fixes. We will provide the next update in 4 hours.
Posted Mar 02, 2023 - 15:54 UTC
Monitoring
Confluent engineering is monitoring the success of this latest fix. At this point the fix seems to be working and no new issues are identified. We will provide an update in 4 hours when we have a full recovery.
Posted Mar 02, 2023 - 11:24 UTC
Identified
Confluent engineering has seen error rates decrease after the latest fix. We will be monitoring the success of this latest fix and will provide an update in 6 hours as we continue to push for full recovery.
Posted Mar 02, 2023 - 03:39 UTC
Update
Confluent engineering has rolled out a mitigation to production and is currently monitoring it.

We will provide a new status update again at 8 PM Pacific Time
Posted Mar 02, 2023 - 02:09 UTC
Update
The Confluent Cloud Metrics API continues to see elevated latencies and error rates. The underlying telemetry data store is experiencing elevated query times, causing the metric APIs to timeout. The Confluent engineering team is working around the clock to identify the bottlenecks, test, and roll out changes to alleviate them.

We will provide a new status update again at 6 PM Pacific Time.
Posted Mar 01, 2023 - 23:39 UTC
Update
The latest mitigation attempts have failed, and consumers of the Confluent Cloud Metrics API may continue to see increased error rates and latencies. Confluent engineering is continuing to attempt mitigating the issue. Further updates will be provided as they become available.
Posted Mar 01, 2023 - 20:06 UTC
Update
The team has been rolling out a set of fixes, but ran into an issue mid rollout. That issue has been addressed and the team will continue the roll out and provide updates on completion.
Posted Mar 01, 2023 - 14:25 UTC
Update
Confluent engineering is continuing to investigate issues with the Confluent Cloud Metrics API and apply fixes to mitigate impact. Customers may see requests to the Confluent Cloud Metrics API fail.
Posted Mar 01, 2023 - 03:59 UTC
Update
The Confluent team have identified several additional issues and are rolling out a series of updates to address them. These will be measured updates. We will monitor the changes and will provide updates when available.
Posted Feb 28, 2023 - 18:58 UTC
Update
We have identified several issues and mitigated them, but are continuing to see more errors.
We are still investigating and will continue to work towards a resolution.
Posted Feb 28, 2023 - 06:27 UTC
Investigating
Confluent Cloud Metrics API error rates and latencies are higher than normal. This issue started at 11 AM UTC time. We are aware of the issue and investigating. We will provide an update when we have more information.
Posted Feb 27, 2023 - 16:05 UTC
This incident affected: Confluent Cloud.