11/28/2023 0 Comments Slack outage feb 2022![]() ![]() ![]() "We introduced tighter rate limits on connection requests to reduce the load on the system. As a result, the new DB queries take longer for CRUD operations, causing APIs to fail to complete within the pre-determined time-out period defined in load balancing/reverse proxying services such as Nginx. Because of the existing query processes in the queue, DB TPS should have been lowered. We don't know what configuration change caused the outage, but it appears that the database (DB) load has grown as a result of this update. Due to this increased activity, the affected databases failed to serve incoming requests to connect to Slack. " "A configuration change inadvertently lead to a sudden increase in activity on our database infrastructure. With this limited information let us analyze the issue summary given by the slack team Fine-Grained the user/channel-based sharding model to reduce hotspots.This also helps them solve mass reconnection issues. So they started downloading the workspace model first using Flannel (global distributed cache), then loading the payload slowly. As previously stated, downloading the complete user's world while connecting is costly.But this will create hotspots for some channels. So, they have sharded data over the channel, as most of the APIs are related to channel data after the user is connected. Moreover, a single database cannot help them to store messages from millions of users over years. So, they adopted Master-Master Configuration to handle heavy write requests. Slack is a producer heavy system, especially during peak hours.However, once connected, the user receives a link for a WebSocket connection, which they can use to receive realtime messages and notifications from the messaging servers. Many messaging systems employ this paradigm to minimize server overload.When a user connects to Slack, the complete universe is downloaded to the user's PC. This is a fairly costly API because it must generate the payload with all of the channels and users included inside them, as well as markers for the most recent message in each channel. This could be a decision made during the architecture design process to load the complete system first and then deliver incremental real-time updates.I have analyzed the information from Qcon events that updated slack architecture as of 20 and listed a few interesting points. Slack also works similarly, the world here for a user is their team and real-time events are the messages the user sends and receive. Consider a video game where we operate in a world and some real-time events happen on top of it like controlling the bot to move and shoot. So some part of their architecture resembles the multi-player gaming world. Slack was born out of a gaming company called Tiny Speck.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |