Cyclic reference on org chart:
- Detail: We've implemented a strategy to prevent the display of a user as their own manager, even when the Active Directory indicates that they are their own manager.
- Ticket: 30721
- Work item: 55399
Queue resilience strategy
We've put in place a strategy to enhance our message tracking, gather additional details for troubleshooting potential problems, and prevent malfunctions in our solutions. This strategy can be broken down into three key parts.
We've made some improvements to how we handle messages using RabbitMQ, our messaging broker. With the latest updates, RabbitMQ has introduced something called quorum queues. These special queues allow us to try delivering a message to our services multiple times, even if there are issues like a service malfunction or connection problem.
If a message still can't be delivered after a certain number of attempts, RabbitMQ safely redirects it to a 'dead-letter' queue. This ensures that we don't lose any messages, and we get notified about the error and where it originated.
To make the most of this, we've upgraded our queues and added a new service to log these messages. This helps us troubleshoot any issues more effectively and gives us detailed information to fix them. Currently, we've set up two delivery attempts. The message will be delivered the first time, and if something goes wrong, it gets another shot. If issues persist, there's one last attempt before the message is sent to our dead-letter queue.
When faced with unexpected errors in our applications—whether it's a database connection problem, an issue with the Graph API etc—we've improved our logs and adjusted our strategy for handling them. Now, we communicate with RabbitMQ to signal when something goes wrong, enabling RabbitMQ to implement its retry strategy effectively.
Given our use of a retry policy, we wanted to make sure that if a message is processed twice (due to, say, an application crash during handling and subsequent processing when the application restarts), it doesn't introduce new issues. To achieve this, we delved into our systems and ensured that these processes are idempotent.
While ensuring idempotence is straightforward for some processes, others pose a bit of a challenge. For these, we've implemented specific strategies:
General emails - Every process sending an email has a distinct context. To avoid duplicate emails, we've implemented a caching system that stores the context and target when an email is successfully processed. This way, if a message attempts to send the same email to the same recipient twice, it will be skipped, as the email has already been sent.
Profile validation email - This process handles sending profile validation emails for all profiles in a tenant. To optimize efficiency, we cache checkpoints. In the event of message retries, we pick up from where it left off, preventing unnecessary preparation of emails for profiles already in the queue. This approach is specifically tailored for initial profile validation, as reminders are individually scheduled for each profile and don't encounter the same resource usage concern.
Settings update - When updating a setting, like the frequency of profile validation emails, we used to skip the process if the new setting matched what was already stored. However, in the worst-case scenario—where the setting is updated in our database and the application crashes during the reschedule task—the process would be skipped on the next message reception. To address this, we've adjusted our strategy. Now, we cache whether the message was fully processed. If it was, the process is skipped on subsequent receptions; if not, it maintains idempotence.