Why you should rethink your webhook strategy
Find out about the common problems with webhooks, like out-of-order events and traffic surges, and how the Events API solves them.
Webhooks are HTTP requests sent from one application to another when specific event occur. They are widely used for their simplicity and immediacy, making them a popular choice for developers who need instant updates. However, despite their advantages, webhooks have significant limitations and are highly prone to failures.
For example, what happens when webhooks fail to deliver updates in the right order? Or when they flood your system with too many requests at once? Imagine the chaos that ensues when your application processes outdated or incorrect data, leading to security risks and incorrect application state or bugs that are difficult to diagnose and resolve.
In this article, we will talk about the challenges that webhooks introduce to systems and present an alternative solution: the Events API. This API is designed to ensure reliable, orderly, and efficient data synchronization, making it a superior choice for managing real-time data updates.
For illustration purposes, we will use as example a simple project management application (TaskTango) that integrates with WorkOS to manage users and group memberships. TaskTango uses lifecycle events such as user.<create|update|delete>
and group.<user_added|user_removed>
to:
- Grant the user access to the tool upon account creation.
- Update their role to ensure they have the appropriate access to the necessary projects.
- Revoke access when the customer removes a user.
How webhooks work
Webhooks allow one system to send data to another whenever a specific event occurs, enabling applications to react to changes without constant polling. This keeps systems in sync.
- Whenever event X happens in system A, then system A sends a message to system B.
- System B implements a handler that receives the request and performs some action asynchronously.
Let's see how this looks using our example.
- WorkOS exposes events as
user.<created/updated/deleted>
andgroup.<user_added/user_removed>
, for the application to subscribe to.
- A TaskTango developer implements a publicly accessible web endpoint at
/webhooks
. The endpoint filters for the event type and appropriately handles the payload according to the application’s requirements. Here is the code snippet for reference, using Express:
- The TaskTango developer then enables the aforementioned WorkOS events to be delivered to
/webhooks
. - Every time a user is created, updated, or deleted, a message with this payload is sent from WorkOS to
/webhooks
:
- Every time a user is added or removed from a group, a message with this payload is sent from WorkOS to
/webhooks
:
Challenges with webhooks
Out of Order Updates
Webhooks do not always arrive in the order they are generated. Even when the originating application sends them in order, each webhook might be processed by the application at different times. Several reasons can account for this behavior:
- The earlier event traverses a more congested path on the open internet, taking longer to arrive at the application’s endpoint.
- The consuming application’s load balancer routes the earlier event to a busier web server, increasing processing latency.
Also, the webhook consumers can encounter intermittent errors.
- The web server handling the earlier event can crash due to an out-of-memory error, triggering a 500 response. Modern webhook delivery systems will retry failed deliveries, often introducing delays that increase the likelihood of out-of-order webhooks.
- The web server handling the earlier event can be forcefully killed or rotated by the container orchestrator.
- The web server handling the earlier event can be preempted by the operating system’s thread/process scheduler.
This lack of causal order can disrupt the application’s data integrity. If a newer update arrives before an older one, outdated information may overwrite recent changes. Developers often use timestamp tracking or API calls to retrieve the latest state to mitigate these issues, but these methods can increase complexity and latency.
Using our example, let’s consider a case where WorkOS issues a user.created
webhook event followed by user.deleted
. However, due to the aforementioned reasons, TaskTango processes user.deleted
before user.created
. Since there is no user to delete, the user.deleted
event is silently ignored. To make things worse, the subsequent user.created
event may be successfully processed, leading to a loss of data integrity. As a result, TaskTango has an unintended active user lurking in the system, compromising the application’s security.
Dependent webhooks
The aforementioned out-of-order updates is not a concern for mutually exclusive webhooks that do not reference the same entities. For example, two user.created
events can be processed out of order without compromising data integrity, as their payloads do not share common entities.
However, sometimes a webhook references another recently created or updated API object. This requires the application to process an earlier webhook to correctly handle subsequent webhooks that reference the objects processed in the earlier webhook.
In our example integration:
- Sequential processing: Processing any
user.updated
oruser.deleted
webhooks requires the application to successfully processuser.created
beforehand. - Entity dependencies: Similarly, processing
group.user_added
orgroup.user_deleted
webhooks requires the application to successfully processuser.created
earlier. Otherwise, theuser_id
mentioned in the payload is meaningless.
If these dependencies aren’t managed properly, it can lead to failed retries or, worse, silent errors that compromise data accuracy. Such issues can create bugs that are hard to diagnose without thorough logging.
Spiky throughput
When a high volume of webhooks is delivered in a short period, it can create a "thundering herd" effect. This happens when numerous requests hit the system simultaneously, overwhelming the webhook consumers, and can result to:
- Cascading failures: As errors occur, retries can further exacerbate the risk of processing webhooks out of order.
- Performance bottlenecks: Applications without proper scaling strategies can struggle to handle spikes, causing errors and delays in processing.
- System overload: Without effective auto-scaling, databases and servers may not manage the sudden load, resulting in dropped requests or slow response times.
Imagine TaskTango signs a deal with VeryBigCorp, leading to the onboarding of all their employees (~100,000). WorkOS sends a sudden burst of webhooks to TaskTango. The application's ability to handle this traffic spike depends on its architecture.
Key factors include:
- Whether proper web-server auto-scaling policies are in place.
- Database can gracefully manage many/spiky concurrent connections.
- Any write-through caching strategy is designed to avoid lock contention delays and appropriately fails-open when needed.
All these elements need to be in place to manage such a large influx of requests effectively.
Lack of audit trail
Typically, vendors do not provide a built-in history of webhooks. It falls on the developer to keep track of webhook events. The lack of audit trails creates challenges in scenarios like:
- Incident remediation: Without a detailed history, it becomes difficult to identify why certain events failed or went unprocessed.
- Data loss recovery: Reconstructing lost or inconsistent data states is nearly impossible without a record of every event.
Applications are often forced to rely on the current state or a snapshot in time of the system to reason about certain behaviors. While this approach might suffice in most cases, debugging complex bugs introduced by the issues mentioned in the sections “Out-of-Order Updates” or “Dependent Updates” requires a detailed analysis or replay of every event that led to the current state.
Consider the example from the “Out of Order” section: a customer reports a bug about a user who no longer works at BigCorp still being active in TaskTango. Without an audit trail of every event, the TaskTango developer isn't able to see the user.deleted
webhook was silently ignored because it was processed before the user.created
webhook.
Syncing data with the Events API
The Events API is a strictly ordered list of immutable events exposed through a paginated API. Each event includes a sortable ID, type, and data payload. The API ensures events are returned in a consistent order using cursor pagination. Developers need to track the latest event ID to maintain their position in the event stream.
Key Features
Strict Order
All the events at the producer are stored and served via a paginated API in strict order. This design choice addresses the core flaws of webhooks' “Out of Order updates” as follows:
- Network Latency: The unpredictability in webhooks is addressed by allowing the consumer application (TaskTango) to query the Events API in a serial fashion. The application uses a single communication channel at any given time, thus eliminating any concerns of race conditions caused by parallel requests.
- Consumer Errors: Since control lies within a single process making the Events API call, the consumer application can enforce strict ordered processing of events. For example, the consumer process can proceed to the next event in the response or make the next paginated call only if the previous event is successfully processed.
- Dependent events: With the aforementioned points, the strict ordering inherently addresses this webhook flaw. The application can always be sure that any dependent objects referenced in a given event have already been successfully processed earlier; otherwise, it would not have reached this event at all.
Controlled Throughput
By using a polled approach, the API allows consumer applications to control data throughput according to their capacity. This solves the thundering herd problem by letting consumers request and process data at a manageable pace.
All the concerns discussed in the “Spiky Throughput” section are addressed by giving throughput control to the application. Whether a SmallCorp or VeryBigCorp is onboarded, the processing pace remains consistent. Developers can make a conscious choice to improve processing speed by vertically scaling to faster instances or sharding the API query per tenant.
To serve as a real-time alternative to webhooks, the Events API supports continuous polling for near real-time data syncing. Even more realtime alternatives will be discussed later in the “Future Extensions” section.
Audit Trail
The design of the Events API requires the producer to store an ordered list of events. This indirectly acts as a historical log of event data, helping facilitate debugging and data recovery. Developers can reprocess event history without having to store historical event data themselves.
Another key design choice is the use of cursor pagination, allowing the latest event ID to act as a bookmark in the event stream. This feature is especially helpful for resuming from a partially processed state caused by consumer errors.
Sample Integration
Technical Design
While this blog primarily aims to guide application developers in making informed decisions by evaluating the tradeoffs between webhooks and the Events API, we also hope it encourages SaaS vendors to consider offering similar alternatives to webhooks. With that intention in mind, let's dive deeper into some of the technical choices WorkOS made as part of the Events API architecture.
Database
The producer must persist events in the critical path alongside the code that updates the API objects. Any database supporting ACID transactions to ensure atomicity and consistency of writes should suffice; WorkOS chose PostgreSQL.
The events table should ideally be stored in the same database as the other API object tables. This ensures that API object updates and event capture occur within a single atomic transaction. This can be challenging in a micro-service architecture where different services host their own databases, and events are captured in a centralized service/database via a message queue. In such scenarios, it’s critical to ensure event capture message is successfully queued before committing the API object update transaction.
The access pattern of the Events API is generally read-heavy, focusing on date-based queries where the most recent dates receive the highest traffic. Older dates are typically needed only for audit or reconciliation purposes, and extremely old events can be moved to cold storage or dropped for compliance. It's crucial to optimize the table design to account for this pattern.
WorkOS chose to horizontally partition the events table by date using PostgreSQL's native partitioning. This approach simplifies retention policies and improves query performance. However, partitioning also imposes limitations, such as the necessity for queries to include the partition key and the inability to maintain a global unique index.
Given that events need to be captured in the critical path and the system must be scalable to handle high read throughput, it is recommended to decouple this dependency by introducing read replicas and directing all read traffic to them. The replication lag introduced by read replicas should not be a significant concern, as the events will remain strictly ordered, albeit with a slight delay in the availability of fresh data.
API
To handle high throughput efficiently, the event data is stored in a format similar to the API schema, avoiding complex data transformations and relations.
The Events API was designed with limited retention in mind, focusing on low latency and scalability. Retention limits are enforced by the API to ensure consistency.
Deployment recommendations
Serial execution
To ensure serial event processing, WorkOS recommends starting with a single worker to handle events. Deploying a dedicated worker for event handling simplifies and streamlines event consumption.
Scale
Determining an effective sharding mechanism can significantly enhance event processing throughput by enabling parallelization. For example:
- Dependent events could be scoped within a single organization or tenant. You can rely on the Events API to provide native filters, such as
/events/{organization_id}
. - There could be mutually exclusive events generated from different products or use cases with no overlap.
The consumer application can spawn a separate worker for each logical shard, thereby processing multiple event streams in parallel.
Handling replay side-effects
In some cases, it may be necessary to go back in time and “replay” the events. When designing your app logic to handle events replay it is important to design your event handling logic in a way that can safely accommodate it without undesired side effects.
Separate your app’s data handlers from transactional business logic like sending emails, communicating to 3rd party APIs, etc. Implementing separate data handling allows replaying events to sync data state without side effects.
Future Extensions
WorkOS is exploring additional features to enhance the Events API:
- Long Polling: Long polling allows the client to hold an HTTP connection open if no new events are available. When new events become available, the API immediately returns with new data, and the client requests the next set of events or times out.
- Server-Sent Events: As an alternative to long polling, server-sent events allow the client to keep a persistent connection open while the server continuously returns events.
By addressing the limitations of webhooks and offering a robust, scalable alternative with the Events API, WorkOS enhances the developer experience and ensures reliable data synchronization between systems.