Blog

Why you should rethink your webhook strategy

Developers favor webhooks for their ease of implementation and similarity to web endpoints. However, it's crucial to balance simplicity and data integrity, like choosing between TCP for reliability and UDP for speed. For SaaS vendors, offering more alternatives to webhooks will give developers the flexibility to meet diverse application needs.


Webhooks are a crucial feature in modern application integrations, providing real-time feedback by sending data to other systems as soon as an event occurs. They are widely used for their simplicity and immediacy, making them a popular choice for developers who needinstant updates. Despite their advantages, webhooks have significant limitations.

For example, what happens when webhooks fail to deliver updates in the right order? Or when they flood your system with too many requests at once? Imagine the chaos that ensues when your application processes outdated or incorrect data, leading to security risks and incorrect application state or bugs that are difficult to diagnose and resolve.

In this article, we will introduce an alternative solution: the Events API. This API is designed to ensure reliable, orderly, and efficient data synchronization, making it a superior choice for managing real-time data updates.

For illustration purposes, consider a simple project management application (TaskTango) that integrates with WorkOS to manage users and group memberships. TaskTango uses lifecycle events such as user.<create|update|delete> and group.<user_added|user_removed> to:

These events are used to:

  1. Grant the user access to the tool upon account creation.
  2. Update their role to ensure they have the appropriate access to the necessary projects.
  3. Revoke access when the customer removes a user.

Webhooks

To illustrate a typical webhooks integration, consuming webhooks from WorkOS looks like:

  1. WorkOS exposes events as user.<created/updated/deleted> and group.<user_added/user_removed>, for the application to subscribe to.
  2. A sample user and group event payload would look like:
      
// User payload
{
  "id": "event_01ECAZ4NV9QMV47GW873HDCX74",
  "event": "user.created",
  "user_id": "user_01EZTR6WYX1A0DSE2CYMGXQ24Y",
  "organization_id": "org_01EHZNVPK3SFK441A1RGBFSHRT"
  "email": "bob@foocorp.com",
  "role": "manager",
  "created_at": "2021-06-25T19:07:33.155Z",
  "updated_at": "2021-07-10T07:18:14.298Z"
}

// Group payload
{
  "id": "event_01E1JG7J09H96KYP8HM9B0G5SJ",
  "event": "group.user_added",
  "group_id": "group_01ECAZ4NV9QMV47GW873HDCX74",
  "user_id": "user_01EZTR6WYX1A0DSE2CYMGXQ24Y",
  "organization_id": "org_01EHZNVPK3SFK441A1RGBFSHRT"
  "created_at": "2021-06-26T20:04:12.782Z",
  "updated_at": "2021-06-29T14:03:19.491Z"
}
      
      
  1. A TaskTango developer implements a publicly accessible web endpoint at /webhooks . The endpoint filters for the event type and appropriately handles the payload according to the application’s requirements.

Here is the code snippet for reference, using Express:

      
const workos = new WorkOS(process.env.WORKOS_API_KEY);

app.post('/webhooks', async (req: Request, res: Response) => {
  const webhook = workos.webhooks.constructEvent({
    payload: req.body,
    sigHeader:req.headers['workos-signature'],
    secret: process.env.WORKOS_WEBHOOK_SECRET,
  })

  switch (webhook.event) {
    case 'user.created':
    case 'user.updated':
    case 'user.deleted':
      // Business logic with contents of webhook.data
      break;
    default:
      // Unsupported event type
      res.sendStatus(400)
  }

    res.sendStatus(200)
})
      
      
  1. The TaskTango developer then enables the aforementioned WorkOS events to be delivered to the web URL from step 3.

Challenges

Out of Order Updates

Webhooks do not always arrive in the order they are generated. Even when the originating application sends them in order, each webhook might be processed by the application at different times. Several reasons can account for this behavior:

  1. The earlier event traverses a more congested path on the open internet, taking longer to arrive at the application’s endpoint.
  2. The consuming application’s load balancer routes the earlier event to a busier web server, increasing processing latency.

No application is flawless, and webhook consumers can encounter intermittent errors.

  1. The web server handling the earlier event can crash due to an out-of-memory error, triggering a 500 response. Modern webhook delivery systems will retry failed deliveries, often introducing delays that increase the likelihood of out-of-order webhooks.
  2. The web server handling the earlier event can be forcefully killed or rotated by the container orchestrator.
  3. The web server handling the earlier event can be preempted by the operating system’s thread/process scheduler.

This lack of causal order can disrupt the application’s data integrity. If a newer update arrives before an older one, outdated information may overwrite recent changes. Tracking update times or making API calls to get the latest state can help mitigate these issues, these solutions add complexity and latency to the integration.

In the example integration, let’s consider a case where WorkOS issues a user.created webhook event followed by user.deleted. However, due to the aforementioned reasons, TaskTango processes user.deleted before user.created. Since there is no user to delete, the user.deleted event is silently ignored. To make things worse, the subsequent user.created event may be successfully processed, leading to a loss of data integrity. As a result, TaskTango has an unintended active user lurking in the system, compromising the application’s security.

Dependent webhooks

The aforementioned out-of-order updates is not a concern for mutually exclusive webhooks that do not reference the same entities. For example, two user.created events can be processed out of order without compromising data integrity, as their payloads do not share common entities.

However, sometimes a webhook references another recently created or updated API object. This requires the application to process an earlier webhook to correctly handle subsequent webhooks that reference the objects processed in the earlier webhook.

In our example integration:

  1. Processing any user.updated or user.deleted webhooks requires the application to successfully process user.created beforehand.
  2. Similarly, processing group.user_added or group.user_deleted webhooks requires the application to successfully process user.created earlier. Otherwise, the user_id mentioned in the payload is meaningless.

These errors manifest as exceptions in consumer applications forcing retries. Or worse, they could silently fail and compromise the data integrity of the system.

Spiky throughput

When a high volume of webhooks is delivered in a short period, it can create a "thundering herd" effect. This phenomenon occurs when numerous requests hit the system simultaneously, overwhelming the webhook consumers. The result is a cascade of errors and retries, which further exacerbates the risk of processing webhooks out of order.

Imagine TaskTango signs a deal with VeryBigCorp, leading to the onboarding of all their employees (~100,000). WorkOS sends a sudden burst of webhooks to TaskTango. The application's ability to handle this traffic spike depends on its architecture.

Key factors include:

  • Whether proper web-server auto-scaling policies are in place.
  • Database can gracefully manage many/spiky concurrent connections.
  • Any write-through caching strategy is designed to avoid lock contention delays and appropriately fails-open when needed.

All these elements need to be in place to manage such a large influx of requests effectively.

Audit trail

Typically, vendors do not provide a built-in history of webhooks. It falls on the TaskTango developer to keep track of webhook events. Without an inherent event history, crucial activities such as incident remediation or data loss recovery become significantly more challenging.

Applications are often forced to rely on the current state or a snapshot in time of the system to reason about certain behaviors. While this approach might suffice in most cases, debugging complex bugs introduced by the issues mentioned in the sections “Out-of-Order Updates” or “Dependent Updates” requires a detailed analysis or replay of every event that led to the current state.

Consider the example from the “Out of Order” section: a customer reports a bug about a user who no longer works at BigCorp still being active in TaskTango. Without an audit trail of every event, the TaskTango developer isn't able to see the user.deleted webhook was silently ignored because it was processed before the user.created webhook.

Introducing the Events API

The Events API is a strictly ordered list of immutable events exposed through a paginated API. Each event includes a sortable ID, type, and data payload. The API ensures events are returned in a consistent order using cursor pagination. Developers need to track the latest event ID to maintain their position in the event stream.

Key Features

Strict Order

All the events at the producer are stored and served via a paginated API in strict order. This design choice addresses the core flaws of webhooks' “Out of Order updates” as follows:

  1. Network Latency: The unpredictability in webhooks is addressed by allowing the consumer application (TaskTango) to query the Events API in a serial fashion. The application uses a single communication channel at any given time, thus eliminating any concerns of race conditions caused by parallel requests.
  2. Consumer Errors: Since control lies within a single process making the Events API call, the consumer application can enforce strict ordered processing of events. For example, the consumer process can proceed to the next event in the response or make the next paginated call only if the previous event is successfully processed.
  3. Dependent events: With the aforementioned points, the strict ordering inherently addresses this webhook flaw. The application can always be sure that any dependent objects referenced in a given event have already been successfully processed earlier; otherwise, it would not have reached this event at all.

Controlled Throughput

By using a polled approach, the API allows consumer applications to control data throughput according to their capacity. This solves the thundering herd problem by letting consumers request and process data at a manageable pace.

All the concerns discussed in the “Spiky Throughput” section are addressed by giving throughput control to the application. Whether a SmallCorp or VeryBigCorp is onboarded, the processing pace remains consistent. Developers can make a conscious choice to improve processing speed by vertically scaling to faster instances or sharding the API query per tenant.

To serve as a real-time alternative to webhooks, the Events API supports continuous polling for near real-time data syncing. Even more realtime alternatives will be discussed later in the “Future Extensions” section.

Audit Trail

The design of the Events API requires the producer to store an ordered list of events. This indirectly acts as a historical log of event data, helping facilitate debugging and data recovery. Developers can reprocess event history without having to store historical event data themselves.

Another key design choice is the use of cursor pagination, allowing the latest event ID to act as a bookmark in the event stream. This feature is especially helpful for resuming from a partially processed state caused by consumer errors.

Sample Integration

      
import { setTimeout } from 'node:timers/promises';
import { WorkOS } from '@workos-inc/node';

const workos = new WorkOS(process.env.WORKOS_API_KEY);

async function processEvents(after) {
  let bookmarkEventID = after || null;

  while (true) {
    const response = await workos.events.listEvents({
      events: [
        'user.created',
        'user.updated',
        'user.deleted',
      ],
      after: bookmarkEventID,
      limit: 100
    });

    const events = response.data;

    if (events.length === 0) {
      // No new events. Sleep for 1 minute...
      await setTimeout(60000);
    } else {
      for (const event of events) {
        try {
          switch (event.type) {
            case 'user.created':
              handleUserCreated(event);
              break;
            case 'user.updated':
              handleUserUpdated(event);
              break;
            case 'user.deleted':
              handleUserDeleted(event);
              break;
            default:
              console.log(`Unhandled event type: ${event.type}`);
          }
        } catch (error) {
          // Capture the previous event ID where the error occurred and persist it
          persistBookmarkEventID(bookmarkEventID);
          return; // Exit the loop to avoid further processing
        }

        bookmarkEventID = event.id;
      }

      // Persist the latest processed event ID
      persistBookmarkEventID(bookmarkEventID);
    }
  }
}

function handleUserCreated(event) {
  // Process user created event
}

function handleUserUpdated(event) {
  // Process user updated event
}

function handleUserDeleted(event) {
  // Process user deleted event
}

function persistBookmarkEventID(bookmarkEventID) {
  // Durably persist this information somewhere
}

function loadBookmarkEventID() {
  // Load the bookmarkEventID from persistent store, else return null
}

// Load the bookmarkEventID and start processing events
const bookmarkEventID = loadBookmarkEventID();
processEvents(bookmarkEventID).catch(err => {
  console.error('Error processing events:', err);
});
      
      

Technical Design

While this blog primarily aims to guide application developers in making informed decisions by evaluating the tradeoffs between webhooks and the Events API, we also hope it encourages SaaS vendors to consider offering similar alternatives to webhooks. With that intention in mind, let's dive deeper into some of the technical choices WorkOS made as part of the Events API architecture.

Database

The producer must persist events in the critical path alongside the code that updates the API objects. Any database supporting ACID transactions to ensure atomicity and consistency of writes should suffice; WorkOS chose PostgreSQL.

The events table should ideally be stored in the same database as the other API object tables. This ensures that API object updates and event capture occur within a single atomic transaction. This can be challenging in a micro-service architecture where different services host their own databases, and events are captured in a centralized service/database via a message queue. In such scenarios, it’s critical to ensure event capture message is successfully queued before committing the API object update transaction.

The access pattern of the Events API is generally read-heavy, focusing on date-based queries where the most recent dates receive the highest traffic. Older dates are typically needed only for audit or reconciliation purposes, and extremely old events can be moved to cold storage or dropped for compliance. It's crucial to optimize the table design to account for this pattern.

WorkOS chose to horizontally partition the events table by date using PostgreSQL's native partitioning. This approach simplifies retention policies and improves query performance. However, partitioning also imposes limitations, such as the necessity for queries to include the partition key and the inability to maintain a global unique index.

Given that events need to be captured in the critical path and the system must be scalable to handle high read throughput, it is recommended to decouple this dependency by introducing read replicas and directing all read traffic to them. The replication lag introduced by read replicas should not be a significant concern, as the events will remain strictly ordered, albeit with a slight delay in the availability of fresh data.

API

To handle high throughput efficiently, the event data is stored in a format similar to the API schema, avoiding complex data transformations and relations.

The Events API was designed with limited retention in mind, focusing on low latency and scalability. Retention limits are enforced by the API to ensure consistency.

Deployment recommendations

Serial execution

To ensure serial event processing, WorkOS recommends starting with a single worker to handle events. Deploying a dedicated worker for event handling simplifies and streamlines event consumption.

Scale

Determining an effective sharding mechanism can significantly enhance event processing throughput by enabling parallelization. For example:

  1. Dependent events could be scoped within a single organization or tenant. You can rely on the Events API to provide native filters, such as /events/{organization_id}.
  2. There could be mutually exclusive events generated from different products or use cases with no overlap.

The consumer application can spawn a separate worker for each logical shard, thereby processing multiple event streams in parallel.

Handling replay side-effects

In some cases, it may be necessary to go back in time and “replay” the events. When designing your app logic to handle events replay it is important to design your event handling logic in a way that can safely accommodate it without undesired side effects.

Separate your app’s data handlers from transactional business logic like sending emails, communicating to 3rd party APIs, etc. Implementing separate data handling allows replaying events to sync data state without side effects.

Future Extensions

WorkOS is exploring additional features to enhance the Events API:

  1. Long Polling: Long polling allows the client to hold an HTTP connection open if no new events are available. When new events become available, the API immediately returns with new data, and the client requests the next set of events or times out.
  2. Server-Sent Events: As an alternative to long polling, server-sent events allow the client to keep a persistent connection open while the server continuously returns events.

By addressing the limitations of webhooks and offering a robust, scalable alternative with the Events API, WorkOS enhances the developer experience and ensures reliable data synchronization between systems.

In this article

This site uses cookies to improve your experience. Please accept the use of cookies on this site. You can review our cookie policy here and our privacy policy here. If you choose to refuse, functionality of this site will be limited.