🌞

nicosommi

developer

posts

slides

Making proactive reactive

November 07, 2021
8 min read

Simplifying existing solutions and creating new ones are the most important tasks an Engineer is supposed to do in a Company.

A simpler solution is more effective than a complex solution.

It often involves fewer assumptions.

This fact sounds obvious when read.

But in practice is much harder to detect in real life.

And the experience may go away silently if you don't often reflect on the things you do.

In this post, I'll explain a specific situation in which I was able to identify this.

When code is needed, I'll use JS-like pseudocode and sequence diagrams.

Structure of the post:

  • Introduction: Context
  • The problem with the initial state
  • Initial cache pattern
  • Solution 1: Proactive loading
  • Solution 2: Reactive loading
  • Current situation
  • Conclusions

That said, let's jump into it!

Context

I'm working on a marketing website, which is visited daily by more than 250k users. The servers receive more than 15M requests per day, according to the CDN stats.

When you have this kind of load, being up and replying fast is super important, and being stable when providing that performance through time and locations also is. This notion is nothing new, as is one of the main goals of Site Reliability Engineering (SRE).

For this reason, it often makes sense to have a cache layer between the CMS (especially when it is a third party) and the Web server.

Also, it is worth mentioning that we always have different kinds of alerts in place (Opsgenie from Pingdom, plus Prometheus and Grafana), plus a bunch of metrics to understand the website's performance and load. On top of that, we had the CDN usage stats.

The problem with the initial state

So, we had a cache.

But there was a problem with it.

When we cleared it, we had micro-downtimes because we deployed a new version of the app or because one editor needed their content to be updated right away.

Note: Without the observability in place, it would be much harder to detect since it would require manual tests, plus random chance, and then spreading this through time to find the pattern of the problem.

Initial cache pattern

The initial version of the website was more oriented to provide the editorial platform's functionality while also providing a solution for caching.

Because of this, we had an initial implementation, which was ignorant of the micro downtimes problems with the high load.

This is how a sequence diagram may look like

%%{init: {'theme': 'dark'}}%%
sequenceDiagram
  actor U as Browser
  participant A as Server
  participant R as Redis
  participant C as CMS
  U->>+A: request '/' route
  A->>+R: request content
  R->>-A: I don't have the content
  A->>+C: request content
  C->>-A: here you go
  A->>-U: here you go
  A->>+R: put content

Clients usually call the cache like this

cache.set(key, async () => {
const value = cms.get()
return value;
});

And the cache.set implementation was roughly something like this

function set(key, getFunction) {
let value = cache.get('key');
if (!value) {
value = await getFunction();
cache.set('key', value); // fire and forget
}
return value
}

This approach was enough for some time, but then some problems started to arise.

Because after a cache clear, we wait for the CMS response, Node.JS waits to reply to current pending requests.

And we had thousands of requests pending.

So they all wait.

They wait at least 600 ms, because, in the most basic API request, we hit 3 CMS endpoints (with cache, of course). Each one of those takes around 200ms each. Some requests went up to 30 seconds or even more when something went out of control on the third party.

In addition, at some point, our servers could not accept more requests and started to fail with the status codes 503 - Service unavailable, Socket timeout, or 403 - Bad Gateway. As a result, the throughput went dramatically low, and you could even check the downtime from a browser.

But... why?

The CMS call is an API call, which means it goes through HTTP, with encryption on top and pointing to a different infrastructure (CMS provider).

We can then guess that calling an API endpoint located in a different infrastructure is slower than calling a Redis database on our own, in the most common scenario.

We checked that in a Newrelic dashboard.

This dashboard has shown us how much time a request took on each function call. Redis database read access is often lower than 25 ms and, in some rare scenarios, up to 75 ms.

At the same time, third parties were around 200 ms and up to more than 10 seconds from time to time (it happened every two or three weeks in our particular case).

We had to do something to improve this.

Solution 1: Proactive loading

So, what can we do about it?

Well, having the content in the cache in the first place sounded like an option to us.

So, what if we periodically simulate our users and put everything in cache?

This solution looks something like this

%%{init: {'theme': 'dark'}}%%
sequenceDiagram
  actor J as Cronjob
  actor U as Browser
  participant A as Server
  participant R as Redis
  participant C as CMS
  J->>+A: request all routes
  A->>+C: request content
  C->>-A: here you go
  A->>-J: ok, loaded
  U->>+A: request '/' route
  A->>+R: request content
  R->>-A: here you go
  A->>-U: here you go

We saw some light there, and we started to move in this direction.

We called it Proactive load.

The idea was to trigger a command via a Cron Job each hour or so, in which we'll load the most popular content, if not all, in the cache.

Because we pull the data from a CMS, it is very similar to how Static Site Generation (SSG) works. For example, I used Gatsby, a web framework on my blog, and a few other projects.

It is just that instead of generating HTML on files, we'll load a bunch of objects in the cache.

When I started programming this, I had to add many new interface methods here and there to manipulate the cache keys and generate all the combinations of the data, which implied lots of new methods and data sources.

It didn't feel very well, to be honest. The solution started to smell too complex.

In the meantime, a few other things happened.

We use Next.js for the website, and because of that, I was following some releases there. I saw that they released a new feature on the framework, supported by their CDN, called Staled While Revalidate. A pattern that I knew from theory but not so much in practice. They advertised it to be even better than SSG. I didn't believe them (Or understand them?). For me, Gatsby was the new fancy tool in town, and SSG was the new thing better than traditional SSR and the old PHP approach suggested by Next.js.

Also, I was reading parts of the book "Reactive design patterns".

In this book, I learned a few things about how reactive systems are often superior because they tend to be more efficient.

Now with this in mind, let's think back about the notion of "nasty smell to complexity". It seems to be related to recent learnings.

To summarize, with this approach, we'll be running a job each hour and loading some content in the cache that maybe never is going to be consumed. So it won't solve the initial issue in the worst scenario unless we always generate everything.

Then I remembered the whole thing of "Stale While Revalidate" and decided to take a look at it.

Solution 2: Reactive loading

Staled While Revalidate solution looks something like this

%%{init: {'theme': 'dark'}}%%
sequenceDiagram
  actor U as Browser
  participant A as Server
  participant R as Redis
  participant C as CMS
  U->>+A: request '/' route
  A->>+R: request content from layer 1
  R->>-A: I don't have the content
  A->>+R: request content from layer 2
  R->>-A: here you go
  A->>+C: request content if not found
  C->>-A: hrere you go
  A->>U: here you go
  A->>+A: Revalidate
  A->>+C: request content
  C->>-A: here you go
  A->>R: set content on both layers

There are many changes here.

First and foremost, it adds a new cache layer with higher TTL (much higher).

The second layer reduces the probability of the need to go to the CMS while making the users wait and causing the micro-downtimes.

It looks like it has the same problem, but probability makes the need to go to the CMS so tiny that it is not a problem anymore.

Now, to implement this is super easy. First, we change the cache algorithm, and that's almost it.

There is no need for simulated requests because it uses the actual user event, making it reactive instead of proactive.

Now there is a downside: requests that take the content from layer 2 are technically getting old content. But at least they are getting content instead of erroring. And the content is not very old neither (only as old as the last request that triggered the Revalidation function).

I said almost above because we had a CDN in front of our server. And because the CDN is in charge of caching, when returning a staled response, you need to tell them to cache the response only for a few seconds (more or less the delay of the request to the CMS) instead of the regular expiration, which may be 6 hours for example. Otherwise, the CDN would continue replying with the staled content for those 6 hours.

Current situation

The current situation is that we don't have any more micro-downtimes when clearing the cache.

No more micro-downtimes is great because when the system is unstable, it causes performance and business implications and brings fear to developers who are afraid to deploy new changes, which is not good.

Now we are dealing with other kinds of issues I'll share in the future.

Conclusions

So, how is SWR simpler than SSG?

First of all, the implementation is more straightforward. It involves fewer new pieces, while it reuses most of the current existing pieces.

Also, it uses the original user event to work, instead of assuming that an automated user with a clunky visit pattern (CRON + script) could generate something useful.

In this particular case, for me, the complex implementation was the first sign.

My favorite takeaways are:

  • Try to make complex more straightforward choosing solutions with fewer assumptions
  • In reactive systems, it is essential to distinguish between original events and synthetic ones
  • Keep it up with reading and learning; it helps you solve problems
  • Have observability in place, get to know your app at runtime

Special thanks to Dan, Dimi, and Vlad for their collaboration during this process.

nico
Copyright by nicosommi 🐱