First incident report, sort of

2021 Week 4112 min read

giscus 💎

Someone wrote a blog post and submitted it to the "orange website" on Monday. Apparently, the post was so controversial and gained a lot of traction.

The website happens to have giscus installed on it. Well, that explained the sudden traffic surge I saw on my Vercel dashboard.

Sudden spike on Oct 4

How did I notice this? Well...

A GitHub employee raised an issue on giscus' repository, notifying that there had been "an unusually high number of requests" to GitHub's app access token API that day.

Back when I was just starting to work on giscus, I wrote about how GitHub's Discussions API is only available through GraphQL. I also wrote about the workaround needed so that it can be accessed without signing in.

Much of that code is still running today. I had modified it so that instead of using my own account's giscus installation, I used the installation of the repository's owner. An installation can be used on multiple repositories within the same account. Thus, I also scoped the token so that it only has access to the specified repository.

Unfortunately, as I wrote in that post, the token only lasts for one hour. Note that giscus was built using Next.js and its serverless functions, without any external database whatsoever. That means I couldn't keep the token across multiple API calls.

Rather than finding a way to store the token for its lifetime (one hour), I chose the easy way: just request a new token to GitHub whenever there's an unauthenticated request 🤷

And for the most part, it worked.

Well, until that traffic surge happened. Every request to that blog post would also lead to a request to giscus. Given that most of the visitors are not signed in, that request would also lead to another request to GitHub. Requests, actually:

  • A request to get the installation ID for the specified repo.
  • A request to get the app access token for the installation.
  • A request to get the discussion data using the access token.

And, those requests are doubled because... I want to match GitHub's pagination style where the first and last pages are immediately loaded, and then there's a "Load more..." button in between them. In order to do this, I could do either one of these options:

  1. Search for the matching discussion and fetch its metadata first. Then, check the number of comments to determine whether we only need to fetch the first page or the last page as well. This kinda goes against the point of using GraphQL that lets you search and get the discussion's details within the same request.

  2. Search for the matching discussion and immediately fetch its metadata and the first page of the discussion's comments, all within the same request. Then, fetch the last page as necessary (based on the metadata). This is slightly better than the first option, but also slightly worse: you will have to wait for the first page's results to be returned before you send another request for the last page.

    You can defer the rendering until the last page is loaded, but it takes some time because the first and last page queries are done sequentially. If you don't defer the rendering, the last page will appear a few seconds after the first page is shown, which isn't ideal.

  3. Search for the matching discussion and immediately fetch its metadata and comments using two separate requests, one for the first page and one for the last page. Both requests are done in parallel, so this will be faster than the second option. The downside is that you always fire the request for the last page even if there's only one page, because you don't know the number of pages ahead-of-time.

  4. I'm not sure if I had thought about this option before or maybe I hadn't until just now. With GraphQL, we can fetch the first and last pages within the same request. The downside is that you will have two separate fields in the result, one for the first page and one for the last page. It's a bit unnatural to work with this type of data especially when you come from REST.

    Plus, it doesn't make much sense to have those two fields when you fetch the middle page later on. I need the response data to have a consistent structure so that I don't have to define separate TypeScript definitions and handle the differences.

I initially implemented the second option, but then I changed it to the third option because I saw that the speed difference was noticeable. The downside is that now the requests are doubled.

And it's even worse because I also enabled SWR's revalidate on focus. This means that if the visitor is reading the blog post and switching back and forth to other tabs, it will trigger the revalidation and send the first and last page requests again. I'm still a bit hesitant to disable the feature because it's nice to be able to automatically update the discussion data without having to reload the page.

Combine all of those with the traffic surge, and what do you get?

You get rate limited.

Well, thankfully, because the rate limit is imposed on the installation (not on giscus itself), other sites continued to work just fine.

However, I missed another thing about the rate limit: handling the error itself.

GraphQL always responds with 200 OK HTTP status code, even when it throws an error. To check for errors, we need to look into the response data. Based on my experience, the 401 Unauthorized error looks like this for the GraphQL API:

{
  "message": "Bad credentials",
  "documentation_url": "https://docs.github.com/graphql"
}

For the REST API, it's the same thing but the documentation_url points to /rest instead of /graphql.

Now, for the GraphQL API rate limit, I've never hit it myself because:

  • the GraphQL API requires authentication, which gives you
  • the rate limit of 5,000 points per hour, very generous for my use cases.

I can't find the docs for the rate limit error message, and I didn't want to try and spam GitHub until I get the message. So, based on the similarities between the GraphQL and REST API errors for the 401 error, I tried hitting the rate limit for the REST API instead. The REST API can be used without authentication with the limit of 60 requests/hour, which is pretty easy to hit.

The REST API rate limit error message looks like this:

{
  "message": "API rate limit exceeded for X.X.X.X. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
  "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"
}

So, I assumed for the GraphQL API, it would be more or less the same. Maybe it would be something like this:

{
  "message": "API rate limit exceeded for <scope>",
  "documentation_url": "https://docs.github.com/en/graphql/overview/resource-limitations"
}

Thus, I handled the error this way:

if ('message' in data) {
  // Handle the error,
  // because I don't expect `'message'` in the correct `data`
}

// Proceed to process `data`

Should be fine, right?

Wrong.

It turns out the response data looks like this:

{
  "errors": [
    {
      "type": "RATE_LIMITED",
      "message": "API rate limit exceeded for installation ID 12345678."
    }
  ]
}

So when I proceeded to process the data, it errors out because the properties I'm expecting don't exist. That explains the crashes shown on my Vercel dashboard.

I should've done something like this instead:

if ('discussion' in data) {
  // Process the data
}

if ('message' in data) {
  // Handle known errors (e.g. Bad credentials)
}

// Return unknown error response

Which could've prevented the crashes even when I hadn't known what the rate limit error looks like.

Granted, I could've experimented more with the GraphQL API and remembered that the GraphQL API mostly returns error messages in the form of { "errors": object[] }. However, from my experience, that would only happen if your query is incorrect when validated against GitHub's GraphQL API schema. I was fairly confident that my queries will never hit that error format because it conforms to the schema.

I also thought the rate limit error would use the { "message": string, "documentation_url": string } format because the rate limit should be something generic and not GraphQL-specific, just like the 401 Unauthorized error. I guess because the GraphQL API rate limit needs to be calculated with points and not handled in the HTTP header itself, they decided to use the GraphQL error message format instead.

But the error message formats aren't in their GraphQL API schema anyway, so how was I supposed to know? 🤷

Well, lesson learned. I've opened an issue on GitHub's docs repository.


Anyway, to reduce the number of requests sent to GitHub, I finally implemented a way to cache the tokens. To do this, I created a database table on Supabase to store the access tokens per installation and their expiration date. The table looks like this:

Table that consists of five columns: installation_id, token, expires_at,
created_at, and updated_at

The created_at and updated_at columns are there just in case.

Anyway, Supabase is built on top of PostgREST, which is:

a standalone web server that turns your PostgreSQL database directly into a RESTful API. The structural constraints and permissions in the database determine the API endpoints and operations.

That sounds really cool and convenient, and it is. Supabase provides unlimited API requests and 500MB of database space with their free pricing. Very generous!

After creating a database table, you immediately get a REST API to operate on that table. Without much configuration, I was able to get it to work for giscus.

Now, for unauthenticated requests in giscus, it will:

  • ask GitHub for the installation ID using the repository owner and name, then
  • fetch the token from the Supabase table based on the installation ID, then
  • if the token's remaining lifetime is longer than 5 minutes, use it. Otherwise,
  • request a new token from GitHub, store it to the Supabase table, and use it.

This will prevent giscus from requesting too many tokens in an hour. It will still need to send requests to GitHub to get the installation ID, though. It's because I don't cache the mapping between the repository and the installation ID, for a few reasons I highlighted in this comment. I might elaborate more and revisit that decision later.

For now, I think it's already much better than not caching the tokens at all. Besides, I have an educated guess that GitHub doesn't put a rate limit (or put a really loose limit) on the installation API.

I'm thinking of writing a proper post on how I set up Supabase to cache the tokens for giscus.


Someone submitted a PR to lay the groundwork for internationalization (i18n) support to giscus. That was fantastic! Someone else made an issue on it a while ago, and it was definitely something I wanted (and needed) to tackle in the future.

Previously, I was undecided on whether to use next-translate or next-i18next. Now that somebody else has chosen it for me, I can start adding proper support for localization (l10n) to giscus.

I've merged the PR. I will need to add a configuration step for the language, update the contributing guide, and add Indonesian translations. (Psst, you can beat me to it if you want. Hacktoberfest, anyone?)

Work

We finished the first sprint for the new project and I've done all the tasks for the first part. Will continue next week. I also submitted a request to take the days off on October 21 and 22. We have a national holiday on October 20, so I'll have the Wednesday-Sunday off, which is great.

Still had my TA duties on the weekend. It's rather tiring, but there's no way to back off now.

Entertainment & Friends

Sam couldn't make it to our weekly Friday session because he was preparing for a recruitment test and an IELTS test. We also didn't continue playing Resident Evil 6 because of that.

Good news is, his team won the bronze award in the competition. That competition had been running for the past few months, so I'm glad that their work paid off.

I ended up having the weekly session with my other friend. Let's just call him Sean from now on, it's not his real name but very close to it. He's the one who went to medical school.

We talked about what we do these days. He's currently tasked to write medical reports on patients and discuss them with professional doctors. Meanwhile, I talked about the stuff I had to implement at work.

Somehow that also lead to me explaining what query params are and how they work. For someone with non-technical background, he seemed to understand it quickly. Well, he's always been a smart one, and I'm so grateful to be his friend.

I also let him pick a few songs and see me try to play them on the instrument.

On Sunday, I played Final Fantasy XIV again for around two hours. Still fun!