Speeding up Typescript Go by 100x

Zach Wade

Typescript Go was announced last week! A huge step forward for the language & community as tsgo boasts a 10x performance improvement over the current tsc compiler.

This post was originally supposed to be a puff piece about how much faster the new compiler is. Instead, it turned into a 12 hour debugging session and a pull request that one Microsoft dev described as "cursed."

Where did it all go wrong??

The Backstory

I love Typescript! The CompassRx stack is TS on both the frontend and the backend, with some special magic to export the API types from the server into the client code. This particular transformation isn't terribly slow, but it does incur a compilation cost of about 350ms at its current size, which is just slow enough to be annoying during development.

When the new Typescript 7 project was announced, boasting a 10x speedup over the bootstrapped Typescript compiler, I figured it made for a great opportunity to talk about the changes and how it would affect the Compass engineering team.

With that said, I wanted to be able to profile real code, and the fancy server compilation is hard to discuss without sharing actual implementation details of our backend (something my teammates might not love me for). As a result, I opted to profile it on an even more intensive project from a few years back: jsits.

Crimes against Typescript

If you're interested in the JSiTS project, I'd recommend my original writeup from when I implemented it. For the sake of this article, however, the important thing to note is that it serves as an end-to-end VM for a subset of Javascript, all implemented using Typescript's type-system.

For instance, you could write the following code and it would compile:

type FiveFactorial = Eval<`
    (limit) => {
        let rec = (val) => {
            if (val > limit) {
                return 1;
            }

            return val * rec(val + 1);
        };

        return rec(1);
    }
`, [5]>;

const fiveFac: FiveFactorial = 120;

Likewise, with any other value for fiveFac, we would get a type error. Also as a point of note, Eval is a helper type that simply runs the input through a 4-stage parse & execution pipeline. type Result = Eval<T, A> is equivalent to:

type R1 = Tokenize<T>;
type R2 = Lex<R1>;
type R3 = Parse<R2>;
type Result = Run<R3, A>;

These being (nearly) identical will be relevant for the upcoming issue.

Shattering my Hopes

I'll be completely honest here — I was fully anticipating this to be a quick, 2 hour writeup of the changes showcasing just how much faster tsgo is. I started out by cloning the nascent repo and patched out the type instantiation guardrails for both the go and ts implementations.

Brimming with excitement, I ran my patched variants against the JSiTS codebase, and waited.

And waited.

And waited some more.

Unsurprisingly, the Typescript version took a few seconds to execute. More surprisingly, the go version took a lot of seconds to execute. Like nearly 10x more. I ran a few different version to check the outcome, as shown in this example diagnostic here:

Descent into Madness

What?

After a few quick sanity checks to make sure the issue wasn't on my end, the next thing to do was run some deeper diagnostics. tsgo, being written in go, has access to the native profiling tool pprof. Fortunately for us, the Typescript team had already properly instrumented the binary with a profiling option that one can invoke by passing in a --pprofDir flag.

Once this execution completed, the binary spat out two profiles, one for memory usage and one for CPU usage. From there, one could use go tool pprof to examine them and track down what's happening.

The two profiles demonstrated roughly the same data so, without loss of generality, lets focus on the CPU report. Running top5 -cum in pprof listed the top 5 functions according to the cumulative time they represent. The data here is telling:

Showing nodes accounting for 0.38s, 6.73% of 5.65s total
Dropped 124 nodes (cum <= 0.03s)
Showing top 5 nodes out of 150
      flat  flat%   sum%        cum   cum%
     0.13s  2.30%  2.30%      3.33s 58.94%  ...checker.(*Checker).getObjectTypeInstantiation
         0     0%  2.30%      3.33s 58.94%  ...checker.(*Checker).instantiateType (partial-inline)
     0.22s  3.89%  6.19%      3.33s 58.94%  ...checker.(*Checker).instantiateTypeWithAlias
     0.03s  0.53%  6.73%      3.33s 58.94%  ...checker.(*Checker).instantiateTypeWorker
         0     0%  6.73%      3.33s 58.94%  ...checker.(*Checker).instantiateTypes (inline)

Regardless of what was happening internally, its easy to see that the vast majority of our time was spent in type instantiation. (As an aside, it turns out the remaining time was basically entirely spent in the GC).

Additionally, pprof has this handy tool called web that produces a graph of how all these functions interact. You can see the full interactive graph here, but the interesting bit is this:

function cycle

Ugh — that's a cycle. This isn't going to be easy, is it?

Time to talk Shop

So we're spending all of our time in a vicious cycle of type instantiation. That's great and all, but what exactly is type instantiation?

Some types in Typescript are simple: type T = string. Some types have a little more complexity: type F<T> = T | string.

When the typechecker is passing around types, it will often do so lazily (e.g. just passing around T as-is). Once it's ready to actually do something with that type, then it will go through the expensive process of "instantiating" it. In Typescript at least, this process can be thought of "concretizing:" taking a type description that may contain arbitrary variables (called "bindings") and turning it into a specific type that we can work with.

This process may require several steps, as each binding has its own bindings, and so-on. Consider this relatively simple type:

type StringOrUndefined<T> = T | undefined;
type Result = NonNullable<StringOrUndefined<"A" | "B">>;

The value of Result goes through several steps in order to evaluate it:

type Result = NonNullable<StringOrUndefined<"A" | "B">>;

{ T_0 -> StringOrUndefined<"A" | "B"> }
type Result = NonNullable<T_0>;

{ T_0 -> StringOrUndefined<"A" | "B"> }
type Result = T_0 extends null | undefined ? never : T_0;

{ T_0 -> T_1 | undefined; T_1 -> "A" | "B" }
type Result = T_0 extends null | undefined ? never : T_0;

{ T_0 -> "A" | "B" | undefined }
type Result = T_0 extends null | undefined ? never : T_0;

type Result = // Remember, conditionals will distribute over unions
    | "A" extends null | undefined ? never : "A"
    | "B" extends null | undefined ? never : "B"
    | undefined extends null | undefined ? never : undefined;

type Result =
    | "A"
    | "B"
    | never;

type Result = "A" | "B";

Phew, that's a lot of work for an identity function. However, we can see how this process is necessary in order to be able to use Result. To be functional, it must first be instantiated.

Break the Cycle

break the cycle

Ok back to our cycle. Knowing now what type instantiation is, it's unsurprising that the process would be cyclic. However that still doesn't help us determine wherein the problem lies. tsgo is supposed to be a 1-1 port of tsc, so the control flows should directly mimic one another.

Narrowing this down turned out to be difficult, largely due to how frequently these functions run even for mundane tasks, and my difficulty in finding a small enough repro to validate what was happening.

So instead of walking you through the full depths of my insanity, we can sample it together by briefly outlining my process and then digging into the results.

Instrumentation

One of the first things I did was instrument both tsc and tsgo with an additional numeric metric simply called Zach's Stat. This would appear when run with -diagnostics and I could use it to identify differences between the two by counting how many times we passed certain code paths (similar to cycle counting on binaries). Since the two are supposed to be identical variants of one another, we could easily identify happy paths where Zach's Stat was the same for both.

The second thing I did was try to figure out what part of JSiTS was causing the problem. After a bit of linear searching on the various components of it, I realized that it wasn't the VM causing problems, but rather the parser. Curiously, however, it only appeared when the parser was run directly on the output of Lex — if we ran the parser over a hand-lexed program it ran perfectly.

So... Now What?

After all this work and a long scrutiny of the code, the only conclusion I could come to was that we were instantiating too many objects 🫠. In particular, the bit I mentioned earlier about Typescript prefering to lazily instantiate types was coming back to bite us.

The lexer has a recursive type that looks like Lex<Tail, [...Acc, [Head, LexTokenMap[Key]]]> which adds [Head, LexTokenMap[Key]] to an accumulator which is eventually returned. What I found through my instrumentation was that this type was being passed around as-is within the parser and needed to be re-instantiated hundreds of millions of times as the parser attempted to match it to a list of patterns.

In the case of the hand-lexed program, when the parser made it's check, it would have a value such as ["(", "open-paren"]. There's nothing to instantiate here, as this type is terminal. However when it saw [Head, LexTokenMap[Key]], it has to recursively instantiate Head, LexTokenMap, and Key, then use those to construct the resulting type (a comparatively expensive operation). For reasons beyond my understanding, it would do this each time the type was used, even if Head, LexTokenMap, and Key were unchanged.

Initially I thought that the difference must be attributed to tsc eagerly instantiating the type, but I could easily rule that out by listing the instantiations that occurred during parsing. As it turns out, tsc was also instantiating it in the same location -- just far less often.

Trimming the Fat (Control-Flow Graph)

Clearly the Typescript code was cutting the cycle somewhere that the Go code wasn't. With that said, there really aren't that many places the cycle can be cut. I narrowed it down to three different call-sites, and then finally to just one:

func (c *Checker) instantiateTypeWithAlias(t *Type, m *TypeMapper, alias *TypeAlias) *Type {
	if t == nil || m == nil || !c.couldContainTypeVariables(t) {
		return t
	}

    // ...
	result := c.instantiateTypeWorker(t, m, alias)
    // ...

	return result
}

At this point, my gut was telling me that couldContainTypeVariables must have a bug in it; it was easy to add some logging to see if they short circuit the same way:

(go on the left, ts on the right)

short circuit

Aha! The Typescript code short circuits but the Golang code does not. Surely I had found the culprit and just needed to identify and fix the bug.

Spot the Difference

So where was that bug?

Spoiler alert: the two functions were identical. I spent a solid hour trying to suss out any difference between them and came up blank.

Guess my gut had lied to me.

same pic

The Light at the End

Ok, let's take a step back.

Its always Caching

If two processes are doing the same thing, but one does it less often than the other, it's a safe bet that there's a cache involved somewhere. In practice, the Typescript compiler makes use of a number of caching mechanisms to speed up inference. However one in particular seemed most likely, given that it would cause a short circuit during type instantiation.

The above function, couldContainTypeVariables stores an "object flag" against types that caches the result of the method for future invocations. A quick check showed that although the logic was the same for both methods, the cached value differed between Typescript and Go.

This struck me as particularly odd, though. If the method itself behaves the same way between the two then the cached result should always remain the same, regardless of when caching happens.

Back to where it all Began

How could this happen? Well there were two possibilities: Either the state of the object has changed between cache write and cache read, or the cache was being populated in more than one place. The former would have been nightmarish to track down — fortunately I didn't need to.

Remember when we profiled our control flow and saw that we were spending too much time in getObjectTypeInstantiation? As it turns out, we weren't spending enough time there.

The Typescript implementation had a final bit of code at the end of the function that made use of all the work it had already done. See, getObjectTypeInstantiation is expensive, but as a result it has a lot more context than couldContainTypeVariables. Consequently, if getObjectTypeInstantiation knows that there are no type variables for the type, it should pre-empt the cache with a value of false that the latter method can read without computing.

Furthermore, since this was computed during the type's construction (not instantiation), it meant that the reference was preserved across all usages and each of the hundreds (thousands?) of attempts to instantiate it would short circuit immediately.

While this might seem like it would only lead to a marginal time improvement, the recursive nature of the lexer meant that when we tried to examine a single lex token, we would recursively re-instantiate the full lex result. As a consequence, without this caching, any attempt to use the lexer results would lead to a quadratic slow down (based on the size of the program).

After glancing at the Go code for a moment, it became clear that this logic was simply missing. After about 30 seconds' worth of work to port it over we see...

relief

Oh look, there's the 10x speedup everyone was talking about! A pretty nice improvement from the 10x slowdown we were seeing before.

Puff the Magic Dragon

So how did tsgo stack up when all was said and done? The metric I'd been using for efficiency here is that factorial function from earlier. With the normal guardrails in place, both tsc and tsgo can only compute 6! before hitting

error TS2589: Type instantiation is excessively deep and possibly infinite.

However, with the protections removed, tsc can compute up to 9! before it crashes. Any guesses on how much tsgo can do?

21fac

Yeppp that's 21 factorial computed in just over a second. Guess there is something to this after all.

Final Thoughts

Wow, that was not how I expected this blog post to go. In the end, I cleaned up the code and submitted it to the typescript repo for review. It was quickly merged in and tsgo is now ready for use on all your Javascript parsers.

Join Us!

Well, I spent like a day and a half doing this instead of my real work (hiring), so if anyone wants to help me out and just apply for a job that would be wonderful. Thanks!

No AI was used to write this, all typos are my own. Huge thanks to David Lanman and Lin He for editing!