Lellansin

Lellansin正在翻译

Maybe you don't need Rust and WASM to speed up your JS

原文链接: mrale.ph

Few weeks ago I noticed a blog post “Oxidizing Source Maps with Rust and WebAssembly” making rounds on Twitter - talking about performance benefits of replacing plain JavaScript in the core of source-map library with a Rust version compiled to WebAssembly.

This post piqued my interest, not because I am a huge on either Rust or WASM, but rather because I am always curious about language features and optimizations missing in pure JavaScript to achieve similar performance characteristics.

So I checked out the library from GitHub and departed on a small performance investigation, which I am documenting here almost verbatim.

Getting the Code

For my investigations I was using an almost default x64.release build of the V8 at commit 69abb960c97606df99408e6869d66e014aa0fb51 from January 20th. My only departure from the default configuration is that I enable disassembler via GN flags to be able to dive down to generated machine code if needed.

╭─ ~/src/v8/v8 ‹master›
╰─$ gn args out.gn/x64.release --list --short --overrides-only
is_debug = false
target_cpu = "x64"
use_goma = true
v8_enable_disassembler = true

Then I got a checkouts of source-map package at:

  • commit c97d38b, which was the last commit that updated dist/source-map.js before Rust/WASM started landed;
  • commit 51cf770 which was the most recent commit, when I did my investigation;

Profiling the Pure-JavaScript Version

Running benchmark in the pure-JS version was simple:

╭─ ~/src/source-map/bench ‹ c97d38b›
╰─$ d8 bench-shell-bindings.js
Parsing source map
console.timeEnd: iteration, 4655.638000
console.timeEnd: iteration, 4751.122000
console.timeEnd: iteration, 4820.566000
console.timeEnd: iteration, 4996.942000
console.timeEnd: iteration, 4644.619000
[Stats samples: 5, total: 23868 ms, mean: 4773.6 ms, stddev: 161.22112144505135 ms]

The first thing that I did was to disable the serialization part of the benchmark:

diff --git a/bench/bench-shell-bindings.js b/bench/bench-shell-bindings.js
index 811df40..c97d38b 100644
--- a/bench/bench-shell-bindings.js
+++ b/bench/bench-shell-bindings.js
@@ -19,5 +19,5 @@ load("./bench.js");
 print("Parsing source map");
 print(benchmarkParseSourceMap());
 print();
-print("Serializing source map");
-print(benchmarkSerializeSourceMap());
+// print("Serializing source map");
+// print(benchmarkSerializeSourceMap());

And then threw it into the Linux perf profiler:

╭─ ~/src/source-map/bench ‹perf-work›
╰─$ perf record -g d8 --perf-basic-prof bench-shell-bindings.js
Parsing source map
console.timeEnd: iteration, 4984.464000
^C[ perf record: Woken up 90 times to write data ]
[ perf record: Captured and wrote 24.659 MB perf.data (~1077375 samples) ]

Notice that I am passing --perf-basic-prof flag to the d8 binary which instructs V8 to generate an auxiliary mappings file /tmp/perf-$pid.map. This file allows perf report to understand JIT generated machine code.

Here is what we get from perf report --no-children after zooming on the main execution thread:

Overhead  Symbol
  17.02%  *doQuickSort ../dist/source-map.js:2752
  11.20%  Builtin:ArgumentsAdaptorTrampoline
   7.17%  *compareByOriginalPositions ../dist/source-map.js:1024
   4.49%  Builtin:CallFunction_ReceiverIsNullOrUndefined
   3.58%  *compareByGeneratedPositionsDeflated ../dist/source-map.js:1063
   2.73%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   2.11%  Builtin:StringEqual
   1.93%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   1.66%  *doQuickSort ../dist/source-map.js:2752
   1.25%  v8::internal::StringTable::LookupStringIfExists_NoAllocate
   1.22%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   1.21%  Builtin:StringCharAt
   1.16%  Builtin:Call_ReceiverIsNullOrUndefined
   1.14%  v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
   0.90%  Builtin:StringPrototypeSlice
   0.86%  Builtin:KeyedLoadIC_Megamorphic
   0.82%  v8::internal::(anonymous namespace)::MakeStringThin
   0.80%  v8::internal::(anonymous namespace)::CopyObjectToObjectElements
   0.76%  v8::internal::Scavenger::ScavengeObject
   0.72%  v8::internal::String::VisitFlat`<v8::internal::IteratingStringHasher>`
   0.68%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   0.64%  *doQuickSort ../dist/source-map.js:2752
   0.56%  v8::internal::IncrementalMarking::RecordWriteSlow

Indeed, just like the “Oxidizing Source Maps …” post has stated the benchmark is rather heavy on sort: doQuickSort appears at the top of the profile and also several times down the list (which means that it was optimized/deoptimized few times).

Optimizing Sorting - Argument Adaptation

One thing that jumps out in the profiler are suspicious entries, namely Builtin:ArgumentsAdaptorTrampoline and Builtin:CallFunction_ReceiverIsNullOrUndefined which seem to be part of the V8 implementation. If we ask perf report to expand call chains leading to them then we will notice that these functions are also mostly invoked from the sorting code:

- Builtin:ArgumentsAdaptorTrampoline
  + 96.87% *doQuickSort ../dist/source-map.js:2752
  +  1.22% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
  +  0.68% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
  +  0.68% Builtin:InterpreterEntryTrampoline
  +  0.55% *doQuickSort ../dist/source-map.js:2752

- Builtin:CallFunction_ReceiverIsNullOrUndefined
  + 93.88% *doQuickSort ../dist/source-map.js:2752
  +  2.24% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
  +  2.01% Builtin:InterpreterEntryTrampoline
  +  1.49% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894

It is time to look at the code. Quicksort implementation itself lives in lib/quick-sort.js and it is invoked from parsing code in lib/source-map-consumer.js. Comparison functions used for sorting are compareByGeneratedPositionsDeflated and compareByOriginalPositions.

Looking at the definitions of these comparison functions and how they are invoked from quick-sort implementation reveals that the invocation site has mismatching arity:

function compareByOriginalPositions(mappingA, mappingB, onlyCompareOriginal) {
  // ...
}

function compareByGeneratedPositionsDeflated(mappingA, mappingB, onlyCompareGenerated) {
  // ...
}

function doQuickSort(ary, comparator, p, r) {
  // ...
      if (comparator(ary[j], pivot) <= 0) {
        // ...
      }
  // ...
}

Grepping through library sources reveals that outside of tests quickSort is only ever called with these two functions.

What if we fix the invocation arity?

diff --git a/dist/source-map.js b/dist/source-map.js
index ade5bb2..2d39b28 100644
--- a/dist/source-map.js
+++ b/dist/source-map.js
@@ -2779,7 +2779,7 @@ return /******/ (function(modules) { // webpackBootstrap
            //
            //   * Every element in `ary[i+1 .. j-1]` is greater than the pivot.
            for (var j = p; j < r; j++) {
-             if (comparator(ary[j], pivot) <= 0) {
+             if (comparator(ary[j], pivot, false) <= 0) {
                i += 1;
                swap(ary, i, j);
              }

[Note: I am doing edits directly in dist/source-map.js because I did not want to spend time figuring out the build process]

╭─ ~/src/source-map/bench ‹perf-work› [Fix comparator invocation arity]
╰─$ d8 bench-shell-bindings.js
Parsing source map
console.timeEnd: iteration, 4037.084000
console.timeEnd: iteration, 4249.258000
console.timeEnd: iteration, 4241.165000
console.timeEnd: iteration, 3936.664000
console.timeEnd: iteration, 4131.844000
console.timeEnd: iteration, 4140.963000
[Stats samples: 6, total: 24737 ms, mean: 4122.833333333333 ms, stddev: 132.18789657150916 ms]

Just by fixing the arity mismatch we improved benchmark mean on V8 by 14% from 4774 ms to 4123 ms. If we profile the benchmark again we will discover that ArgumentsAdaptorTrampoline has completely disappeared from it. Why was it there in the first place?

It turns out that ArgumentsAdaptorTrampoline is V8’s mechanism for coping with JavaScript’s variadic calling convention: you can call function that has 3 parameters with 2 arguments - in which case the third parameter will be filled with undefined. V8 does this by creating a new frame on the stack, copying arguments down and then invoking the target function:

Argument adaptation

[If you have never heard about execution stack, checkout out Wikipedia and Franziska Hinkelmann’s blog post.]

While such costs might be negligible for cold code, in this code comparator was invoked millions of times during benchmark run which magnified overheads of arguments adaptation.

An attentive reader might also notice that we are now explicitly passing boolean value false where previously an implicit undefined was used. This does seem to contribute a bit to the performance improvement. If we replace false with void 0 we would get slightly worse numbers:

diff --git a/dist/source-map.js b/dist/source-map.js
index 2d39b28..243b2ef 100644
--- a/dist/source-map.js
+++ b/dist/source-map.js
@@ -2779,7 +2779,7 @@ return /******/ (function(modules) { // webpackBootstrap
            //
            //   * Every element in `ary[i+1 .. j-1]` is greater than the pivot.
            for (var j = p; j < r; j++) {
-             if (comparator(ary[j], pivot, false) <= 0) {
+             if (comparator(ary[j], pivot, void 0) <= 0) {
                i += 1;
                swap(ary, i, j);
              }

╭─ ~/src/source-map/bench ‹perf-work U› [Fix comparator invocation arity]
╰─$ ~/src/v8/v8/out.gn/x64.release/d8 bench-shell-bindings.js
Parsing source map
console.timeEnd: iteration, 4215.623000
console.timeEnd: iteration, 4247.643000
console.timeEnd: iteration, 4425.871000
console.timeEnd: iteration, 4167.691000
console.timeEnd: iteration, 4343.613000
console.timeEnd: iteration, 4209.427000
[Stats samples: 6, total: 25610 ms, mean: 4268.333333333333 ms, stddev: 106.38947316346669 ms]

For what it is worth argument adaptation overhead seems to be highly V8 specific. When I benchmark my change against SpiderMonkey, I don’t see any significant performance improvement from matching the arity:

╭─ ~/src/source-map/bench ‹ d052ea4› [Disabled serialization part of the benchmark]
╰─$ sm bench-shell-bindings.js
Parsing source map
[Stats samples: 8, total: 24751 ms, mean: 3093.875 ms, stddev: 327.27966571700836 ms]
╭─ ~/src/source-map/bench ‹perf-work› [Fix comparator invocation arity]
╰─$ sm bench-shell-bindings.js
Parsing source map
[Stats samples: 8, total: 25397 ms, mean: 3174.625 ms, stddev: 360.4636187025859 ms]

[SpiderMonkey shell is now very easy to install thanks to Mathias Bynens’es jsvu tool.]

Let us get back to the sorting code. If we profile the benchmark again we will notice that ArgumentsAdaptorTrampoline is gone for good from the profile, but CallFunction_ReceiverIsNullOrUndefined is still there. This is not surprising given that we are still calling the comparator.

Optimizing Sorting - Monomorphisation

What usually performs better than calling the function? Not calling it!

The obvious option here is to try and get the comparator inlined into the doQuickSort. However the fact that doQuickSort is called with different comparator functions stands in the way of inlining.

To work around this we can try to monomorphise doQuickSort by cloning it. Here is how we do it.

We start by wrapping doQuickSort and other helpers into SortTemplate function:

function SortTemplate(comparator) {
  function swap(ary, x, y) {
    // ...
  }

  function randomIntInRange(low, high) {
    // ...
  }

  function doQuickSort(ary, p, r) {
    // ...
  }

  return doQuickSort;
}

Then we can produce clones of our sorting routines by converting SortTemplate into a string and then parsing it back into a function via Function constructor:

function cloneSort(comparator) {
  let template = SortTemplate.toString();
  let templateFn = new Function(`return ${template}`)();
  return templateFn(comparator);  // Invoke template to get doQuickSort
}

Now we can use cloneSort to produce a sort function for each comparator we are using:

let sortCache = new WeakMap();  // Cache for specialized sorts.
exports.quickSort = function (ary, comparator) {
  let doQuickSort = sortCache.get(comparator);
  if (doQuickSort === void 0) {
    doQuickSort = cloneSort(comparator);
    sortCache.set(comparator, doQuickSort);
  }
  doQuickSort(ary, 0, ary.length - 1);
};

Rerunning benchmark yields:

╭─ ~/src/source-map/bench ‹perf-work› [Clone sorting functions for each comparator]
╰─$ d8 bench-shell-bindings.js
Parsing source map
console.timeEnd: iteration, 2955.199000
console.timeEnd: iteration, 3084.979000
console.timeEnd: iteration, 3193.134000
console.timeEnd: iteration, 3480.459000
console.timeEnd: iteration, 3115.011000
console.timeEnd: iteration, 3216.344000
console.timeEnd: iteration, 3343.459000
console.timeEnd: iteration, 3036.211000
[Stats samples: 8, total: 25423 ms, mean: 3177.875 ms, stddev: 181.87633161024556 ms]

We can see that the mean time went from 4268 ms to 3177 ms (25% improvement).

Profiling reveals the following picture:

 Overhead Symbol
   14.95% *doQuickSort :44
   11.49% *doQuickSort :44
    3.29% Builtin:StringEqual
    3.13% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
    1.86% v8::internal::StringTable::LookupStringIfExists_NoAllocate
    1.86% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
    1.72% Builtin:StringCharAt
    1.67% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
    1.61% v8::internal::Scavenger::ScavengeObject
    1.45% v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
    1.23% Builtin:StringPrototypeSlice
    1.17% v8::internal::(anonymous namespace)::MakeStringThin
    1.08% Builtin:KeyedLoadIC_Megamorphic
    1.05% v8::internal::(anonymous namespace)::CopyObjectToObjectElements
    0.99% v8::internal::String::VisitFlat`<v8::internal::IteratingStringHasher>`
    0.86% clear_page_c_e
    0.77% v8::internal::IncrementalMarking::RecordWriteSlow
    0.48% Builtin:MathRandom
    0.41% Builtin:RecordWrite
    0.39% Builtin:KeyedLoadIC

Overheads related to invoking comparator have now completely disappeared from the profile.

At this point I became interested in how much time we spend parsing mappings vs. sorting them. I went into the parsing code and added few Date.now() invocations:

[I wanted to sprinkle performance.now() but SpiderMonkey shell apparently does not support it.]

diff --git a/dist/source-map.js b/dist/source-map.js
index 75ebbdf..7312058 100644
--- a/dist/source-map.js
+++ b/dist/source-map.js
@@ -1906,6 +1906,8 @@ return /******/ (function(modules) { // webpackBootstrap
            var generatedMappings = [];
            var mapping, str, segment, end, value;

+
+      var startParsing = Date.now();
            while (index < length) {
              if (aStr.charAt(index) === ';') {
                generatedLine++;
@@ -1986,12 +1988,20 @@ return /******/ (function(modules) { // webpackBootstrap
                }
              }
            }
+      var endParsing = Date.now();

+      var startSortGenerated = Date.now();
            quickSort(generatedMappings, util.compareByGeneratedPositionsDeflated);
            this.__generatedMappings = generatedMappings;
+      var endSortGenerated = Date.now();

+      var startSortOriginal = Date.now();
            quickSort(originalMappings, util.compareByOriginalPositions);
            this.__originalMappings = originalMappings;
+      var endSortOriginal = Date.now();
+
+      console.log(`${}, ${endSortGenerated - startSortGenerated}, ${endSortOriginal - startSortOriginal}`);
+      console.log(`sortGenerated: `);
+      console.log(`sortOriginal:  `);
          };

This yielded:

╭─ ~/src/source-map/bench ‹perf-work U› [Clone sorting functions for each comparator]
╰─$ d8 bench-shell-bindings.js
Parsing source map
parse:         1911.846
sortGenerated: 619.5990000000002
sortOriginal:  905.8220000000001
parse:         1965.4820000000004
sortGenerated: 602.1939999999995
sortOriginal:  896.3589999999995
^C

Here is how parsing and sorting times look like in V8 and SpiderMonkey per benchmark iteration run:

Parse and Sort times

In V8 we seem to be spending roughly as much time parsing mappings as sorting them. In SpiderMonkey parsing is considerably faster - while sorting is slower. This prompted me to start looking at the parsing code.

Optimizing Parsing - Removing Segment Cache

Lets take a look at the profile again

Overhead  Symbol
  18.23%  *doQuickSort :44
  12.36%  *doQuickSort :44
   3.84%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   3.07%  Builtin:StringEqual
   1.92%  v8::internal::StringTable::LookupStringIfExists_NoAllocate
   1.85%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   1.59%  *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
   1.54%  Builtin:StringCharAt
   1.52%  v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
   1.38%  v8::internal::Scavenger::ScavengeObject
   1.27%  Builtin:KeyedLoadIC_Megamorphic
   1.22%  Builtin:StringPrototypeSlice
   1.10%  v8::internal::(anonymous namespace)::MakeStringThin
   1.05%  v8::internal::(anonymous namespace)::CopyObjectToObjectElements
   1.03%  v8::internal::String::VisitFlat`<v8::internal::IteratingStringHasher>`
   0.88%  clear_page_c_e
   0.51%  Builtin:MathRandom
   0.48%  Builtin:KeyedLoadIC
   0.46%  v8::internal::IteratingStringHasher::Hash
   0.41%  Builtin:RecordWrite

Removing the JavaScript code we recognize leaves us with the following:

Overhead  Symbol
   3.07%  Builtin:StringEqual
   1.92%  v8::internal::StringTable::LookupStringIfExists_NoAllocate
   1.54%  Builtin:StringCharAt
   1.52%  v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
   1.38%  v8::internal::Scavenger::ScavengeObject
   1.27%  Builtin:KeyedLoadIC_Megamorphic
   1.22%  Builtin:StringPrototypeSlice
   1.10%  v8::internal::(anonymous namespace)::MakeStringThin
   1.05%  v8::internal::(anonymous namespace)::CopyObjectToObjectElements
   1.03%  v8::internal::String::VisitFlat`<v8::internal::IteratingStringHasher>`
   0.88%  clear_page_c_e
   0.51%  Builtin:MathRandom
   0.48%  Builtin:KeyedLoadIC
   0.46%  v8::internal::IteratingStringHasher::Hash
   0.41%  Builtin:RecordWrite

When I started looking at call chains for individual entries I discovered that many of them pass through KeyedLoadIC_Megamorphic into SourceMapConsumer_parseMappings.

-    1.92% v8::internal::StringTable::LookupStringIfExists_NoAllocate
   - v8::internal::StringTable::LookupStringIfExists_NoAllocate
      + 99.80% Builtin:KeyedLoadIC_Megamorphic

-    1.52% v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
   - v8::internal::(anonymous namespace)::StringTableNoAllocateKey::IsMatch
      - 98.32% v8::internal::StringTable::LookupStringIfExists_NoAllocate
         + Builtin:KeyedLoadIC_Megamorphic
      + 1.68% Builtin:KeyedLoadIC_Megamorphic

-    1.27% Builtin:KeyedLoadIC_Megamorphic
   - Builtin:KeyedLoadIC_Megamorphic
      + 57.65% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
      + 22.62% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
      + 15.91% *SourceMapConsumer_parseMappings ../dist/source-map.js:1894
      + 2.46% Builtin:InterpreterEntryTrampoline
      + 0.61% BytecodeHandler:Mul
      + 0.57% *doQuickSort :44

-    1.10% v8::internal::(anonymous namespace)::MakeStringThin
   - v8::internal::(anonymous namespace)::MakeStringThin
      - 94.72% v8::internal::StringTable::LookupStringIfExists_NoAllocate
         + Builtin:KeyedLoadIC_Megamorphic
      + 3.63% Builtin:KeyedLoadIC_Megamorphic
      + 1.66% v8::internal::StringTable::LookupString

This sort of call stacks indicated to me that the code is performing a lot of keyed lookups of form obj[key] where key is dynamically built string. When I looked at the parsing I discovered the following code:

// Because each offset is encoded relative to the previous one,
// many segments often have the same encoding. We can exploit this
// fact by caching the parsed variable length fields of each segment,
// allowing us to avoid a second parse if we encounter the same
// segment again.
for (end = index; end < length; end++) {
  if (this._charIsMappingSeparator(aStr, end)) {
    break;
  }
}
str = aStr.slice(index, end);

segment = cachedSegments[str];
if (segment) {
  index += str.length;
} else {
  segment = [];
  while (index < end) {
    base64VLQ.decode(aStr, index, temp);
    value = temp.value;
    index = temp.rest;
    segment.push(value);
  }

  // ...

  cachedSegments[str] = segment;
}

This code is responsible for decoding Base64 VLQ encoded sequences, e.g. a string A would be decoded as [0] and UAAAA gets decoded as [10,0,0,0,0]. I suggest checking this blog post about source maps internals if you would like to understand the encoding itself better.

Instead of decoding each sequence independently this code attempts to cache decoded segments: it scans forward until a separator (, or ;) is found, then extracts substring from the current position to the separator and checks if we have previous decoded such segment by looking up the extracted substring in a cache - if we hit the cache we return cached segment, otherwise we parse and cache the segment in the cache.

Caching (aka memoization) is a very powerful optimization technique - however it only makes sense when maintaining the cache itself and looking up cached results is cheaper than performing computation itself again.

Abstract Analysis

Lets try to compare these two operations abstractly.

On one hand is pure parsing:

Parsing segment looks at each character of a segment once. For each character it performs few comparisons and arithmetic operations to convert a base64 character into an integer value it represents. Then it performs few bitwise operations to incorporate this integer value into a larger integer value. Then it stores decoded value into an array and moves to the next part of the segment. Segments are limited to 5 elements.

On the other hand caching:

  1. To look up a cached value we traverse all the characters of the segment once to find its end;
  2. We extract the substring, which requires allocation and potentially copying depending on how strings are implemented in a JS VM;
  3. We use this string as a key in a dictionary, which:
    1. first requires VM to compute hash for this string (traversing it again and performing various bitwise operations on individual characters), this might also require VM to internalize the string (depending on implementation);
    2. then VM has to perform a hash table lookup, which requires probing and comparing key by value with other keys (which might require it again to look at individual characters in a string);

Overall it seems that direct parsing should be faster, assuming that JS VM does good job with individual arithmetic/bitwise operations, simply because it looks at each individual character only once, where caching requires traversing the segment 2-4 times just to establish whether we hit the cache or not.

Profile seems to confirm this too: KeyedLoadIC_Megamorphic is a stub used by V8 to implement keyed lookups like cachedSegments[str] in the code above.

Based on these observations I set out to do few experiments. First I checked how big cachedSegments cache is at the end of the parsing. The smaller it is the more efficient caching would be.

Turns out that it grows quite big:

Object.keys(cachedSegments).length = 155478

Standalone Microbenchmarks

Now I decided to write a small standalone benchmark:

// Generate a string with [n] segments, segments repeat in a cycle of length
// [v] i.e. segments number 0, v, 2*v, ... are all equal and so are
// 1, 1 + v, 1 + 2*v, ...
// Use [base] as a base value in a segment - this parameter allows to make
// segments long.
//
// Note: the bigger [v], the bigger [cachedSegments] cache is.
function makeString(n, v, base) {
  var arr = [];
  for (var i = 0; i < n; i++) {
    arr.push([0, base + (i % v), 0, 0].map(base64VLQ.encode).join(''));
  }
  return arr.join(';') + ';';
}

// Run function [f] against the string [str].
function bench(f, str) {
  for (var i = 0; i < 1000; i++) {
    f(str);
  }
}

// Measure and report performance of [f] against [str].
// It has [v] different segments.
function measure(v, str, f) {
  var start = Date.now();
  bench(f, str);
  var end = Date.now();
  report(`${v}, ${f.name}, ${(end - start).toFixed(2)}`);
}

async function measureAll() {
  for (let v = 1; v <= 256; v *= 2) {
    // Make a string with 1000 total segments and [v] different ones
    // so that [cachedSegments] has [v] cached segments.
    let str = makeString(1000, v, 1024 * 1024);

    let arr = encoder.encode(str);

    // Run 10 iterations for each way of decoding.
    for (var j = 0; j `< 10; j++) {
     measure(j, i, str, decodeCached);
     measure(j, i, str, decodeNoCaching);
     measure(j, i, str, decodeNoCachingNoStrings);
     measure(j, i, arr, decodeNoCachingNoStringsPreEncoded);
     await nextTick();
    }
  }
}

function nextTick() { return new Promise((resolve) =>` setTimeout(resolve)); }

There are three different ways to decode Base64 VLQ segments that I tried against each other.

The first one decodeCached is exactly the same as the default implementation used by source-map - which I already listed above:

function decodeCached(aStr) {
  var length = aStr.length;
  var cachedSegments = {};
  var end, str, segment, value, temp = {value: 0, rest: 0};
  const decode = base64VLQ.decode;

  var index = 0;
  while (index < length) {
    // Because each offset is encoded relative to the previous one,
    // many segments often have the same encoding. We can exploit this
    // fact by caching the parsed variable length fields of each segment,
    // allowing us to avoid a second parse if we encounter the same
    // segment again.
    for (end = index; end < length; end++) {
      if (_charIsMappingSeparator(aStr, end)) {
        break;
      }
    }
    str = aStr.slice(index, end);

    segment = cachedSegments[str];
    if (segment) {
      index += str.length;
    } else {
      segment = [];
      while (index < end) {
        decode(aStr, index, temp);
        value = temp.value;
        index = temp.rest;
        segment.push(value);
      }

      if (segment.length === 2) {
        throw new Error('Found a source, but no line and column');
      }

      if (segment.length === 3) {
        throw new Error('Found a source and line, but no column');
      }

      cachedSegments[str] = segment;
    }

    index++;
  }
}

The next competitor is decodeNoCaching. It is essentially decodeCached but without the cache. Each segment is decoded independently. I also replaced Array with Int32Array for segment storage.

function decodeNoCaching(aStr) {
  var length = aStr.length;
  var cachedSegments = {};
  var end, str, segment, temp = {value: 0, rest: 0};
  const decode = base64VLQ.decode;

  var index = 0, value;
  var segment = new Int32Array(5);
  var segmentLength = 0;
  while (index `< length) {
    segmentLength = 0;
    while (!_charIsMappingSeparator(aStr, index)) {
      decode(aStr, index, temp);
      value = temp.value;
      index = temp.rest;
      if (segmentLength >`= 5) throw new Error('Too many segments');
      segment[segmentLength++] = value;
    }

    if (segmentLength === 2) {
      throw new Error('Found a source, but no line and column');
    }

    if (segmentLength === 3) {
      throw new Error('Found a source and line, but no column');
    }

    index++;
  }
}

Finally the third variant decodeNoCachingNoString tries to avoid dealing with JavaScript strings altogether by converting the string into utf8 encoded Uint8Array. This optimization is inspired by the fact that JS VMs are more likely to optimize an array load down to a single memory access. Optimizing String.prototype.charCodeAt to the same extent is harder due to the sheer complexity of the hierarchy of different string representations that JS VMs utilize.

I benchmarked both a version that encodes string into utf8 as part of the iteration and a version that uses preencoded string. With this latter “optimistic” version I am trying to estimate how much we could gain if we were able to skip typed array ⇒ string ⇒ typed array round trip. Which would be possible if we loaded the source map directly as an array buffer and parsed it directly from that buffer instead of converting it to string first.

let encoder = new TextEncoder();
function decodeNoCachingNoString(aStr) {
  decodeNoCachingNoStringPreEncoded(encoder.encode(aStr));
}

function decodeNoCachingNoStringPreEncoded(arr) {
  var length = arr.length;
  var cachedSegments = {};
  var end, str, segment, temp = {value: 0, rest: 0};
  const decode2 = base64VLQ.decode2;

  var index = 0, value;
  var segment = new Int32Array(5);
  var segmentLength = 0;
  while (index < length) {
    segmentLength = 0;
    while (arr[index] != 59 && arr[index] != 44) {
      decode2(arr, index, temp);
      value = temp.value;
      index = temp.rest;
      if (segmentLength < 5) {
        segment[segmentLength++] = value;
      }
    }

    if (segmentLength === 2) {
      throw new Error('Found a source, but no line and column');
    }

    if (segmentLength === 3) {
      throw new Error('Found a source and line, but no column');
    }

    index++;
  }
}

Here are the results I’ve gotten by running my microbenchmark in Chrome Dev 66.0.3343.3 (V8 6.6.189) and Firefox Nightly 60.0a1 (2018-02-11):

Different Decodes

There are few things to notice here:

  • the version that uses caching is slower than anything else on both V8 and SpiderMonkey. Its performance degrades steeply as number of cache entries grows - while performance of non-caching versions does not depend on that;
  • on SpiderMonkey it pays off to convert string into typed array as part of parsing, while on V8 character access is fast enough - so it only pays off to use array if you can move string-to-array conversion out of the benchmark (e.g. you load your data into typed arrays to begin with);

I was curious if V8 team did any work recently to improve charCodeAt performance - as I remembered rather vividly that Crankshaft never made an effort to specialize charCodeAt for a particular string representation at a call-site and instead expanded charCodeAt into a large chunk of code handling many different string representations, making loading characters from strings slower than loading elements from typed arrays.

I trawled V8 issue tracker and found few active issues like these:

Some of the comments on these issues reference commits from late January 2018 and onward, which indicated to me that performance of charCodeAt is being actively worked on. Out of curiosity I decided to rerun my microbenchmark in Chrome Beta and compare against Chrome Dev

Different Decodes

This comparison does in fact confirm that all those commits by the V8 team were not for nothing: performance of charCodeAt improved drastically from version 6.5.254.21 to 6.6.189. Comparing “no cache” and “using array” lines we can see that on an older V8 charCodeAt behaved so much worse that it made sense to convert the string into Uint8Array just to access it faster. However the overhead of doing this conversion inside the parse does not pay off anymore in a newer V8.

However it still pays off to use an array instead of a string as long as you don’t have to pay the conversion cost. Why is that? To figure that out I run the following code in tip-of-the tree V8:

function foo(str, i) {
  return str.charCodeAt(i);
}

let str = "fisk";

foo(str, 0);
foo(str, 0);
foo(str, 0);
%OptimizeFunctionOnNextCall(foo);
foo(str, 0);

╭─ ~/src/v8/v8 ‹master›
╰─$ out.gn/x64.release/d8 --allow-natives-syntax --print-opt-code --code-comments x.js

The command produced a gigantic assembly listing confirming my suspicion that V8 still does not specialize charCodeAt for a particular string representation. This lowering seems to come from this code in V8 sources, which resolves the mystery of why array access is faster than string charCodeAt.

Parsing Improvements

In light of these discoveries lets remove caching of parsed segments from source-map parsing code and measure the effect.

Parse and Sort times

Just like our microbenchmarking predicted caching was detrimental to the overall performance rather than being beneficial: removing it actually improves parsing times considerably.

Optimizing Sorting - Algorithmic Improvements

Now that we improved parsing performance lets take a look at the sorting again.

There are two arrays that are being sorted:

  1. originalMappings array is being sorted using compareByOriginalPositions comparator;
  2. generatedMappings array is being sorted using compareByGeneratedPositionsDeflated comparator.

Optimizing originalMappings Sorting

I took a look at compareByOriginalPositions first.

function compareByOriginalPositions(mappingA, mappingB, onlyCompareOriginal) {
  var cmp = strcmp(mappingA.source, mappingB.source);
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.originalLine - mappingB.originalLine;
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.originalColumn - mappingB.originalColumn;
  if (cmp !== 0 || onlyCompareOriginal) {
    return cmp;
  }

  cmp = mappingA.generatedColumn - mappingB.generatedColumn;
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.generatedLine - mappingB.generatedLine;
  if (cmp !== 0) {
    return cmp;
  }

  return strcmp(mappingA.name, mappingB.name);
}

Here I noticed that mappings are being ordered by source component first and then by all other components. source specifies which source file the mapping originally came from. An obvious idea here is that instead of using a flat gigantic originalMappings array, which mixes together mappings from different source files, we can turn originalMappings into array of arrays: originalMappings[i] would be an array of all mappings from source file with index i. This way we can sort mappings into different originalMappings[i] arrays based on their source as we parse them and then sort individual smaller arrays.

[This is essentially a Bucket Sort]

Here is what we do in parsing loop:

if (typeof mapping.originalLine === 'number') {
  // This code used to just do: originalMappings.push(mapping).
  // Now it sorts original mappings already by source during parsing.
  let currentSource = mapping.source;
  while (originalMappings.length <= currentSource) {
    originalMappings.push(null);
  }
  if (originalMappings[currentSource] === null) {
    originalMappings[currentSource] = [];
  }
  originalMappings[currentSource].push(mapping);
}

And then after that:

var startSortOriginal = Date.now();
// The code used to sort the whole array:
//     quickSort(originalMappings, util.compareByOriginalPositions);
for (var i = 0; i < originalMappings.length; i++) {
  if (originalMappings[i] != null) {
    quickSort(originalMappings[i], util.compareByOriginalPositionsNoSource);
  }
}
var endSortOriginal = Date.now();

The compareByOriginalPositionsNoSource comparator is almost exactly the same as compareByOriginalPositions comparator except it does not compare source component anymore - those are guaranteed to be equal due to the way we constructed each originalMappings[i] array.

Parse and Sort times

This algorithmic change improves sorting times on both V8 and SpiderMonkey and additionally improves parsing times on V8.

[Parse time improvement is likely due to the reduction of costs associated with managing originalMappings array: growing a single gigantic originalMappings array is more expensive than growing multiple smaller originalMappings[i] arrays individually. However this is just my guess, which is not confirmed by any rigorous analysis.]

Optimizing generatedMappings Sorting

Let us take a look at generatedMappings and compareByGeneratedPositionsDeflated comparator.

function compareByGeneratedPositionsDeflated(mappingA, mappingB, onlyCompareGenerated) {
  var cmp = mappingA.generatedLine - mappingB.generatedLine;
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.generatedColumn - mappingB.generatedColumn;
  if (cmp !== 0 || onlyCompareGenerated) {
    return cmp;
  }

  cmp = strcmp(mappingA.source, mappingB.source);
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.originalLine - mappingB.originalLine;
  if (cmp !== 0) {
    return cmp;
  }

  cmp = mappingA.originalColumn - mappingB.originalColumn;
  if (cmp !== 0) {
    return cmp;
  }

  return strcmp(mappingA.name, mappingB.name);
}

Here we first compare mappings by generatedLine. There are likely considerably more generated lines than original source files so it does not make sense to split generatedMappings into multiple individual arrays.

However when I looked at the parsing code I noticed the following:

while (index < length) {
  if (aStr.charAt(index) === ';') {
    generatedLine++;
    // ...
  } else if (aStr.charAt(index) === ',') {
    // ...
  } else {
    mapping = new Mapping();
    mapping.generatedLine = generatedLine;

    // ...
  }
}

These are the only occurrences of generatedLine in this code, which means that generatedLine is growing monotonically - implying that generatedMappings array is already ordered by generatedLine and it does not make sense to sort the array as whole. Instead we can sort each individual smaller subarray. We change the code like this:

let subarrayStart = 0;
while (index < length) {
  if (aStr.charAt(index) === ';') {
    generatedLine++;
    // ...

    // Sort subarray [subarrayStart, generatedMappings.length].
    sortGenerated(generatedMappings, subarrayStart);
    subarrayStart = generatedMappings.length;
  } else if (aStr.charAt(index) === ',') {
    // ...
  } else {
    mapping = new Mapping();
    mapping.generatedLine = generatedLine;

    // ...
  }
}
// Sort the tail.
sortGenerated(generatedMappings, subarrayStart);

Instead of using quickSort for sorting smaller subarrays, I also decided to use insertion sort, similar to a hybrid strategy that some VMs use for Array.prototype.sort.

[Note: insertion sort is also faster than quick sort if input array is already sorted… and it turns out that mappings used for the benchmark are in fact sorted. If we expect generatedMappings to be almost always sorted after parsing then it would be even more efficient to simply check whether generatedMappings is sorted before trying to sort it.]

const compareGenerated = util.compareByGeneratedPositionsDeflatedNoLine;

function sortGenerated(array, start) {
  let l = array.length;
  let n = array.length - start;
  if (n `<= 1) {
    return;
  } else if (n == 2) {
    let a = array[start];
    let b = array[start + 1];
    if (compareGenerated(a, b) >` 0) {
      array[start] = b;
      array[start + 1] = a;
    }
  } else if (n < 20) {
    for (let i = start; i `< l; i++) {
      for (let j = i; j >` start; j--) {
        let a = array[j - 1];
        let b = array[j];
        if (compareGenerated(a, b) <= 0) {
          break;
        }
        array[j - 1] = b;
        array[j] = a;
      }
    }
  } else {
    quickSort(array, compareGenerated, start);
  }
}

This yields the following result:

Parse and Sort times

Sorting times drop drastically, while parsing times slightly increase - that happens because the code sorting generatedMappings as part of the parsing loop, making our breakdown slightly meaningless. Lets check comparison of cumulative timings (parsing and sorting together)

Improvements to Total Time

Parse and Sort times

Now it becomes obvious that we considerably improved overall mappings parsing performance.

Is there anything else we could do to improve performance?

It turns out yes: we can pull out a page from asm.js / WASM own playbook without going full-Rust on our JavaScript code base.

Optimizing Parsing - Reducing GC Pressure

We are allocating hundreds of thousands of Mapping objects, which puts considerable pressure on the GC - however we don’t really need these objects to be objects - we can pack them into a typed array. Here is how I did it.

[Few years ago I was really excited about Typed Objects proposal which would allow JavaScript programmers to define structs and arrays of structs and all other amazing things that would come extremely handy here. Unfortunately champions working on that proposal moved away to work on other things leaving us with a choice to write these things either manually or in C++ 😞]

First, I changed Mapping from a normal object into a wrapper that points into a gigantic typed array that would contain all our mappings.

function Mapping(memory) {
  this._memory = memory;
  this.pointer = 0;
}
Mapping.prototype = {
  get generatedLine () {
    return this._memory[this.pointer + 0];
  },
  get generatedColumn () {
    return this._memory[this.pointer + 1];
  },
  get source () {
    return this._memory[this.pointer + 2];
  },
  get originalLine () {
    return this._memory[this.pointer + 3];
  },
  get originalColumn () {
    return this._memory[this.pointer + 4];
  },
  get name () {
    return this._memory[this.pointer + 5];
  },
  set generatedLine (value) {
    this._memory[this.pointer + 0] = value;
  },
  set generatedColumn (value) {
    this._memory[this.pointer + 1] = value;
  },
  set source (value) {
    this._memory[this.pointer + 2] = value;
  },
  set originalLine (value) {
    this._memory[this.pointer + 3] = value;
  },
  set originalColumn (value) {
    this._memory[this.pointer + 4] = value;
  },
  set name (value) {
    this._memory[this.pointer + 5] = value;
  },
};

Then I adjusted the parsing and sorting code to use it like this:

BasicSourceMapConsumer.prototype._parseMappings = function (aStr, aSourceRoot) {
  // Allocate 4 MB memory buffer. This can be proportional to aStr size to
  // save memory for smaller mappings.
  this._memory = new Int32Array(1 * 1024 * 1024);
  this._allocationFinger = 0;
  let mapping = new Mapping(this._memory);
  // ...
  while (index `< length) {
    if (aStr.charAt(index) === ';') {

      // All code that could previously access mappings directly now needs to
      // access them indirectly though memory.
      sortGenerated(this._memory, generatedMappings, previousGeneratedLineStart);
    } else {
      this._allocateMapping(mapping);

      // ...

      // Arrays of mappings now store "pointers" instead of actual mappings.
      generatedMappings.push(mapping.pointer);
      if (segmentLength >` 1) {
        // ...
        originalMappings[currentSource].push(mapping.pointer);
      }
    }
  }

  // ...

  for (var i = 0; i `< originalMappings.length; i++) {
    if (originalMappings[i] != null) {
      quickSort(this._memory, originalMappings[i], util.compareByOriginalPositionsNoSource);
    }
  }
};

BasicSourceMapConsumer.prototype._allocateMapping = function (mapping) {
  let start = this._allocationFinger;
  let end = start + 6;
  if (end >` this._memory.length) {  // Do we need to grow memory buffer?
    let memory = new Int32Array(this._memory.length * 2);
    memory.set(this._memory);
    this._memory = memory;
  }
  this._allocationFinger = end;
  let memory = this._memory;
  mapping._memory = memory;
  mapping.pointer = start;
  mapping.name = 0x7fffffff;  // Instead of null use INT32_MAX.
  mapping.source = 0x7fffffff;  // Instead of null use INT32_MAX.
};

exports.compareByOriginalPositionsNoSource =
  function (memory, mappingA, mappingB, onlyCompareOriginal) {
  var cmp = memory[mappingA + 3] - memory[mappingB + 3];  // originalLine
  if (cmp !== 0) {
    return cmp;
  }

  cmp = memory[mappingA + 4] - memory[mappingB + 4];  // originalColumn
  if (cmp !== 0 || onlyCompareOriginal) {
    return cmp;
  }

  cmp = memory[mappingA + 1] - memory[mappingB + 1];  // generatedColumn
  if (cmp !== 0) {
    return cmp;
  }

  cmp = memory[mappingA + 0] - memory[mappingB + 0];  // generatedLine
  if (cmp !== 0) {
    return cmp;
  }

  return memory[mappingA + 5] - memory[mappingB + 5];  // name
};

As you can see readability does suffer quite a bit. Ideally I would prefer to allocate a temporary Mapping object whenever I need to work with its fields. However such code style would lean heavily on VMs ability to eliminate allocations of these temporary wrappers via allocation sinking, scalar replacement or other similar optimizations. Unfortunately in my experiments SpiderMonkey could not deal with such code well enough and thus I opted for much more verbose and error prone code.

[This sort of almost manual memory management might seem rather foreign in JS. That’s why I think it might be worth mentioning here that “oxidized” source-map actually requires users to manually manage its lifetime to ensure that WASM resources are freed]

Rerunning benchmark confirms that alleviating GC pressure yields a nice improvement

After reworking allocation

After reworking allocation

Interestingly enough on SpiderMonkey this approach improves both parsing and sorting times, which came as a surprise to me.

SpiderMonkey Performance Cliff

As I was playing with this code I also discovered a confusing performance cliff in SpiderMonkey: when I increased the size of preallocated memory buffer from 4MB to 64MB to gauge reallocation costs, benchmark showed a sudden drop in performance after 7th iteration.

After reworking allocation

This looked like some sort of polymorphism to me, but I could not immediately figure out how changing the size of an array can result in a polymorphic behavior.

Puzzled I reached out to a SpiderMonkey hacker Jan de Mooij who very quickly identified an asm.js related optimization from 2012 as a culprit… then he went and removed it from SpiderMonkey so that nobody hits this confusing cliff again.

Optimizing Parsing - Using Uint8Array Instead of a String.

Finally if we start using Uint8Array instead of a string for parsing we get yet another small improvement.

After reworking allocation

[This improvement is predicated on rewriting source-map to parse mappings directly from typed arrays, instead of using JavaScript string and parsing it with JSON.decode. I did not do such rewrite but I don’t anticipate any issues.]

Total Improvements Against the Baseline

Here is where we started:

$ d8 bench-shell-bindings.js
...
[Stats samples: 5, total: 24050 ms, mean: 4810 ms, stddev: 155.91063145276527 ms]
$ sm bench-shell-bindings.js
...
[Stats samples: 7, total: 22925 ms, mean: 3275 ms, stddev: 269.5999093306804 ms]

and this is where we are finishing

$ d8 bench-shell-bindings.js
...
[Stats samples: 22, total: 25158 ms, mean: 1143.5454545454545 ms, stddev: 16.59358125226469 ms]
$ sm bench-shell-bindings.js
...
[Stats samples: 31, total: 25247 ms, mean: 814.4193548387096 ms, stddev: 5.591064299397745 ms]

After reworking allocation

After reworking allocation

This is a factor of 4 improvement!

It might be also worth noting that we are still sorting all originalMappings arrays eagerly even though this is not really needed. There are only two operations that use originalMappings:

  • allGeneratedPositionsFor which returns all generated positions for the given line in the original source;
  • eachMapping(..., ORIGINAL_ORDER) which iterates over all mappings in their original order.

If we assume that allGeneratedPositionsFor is the most common operation and that we are only going to search within a handful of originalMappings[i] arrays then we can vastly improve parsing time by sorting originalMappings[i] arrays lazily whenever we actually need to search one of them.

Finally a comparison of V8 from Jan 19th to V8 from Feb 19th with and without untrusted code mitigations.

After reworking allocation

Comparing to Oxidized source-map Version

Following the publication of this post on February 19th, I got few requests to compare source-map with my tweaks against mainline oxidized source-map that uses Rust and WASM.

Quick look at Rust source code for parse_mappings revealed that Rust version does not collect or sort original mappings eagerly, only equivalent of generatedMappings is produced and sorted. To match this behavior I adjusted my JS version by commenting out sorting of originalMappings[i] arrays.

Here are benchmark results for just parsing (which also includes sorting generatedMappings) and for parsing and then iterating over all generatedMappings.

Parse only times

Parse and iterate times

Note that the comparison is slightly misleading because Rust version does not optimize sorting of generatedMappings in the same way as my JS version does.

Thus I am not gonna declare here that «we have successfully reached parity with the Rust+WASM version». However at this level of performance differences it might make sense to reevaluate if it is even worth the complexity to use Rust in source-map.

Update (Feb 27th 2018)

Nick Fitzgerald, the author of source-map, has updated Rust+WASM version with algorithmic improvements described in this article. Here is an amended performance graph for parse and iterate benchmark:

Parse and iterate times

As you can see WASM+Rust version is now around 15% faster on SpiderMonkey and approximately the same speed on V8.

Learnings

For a JavaScript Developer

Profiler Is Your Friend

Profiling and fine grained performance tracking in various shapes and forms is the best way to stay on top of the performance. It allows you to localize hot-spots in your code and also reveals potential issues in the underlying runtime. For this particular reason don’t shy away from using low-level profiling tools like perf - “friendly” tools might not be telling you the whole story because they hide lower level.

Different performance problems require different approaches to profiling and visualizing collected profiles. Make sure to familiarize yourself with a wide spectrum of available tools.

Algorithms Are Important

Being able to reason about your code in terms of abstract complexity is an important skill. Is it better to quick-sort one array with 100K elements or quick-sort 3333 30-element subarrays?

A bit of handwavy mathematics can guide us (\(100000 \log 100000\) is 3 times larger than \(3333 \times 30 \log 30\)) - and the larger your data is the more important it usually is to be able to do a tiny bit of mathematics.

In addition to knowing your logarithms, you need to posses a healthy amount of common sense and be able to evaluate how your code would be used on average and in the worst case: which operations are common, how the cost of expensive operations can be amortized, what the penalty for amortizing expensive operations?

VMs Are Work in Progress. Bug Developers!

Do not hesitate to reach out to developers to discuss strange performance issues. Not everything can be solved just by changing your own code. The Russian proverb says «It’s not gods who make pots!». VM developers are people and just like all others they make mistakes. They are also quite good at fixing those mistakes once you reach out to them. One mail or chat message or a DM might save you days of digging through foreign C++ code.

VMs Still Need a Bit of Help

Sometimes you need to write low-level code or know low-level details to squeeze the last drops of that performance juice out of JavaScript.

One could prefer a better language level facilities to achieve that, but it remains to be seen if we ever get there.

For a Language Implementor/Designer

Clever Optimizations Must be Diagnosable

If your runtime has any sort of built-in clever optimizations then you need to provide a straightforward tool to diagnose when these optimizations fail and deliver an actionable feedback to the developer.

In the context of languages like JavaScript this at minimum means that tools like profiler should also provide you with a way to inspect individual operations to figure out whether VM specializes them well and it it does not - what is the reason for that.

This sort of introspection should not require building custom versions of the VM with magic flags and then treading through megabytes of undocumented debug output. This sort of tools should be right there, when you open your DevTools window.

Language and Optimizations Must Be Friends

Finally as a language designer you should attempt to foresee where the language lacks features which make it easier to write well performing code. Are your users on the market for a way to layout and manage memory manually? I am sure they are. If your language is even remotely popular users would eventually succeed in writing code that performs poorly. Weight the cost of adding language features that fix performance problems against solving the same performance problems by other means (e.g. by more sophisticated optimizations or by asking users to rewrite their code in Rust).

This works the other way around too: if your language has features make sure that they perform reasonably well and their performance is both well understood by users and can be easily diagnosed. Invest in making your whole language optimized, instead of having a well performing core surrounded by poorly performing long tail of rarely used features.

Afterword

Most optimizations we discovered in this post fall into three different groups:

  1. algorithmic improvements;
  2. workarounds for implementation independent, but potentially language dependent issues;
  3. workarounds for V8 specific issues.

No matter which language you write in you still need to think about algorithms. It is easier to notice when you are using worse algorithms in inherently “slower” languages, but just reimplementing the same algorithms in a “faster” language does not solve the problem even though it might alleviate the symptoms. Large part of the post is dedicated to optimizations from this group:

  • sorting improvements achieved by sorting subsequences rather than the whole array;
  • discussions of caching benefits or lack of them there-off.

The second group is represented by the monomorphisation trick. Performance suffering due to polymorphism is not a V8 specific issue. Neither it is a JS specific issue. You can apply monomorphisation across implementations and even languages. Some languages (Rust, actually) apply it in some form for you under the hood.

The last and most controversial group is represented by argument adaptation stuff.

Finally an optimization I did to mappings representation (packing individual objects into a single typed array) is an optimization that spans all three groups. It’s about understanding limitations and costs of a GCed system as whole. It’s also about understanding strength of existing JS VMs and utilizing it for our advantage.

So… Why did I choose the title? That’s because I think that the third group represents all issues which should and would be fixed over time. Other groups represent universal knowledge that spans across implementations and languages.

Obviously each developer and each team are free to choose between spending N rigorous hours profiling and reading and thinking about their JavaScript code, or to spend M hours rewriting their stuff in a language X.

However: (a) everybody needs to be fully aware that the choice even exists; and (b) language designers and implementors should work together on making this choice less and less obvious - which means working on language features and tools and reducing the need in “group №3” optimizations.