Monday, May 15, 2017

Launching Ignition and TurboFan


Today we are excited to announce the launch of a new JavaScript execution pipeline for V8 5.9 that will reach Chrome Stable in M59. With the new pipeline, we achieve big performance improvements and significant memory savings on real-world JavaScript applications. We’ll discuss the numbers in more detail at the end of this post, but first let’s take a look at the pipeline itself.

The new pipeline is built upon Ignition, V8’s interpreter, and TurboFan, V8’s newest optimizing compiler. These technologies should be familiar to those of you who have followed the V8 blog over the last few years, but the switch to the new pipeline marks a big new milestone for both.

For the first time, Ignition and TurboFan are used universally and exclusively for JavaScript execution in V8 5.9. Furthermore, starting with 5.9, Full-codegen and Crankshaft, the technologies that served V8 well since 2010, are no longer used in V8 for JavaScript execution, since they no longer are able to keep pace with new JavaScript language features and the optimizations those features require. We plan to remove them completely very soon. That means that V8 will have an overall much simpler and more maintainable architecture going forward.

A Long Journey


The combined Ignition and TurboFan pipeline has been in development for almost 3½ years. It represents the culmination of the collective insight that the V8 team has gleaned from measuring real-world JavaScript performance and carefully considering the shortcomings of Full-codegen and Crankshaft. It is a foundation with which we will be able to continue to optimize the entirety of the JavaScript language for years to come.

The TurboFan project originally started in late 2013 to address the shortcomings of Crankshaft. Crankshaft can only optimize a subset of the JavaScript language. For example, it was not designed to optimize JavaScript code using structured exception handling, i.e. code blocks demarcated by JavaScript’s try, catch, and finally keywords. It is difficult to add support for new language features in Crankshaft, since these features almost always require writing architecture-specific code for nine supported platforms. Furthermore, Crankshaft’s architecture is limited in the extent that it can generate optimal machine code. It can only squeeze so much performance out of JavaScript, despite requiring the V8 team to maintain more than ten thousand lines of code per chip architecture.

TurboFan was designed from the beginning not only to optimize all of the language features found in the JavaScript standard at the time, ES5, but also all the future features planned for ES2015 and beyond. It introduces a layered compiler design that enables a clean separation between high-level and low-level compiler optimizations, making it easy to add new language features without modifying architecture-specific code. TurboFan adds an explicit instruction selection compilation phase that makes it possible to write far less architecture-specific code for each supported platform in the first place. With this new phase, architecture-specific code is written once and it rarely needs to be changed. These and other decisions lead to a more maintainable and extensible optimizing compiler for all of the architectures that V8 supports.

The original motivation behind V8’s Ignition interpreter was to reduce memory consumption on mobile devices. Before Ignition, the code generated by V8’s Full-codegen baseline compiler typically occupied almost one third of the overall JavaScript heap in Chrome. That left less space for a web application’s actual data. When Ignition was enabled for Chrome M53 on Android devices with limited RAM, the memory footprint required for baseline, non-optimized JavaScript code shrank by a factor of nine on ARM64-based mobile devices.

Later the V8 team took advantage of the fact that Ignition’s bytecode can be used to generate optimized machine code with TurboFan directly rather than having to re-compile from source code as Crankshaft did. Ignition’s bytecode provides a cleaner and less error-prone baseline execution model in V8, simplifying the deoptimization mechanism that is a key feature of V8’s adaptive optimization. Finally, since generating bytecode is faster than generating Full-codegen’s baseline compiled code, activating Ignition generally improves script startup times and in turn, web page loads.

By coupling the design of Ignition and TurboFan closely, there are even more benefits to the overall architecture. For example, rather than writing Ignition’s high-performance bytecode handlers in hand-coded assembly, the V8 team instead uses TurboFan’s intermediate representation to express the handlers’ functionality and lets TurboFan do the optimization and final code generation for V8’s numerous supported platforms. This ensures Ignition performs well on all of V8’s supported chip architectures while simultaneously eliminating the burden of maintaining nine separate platform ports.

Running the Numbers


History aside, now let’s take a look at the new pipeline’s real-world performance and memory consumption.

The V8 team continually monitors the performance of real-world use cases using the Telemetry - Catapult framework. Previously in this blog we’ve discussed why it’s so important to use the data from real-world tests to drive our performance optimization work and how we use WebPageReplay together with Telemetry to do so. The switch to Ignition and TurboFan shows performance improvements in those real-world test cases. Specifically, the new pipeline results in significant speed-ups on user interaction story tests for well-known websites:

Reduction in time spent in V8 for user interaction benchmarks

Although Speedometer is a synthetic benchmark, we’ve previously uncovered that it does a better job of approximating the real-world workloads of modern JavaScript than other synthetic benchmarks. The switch to Ignition and TurboFan improves V8’s Speedometer score by 5%-10%, depending on platform and device.

The new pipeline also speeds up server-side JavaScript. AcmeAir, a benchmark for Node.js that simulates the server backend implementation of a fictitious airline, runs more than 10% faster using V8 5.9.

Improvements on Web and Node.js benchmarks

Ignition and TurboFan also reduce V8’s overall memory footprint. In Chrome M59, the new pipeline slims V8’s memory footprint on desktop and high-end mobile devices by 5-10%. This reduction is a result of bringing the Ignition memory savings that have been previously covered in this blog to all devices and platforms supported by V8.

These improvements are just the start. The new Ignition and TurboFan pipeline paves the way for further optimizations that will boost JavaScript performance and shrink V8’s footprint in both Chrome and in Node.js for years to come. We look forward to sharing those improvements with you as we roll them out to developers and users. Stay tuned.

Posted by the V8 team

Wednesday, May 3, 2017

Energizing Atom with V8's custom start-up snapshot

The team behind Atom, a text editor based on the Electron framework, recently published an article detailing significant improvements to start-up time, owing big parts of the gains to the usage of V8's custom start-up snapshot. Awesome!

Native Electron apps, including Atom, leverage Chromium to display a GUI and Node.js as an execution environment, both of which respectively embed V8 to run JavaScript. This allows Electron apps to take advantage of V8 snapshots to quickly initialize a previously serialized heap for faster startup. Electron developers have even released electron-link, a convenience library for setting up this feature, which Atom heavily relies on for its performance optimizations.

We announced V8's support for custom start-up snapshot more than a year ago. With v8::V8::CreateSnapshotDataBlob, embedders can provide an additional script to customize a start-up snapshot. New contexts created from this snapshot are initialized as if the additional script has already been executed. As the Atom team has shown, using a custom start-up snapshot can significantly boost start-up performance.

To quote the Atom article:
The tricky part of using this technology, however, is that the code is executed in a bare V8 context. In other words, it only allows us to run plain JavaScript code and does not provide access to native modules, Node/Electron APIs or DOM manipulation.
To work around this restriction, electron-link goes great lengths to make sure native functions (backed by C++ functions) are not included in the snapshot, and are instead loaded lazily. V8's serializer simply does not know how to serialize these native functions. Instead, native functions are wrapped into helper functions that load them lazily at runtime.

Since our original post about snapshots, the V8 team has continued developing the snapshot API and added many features and improvements. These features include support for serializing native functions. Provided that the backing C++ functions have been registered with V8, the serializer can now recognize and encode native functions for the deserializer to restore later. We’re happy to announce that the work-around in electron-link is no longer necessary.

One caveat remains: the serializer cannot directly capture state outside of V8, for example changes to the DOM in case of Atom. However, outside state directly attached to JavaScript objects via embedder fields (previously named "internal fields") can now be serialized and deserialized through a new callback API. With some work, this feature allows outside state to be put into the snapshot after all.

Some other highlights include:

  • FunctionTemplate and ObjectTemplate objects can now be added and extracted from the snapshot, so that they do not have to be set up from scratch.
  • V8 can now run multiple scripts or apply any other modifications to the context before creating the snapshot, as opposed to only a single source string.
  • Multiple differently-configured contexts can now be included in the same start-up snapshot blob.

To make these features more accessible, we designed a new, more powerful API with v8::SnapshotCreator. The old API is now merely a wrapper around this underlying API. This is how v8::V8::CreateSnapshotDataBlob is implemented:
StartupData V8::CreateSnapshotDataBlob(const char* embedded_source) {
  StartupData result = {nullptr, 0};
  {
    SnapshotCreator snapshot_creator;
    // Obtain an isolate provided by SnapshotCreator.
    Isolate* isolate = snapshot_creator.GetIsolate();
    {
      HandleScope scope(isolate);
      // Create a new context and optionally run some script.
      Local context = Context::New(isolate);
      if (embedded_source != NULL &&
          !RunExtraCode(isolate, context, embedded_source, "")) {
        return result;
      }
      // Add the possibly customized context to the SnapshotCreator.
      snapshot_creator.SetDefaultContext(context);
    }
    // Use the SnapshotCreator to create the snapshot blob.
    result = snapshot_creator.CreateBlob(
        SnapshotCreator::FunctionCodeHandling::kClear);
  }
  return result;
}

For more advanced code examples, take a look at these test cases:


The new API is available in V8 version 5.7 and later. We hope that these new features will help embedders make even better use of custom start-up snapshot. If you have any questions, please reach out to our v8-users mailing list.

Posted by Yang Guo

Thursday, April 27, 2017

V8 Release 5.9


Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 5.9, which will be in beta until it is released in coordination with Chrome 59 Stable in several weeks. V8 5.9 is filled will all sorts of developer-facing goodies. We’d like to give you a preview of some of the highlights in anticipation of the release.

Ignition+Turbofan launched

V8 5.9 is going to be the first version with Ignition+Turbofan enabled by default. In general, this switch should lead to lower memory consumption and faster startup for web application across the board, and we don’t expect stability or performance issues because the new pipeline has already undergone significant testing. However, give us a call in case your code suddenly starts to significantly regress in performance.

A dedicated blog post will delve deeper into this topic soon.

WebAssembly TrapIf support on all platforms

The TrapIf support significantly reduced the time spent compiling code (~30 %).


V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use 'git checkout -b 5.9 -t branch-heads/5.9' to experiment with the new features in V8 5.9. Alternatively you can subscribe to Chrome's Beta channel and try the new features out yourself soon.

Posted by the V8 team

Wednesday, April 12, 2017

Retiring Octane

The genesis of Octane

The history of JavaScript benchmarks is a story of constant evolution. As the web expanded from simple documents to dynamic client-side applications, new JavaScript benchmarks were created to measure workloads that became important for new use cases. This constant change has given individual benchmarks finite lifespans. As web browser and virtual machine (VM) implementations begin to over-optimize for specific test cases, benchmarks themselves cease to become effective proxies for their original use cases. One of the first JavaScript benchmarks, SunSpider, provided early incentives for shipping fast optimizing compilers. However, as VM engineers uncovered the limitations of microbenchmarks and found new ways to optimize around SunSpider’s limitations, the browser community retired SunSpider as a recommended benchmark.

Designed to mitigate some of the weaknesses of early microbenchmarks, the Octane benchmark suite was first released in 2012. It evolved from an earlier set of simple V8 test cases and became a common benchmark for general web performance. Octane consists of 17 different tests, which were designed to cover a variety of different workloads, ranging from Martin Richards’ kernel simulation test to a version of Microsoft’s TypeScript compiler compiling itself. The contents of Octane represented the prevailing wisdom around measuring JavaScript performance at the time of its creation.

Diminishing returns and over-optimization

In the first few years after its release, Octane provided a unique value to the JavaScript VM ecosystem. It allowed engines, including V8, to optimize their performance for a class of applications that stressed peak performance. These CPU-intensive workloads were initially underserviced by VM implementations. Octane helped engine developers deliver optimizations that allowed computationally-heavy applications to reach speeds that made JavaScript a viable alternative to C++ or Java. In addition, Octane drove improvements in garbage collection which helped web browsers avoid long or unpredictable pauses.

By 2015, however, most JavaScript implementations had implemented the compiler optimizations needed to achieve high scores on Octane. Striving for even higher benchmark scores on Octane translated into increasingly-marginal improvements in the performance of real web pages. Investigations into the execution profile of running Octane versus loading common websites (such as Facebook, Twitter, or Wikipedia) revealed that the benchmark doesn’t exercise V8’s parser or the browser loading stack the way real-world code does. Moreover, the style of Octane’s JavaScript doesn’t match the idioms and patterns employed by most modern frameworks and libraries (not to mention transpiled code or newer ES2015+ language features). This means that using Octane to measure V8 performance didn’t capture important use cases for the modern web, such as loading frameworks quickly, supporting large applications with new patterns of state management, or ensuring that ES2015+ features are as fast as their ES5 equivalents.

In addition, we began to notice that JavaScript optimizations which eked out higher Octane scores often had a detrimental effect on real-world scenarios. Octane encourages aggressive inlining to minimize the overhead of function calls, but inlining strategies that are tailored to Octane have led to regressions from increased compilation costs and higher memory usage in real-world use cases. Even when an optimization may be genuinely useful in the real-world, as is the case with dynamic pretenuring, chasing higher Octane scores can result in developing overly-specific heuristics which have little effect or even degrade performance in more generic cases. We found that Octane-derived pretenuring heuristics led to performance degradations in modern frameworks such as Ember. The `instanceof` operator was another example of an optimization tailored to a narrow set of Octane-specific cases that led to significant regressions in Node.js applications.

Another problem is that over time, small bugs in Octane become a target for optimizations themselves. For example, in the Box2DWeb benchmark, taking advantage of a bug where two objects were compared using the `<` and `>=` operators gave a ~15% performance boost on Octane. Unfortunately, this optimization had no effect in the real world and complicates more general types of comparison optimizations. Octane sometimes even negatively penalizes real-world optimizations: engineers working on other VMs have noticed that Octane seems to penalize lazy parsing, a technique that helps most real websites load faster given the amount of dead code frequently found in the wild.

Beyond Octane and other synthetic benchmarks

These examples are just some of the many optimizations which increased Octane scores to the detriment of running real websites. Unfortunately, similar issues exist in other static or synthetic benchmarks, including Kraken and JetStream. Simply put, such benchmarks are insufficient methods of measuring real-world speed and create incentives for VM engineers to over-optimize narrow use cases and under-optimize generic cases, slowing down JavaScript code in the wild.

Given the plateau in scores across most JS VMs and the increasing conflict between optimizing for specific Octane benchmarks rather than implementing speedups for a broader range of real-world code, we believe that it is time to retire Octane as a recommended benchmark.

Octane enabled the JS ecosystem to make large gains in computationally-expensive JavaScript. The next frontier, however, is improving the performance of real web pages, modern libraries, frameworks, ES2015+ language features, new patterns of state management, immutable object allocation, and module bundling. Since V8 runs in many environments, including server side in Node.js, we are also investing time in understanding real-world Node applications and measuring server-side JavaScript performance through workloads such as AcmeAir.

Check back here for more posts about improvements in our measurement methodology and new workloads that better represent real-world performance. We are excited to continue pursuing the performance that matters most to users and developers!

Posted by the V8 team

Monday, March 20, 2017

V8 Release 5.8

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 5.8, which will be in beta until it is released in coordination with Chrome 58 Stable in several weeks. V8 5.8 is filled will all sorts of developer-facing goodies. We’d like to give you a preview of some of the highlights in anticipation of the release.

Arbitrary heap sizes

Historically the V8 heap limit was conveniently set to fit the signed 32-bit integer range with some margin. Over time this convenience lead to sloppy code in V8 that mixed types of different bit widths, effectively breaking the ability to increase the limit. In 5.8 we enabled the use of arbitrary heap sizes. See the dedicated blog post for more information.

Startup performance

In 5.8 the work towards incrementally reducing the time spent in V8 during startup was continued. Reductions in the time spent compiling and parsing code, as well as optimizations in the IC system yielded ~5 % improvements on our real-world startup workloads.

V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use 'git checkout -b 5.8 -t branch-heads/5.8' to experiment with the new features in V8 5.8. Alternatively you can subscribe to Chrome's Beta channel and try the new features out yourself soon.

Posted by the V8 team

Wednesday, March 1, 2017

Fast For-In in V8

For-in is a widely used language feature present in many frameworks. Despite its ubiquity, it is one of the more obscure language constructs from an implementation perspective. V8 went to great lengths to make this feature as fast as possible. Over the course of the past year, for-in became fully spec compliant and up to 3 times faster, depending on the context.

Many popular websites rely heavily on for-in and benefit from its optimization. For example, in early 2016 Facebook spent roughly 7% of its total JavaScript time during startup in the implementation of for-in itself. On Wikipedia this number was even higher at around 8%. By improving the performance of certain slow cases, Chrome 51 significantly improved the performance on these two websites:




Facebook and Wikipedia both improved their total script time by 4% due to various for-in improvements. Note that during the same period, the rest of V8 also got faster, which yielded a total scripting improvement of more than 4%.

In the rest of this blog post we will explain how we managed to speed up this core language feature and fix a long-standing spec violation at the same time.

The Spec

TL;DR; The for-in iteration semantics are fuzzy for performance reasons.

When we look at the spec-text of for-in, it’s written in an unexpectedly fuzzy way,which is observable across different implementations. Let's look at an example when iterating over a Proxy object with the proper traps set.
let proxy = new Proxy({a:1, b:1},{
 getPrototypeOf(target) {
 console.log("getPrototypeOf");
 return null;
},
ownKeys(target) {
 console.log("ownKeys");
 return Reflect.ownKeys(target);
},
getOwnPropertyDescriptor(target, prop) {
 console.log("getOwnPropertyDescriptor name=" + prop);
 return Reflect.getOwnPropertyDescriptor(target, prop);
}
});

In V8/Chrome 56 you get the following output:
ownKeys
getPrototypeOf
getOwnPropertyDescriptor name=a
a
getOwnPropertyDescriptor name=b
b

In contrast, you will see a different order of statements for the same snippet in Firefox 51:
ownKeys 
getOwnPropertyDescriptor name=a 
getOwnPropertyDescriptor name=b 
getPrototypeOf 
a 
b

Both browsers respect the spec, but for once the spec does not enforce an explicit order of instructions. To understand these loop holes properly, let's have a look at the spec text:
EnumerateObjectProperties ( O )
When the abstract operation EnumerateObjectProperties is called with argument O, the following steps are taken:
  1. Assert: Type(O) is Object. 
  2. Return an Iterator object (25.1.1.2) whose next method iterates over all the String-valued keys of enumerable properties of O. The iterator object is never directly accessible to ECMAScript code. The mechanics and order of enumerating the properties is not specified but must conform to the rules specified below. 
Now, usually spec instructions are precise in what exact steps are required. But in this case they refer to a simple list of prose, and even the order of execution is left to implementers. Typically, the reason for this is that such parts of the spec were written after the fact where JavaScript engines already had different implementations. The spec tries to tie the loose ends by providing the following instructions:
  1. The iterator's throw and return methods are null and are never invoked. 
  2. The iterator's next method processes object properties to determine whether the property key should be returned as an iterator value. 
  3. Returned property keys do not include keys that are Symbols. 
  4. Properties of the target object may be deleted during enumeration. 
  5. A property that is deleted before it is processed by the iterator's next method is ignored. If new properties are added to the target object during enumeration, the newly added properties are not guaranteed to be processed in the active enumeration. 
  6. A property name will be returned by the iterator's next method at most once in any enumeration. 
  7. Enumerating the properties of the target object includes enumerating properties of its prototype, and the prototype of the prototype, and so on, recursively; but a property of a prototype is not processed if it has the same name as a property that has already been processed by the iterator's next method. 
  8. The values of [[Enumerable]] attributes are not considered when determining if a property of a prototype object has already been processed. 
  9. The enumerable property names of prototype objects must be obtained by invoking EnumerateObjectProperties passing the prototype object as the argument. 
  10. EnumerateObjectProperties must obtain the own property keys of the target object by calling its [[OwnPropertyKeys]] internal method. 
These steps sound tedious, however the specification also contains an example implementation which is explicit and much more readable:
function* EnumerateObjectProperties(obj) {
  const visited = new Set();
  for (const key of Reflect.ownKeys(obj)) {
    if (typeof key === "symbol") continue;
    const desc = Reflect.getOwnPropertyDescriptor(obj, key);
    if (desc && !visited.has(key)) {
      visited.add(key);
      if (desc.enumerable) yield key;
    }
  }
  const proto = Reflect.getPrototypeOf(obj);
  if (proto === null) return;
  for (const protoKey of EnumerateObjectProperties(proto)) {
    if (!visited.has(protoKey)) yield protoKey;
  }
}

Now that you've made it this far, you might have noticed from the previous example that V8 does not exactly follow the spec example implementation. As a start, the example for-in generator works incrementally, while V8 collects all keys upfront - mostly for performance reasons. This is perfectly fine, and in fact the spec text explicitly states that the order of operations A - J is not defined. Nevertheless, as you will find out later in this post, there are some corner cases where V8 did not fully respect the specification until 2016.

The Enum Cache

The example implementation of the for-in generator follows an incremental pattern of collecting and yielding keys. In V8 the property keys are collected in a first step and only then used in the iteration phase. For V8 this makes a few things easier. To understand why, we need to have a look at the object model.

A simple object such as {a:"value a", b:"value b", c:"value c"} can have various internal representations in V8 as we will show in a detailed follow-up post on properties. This means that depending on what type of properties we have—in-object, fast or slow—the actual property names are stored in different places. This makes collecting enumerable keys a non-trivial undertaking.

V8 keeps track of the object's structure by means of a hidden class or so-called Map. Objects with the same Map have the same structure. Additionally each Map has a shared data-structure, the descriptor array, which contains details about each property, such as where the properties are stored on the object, the property name, and details such as enumerability.

Let’s for a moment assume that our JavaScript object has reached its final shape and no more properties will be added or removed. In this case we could use the descriptor array as a source for the keys. This works if there are only enumerable properties. To avoid the overhead of filtering out non-enumerable properties each time V8 uses a separate EnumCache accessible via the Map's descriptor array.


Given that V8 expects that slow dictionary objects frequently change, (i.e. through addition and removal of properties), there is no descriptor array for slow objects with dictionary properties. Hence, V8 does not provide an EnumCache for slow properties. Similar assumptions hold for indexed properties, and as such they are excluded from the EnumCache as well.

Let’s summarize the important facts:
  • Maps are used to keep track of object shapes. 
  • Descriptor arrays store information about properties (name, configurability, visibility). 
  • Descriptor arrays can be shared between Maps. 
  • Each descriptor array can have an EnumCache listing only the enumerable named keys, not indexed property names.

The Mechanics of For-In

Now you know partially how Maps work and how the EnumCache relates to the descriptor array. V8 executes JavaScript via Ignition, a bytecode interpreter, and TurboFan, the optimizing compiler, which both deal with for-in in similar ways. For simplicity we will use a pseudo-C++ style to explain how for-in is implemented internally:

// For-In Prepare:
FixedArray* keys = nullptr;
Map* original_map = object->map();
if (original_map->HasEnumCache()) {
  if (object->HasNoElements()) {
    keys = original_map->GetCachedEnumKeys();
  } else {
    keys = object->GetCachedEnumKeysWithElements();
  }
} else {
  keys = object->GetEnumKeys();
}

// For-In Body:
for (size_t i = 0; i < keys->length(); i++) {
  // For-In Next:
  String* key = keys[i];
  if (!object->HasProperty(key) continue;
  EVALUATE_FOR_IN_BODY();
}

For-in can be separated into three main steps:
  1. Preparing the keys to iterate over, 
  2. Getting the next key, 
  3. Evaluating the for-in body. 

The “prepare”-step is the most complex out of these three and this is the place where the EnumCache comes into play. In the example above you can see that V8 directly uses the EnumCache if it exists and if there are no elements (integer indexed properties) on the object (and its prototype). For the case where there are indexed property names, V8 jumps to a runtime function implemented in C++ which prepends them to the existing enum cache, as illustrated by the following example:

FixedArray* JSObject::GetCachedEnumKeysWithElements() {
  FixedArray* keys = object->map()->GetCachedEnumKeys();
  return object->GetElementsAccessor()->PrependElementIndices(object, keys);
}

FixedArray* Map::GetCachedEnumKeys() {
  // Get the enumerable property keys from a possibly shared enum cache
  FixedArray* keys_cache = descriptors()->enum_cache()->keys_cache();
  if (enum_length() == keys_cache->length()) return keys_cache;
  return keys_cache->CopyUpTo(enum_length());
}

FixedArray* FastElementsAccessor::PrependElementIndices(
      JSObject* object, FixedArray* property_keys) {
  Assert(object->HasFastElements());
  FixedArray* elements = object->elements();
  int nof_indices = CountElements(elements)
  FixedArray* result = FixedArray::Allocate(property_keys->length() + nof_indices);
  int insertion_index = 0;
  for (int i = 0; i < elements->length(); i++) {
    if (!HasElement(elements, i)) continue;
    result[insertion_index++] = String::FromInt(i);
  }
  // Insert property keys at the end.
  property_keys->CopyTo(result, nof_indices - 1);
  return result;
}

In the case where no existing EnumCache was found we jump again to C++ and follow the initially presented spec steps:

FixedArray* JSObject::GetEnumKeys() {
  // Get the receiver’s enum keys.
  FixedArray* keys = this->GetOwnEnumKeys();
  // Walk up the prototype chain.
  for (JSObject* object : GetPrototypeIterator()) {
     // Append non-duplicate keys to the list.
     keys = keys->UnionOfKeys(object->GetOwnEnumKeys());
  }
  return keys;
}

FixedArray* JSObject::GetOwnEnumKeys() {
  FixedArray* keys;
  if (this->HasEnumCache()) {
    keys = this->map()->GetCachedEnumKeys();
  } else {
    keys = this->GetEnumPropertyKeys();
  }
  if (this->HasFastProperties()) this->map()->FillEnumCache(keys);
  return object->GetElementsAccessor()->PrependElementIndices(object, keys);
}


FixedArray* FixedArray::UnionOfKeys(FixedArray* other) {
  int length = this->length();
  FixedArray* result = FixedArray::Allocate(length + other->length());
  this->CopyTo(result, 0);
  int insertion_index = length;
  for (int i = 0; i < other->length(); i++) {
    String* key = other->get(i);
    if (other->IndexOf(key) == -1) {
      result->set(insertion_index, key);
      insertion_index++;
    }
  }
  result->Shrink(insertion_index);
  return result;
}

This simplified C++ code corresponds to the implementation in V8 until early 2016 when we started to look at the UnionOfKeys method. If you look closely you notice that we used a naive algorithm to exclude duplicates from the list which might yield bad performance if we have many keys on the prototype chain. This is how we decided to pursue the optimizations in following section.

Problems with For-In

As we already hinted in the previous section, the UnionOfKeys method has bad worst-case performance. It was based on the valid assumption that most objects have fast properties and thus will benefit from an EnumCache. The second assumption is that there are only few enumerable properties on the prototype chain limiting the time spent in finding duplicates. However, if the object has slow dictionary properties and many keys on the prototype chain, UnionOfKeys becomes a bottleneck as we have to collect the enumerable property names each time we enter for-in.

Next to performance issues, there was another problem with the existing algorithm in that it’s not spec compliant. V8 got the following example wrong for many years:
var o = { 
  __proto__ : {b: 3},
  a: 1 
};
Object.defineProperty(o, “b”, {});

for (var k in o) print(k);
Output:
 "a"
 "b"
Perhaps counterintuitively this should just print out “a” instead of “a” and “b”. If you recall the spec text at the beginning of this post, steps G and J imply that non-enumerable properties on the receiver shadow properties on the prototype chain.

To make things more complicated, ES6 introduced the proxy object. This broke a lot of assumptions of the V8 code. To implement for-in in a spec-compliant manner, we have to trigger the following 5 out of a total of 13 different proxy traps.


Internal Method
Handler Method
[[GetPrototypeOf]]
getPrototypeOf
[[GetOwnProperty]]
getOwnPropertyDescriptor
[[HasProperty]]
has
[[Get]]
get
[[OwnPropertyKeys]]
ownKeys

This required a duplicate version of the original GetEnumKeys code which tried to follow the spec example implementation more closely. ES6 Proxies and lack of handling shadowing properties were the core motivation for us to refactor how we extract all the keys for for-in in early 2016.

The KeyAccumulator

We introduced a separate helper class, the KeyAccumulator, which dealt with the complexities of collecting the keys for for-in. With growth of the ES6 spec, new features like Object.keys or Reflect.ownKeys required their own slightly modified version of collecting keys. By having a single configurable place we could improve the performance of for-in and avoid duplicated code.

The KeyAccumulator consists of a fast part that only supports a limited set of actions but is able to complete them very efficiently. The slow accumulator supports all the complex cases, like ES6 Proxies.

In order to properly filter out shadowing properties we have to maintain a separate list of non-enumerable properties that we have seen so far. For performance reasons we only do this after we figure out that there are enumerable properties on the prototype chain of an object.

Performance Improvements

With the KeyAccumulator in place, a few more patterns became feasible to optimize. The first one was to avoid the nested loop of the original UnionOfKeys method which caused slow corner cases. In a second step we performed more detailed pre-checks to make use of existing EnumCaches and avoid unnecessary copy steps.

To illustrate that the spec-compliant implementation is faster, let’s have a look at the following four different objects:

var fastProperties = {
    __proto__ : null,
    “property 1” : 1,
    …
    “property 10” : n
}

var fastPropertiesWithPrototype = {
    “property 1” : 1,
    …
    “property 10” : n
}

var slowProperties = {
    __proto__ : null,
   “dummy”: null,
    “property 1” : 1,
    …
    “property 10” : n
}
delete slowProperties[“dummy”]

var elements = {
    __proto__: null,

    “1” : 1,
    …
    “10” : n
}

  • The fastProperties object has standard fast properties. 
  • The fastPropertiesWithPrototype object has additional non-enumerable properties on the prototype chain by using the Object.prototype. 
  • The slowProperties object has slow dictionary properties. 
  • The elements object has only indexed properties. 

The following graph compares the original performance of running a for-in loop a million times in a tight loop without the help of our optimizing compiler.


As we've outlined in the introduction, these improvements became very visible on Facebook and Wikipedia in particular. 




Besides the initial improvements available in Chrome 51, a second performance tweak yielded another significant improvement. The following graph shows our tracking data of the total time spent in scripting during startup on a Facebook page. The selected range around V8 revision 37937 corresponds to an additional 4% performance improvement!


To underline the importance of improving for-in we can rely on the data from a tool we built back in 2016 that allows us to extract V8 measurements over a set of websites. The following table shows the relative time spent in V8 C++ entry points (runtime functions and builtins) for Chrome 49 over a set of roughly 25 representative real-world websites.

Position
Name
Total Time
1
CreateObjectLiteral
1.10%
2
NewObject
0.90%
3
KeyedGetProperty
0.70%
4
GetProperty
0.60%
5
ForInEnumerate
0.60%
6
SetProperty
0.50%
7
StringReplaceGlobalRegExpWithString
0.30%
8
HandleApiCallConstruct
0.30%
9
RegExpExec
0.30%
10
ObjectProtoToString
0.30%
11
ArrayPush
0.20%
12
NewClosure
0.20%
13
NewClosure_Tenured
0.20%
14
ObjectDefineProperty
0.20%
15
HasProperty
0.20%
16
StringSplit
0.20%
17
ForInFilter
0.10%

The most important for-in helpers are at position 5 and 17, accounting for an average of 0.7% percent of the total time spent in scripting on a website. In Chrome 57 ForInEnumerate has dropped to 0.2% of the total time and ForInFilter is below the measuring threshold due to a fast path written in assembler.

Posted by Camillo Bruni, @camillobruni

Friday, February 17, 2017

High-performance ES2015 and beyond

Over the last couple of months the V8 team focused on bringing the performance of newly added ES2015 and other even more recent JavaScript features on par with their transpiled ES5 counterparts.

Motivation

Before we go into the details of the various improvements, we should first consider why performance of ES2015+ features matter despite the widespread usage of Babel in modern web development:
  1. First of all there are new ES2015 features that are only polyfilled on demand, for example the Object.assign builtin. When Babel transpiles object spread properties (which are heavily used by many React and Redux applications), it relies on Object.assign instead of an ES5 equivalent if the VM supports it.
  2. Polyfilling ES2015 features typically increases code size, which contributes significantly to the current web performance crisis, especially on mobile devices common in emerging markets. So the cost of just delivering, parsing and compiling the code can be fairly high, even before you get to the actual execution cost.
  3. And last but not least, the client side JavaScript is only one of the environments that relies on the V8 engine. There’s also Node.js for server side applications and tools, where developers don’t need to transpile to ES5 code, but can directly use the features supported by the relevant V8 version in the target Node.js release.
Let’s consider the following code snippet from the Redux documentation:


function todoApp(state = initialState, action) {
  switch (action.type) {
    case SET_VISIBILITY_FILTER:
      return { ...state, visibilityFilter: action.filter }
    default:
      return state
  }
}

There are two things in that code that demand transpilation: the default parameter for state and the spreading of state into the object literal. Babel generates the following ES5 code:


"use strict";

var _extends = Object.assign || function (target) { for (var i = 1; i < arguments.length; i++) { var source = arguments[i]; for (var key in source) { if (Object.prototype.hasOwnProperty.call(source, key)) { target[key] = source[key]; } } } return target; };

function todoApp() {
  var state = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : initialState;
  var action = arguments[1];

  switch (action.type) {
    case SET_VISIBILITY_FILTER:
      return _extends({}, state, { visibilityFilter: action.filter });
    default:
      return state;
  }
}

Now imagine that Object.assign is orders of magnitude slower than the polyfilled _extends generated by Babel. In that case upgrading from a browser that doesn’t support Object.assign to an ES2015 capable version of the browser would be a serious performance regression and probably hinder adoption of ES2015 in the wild.

This example also highlights another important drawback of transpilation: The generated code that is shipped to the user is usually considerably bigger than the ES2015+ code that the developer initially wrote. In the example above, the original code is 203 characters (176 bytes gzipped) whereas the generated code is 588 characters (367 bytes gzipped). That’s already a factor of two increase in size. Let’s look at another example from the Async Iterators for JavaScript proposal:


async function* readLines(path) {
  let file = await fileOpen(path);

  try {
    while (!file.EOF) {
      yield await file.readLine();
    }
  } finally {
    await file.close();
  }
}

Babel translates these 187 characters (150 bytes gzipped) into a whopping 2987 characters (971 bytes gzipped) of ES5 code, not even counting the regenerator runtime that is required as an additional dependency:


"use strict";

var _asyncGenerator = function () { function AwaitValue(value) { this.value = value; } function AsyncGenerator(gen) { var front, back; function send(key, arg) { return new Promise(function (resolve, reject) { var request = { key: key, arg: arg, resolve: resolve, reject: reject, next: null }; if (back) { back = back.next = request; } else { front = back = request; resume(key, arg); } }); } function resume(key, arg) { try { var result = gen[key](arg); var value = result.value; if (value instanceof AwaitValue) { Promise.resolve(value.value).then(function (arg) { resume("next", arg); }, function (arg) { resume("throw", arg); }); } else { settle(result.done ? "return" : "normal", result.value); } } catch (err) { settle("throw", err); } } function settle(type, value) { switch (type) { case "return": front.resolve({ value: value, done: true }); break; case "throw": front.reject(value); break; default: front.resolve({ value: value, done: false }); break; } front = front.next; if (front) { resume(front.key, front.arg); } else { back = null; } } this._invoke = send; if (typeof gen.return !== "function") { this.return = undefined; } } if (typeof Symbol === "function" && Symbol.asyncIterator) { AsyncGenerator.prototype[Symbol.asyncIterator] = function () { return this; }; } AsyncGenerator.prototype.next = function (arg) { return this._invoke("next", arg); }; AsyncGenerator.prototype.throw = function (arg) { return this._invoke("throw", arg); }; AsyncGenerator.prototype.return = function (arg) { return this._invoke("return", arg); }; return { wrap: function wrap(fn) { return function () { return new AsyncGenerator(fn.apply(this, arguments)); }; }, await: function await(value) { return new AwaitValue(value); } }; }();

var readLines = function () {
  var _ref = _asyncGenerator.wrap(regeneratorRuntime.mark(function _callee(path) {
    var file;
    return regeneratorRuntime.wrap(function _callee$(_context) {
      while (1) {
        switch (_context.prev = _context.next) {
          case 0:
            _context.next = 2;
            return _asyncGenerator.await(fileOpen(path));

          case 2:
            file = _context.sent;
            _context.prev = 3;

          case 4:
            if (file.EOF) {
              _context.next = 11;
              break;
            }

            _context.next = 7;
            return _asyncGenerator.await(file.readLine());

          case 7:
            _context.next = 9;
            return _context.sent;

          case 9:
            _context.next = 4;
            break;

          case 11:
            _context.prev = 11;
            _context.next = 14;
            return _asyncGenerator.await(file.close());

          case 14:
            return _context.finish(11);

          case 15:
          case "end":
            return _context.stop();
        }
      }
    }, _callee, this, [[3,, 11, 15]]);
  }));

  return function readLines(_x) {
    return _ref.apply(this, arguments);
  };
}();

This is a 650% increase in size (the generic _asyncGenerator function might be shareable depending on how you bundle your code, so you can amortize some of that cost across multiple uses of async iterators). We don’t think it’s viable to ship only code transpiled to ES5 long-term, as the increase in size will not only affect download time/cost, but will also add additional overhead to parsing and compilation. If we really want to drastically improve page load and snappiness of modern web applications, especially on mobile devices, we have to encourage developers to not only use ES2015+ when writing code, but also to ship that instead of transpiling to ES5. Only deliver fully transpiled bundles to legacy browsers that don’t support ES2015. For VM implementors, this vision means we need to support ES2015+ features natively and provide reasonable performance.

Measurement methodology

As described above, absolute performance of ES2015+ features is not really an issue at this point. Instead the highest priority currently is to ensure that performance of ES2015+ features is on par with their naive ES5 and even more importantly, with the version generated by Babel. Conveniently there was already a project called six-speed by Kevin Decker, that accomplishes more or less exactly what we needed: a performance comparison of ES2015 features vs. naive ES5 vs. code generated by transpilers.

Six-Speed benchmark


So we decided to take that as the basis for our initial ES2015+ performance work. We forked it and added a couple of benchmarks. We focused on the most serious regressions first, i.e. line items where slowdown from naive ES5 to recommended ES2015+ version was above 2x, because our fundamental assumption is that the naive ES5 version will be at least as fast as the somewhat spec-compliant version that Babel generates.

A modern architecture for a modern language

In the past V8’s had difficulties optimizing the kind of language features that are found in ES2015+. For example, it never became feasible to add exception handling (i.e. try/catch/finally) support to Crankshaft, V8’s classic optimizing compiler. This meant V8’s ability to optimize an ES6 feature like for...of, which essentially has an implicit finally clause, was limited. Crankshaft’s limitations and the overall complexity of adding new language features to full-codegen, V8’s baseline compiler, made it inherently difficult to ensure new ES features were added and optimized in V8 as quickly as they were standardized.

Fortunately, Ignition and TurboFan (V8’s new interpreter and compiler pipeline), were designed to support the entire JavaScript language from the beginning, including advanced control flow, exception handling, and most recently for...of and destructuring from ES2015. The tight integration of the architecture of Ignition and TurboFan make it possible to quickly add new features and to optimize them fast and incrementally.

Many of the improvements we achieved for modern language features were only feasible with the new Ignition/Turbofan pipeline. Ignition and TurboFan proved especially critical to optimizing generators and async functions. Generators had long been supported by V8, but were not optimizable due to control flow limitations in Crankshaft. Async functions are essentially sugar on top of generators, so they fall into the same category. The new compiler pipeline leverages Ignition to make sense of the AST and generate bytecodes which de-sugar complex generator control flow into simpler local-control flow bytecodes. TurboFan can more easily optimize the resulting bytecodes since it doesn’t need to know anything specific about generator control flow, just how to save and restore a function’s state on yields.

How JavaScript generators are represented in Ignition and TurboFan

State of the union

Our short-term goal was to reach less than 2x slowdown on average as soon as possible. We started by looking at the worst test first, and from Chrome M54 to Chrome M58 (Canary) we managed to reduce the number of tests with slowdown above 2x from 16 to 8, and at the same time reduce the worst slowdown from 19x in M54 to just 6x in M58 (Canary). We also significantly reduced the average and median slowdown during that period:


You can see a clear trend towards parity of ES2015+ and ES5. On average we improved performance relative to ES5 by over 47%. Here are some highlights that we addressed since M54.


Most notably we improved performance of new language constructs that are based on iteration, like the spread operator, destructuring and for...of loops. For example, using array destructuring


function fn() {
  var [c] = data;
  return c;
}

is now as fast as the naive ES5 version


function fn() {
  var c = data[0];
  return c;
}

and a lot faster (and shorter) than the Babel generated code:


"use strict";

var _slicedToArray = function () { function sliceIterator(arr, i) { var _arr = []; var _n = true; var _d = false; var _e = undefined; try { for (var _i = arr[Symbol.iterator](), _s; !(_n = (_s = _i.next()).done); _n = true) { _arr.push(_s.value); if (i && _arr.length === i) break; } } catch (err) { _d = true; _e = err; } finally { try { if (!_n && _i["return"]) _i["return"](); } finally { if (_d) throw _e; } } return _arr; } return function (arr, i) { if (Array.isArray(arr)) { return arr; } else if (Symbol.iterator in Object(arr)) { return sliceIterator(arr, i); } else { throw new TypeError("Invalid attempt to destructure non-iterable instance"); } }; }();

function fn() {
  var _data = data,
      _data2 = _slicedToArray(_data, 1),
      c = _data2[0];

  return c;
}

You can check out the High-Speed ES2015 talk we gave at the last Munich NodeJS User Group meetup for additional details:



We are committed to continue improving the performance of ES2015+ features. In case you are interested in the nitty-gritty details please have a look at V8's ES2015 and beyond performance plan.

Posted by Benedikt Meurer @bmeurer, EcmaScript Performance Engineer