Simon Oehrl - Virtual Machine Optimizations Part I

A few days ago a colleague of mine introduced me to the Advent of Code which is a coding challenge with 25 programming tasks, one for each day in December up until Christmas. While talking about it, I had the idea of implementing the challenges in my own visual scripting language I introduced in a previous post . This undertaking would serve two goals:

Forcing me to implement features I postponed for too long now (looking at you, generic types!).
Testing how practical it is for general programming tasks.

I will definitely not be able to solve the challenges on the day they come out. Probably I will not even start with them in December, but nevertheless, it will be a fun challenge and I will see how far I will come until I give up.

One of the things I need to do before even starting the challenges is to implement an array or list type. I put it off for way too long now because one of the drawbacks of scripting with static types is the need for generic types. However, even before I want to start with that I want to address a problem I noticed a while ago when I wrote a benchmark comparing my language to Lua : performance! The benchmark measures three different aspects:

Parsing speed
Execution speed
Function call overhead

Parsing speed and execution speed should be very self-explanatory. Function call overhead here means how much overhead is added when a function is called through the virtual machine (VM) of the language instead of being called directly in C++. E.g., in both languages one is able to bind C++ functions to the VM. In lua with a call to

lua_pushcfunction(L, &SomeFunction);
lua_setglobal(L, "SomeFunction");

or in my language via

some_module->RegisterFunction<&SomeFunction>("SomeFunction")

Also in both cases you are able to call arbitrary script functions via C/C++ like

lua_getglobal(L, "SomeFunction");
lua_pcall(L, 0, 0, 0);

some_module->GetFunction("SomeFunction")->Call();

When calling functions with parameters or return values the VM is then forced to store the values in common format internally when passing them to or from the function. E.g., when calling a function in the VM of ovis (the name of my engine) all parameters are saved in type erased ovis::vm::Value class that is then pushed onto the stack. The called function then takes the values from the stack and converts them back to their C++ type. This involves a lot of type checking, probably conversions to base classes, etc. Thus, it will never be as fast as calling the C function directly. The first benchmark checks how much overhead is introduced in the Lua and ovis VMs. In the case of Lua I use sol2 as a wrapper to simplify binding C++ functions.

For testing, I have the following C++ function which calculates the product an arbitrary number of values that are passed as parameters.

template <typename... T>
double Product(T... values) {
  static_assert((... && std::is_same_v<T, double>));
  return (values * ... * 1.0);
}

Then I added the following benchmarks using the Google Benchmark framework:

template <int... Values>
void BM_FunctionCallCPP(benchmark::State& state) {
  for (auto _ : state) {
    benchmark::DoNotOptimize(ProductNoInline(static_cast<double>(Values)...));
  }
}

template <int... Values>
static void BM_FunctionCallLua(benchmark::State& state) {
  sol::state lua;
  lua["Product"] = &Product<decltype(static_cast<double>(Values))...>;
  auto product = lua["Product"];

  for (auto _ : state) {
    benchmark::DoNotOptimize(product(static_cast<double>(Values)...));
  }
}

template <int... Values>
static void BM_FunctionCallVM(benchmark::State& state) {
  ovis::LoadCoreModule();
  auto module = ovis::vm::Module::Register("Benchmark");
  const std::string function_name = fmt::format("Product{}", sizeof...(Values));
  std::vector<std::string_view> parameter_names(sizeof...(Values), "");
  auto product = module->RegisterFunction<&Product<decltype(static_cast<double>(Values))...>>(function_name, parameter_names, { "result" });

  for (auto _ : state) {
    benchmark::DoNotOptimize(product->Call<double>(static_cast<double>(Values)...));
  }
}

The template parameters make the binding calls look way uglier than they normally are, but this allowed me to write the test once and reuse it for an arbitrary number of arguments. E.g., I can test how much overhead there is for 0, 1, or 100 arguments. In the table below you can see the results for 0 to 10 arguments (all numbers indicate nanoseconds).

Arguments	C++	Lua	Ovis VM
0	1.15	119	222
1	1.14	121	452
2	1.38	128	673
3	1.15	134	904
4	1.15	133	1171
5	1.16	139	1468
6	1.61	147	1727
7	1.39	159	2012
8	1.62	152	2352
9	1.41	156	2712
10	1.71	155	3099

Neither Lua nor ovis is close to a raw function call in C++. However, ovis scales especially bad with the number of arguments. It is still somewhat close when calling the function with no arguments at all but with 10 arguments it takes nearly 20 times longer. And it does not look much better for parsing or executing a small factorial function:

Benchmark	C++	Lua	Ovis VM
Parsing	-	3286	9787
1!	0.23	131	1421
8!	2.32	205	13981
18!	7.21	299	31965

Nearly 32 microseconds to calculate the factorial of 18?! That is way too much, I need to optimize this! This post will cover the improvements made to the type system to decrease the function call overhead. The optimizations for executing scripts will be covered in a follow-up post. Parsing performance could obviously be way faster, but it is not really important right now, so I will not work on that in the near future.

In order to improve the function call overhead, I profiled the corresponding benchmark with the perf tool. If you want to learn about the tool and how to use it I can recommend the talk from Chandler Carruth . After looking at the profiles I recognized that the type system was the major bottleneck in the current implementation. The biggest issue was that the usage of a custom smart pointer I wrote, the safe_ptr<T>. I may write another post about its design rationales in the future (or maybe not, I am not sure if the idea was that good), but, in short, it is a pointer that automatically nulls itself when the object it is pointing to gets destroyed. In the VM of my engine, I have a Type class that stores all information about a registered type, e.g., name, parent type, properties, etc. Then a safe_ptr<Type> was used to store references to a type, e.g., each value on the stack contained a type pointer, thus, they were created and destroyed extremely frequently which is not a strong point of the implementation (although I have ideas how to fix that). At this point, you might ask: “why is it necessary to have a smart pointer to Type you wouldn’t ever dare to deregister a type during runtime, would you?”. And yes, this may be true for the types that come from the engine like Number, SceneObject, and so on. However, the user is also able to create custom types in the editor and they most likely will also remove a previously created type at some point. When they also created a function using that type there will be a dangling pointer. One solution would be to keep the type and add a deleted flag, however, one would have to actively check that flag which can be forgotten, and it would lead to some cumbersome bugs, hence a smart pointer.

So, first of all, I tediously exchanged the safe_ptr with the weak_ptr from the standard library which already helped quite a bit. However, it was still not as efficient as the Lua/sol version (I am very sorry, I do not have any actual numbers for the intermediate steps, only for overall improvement at the end, so, be patient!).

I wanted something way more lightweight than a smart pointer at that place as most often I just need it to check if two types are equal. I need some kind of ID for the type! However, it should be possible to extremely quickly retrieve a pointer to the Type class from that ID. Optimally, it should just be an index into an array. The problem of just using an index is that you are never allowed to fill that exact spot in the array again after a type is removed. One solution to this problem is, what I call, a versioned index (it most likely is called something else normally, but I was too lazy to duckduckgo it before writing the corresponding class). The concept is extremely simple but very useful. The versioned index is an unsigned integer that represents an index except that some bits are used to represent the version. So you can use it to refer to items in a vector and each time an item is replaced you increase the version number by one. Thus, you can use it as an ID that uniquely refers to an item in an array or array-like structure. I already used it to manage my graphics resources and I created a small type for it:

template <typename T>
concept IndexType = std::is_unsigned_v<T> && !std::is_same_v<T, bool>;

template <IndexType T, std::size_t VERSION_BITS>
struct VersionedIndex {
  static constexpr int BIT_COUNT = std::numeric_limits<T>::digits;
  static constexpr int INDEX_BITS = BIT_COUNT - VERSION_BITS;
  static constexpr T INDEX_MASK = (1 << INDEX_BITS) - 1;
  static constexpr T VERSION_MASK = ((1 << VERSION_BITS) - 1) << INDEX_BITS;

  static_assert((INDEX_MASK & VERSION_MASK) == 0, "Index and version mask are overlapping");
  static_assert((INDEX_MASK | VERSION_MASK) == std::numeric_limits<T>::max(),
                "Index and version mask should fill the whole value");

  static constexpr T NextVersion(T current_version) {
    return (current_version + 1) & ((1 << VERSION_BITS) - 1);
  }

  VersionedIndex() = default;
  constexpr VersionedIndex(T index) : value(index) {
    assert(index <= INDEX_MASK);
  }
  constexpr VersionedIndex(T index, T version) : value((version << VERSION_BITS) | index) {
    assert(index <= INDEX_MASK);
    assert(version < (1 << VERSION_BITS));
  }

  constexpr T index() const { return value & INDEX_MASK; }
  constexpr T version() const { return (value & VERSION_MASK) >> INDEX_BITS; }
  constexpr VersionedIndex<T, VERSION_BITS> next() const {
    return VersionedIndex<T, VERSION_BITS>(index(), NextVersion(version()));
  }

  T value;
};

template <IndexType T, std::size_t VERSION_BITS>
inline bool operator==(VersionedIndex<T, VERSION_BITS> lhs, VersionedIndex<T, VERSION_BITS> rhs) {
  return lhs.value == rhs.value;
}

template <IndexType T, std::size_t VERSION_BITS>
inline bool operator!=(VersionedIndex<T, VERSION_BITS> lhs, VersionedIndex<T, VERSION_BITS> rhs) {
  return lhs.value != rhs.value;
}

The user can define the underlying type as well as how many bits should be used for versioning. The number of versioning bits you should use depends on how often items are replaced and how critical it is when two IDs are the same, as the system is not perfect. Choosing two version bits means that items will receive the same ID when they are replaced four times. If it is critical that two IDs are not the same you should increase the number of version bits and maybe also stop using a slot when the last version is used.

For my type ID, I chose a version index with 32 bits of which 20 are used for versioning. This means I can have 4096 (2^12) different types and I can reuse each slot 1,048,576 (2^20) times before I get a collision for the ID. The number of version bits is probably very excessive, and 16 bits are probably more than enough but I can change that at any time. I would probably also get away with a 16-bit ID, but for now, the space is not that critical.

So in the end I replaced the smart pointers with my ID-system, added some other minor optimizations, and re-ran the benchmark measuring the function call overhead. The following table shows the numbers from before with the numbers from after the benchmark added.

Arguments	C++	Lua	Ovis VM (old)	Ovis VM (new)	Speedup
0	1.2	119	222	17	13.29x
1	1.1	121	452	26	17.73x
2	1.4	128	673	37	18.14x
3	1.2	134	904	48	18.79x
4	1.2	133	1171	63	18.74x
5	1.2	139	1468	69	21.43x
6	1.6	147	1727	80	21.53x
7	1.4	159	2012	95	21.18x
8	1.6	152	2352	104	22.62x
9	1.4	156	2712	116	23.38x
10	1.7	155	3099	130	23.84x

The new numbers look way better than before, yielding a speedup of over 23 times for 10 double arguments. However, the scaling is still not great. The baseline with zero arguments may be much better than the Lua version but the time each additional argument adds is still worse. This most likely is a result of the currently very inefficient implementation of the Value class. This, however, is a possible topic for Virtual Machine Optimizations Part III.

Even though it was not the primary goal the optimizations also had an effect on the parsing and execution speed as can be seen in the following table:

Benchmark	C++	Lua	Ovis VM (old)	Ovis VM (new)	Speedup
Parsing	-	3286	9787	7315	1.34x
1!	0.228	131	1421	163	8.72x
8!	2.32	205	13981	1665	8.40x
18!	7.21	299	31965	3760	8.50x

As said, they got better but they are far from acceptable. The next part of this “series” will focus on improving the execution performance, so stay tuned!