A few days ago a colleague of mine introduced me to the Advent of Code which is a coding challenge with 25 programming tasks, one for each day in December up until Christmas. While talking about it, I had the idea of implementing the challenges in my own visual scripting language I introduced in a previous post. This undertaking would serve two goals:

  1. Forcing me to implement features I postponed for too long now (looking at you, generic types!).
  2. Testing how practical it is for general programming tasks.

I will definitely not be able to solve the challenges on the day they come out. Probably I will not even start with them in December, but nevertheless, it will be a fun challenge and I will see how far I will come until I give up.

One of the things I need to do before even starting the challenges is to implement an array or list type. I put it off for way too long now because one of the drawbacks of scripting with static types is the need for generic types. However, even before I want to start with that I want to address a problem I noticed a while ago when I wrote a benchmark comparing my language to Lua: performance! The benchmark measures three different aspects:

  1. Parsing speed
  2. Execution speed
  3. Function call overhead

Parsing speed and execution speed should be very self-explanatory. Function call overhead here means how much overhead is added when a function is called through the virtual machine (VM) of the language instead of being called directly in C++. E.g., in both languages one is able to bind C++ functions to the VM. In lua with a call to

lua_pushcfunction(L, &SomeFunction);
lua_setglobal(L, "SomeFunction");

or in my language via

some_module->RegisterFunction<&SomeFunction>("SomeFunction")

Also in both cases you are able to call arbitrary script functions via C/C++ like

lua_getglobal(L, "SomeFunction");
lua_pcall(L, 0, 0, 0);

or

some_module->GetFunction("SomeFunction")->Call();

When calling functions with parameters or return values the VM is then forced to store the values in common format internally when passing them to or from the function. E.g., when calling a function in the VM of ovis (the name of my engine) all parameters are saved in type erased ovis::vm::Value class that is then pushed onto the stack. The called function then takes the values from the stack and converts them back to their C++ type. This involves a lot of type checking, probably conversions to base classes, etc. Thus, it will never be as fast as calling the C function directly. The first benchmark checks how much overhead is introduced in the Lua and ovis VMs. In the case of Lua I use sol2 as a wrapper to simplify binding C++ functions.

For testing, I have the following C++ function which calculates the product an arbitrary number of values that are passed as parameters.

template <typename... T>
double Product(T... values) {
  static_assert((... && std::is_same_v<T, double>));
  return (values * ... * 1.0);
}

Then I added the following benchmarks using the Google Benchmark framework:

template <int... Values>
void BM_FunctionCallCPP(benchmark::State& state) {
  for (auto _ : state) {
    benchmark::DoNotOptimize(ProductNoInline(static_cast<double>(Values)...));
  }
}

template <int... Values>
static void BM_FunctionCallLua(benchmark::State& state) {
  sol::state lua;
  lua["Product"] = &Product<decltype(static_cast<double>(Values))...>;
  auto product = lua["Product"];

  for (auto _ : state) {
    benchmark::DoNotOptimize(product(static_cast<double>(Values)...));
  }
}

template <int... Values>
static void BM_FunctionCallVM(benchmark::State& state) {
  ovis::LoadCoreModule();
  auto module = ovis::vm::Module::Register("Benchmark");
  const std::string function_name = fmt::format("Product{}", sizeof...(Values));
  std::vector<std::string_view> parameter_names(sizeof...(Values), "");
  auto product = module->RegisterFunction<&Product<decltype(static_cast<double>(Values))...>>(function_name, parameter_names, { "result" });

  for (auto _ : state) {
    benchmark::DoNotOptimize(product->Call<double>(static_cast<double>(Values)...));
  }
}

The template parameters make the binding calls look way uglier than they normally are, but this allowed me to write the test once and reuse it for an arbitrary number of arguments. E.g., I can test how much overhead there is for 0, 1, or 100 arguments. In the table below you can see the results for 0 to 10 arguments (all numbers indicate nanoseconds).

ArgumentsC++LuaOvis VM
01.15119222
11.14121452
21.38128673
31.15134904
41.151331171
51.161391468
61.611471727
71.391592012
81.621522352
91.411562712
101.711553099

Neither Lua nor ovis is close to a raw function call in C++. However, ovis scales especially bad with the number of arguments. It is still somewhat close when calling the function with no arguments at all but with 10 arguments it takes nearly 20 times longer. And it does not look much better for parsing or executing a small factorial function:

BenchmarkC++LuaOvis VM
Parsing-32869787
1!0.231311421
8!2.3220513981
18!7.2129931965

Nearly 32 microseconds to calculate the factorial of 18?! That is way too much, I need to optimize this! This post will cover the improvements made to the type system to decrease the function call overhead. The optimizations for executing scripts will be covered in a follow-up post. Parsing performance could obviously be way faster, but it is not really important right now, so I will not work on that in the near future.

In order to improve the function call overhead, I profiled the corresponding benchmark with the perf tool. If you want to learn about the tool and how to use it I can recommend the talk from Chandler Carruth. After looking at the profiles I recognized that the type system was the major bottleneck in the current implementation. The biggest issue was that the usage of a custom smart pointer I wrote, the safe_ptr<T>. I may write another post about its design rationales in the future (or maybe not, I am not sure if the idea was that good), but, in short, it is a pointer that automatically nulls itself when the object it is pointing to gets destroyed. In the VM of my engine, I have a Type class that stores all information about a registered type, e.g., name, parent type, properties, etc. Then a safe_ptr<Type> was used to store references to a type, e.g., each value on the stack contained a type pointer, thus, they were created and destroyed extremely frequently which is not a strong point of the implementation (although I have ideas how to fix that). At this point, you might ask: “why is it necessary to have a smart pointer to Type you wouldn’t ever dare to deregister a type during runtime, would you?”. And yes, this may be true for the types that come from the engine like Number, SceneObject, and so on. However, the user is also able to create custom types in the editor and they most likely will also remove a previously created type at some point. When they also created a function using that type there will be a dangling pointer. One solution would be to keep the type and add a deleted flag, however, one would have to actively check that flag which can be forgotten, and it would lead to some cumbersome bugs, hence a smart pointer.

So, first of all, I tediously exchanged the safe_ptr with the weak_ptr from the standard library which already helped quite a bit. However, it was still not as efficient as the Lua/sol version (I am very sorry, I do not have any actual numbers for the intermediate steps, only for overall improvement at the end, so, be patient!).

I wanted something way more lightweight than a smart pointer at that place as most often I just need it to check if two types are equal. I need some kind of ID for the type! However, it should be possible to extremely quickly retrieve a pointer to the Type class from that ID. Optimally, it should just be an index into an array. The problem of just using an index is that you are never allowed to fill that exact spot in the array again after a type is removed. One solution to this problem is, what I call, a versioned index (it most likely is called something else normally, but I was too lazy to duckduckgo it before writing the corresponding class). The concept is extremely simple but very useful. The versioned index is an unsigned integer that represents an index except that some bits are used to represent the version. So you can use it to refer to items in a vector and each time an item is replaced you increase the version number by one. Thus, you can use it as an ID that uniquely refers to an item in an array or array-like structure. I already used it to manage my graphics resources and I created a small type for it:

template <typename T>
concept IndexType = std::is_unsigned_v<T> && !std::is_same_v<T, bool>;

template <IndexType T, std::size_t VERSION_BITS>
struct VersionedIndex {
  static constexpr int BIT_COUNT = std::numeric_limits<T>::digits;
  static constexpr int INDEX_BITS = BIT_COUNT - VERSION_BITS;
  static constexpr T INDEX_MASK = (1 << INDEX_BITS) - 1;
  static constexpr T VERSION_MASK = ((1 << VERSION_BITS) - 1) << INDEX_BITS;

  static_assert((INDEX_MASK & VERSION_MASK) == 0, "Index and version mask are overlapping");
  static_assert((INDEX_MASK | VERSION_MASK) == std::numeric_limits<T>::max(),
                "Index and version mask should fill the whole value");

  static constexpr T NextVersion(T current_version) {
    return (current_version + 1) & ((1 << VERSION_BITS) - 1);
  }

  VersionedIndex() = default;
  constexpr VersionedIndex(T index) : value(index) {
    assert(index <= INDEX_MASK);
  }
  constexpr VersionedIndex(T index, T version) : value((version << VERSION_BITS) | index) {
    assert(index <= INDEX_MASK);
    assert(version < (1 << VERSION_BITS));
  }

  constexpr T index() const { return value & INDEX_MASK; }
  constexpr T version() const { return (value & VERSION_MASK) >> INDEX_BITS; }
  constexpr VersionedIndex<T, VERSION_BITS> next() const {
    return VersionedIndex<T, VERSION_BITS>(index(), NextVersion(version()));
  }

  T value;
};

template <IndexType T, std::size_t VERSION_BITS>
inline bool operator==(VersionedIndex<T, VERSION_BITS> lhs, VersionedIndex<T, VERSION_BITS> rhs) {
  return lhs.value == rhs.value;
}

template <IndexType T, std::size_t VERSION_BITS>
inline bool operator!=(VersionedIndex<T, VERSION_BITS> lhs, VersionedIndex<T, VERSION_BITS> rhs) {
  return lhs.value != rhs.value;
}

The user can define the underlying type as well as how many bits should be used for versioning. The number of versioning bits you should use depends on how often items are replaced and how critical it is when two IDs are the same, as the system is not perfect. Choosing two version bits means that items will receive the same ID when they are replaced four times. If it is critical that two IDs are not the same you should increase the number of version bits and maybe also stop using a slot when the last version is used.

For my type ID, I chose a version index with 32 bits of which 20 are used for versioning. This means I can have 4096 (2^12) different types and I can reuse each slot 1,048,576 (2^20) times before I get a collision for the ID. The number of version bits is probably very excessive, and 16 bits are probably more than enough but I can change that at any time. I would probably also get away with a 16-bit ID, but for now, the space is not that critical.

So in the end I replaced the smart pointers with my ID-system, added some other minor optimizations, and re-ran the benchmark measuring the function call overhead. The following table shows the numbers from before with the numbers from after the benchmark added.

ArgumentsC++LuaOvis VM (old)Ovis VM (new)Speedup
01.21192221713.29x
11.11214522617.73x
21.41286733718.14x
31.21349044818.79x
41.213311716318.74x
51.213914686921.43x
61.614717278021.53x
71.415920129521.18x
81.6152235210422.62x
91.4156271211623.38x
101.7155309913023.84x

The new numbers look way better than before, yielding a speedup of over 23 times for 10 double arguments. However, the scaling is still not great. The baseline with zero arguments may be much better than the Lua version but the time each additional argument adds is still worse. This most likely is a result of the currently very inefficient implementation of the Value class. This, however, is a possible topic for Virtual Machine Optimizations Part III.

Even though it was not the primary goal the optimizations also had an effect on the parsing and execution speed as can be seen in the following table:

BenchmarkC++LuaOvis VM (old)Ovis VM (new)Speedup
Parsing-3286978773151.34x
1!0.22813114211638.72x
8!2.322051398116658.40x
18!7.212993196537608.50x

As said, they got better but they are far from acceptable. The next part of this “series” will focus on improving the execution performance, so stay tuned!