Bystroushaak's blog / English section / Programming / tinySelf / Speedups of the interpreter 2019/1

Speedups of the interpreter 2019/1

So, since I did some benchmarks (Simple while benchmark 2019.01.06), I had high hopes for speedup using rpython's JIT.

As it turned out, it didn't work at all - it actually runs many times slower (62 seconds instead of 17).

Several attempts at tweaking the code revealed where was the problem - I had to specify almost everything as "red" variable. Even things like program counter, which I didn't expect at all.

With those improvements, code run 20% faster than without JIT. Obviously, this wasn't acceptable, so I had to investigate.

Numeric ints

So far, I have been using boxed integer values represented as

def add(_, self, parameters):
    obj = parameters[0]
    # yeah, this can't be factored out, I've tried..
    assert isinstance(self, PrimitiveIntObject)
    assert isinstance(obj, _NumberObject)

    return obj.result_type(self.value + obj.value)

# ... other primitive functions ...

class PrimitiveIntObject(_NumberObject):
    def __init__(self, value, obj_map=None):
        _NumberObject.__init__(self, obj_map)

        self.value = value

        add_primitive_fn(self, "+", add, ["obj"])
        add_primitive_fn(self, "-", substract, ["obj"])
        add_primitive_fn(self, "*", multiply, ["obj"])
        add_primitive_fn(self, "/", divide, ["obj"])
        add_primitive_fn(self, "%", modulo, ["obj"])
        add_primitive_fn(self, "<", lt, ["obj"])
        add_primitive_fn(self, "<=", lte, ["obj"])
        add_primitive_fn(self, ">", gt, ["obj"])
        add_primitive_fn(self, ">=", gte, ["obj"])
        add_primitive_fn(self, "==", compare, ["obj"])
        add_primitive_fn(self, "asString", as_string, [])
        add_primitive_fn(self, "asFloat", as_float, [])

    @property
    def float_value(self):
        return float(self.value)

    def result_type(self, val):
        return PrimitiveIntObject(val)

    def __eq__(self, obj):
        if not hasattr(obj, "value"):
            return False

        return self.value == obj.value

    def __str__(self):
        return str(int(self.value))

This may look strange, but you have to understand that functionality is not used from the class and it's surrounding code itself (=not from rpython), but from the inside of the tinySelf. add_primitive_fn() adds into slots of the insides of PrimitiveIntObject that is represented in tinySelf code objects which wrap given "primitive" code written in RPython.

As you can see, every integer value is boxed inside of class, which is then bound to functions, which manipulate internal .value property. This is obviously highly ineffective.

RPython should translate this into C and that in turn should optimize the code, but is there better way how to tell RPython that "this is int, so treat is as int"?

I thought there is in UnboxedValue, but it turned out that I don't really remember this from reading the documentation correctly. Ouch.

At least I've added _immutable_fields_ = ["value"] declaration to all constant primitives, but it turned out that this didn't improve the performance either. Ouch^2.

Profiling logs

Instead of guessing, I've decided to profile the code. I've only profiled the code with profiler once before using valgrind, so I've just used google and found out Linux Profiling tools and techniques, which gave me two commands to run:

valgrind --tool=callgrind ./tSelf tests/scripts/simple_while_benchmark.self

and then

kcachegrind callgrind.out.*

to visualize the code:

Few moments of clicking around clearly showed, that unholy lot of time was wasted by copying the dicts in the add_primitive_fn. I quickly realized, that each new int, float and string is created as new object and new map for this object.

New map for each int

As you may know, Self is prototype based language. That means it doesn't use classes, which means that functionality is by default in each object. Code reuse is achieved using parent slots.

In class based languages, internal object representation holds just values and pointer to shared methods / description of the object's structure in the class. Self uses similar trick by using maps. When object is created, it holds only data. With object is created its map, which holds metadata - information about slots, their names and so on.

On the internal level, each time object is cloned, it shares its map with the original object, until some kind of structural change is made.

Quick look into the code confirmed that ints, floats and strings create maps again and again:

class PrimitiveIntObject(_NumberObject):
    _immutable_fields_ = ["value"]
    def __init__(self, value, obj_map=None):
        _NumberObject.__init__(self, obj_map)

        ...

As you can see, _NumberObject's init is called with obj_map set to None, which creates new map. Constructor is called for each number, which in effect has a same effect, like there is a (dynamically created) class for each number. Ouch. And each string and float. Ouch ouch.

Fortunately, this is easy to fix. Just share the map between all ints (floats, strings). Quick fix and benchmark for ints proves, that this indeed speeds up the execution significantly; from ~18 seconds to 12 with unoptimized rpython translation without JIT.

Additional optimizations on strings, and floats got down the execution time to something around 8 seconds.

Constructor of the primitive int now looks like this:

class PrimitiveIntObject(_NumberObject):
    _OBJ_CACHE = ObjCache()
    _immutable_fields_ = ["value"]
    def __init__(self, value, obj_map=None):
        _NumberObject.__init__(self, PrimitiveIntObject._OBJ_CACHE.map)

        self.value = value

        if PrimitiveIntObject._OBJ_CACHE.map is not None:
            self.slots_references = PrimitiveIntObject._OBJ_CACHE.slots
            return

        add_primitive_fn(self, "+", add, ["obj"])
        add_primitive_fn(self, "-", substract, ["obj"])
        add_primitive_fn(self, "*", multiply, ["obj"])
        add_primitive_fn(self, "/", divide, ["obj"])
        add_primitive_fn(self, "%", modulo, ["obj"])
        add_primitive_fn(self, "<", lt, ["obj"])
        add_primitive_fn(self, "<=", lte, ["obj"])
        add_primitive_fn(self, ">", gt, ["obj"])
        add_primitive_fn(self, ">=", gte, ["obj"])
        add_primitive_fn(self, "==", compare, ["obj"])
        add_primitive_fn(self, "asString", as_string, [])
        add_primitive_fn(self, "asFloat", as_float, [])

        if PrimitiveIntObject._OBJ_CACHE.map is None:
            PrimitiveIntObject._OBJ_CACHE.store(self)

All the magic is happening in the _OBJ_CACHE, which stores the .map and .slots properties, which are shared across all instances of PrimitiveIntObject. That makes sense, as only the .value is different for each instance.

See the changes in 7946fdfecb339139f22947a0cefc871eccafdb1e.

Don't create new maps with each change

I usually work in pure python and don't think much about the cost of the creation of the dictionary, as python is so slow, that it is nothing compared to cost of other operations. This benchmark of mine reminded me, that dictionaries are costly to create, so naturally, I asked "where else do I create dictionaries?"

And then I remembered - every time I change a structure of an object, I am currently creating the copy of the map.

You see, when you have Object and his clone, and you add or remove or shift slots in one of them, they don't anymore share same map, otherwise they both would be changed. I didn't have any checks, so each operation like addition of the new slot created new map, even if the map wasn't shared between multiple objects.

It turned out, that this also have massive impact on performance as it got simple while benchmark execution time down to 4.2 seconds. And only change it required is to add new method and bool property to check when doing structural changes:

def _clone_map_if_used_by_multiple_objects(self):
        if self.map.used_in_multiple_objects:
            self.map = self.map.clone()

as you can see in f5f0efe4559742bf91973184893a01923fff4e3b.

Clone block traits

Another low hanging fruit is block traits, which is an object, that is created and dynamically bound to each block literal moved on stack. Simply by caching this object, cloning it and change the replacement, I've got down another almost second down to 3.47 seconds.

Instead of

def add_block_trait(block):
    obj = Object()
    obj.meta_add_slot("value", block)
    obj.meta_add_slot("with:", block)
    obj.meta_add_slot("with:With:", block)
    obj.meta_add_slot("with:With:With:", block)
    obj.meta_add_slot("with:With:With:With:", block)
    obj.meta_add_slot("withAll:", block)
    add_primitive_fn(block, "asString", _print_block_source, [])
    add_primitive_fn(block, "getLineNumber", _get_lineno, [])

    obj.scope_parent = _BLOCK_TRAIT

return obj

there is now cached version:

def _create_block_trait_prototype():
    obj = Object()

    placer = PrimitiveNilObject()

    obj.meta_add_slot("value", placer, check_duplicates=True)
    obj.meta_add_slot("with:", placer, check_duplicates=True)
    obj.meta_add_slot("with:With:", placer, check_duplicates=True)
    obj.meta_add_slot("with:With:With:", placer, check_duplicates=True)
    obj.meta_add_slot("with:With:With:With:", placer, check_duplicates=True)
    obj.meta_add_slot("withAll:", placer, check_duplicates=True)

    obj.scope_parent = _USER_EDITABLE_BLOCK_TRAIT

    return obj


_BLOCK_TRAIT_PROTOTYPE = _create_block_trait_prototype()


def add_block_trait(block):
    obj = _BLOCK_TRAIT_PROTOTYPE.clone()
    obj.set_slot("value", block)

    add_primitive_fn(block, "asString", _print_block_source, [])
    add_primitive_fn(block, "getLineNumber", _get_lineno, [])

    return obj

There is still need to clone the object, but objects now share same map and there is no dict-creation involved.

See the full change in b41d8e9c2157e1571c3b3f0fcf401be540a35360.

Additional speedups

I find it really amazing how much few simple tricks and caches in the right place speed up the benchmark. From 16.26 seconds to 3.47. That's more than 460%!

My goal is to get down under one second, ideally under or around 100 milliseconds with JIT.

Next things on my todolist:

Become a Patron