tinySelf performance gains 2019/4

@2019/04/19

tinySelf is my pet programming language inspired by Self, which I am writing in my spare time. It was slow, so I've decided to speed it up a little. Last time (Speedups of the interpreter 2019/1) was a massive success in term of speed gains, so I've had high hopes.

First of all, I've decided to run another profiling with callgrind. Here is the result from the profiler:

callgrind.out.16306

Unsurprisingly, it can be seen from the first glance, that 73.45% of the time is spent in the _do_send() method:

42% takes slot lookup, which is not much surprising. 34% of that takes parent lookup. Still, no surprise there.

A lot takes _create_intermediate_params_obj(). That kind of surprised me, because it was already optimized once. I will have to devise some method of caching of the parameters, maybe in the code context or something.

⚠️

I am using several computers for development, from my notebook to my desktop, so don't look at the time in perf outputs, but at the number of crunched instructions.

Intermediate parent lookup cache

This was the first optimization which I've begun to implement. It has to do with mechanism how are parameters mapped to namespace of method calls.

Each time the _do_send() method is invoked and the message send results in code object, _push_code_obj_for_interpretation() method is called, which in turn calls Tail Call Optimization check and method _create_intermediate_params_obj(), used to map all parameters into one parent, which is then mounted as scope_parent for code object pushed at the top of the stack.

There is already some caching going on, to not to create empty parents with parameters, as it massively slows the interpreter when tail call optimization is working, but in case that this is not a recursive call, it still creates new objects for parameters. This is usually not an issue, but in while loops, new blocks are initialized and new objects for parameters are created in each loop.

Obvious solution is to cache the parameter object map for each block / code object call and create only clone.

So I've implemented it, but measuring the speed by time did not result into conclusive results. It looked like nothing improved.

I've realized, that I need some better tool than just measuring with time, because there is really a lot of stuff running in the background. I would like to see some more "scientific" results, like for example number of instructions that was executed.

Quick search on the google showed me that I can use perf stat.

Results for code without the new optimization:

$ sudo perf stat ./tSelf tests/scripts/simple_while_benchmark.self 
4.676559

 Performance counter stats for './tSelf tests/scripts/simple_while_benchmark.self':

       4639,594198      task-clock (msec)         #    0,987 CPUs utilized          
                55      context-switches          #    0,012 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             9 863      page-faults               #    0,002 M/sec                  
    12 882 386 678      cycles                    #    2,777 GHz                    
    28 716 919 724      instructions              #    2,23  insn per cycle         
     5 525 961 270      branches                  # 1191,044 M/sec                  
        17 527 128      branch-misses             #    0,32% of all branches        

       4,700498870 seconds time elapsed

And for the version with new optimization:

 Performance counter stats for './otSelf tests/scripts/simple_while_benchmark.self':

       4886,654256      task-clock (msec)         #    0,986 CPUs utilized          
                57      context-switches          #    0,012 K/sec                  
                 3      cpu-migrations            #    0,001 K/sec                  
             8 523      page-faults               #    0,002 M/sec                  
    13 340 962 314      cycles                    #    2,730 GHz                    
    29 316 255 135      instructions              #    2,20  insn per cycle         
     5 641 543 528      branches                  # 1154,480 M/sec                  
        15 815 261      branch-misses             #    0,28% of all branches        

       4,957247412 seconds time elapsed

Time is not that important, as it was different each time a and sometimes one was faster than the other just to switch again in next measurement.

But overall, yeah, it is worse with new optimization. I find it quite horrible that 1 000 000 increments takes 29 316 255 135 instructions (29316 for one loop cycle), even tho that there is multiple blocks/lambdas evaluate the result and everything is highly dynamic.

Why is it slower? Some investigation showed that the code still creates new maps when using cached value, because only in some cases it writes * parent into it and that triggers the new map creation for the cloned object, because cloned version was structurally changed. I am trying to cache object which sometimes structurally changes, and that is obviously nonsense.

So, nope, wrong idea, time for rollback. At least I've found out and fixed one bug. Current status 28G7.

Block traits

This is an idea of the same kind as previous optimization, but for blocks. Block trait parents are created dynamically every time block object is used. So cache them for each code context, as block's code context is called many times during each while cycle.

 Performance counter stats for './tSelf tests/scripts/simple_while_benchmark.self':

       4063,963996      task-clock (msec)         #    0,995 CPUs utilized          
                16      context-switches          #    0,004 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
             8 240      page-faults               #    0,002 M/sec                  
    11 404 916 268      cycles                    #    2,806 GHz                    
    27 102 974 881      instructions              #    2,38  insn per cycle         
     5 242 774 320      branches                  # 1290,064 M/sec                  
        15 914 696      branch-misses             #    0,30% of all branches

Such simple optimization reduced number of instruction to 27G1! That's 1 613 944 843 fewer instructions per one million while cycles. Nice.

Random optimization

I was giving another shot to optimization with Intermediate parent lookup cache, and I've rewritten ._put_together_parameters() to generator instead of function returning array. Quite surprisingly, it was suddenly slightly faster:

 Performance counter stats for './tSelf tests/scripts/simple_while_benchmark.self':

       3262,042236      task-clock (msec)         #    0,993 CPUs utilized          
                 8      context-switches          #    0,002 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
            15 792      page-faults               #    0,005 M/sec                  
    11 602 925 303      cycles                    #    3,557 GHz                    
    26 936 652 659      instructions              #    2,32  insn per cycle         
     5 190 354 343      branches                  # 1591,136 M/sec                  
        12 945 940      branch-misses             #    0,25% of all branches

26G9 out of random.

Rewrite set / dict to list

When I was attempting yet another attempt to optimize Intermediate parent lookup cache, I've rewritten _visited_objects variable in Object.parent_lookup() from RPython's sets (dicts with None as value) to list, because the set was just old relic. I really just needed container to store objects for later, when I will unvisit them (set all .visited properties back to False).

This resulted in nice speedup:

       2903,254674      task-clock (msec)         #    0,994 CPUs utilized          
                 7      context-switches          #    0,002 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             2 232      page-faults               #    0,769 K/sec                  
    10 387 334 274      cycles                    #    3,578 GHz                    
    25 032 972 112      instructions              #    2,41  insn per cycle         
     4 838 632 101      branches                  # 1666,623 M/sec                  
        11 633 578      branch-misses             #    0,24% of all branches

25G this simple. Wow.

Intermediate parent lookup cache attempt n. 3

As I said in previous sub-chapter, I was attempting another try at the Intermediate parent lookup cache optimization.

       2794,943301      task-clock (msec)         #    0,993 CPUs utilized          
                13      context-switches          #    0,005 K/sec                  
                 2      cpu-migrations            #    0,001 K/sec                  
             1 661      page-faults               #    0,594 K/sec                  
     9 921 690 448      cycles                    #    3,550 GHz                    
    23 893 490 688      instructions              #    2,41  insn per cycle         
     4 657 628 104      branches                  # 1666,448 M/sec                  
        13 391 723      branch-misses             #    0,29% of all branches

And this time, it looks like I've done it; 23G9 instructions!

Dynamic recompilation

Okay, so I've created new branch to implement dynamic recompilation of the bytecode, where the idea is following:

When resolving local slots, instead of lookups to the slot dictionary, which is by itself just index table indirection to ._slot_values property of the object, recompile the bytecode in such way that whole sequence of PUSH_LITERAL and SEND bytecodes is replaced by LOCAL_SEND bytecode.

This not only saves time because PUSH_LITERAL is skipped, but also removes the name resolution / dict lookup in obj.map._slots in SEND instruction.

Off course, this is not really that great time saving for local sends, but it may be great for parent sends, because it potentially saves a lot of requests to several dictionaries in parents. Local sends are still great for proof of concept.

So, I had to put some kind of measurement into the object_layout.py, as it only makes sense to dynamically recompile objects which already have some slots.

I've implemented all this, introduced new bytecodes for local and parent sends, then split ._do_send() method into multiple parts, which were used by new bytecodes. And it worked and it was slightly faster.

Then I've implemented parent caches as dicts, so it was remembered what object was resolved, and then I've rewritten parent lookups to save results to this cache, but also store versions of whole parent tree. This is necessary, because you have to invalidate parent lookup cache when something in the tree changes. Then I've added dynamic recompilation for parents and it actually worked.

In theory, this should have been faster, but it wasn't:

       2694,114612      task-clock (msec)         #    0,993 CPUs utilized          
                 8      context-switches          #    0,003 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             1 715      page-faults               #    0,637 K/sec                  
     9 767 838 987      cycles                    #    3,626 GHz                    
    24 752 967 353      instructions              #    2,53  insn per cycle         
     4 735 181 543      branches                  # 1757,602 M/sec                  
        11 451 403      branch-misses             #    0,24% of all branches

I've spent afternoon debugging the code, when I've noticed that some object in parent tree is almost always changed and this triggers cache invalidation and recompilation in each cycle. This was annoying, code was complex, so I've thrown most of it out.

This may seem like it was premature decision, but there were also problems with inlined bytecode invalidation and switching back to unoptimized bytecodes, because the optimized code had to be padded with NOPs in places where PUSH_LITERAL instructions were, and switch back to unoptimized code could wreak havoc in MethodStack, when nested calls occurred.

You can see the code in the dynamic_recompiler.py in local_send_cache branch.

LightWeightDict

In the process of writing dynamic recompiler, I had a revelation, that dict lookup is not that costly, it is just dict creation which is slow. So I've decided to implement my own wrapper class, which implements dict-like interface and uses three properties to store values. Only when there are more than three values stored in the dict, it uses list for 8 values and only when there is more than that, it switches back to regular dict.

I've also decided to switch to full tree search in parent lookups, as that is only correct way, otherwise you can have nasty runtime errors (as David Ungar says in one of many papers about Self he published).

Sadly, I don't have exact numbers, but it was slightly (something around 300M instructions) faster. And then I created version with preallocated lists and that was also faster.

Preallocated lists

Parent lookup algorithm uses lists to iterate, and to append to one list called objects, from which it pops in each iteration. Then there is second list visited_objects, where all visited objects are stored so algorithm can `unvisit` them after it finishes the graph lookup.

Both lists are used only once, and are not used for lookups, mode of operation is much more suited for FIFO queue. This lead me to attempt to optimize the algorithm with preallocated array and set of pointers to list:

class TwoPointerArray(object):
    def __init__(self, length):
        self._length = length
        self._array = [None for _ in xrange(length)]
        self._left_pointer = 0
        self._right_pointer = 0

    def __len__(self):
        return self._right_pointer - self._left_pointer

    def __getitem__(self, index):
        return self._array[self._left_pointer + index]

    def __setitem__(self, key, value):
        self._array[key] = value

    def pop_first(self):
        if self._left_pointer == self._right_pointer:
            raise IndexError()

        rval = self._array[self._left_pointer]
        self._array[self._left_pointer] = None
        self._left_pointer += 1
        return rval

    def pop_last(self):
        if self._left_pointer == self._right_pointer:
            raise IndexError()

        rval = self._array[self._right_pointer - 1]
        self._array[self._right_pointer - 1] = None
        self._right_pointer -= 1
        return rval

    def append(self, item):
        if self._right_pointer >= self._length:
            self._array.append(item)
            self._right_pointer += 1
            self._length += 1
            return

        self._array[self._right_pointer] = item
        self._right_pointer += 1

    def to_list(self):
        return self._array[self._left_pointer: self._right_pointer]

Python's lists should be preallocated under RPython, but I've found that my version has huge performance gains:

       3239,654852      task-clock (msec)         #    0,994 CPUs utilized          
                 9      context-switches          #    0,003 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
             3 803      page-faults               #    0,001 M/sec                  
     9 196 081 084      cycles                    #    2,839 GHz                    
    20 420 359 470      instructions              #    2,22  insn per cycle         
     4 064 685 697      branches                  # 1254,666 M/sec                  
         6 971 611      branch-misses             #    0,17% of all branches

20G4, that's 4G3 instructions less just with this simple optimization.

Parent lookup cache

I've decided to give another try to parent cache, that is the dict based cache for parent lookups.

First I've implemented object version stored in .map._version property as numeric int. Each meta-operation, like setting a new content of slot, or adding or removing slot, increments the version.

When a parent lookup is performed, algorithm goes over whole graph of parents, marks objects as .visited=True, and then stores the list of all objects in variable visited_objects, which is then iterated and each object is then unvisited.

I've added cache in format {"slot_name": [result, [VersionedObject(x) for x in visited_objects]]}. That is .visited_objects is stored in wrapper class with their current version.

When the parent lookup request is performed next time, all versions are compared and only those objects which changed are used for lookup algorithm. This resulted in 1G8 less instructions:

       2908,278592      task-clock (msec)         #    0,992 CPUs utilized          
                 8      context-switches          #    0,003 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             3 709      page-faults               #    0,001 M/sec                  
     8 496 614 780      cycles                    #    2,922 GHz                    
    18 645 475 253      instructions              #    2,19  insn per cycle         
     3 584 880 617      branches                  # 1232,647 M/sec                  
         7 969 234      branch-misses             #    0,22% of all branches

Rewrite stack / frames to linked lists

All kind of stacks were implemented as lists, which were .append()-ed and .pop()-ed. Simple rewrite to linked list provided significant performance boost:

       2607,500168      task-clock (msec)         #    0,984 CPUs utilized          
               418      context-switches          #    0,160 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
             3 688      page-faults               #    0,001 M/sec                  
     7 602 782 564      cycles                    #    2,916 GHz                    
    16 544 264 590      instructions              #    2,18  insn per cycle         
     3 225 193 435      branches                  # 1236,891 M/sec                  
         5 024 868      branch-misses             #    0,16% of all branches

16G5, that is 2G1 fewer instructions per benchmark.

#pypy help

I've asked what are people using for profiling on #pypy IRC channel and cfbolz took interest in my interpreter and pushed some JIT hints. JIT wasn't until this point much useful for me, but with his changes, wow:

        979,203041      task-clock (msec)         #    0,981 CPUs utilized          
                 3      context-switches          #    0,003 K/sec                  
                 1      cpu-migrations            #    0,001 K/sec                  
             2 194      page-faults               #    0,002 M/sec                  
     2 813 573 468      cycles                    #    2,873 GHz                    
     6 730 985 108      instructions              #    2,39  insn per cycle         
     1 431 479 384      branches                  # 1461,882 M/sec                  
           533 154      branch-misses             #    0,04% of all branches

Whole benchmark now runs under one second with JIT.

Rewrite method stack to preallocated list

I've also added version of MethodStack with preallocated list, as I know how many items will be on each method stack from the number of literals in that object. This indeed resulted in slightly faster code, but performance under JIT was slightly slower (around 7G3 instructions compared to 6G7 with linked lists).

       1888,295288      task-clock (msec)         #    0,990 CPUs utilized          
                 8      context-switches          #    0,004 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             1 619      page-faults               #    0,857 K/sec                  
     6 736 850 842      cycles                    #    3,568 GHz                    
    16 131 116 476      instructions              #    2,39  insn per cycle         
     3 215 702 497      branches                  # 1702,966 M/sec                  
         2 876 745      branch-misses             #    0,09% of all branches        

       1,907438730 seconds time elapsed

Remove pre-initialization of slots and parents

As it was implemented, __init__() method of Object looked like this:

    def __init__(self, map=None):
        self.map = map
        self.scope_parent = None
        self._slot_values = []
        self._parent_slot_values = []

That means the properties ._slot_values and ._parent_slot_values were initialized to empty lists. Simply changing them to None and initializing them lazily only when they are used changed performance significantly:

       2174,622775      task-clock (msec)         #    0,991 CPUs utilized          
                19      context-switches          #    0,009 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
             1 540      page-faults               #    0,708 K/sec                  
     5 970 301 961      cycles                    #    2,745 GHz                    
    14 764 994 336      instructions              #    2,47  insn per cycle         
     2 942 506 811      branches                  # 1353,111 M/sec                  
         4 412 469      branch-misses             #    0,15% of all branches        

       2,194327578 seconds time elapsed

Different kinds of LightWeightDict

I've decided it is time to measure different kinds of LightWeightDict:

different_kind_of_lightweightdict.txt

I've got one that uses dynamic arrays, one that uses statically allocated array before it switches to dicts internally and several versions of one which uses just number of properties (1-8) before it used internal dictionary to store data.

Results clearly show, that one that uses three properties and then fallback to dict is fastest.

       1634,048187      task-clock (msec)         #    0,989 CPUs utilized
                 3      context-switches          #    0,002 K/sec
                 0      cpu-migrations            #    0,000 K/sec
             1 407      page-faults               #    0,861 K/sec
     5 922 070 772      cycles                    #    3,624 GHz
    14 590 886 554      instructions              #    2,46  insn per cycle
     2 927 349 398      branches                  # 1791,471 M/sec
         3 116 034      branch-misses             #    0,11% of all branches

       1,652842120 seconds time elapsed

and with JIT:

        780,210821      task-clock (msec)         #    0,977 CPUs utilized          
                 6      context-switches          #    0,008 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             2 058      page-faults               #    0,003 M/sec                  
     2 683 932 130      cycles                    #    3,440 GHz                    
     6 062 913 589      instructions              #    2,26  insn per cycle         
     1 312 432 517      branches                  # 1682,151 M/sec                  
           460 458      branch-misses             #    0,04% of all branches        

       0,798466234 seconds time elapsed

So final numbers are 14G6 without JIT and 6G with JIT.

Additional speedups

I am quite happy with the speed now, as it is finally usable, so I won't pursue another speedups for some time, and instead I will focus on features. cfbolz mentioned, that there are still possibilities of at least 10x performance speedup from just JIT itself.

From the top of my head, there is also at least following possible speedups:

☐ Use hierarchical parent cache.

☐ Optimize primitive types to be more effective.

☐ Special versions of instructions ignoring the stack and using literals directly.

Relevant discussions

https://lobste.rs/s/haodba/tinyself_performance_gains_2019_4

New posts

Tags

Blog categories