Monte: Typhon Profiling

Work continues on Typhon. I've recently yearned for a way to study the Monte-level call stacks for profiling feedback. After a bit of work, I think that I've built some things that will help me.

My initial desire was to get the venerable perf to work with Typhon. perf's output is easy to understand, with a little practice, and describes performance problems pretty well.

I'm going to combine this with Brendan Gregg's very cool flame graph system for visually summarizing call stacks, in order to show off how the profiling information is being interpreted. I like flame graphs and they were definitely a goal of this effort.

Maybe perf doesn't need any help. I was able to use it to isolate some Typhon problems last week. I'll use my Monte webserver for all of these tests, putting it under some stress and then looking at the traces and flame graphs.

Now seems like a good time to mention that my dorky little webserver is not production-ready; it is literally just good enough to respond to Firefox, siege, and httperf with a 200 OK and a couple bytes of greeting. This is definitely a microbenchmark.

With that said, let's look at what perf and flame graphs say about webserver performance:

An unhelpful HTTP server profile

You can zoom in on this by clicking. Not that it'll help much. This flame graph has two big problems:

  1. Most of the time is spent in the mysterious "[unknown]" frames. I bet that those are just caused by the JIT's code generation, but perf doesn't know that they're meaningful or how to label them.
  2. The combination of JIT and builtin objects with builtin methods result in totally misleading call stacks, because most object calls don't result in new methods being added to the stack.

I decided to tackle the first problem first, because it seemed easier. Digging a bit, I found a way to generate information on JIT-created code objects and get that information to perf via a temporary file.

The technique is only documented via tribal knowledge and arcane blog entries. (I suppose that, in this regard, I am not helping.) It is described both in this kernel patch implementing the feature, and also in this V8 patch. The Typhon JIT hooks show off my implementation of it.

So, does it work? What does it look like?

I didn't upload a picture of this run, because it doesn't look different from the earlier graph! The big [unknown] frames aren't improved at all. Sufficient digging will reveal the specific newly-annotated frames being nearly never called. Clearly this was not a winning approach.

At this point, I decided to completely change my tack. I wrote a completely new call stack tracer inside Typhon. I wanted to do a sampling profiler, but sampling is hard in RPython. The vmprof project might fix that someday. For now, I'll have to do a precise profiler.

Unlabeled HTTP server profile with correct atoms

I omitted the coding montage. Every time a call is made from within SmallCaps, the profiler takes a time measurement before and after the call. This is pretty great! But can we get more useful names?

Names in Monte are different from names in, say, Python or Java. Python and Java both have class names. Since Monte does not have classes, Monte doesn't have a class name. A compromise which we accept here is to use the "display name" of the class, which will be the pattern used to bind a user-level object literal, and will be the name of the class for all of the runtime object classes. This is acceptable.

HTTP server profile with correct atoms and useful display names

Note how the graphs are differently shaped; all of the frames are being split out properly and the graph is more detailed as a result. The JIT is still active during this entire venture, and it'd be cool to see what the JIT is doing. We can use RPython's rpython.rlib.jit.we_are_jitted() function to mark methods as being JIT'd, and we can ask the flame graph generator to colorize them.

HTTP server profile with JIT COLORS HOLY FUCK I CANNOT BELIEVE IT WORKS

Oh man! This is looking pretty cool. Let's colorize the frames that are able to sit directly below JIT entry points. I do this with a heuristic (regular expression).

THE COLORS NEVER STOP, CAN'T STOP, WON'T STOP

This isn't even close to the kind of precision and detail from the amazing Java-on-Illumos profiles on Gregg's site, but it's more than enough to help my profiling efforts.

~ C.

Last modified on 2015-02-28 16:13:00

Monte: Typhon and SmallCaps

Two years of having a real job have made me bitter and grumpy, so I'm gonna stop with the cutesy blog post titles.

Today, I'm gonna talk a bit about Monte. Specifically, I'm going to talk about Typhon's current virtual machine. Typhon used to use an abstract syntax tree interpreter ("AST interpreter") as its virtual machine. It now uses a variant of the SmallCaps bytecode machine. I'll explain how and why.

When I started designing Typhon, I opted for an AST VM because it seemed like it matched Monte's semantics well. Monte is an expression language, which means that every syntactic node is an expression which can be evaluated to a single value. Monte is side-effecting, so evaluation must happen in an environment which can record the side effects. An AST interpreter could have an object representing each syntactic node, and each node could be evaluated within an environment to produce a value.

class Node(object):

    def evaluate(self, environment):
        pass

Subclasses of Node can override evaluate() to have behavior specific to that node. The Assign class can assign values to names in the environment, the Object class can create new objects, and the Call class can pass messages to objects. This technique is amenable to RPython, since Node.evaluate() has the same signature for all nodes.

How well does this work with the RPython just-in-time ("JIT") compiler? It does alright. The JIT has little problem seeing that the nodes are compile-time constant, and promotion applied strategically allows the JIT to see into things like Call nodes, which cause methods to be inlined relatively well. However, there are some problems.

The first problem is that Monte has no syntactic loops. Since RPython's JIT is designed to compile loops, this means that some extra footwork is needed to set up the JIT. I took Monte's looping object, __loop, and extended it to detect when its argument is likely to yield a JIT trace. The detection is simple, examining whether the loop body is a user-defined object and only entering the JIT when that is the case. In theory, most bodies will be user-defined, since Monte turns this:

object SubList:
    to coerce(specimen, ej):
        def [] + l exit ej := specimen
        for element in l:
            subguard.coerce(element, ej)
        return specimen

…into this:

object SubList:
    method coerce(specimen, ej):
        def via (__splitList.run([0])) [l] exit ej := specimen
        var validFlag__6 := true
        try:
            __loop.run([l, object _ {
                method run(_, value__8) {
                    __validateFor.run([validFlag__6])
                    subguard.coerce([value__8, ej])
                    null
                }
            }])
        finally:
            validFlag__6 := false
        specimen

This example is from Typhon's prelude. The expansion of loops, and other syntax, is defined to turn the full Monte language into a smaller language, Kernel-Monte, which resembles Kernel-E. Now, in this example, the nonce loop object generated by the expander is very custom, and probably couldn't be simplified any further. However, it is theoretically possible that an optimizer could detect loops that call a single method repeatedly and simplify them more aggressively. In practice, that optimization doesn't exist, so Typhon thinks that all loops are user-defined and allows the JIT to trace all of them.

The next hurdle has to do with names. Monte's namespaces are static, which means that it's possible to always know in advance which names are in a scope. This is a powerful property for compilers, since it opens up many kinds of compilation and lowering which aren't available to languages with dynamic namespaces like Python. However, Typhon doesn't have access to the scoping information from the expander, only the AST. This means that Typhon has to redo most of the name and scope analysis in order to figure out things like how big namespaces should be and where to store everything. I initially did all of this at runtime, in the JIT, but it is very slow. This is because the JIT has problems seeing into dictionaries, and it cannot trust that a dictionary is actually constant-size or constant-keyed.

RPython does have an answer to this, called virtual and virtualizable objects. A virtual object is one that is never constructed within a JIT trace, because the JIT knows that the object won't leave the trace and can be totally elided. (The literature talks at length of "escape analysis", the formal term for this.) Typhon's AST nodes occasionally generated virtual objects, but only rarely, because most objects are assigned to names in the environment, and the JIT refuses to ignore modifications to the environment.

Virtualizable objects solve this problem neatly. A virtualizable object has one or more attributes, called virtualizable attributes, which can be "out-of-sync" during a JIT trace. A virtualizable can delay being updated, as long as it's updated at some point before the end of the JIT trace. RPython allows fields of integers and floating-point values to be virtualizable, as well as constant-size lists. However, Typhon uses mappings of names for its environments, backed by dictionaries, and dictionaries can't be virtualizable.

The traditional solution to this problem involves assigning an index to every name within a scope, and using a constant-size list with the indices. I did this, but it was arduous and didn't have the big payoff that I had wanted. Why not? Well, every AST node introduces a new scope! This caused scope creation to be expensive. I realized that I needed to compute closures as well.

Around this time, a few months ago, I began to despair because debugging the AST VM is hard. RPython's JIT logging and tooling is all based around the assumption that there is a low-level virtual machine of some sort which has instructions and encapsulation, and the AST was just too hard to manage in this way. I had had to invent my own tracebacks, my own logging, and my own log viewer. This wasn't going well. I wanted to join the stack-based VM crew and not have piles of ASTs to slog through whenever something wasn't going right. So, I decided to try to implement SmallCaps, from E. E, of course, is the inspiration for Monte, and shares many features with Monte. SmallCaps was based on old Smalltalk systems, but was designed to work with unique E features like ejectors.

So, enough talk, time for some code. First, let's lay down some ground rules. These are the guiding semantics of SmallCaps in Typhon. Keep in mind that we are describing a single-stack automaton with a side stack for exception handling and an environment with frames for local values and closed-over "global" values.

  • All expressions return a value. Therefore, an expression should always compile to some instructions which start with an empty stack leave a single value on the stack.
  • All patterns perform some side effects in the environment and return nothing. Therefore, they should compile to instructions which consume two values from the stack (specimen and ejector) and leave nothing.
  • When an exception handler is required, every handler must be dropped when it's no longer needed.

With these rules, the compiler's methods become very obvious.

class Str(Node):
    """
    A literal string.
    """

    def compile(self, compiler):
        index = compiler.literal(StrObject(self._s))

Strings are compiled into a single LITERAL instruction that places a string on the stack. Simple enough.

class Sequence(Node):
    """
    A sequence of nodes.
    """

    def compile(self, compiler):
        for node in self._l[:-1]:
            node.compile(compiler)
            compiler.addInstruction("POP", 0)
        self._l[-1].compile(compiler)

Here we compile sequences of nodes by compiling each node in the sequence, and then using POP to remove each intermediate node's result, since they aren't used. This nicely mirrors the semantics of sequences, which are to evaluate every node in the sequence and then return the value of the ultimate node's evaluation.

This also shows off a variant on the Visitor pattern which Allen, Mike, and I are calling the "Tourist pattern", where an accumulator is passed from node to node in the structure and recursion is directed by each node. This makes managing the Expression Problem much easier, since nodes completely contain all of the logic for each accumulation, and makes certain transformations much easier. More on that in a future post.

class FinalPattern(Pattern):

    def compile(self, compiler):
        # [specimen ej]
        if self._g is None:
            compiler.addInstruction("POP", 0)
            # [specimen]
        else:
            self._g.compile(compiler)
            # [specimen ej guard]
            compiler.addInstruction("ROT", 0)
            compiler.addInstruction("ROT", 0)
            # [guard specimen ej]
            compiler.call(u"coerce", 2)
            # [specimen]
        index = compiler.addFrame(u"_makeFinalSlot")
        compiler.addInstruction("NOUN_FRAME", index)
        compiler.addInstruction("SWAP", 0)
        # [_makeFinalSlot specimen]
        compiler.call(u"run", 1)
        index = compiler.addLocal(self._n)
        compiler.addInstruction("BINDSLOT", index)
        # []

This pattern is compiled to insert a specimen into the environment, compiling the optional guard along the way and ensuring order of operations. The interspersed comments represent the top of stack in-between operations, because it helps me keep track of how things are compiled.

With this representation, the Compiler is able to see the names and indices of every binding introduced during compilation, which means that creating index-based frames as constant-size lists is easy. (I was going to say "trivial," but it was not trivial!)

I was asked on IRC about why I chose to adapt SmallCaps instead of other possible VMs. The answer is mostly that SmallCaps was designed and implemented by people that were much more experienced than me, and that I trust their judgement. I tried several years ago to design a much purer concatenative semantics for Kernel-E, and failed. SmallCaps works, even if it's not the simplest thing to implement. I did briefly consider even smaller semantics, like those of the Self language, but I couldn't find anything expressive enough to capture all of Kernel-E's systems. Ejectors are tricky.

That's all for now. Peace.

~ C.

Last modified on 2015-02-14 13:51:00

Lessons

I have a problem. It's not an unusual problem, and I wish to tackle it precisely because of its ubiquity. The problem is, roughly, this: I have a large chunk of code which does some blocking work. I wish to augment this code so that it can be used with Twisted, and I wish to do so in a way that satisfies the following conditions:

  1. A minimum of source code is changed.
  2. There are no deep hacks which are not trivial to explain when taken one by one.
  3. The caller may use either the Twisted or non-Twisted interfaces at their leisure.
  4. The difference between the Twisted and non-Twisted results at all of the borders of the module should be undone by maybeDeferred.

These conditions should not be difficult to achieve and yet they seem to constantly stymie many would-be IRC bot authors. I'm going to see if I can improve on this with a couple lessons from Haskell.

So, first, let's consider why we cannot simply remove data from Twisted interfaces. It's elementary: Deferred computations cannot have their results accessed directly. Instead, actions have to be lifted up into a Deferred, which will run the action when it is ready.

To a Haskell programmer, this sounds quite a bit like how IO behaves, and indeed, Deferred is a Functor (and Monad) that cannot be unwrapped. So, with this in mind, let's look a bit at what kind of interface this would be in Haskell. First, let's consider the type of our computation:

computation :: a -> b

That is, we are taking some data of type a and returning some data of type b. The trick here, for those of you not well-versed in Haskell, is that computation may not do anything outside of these types; it cannot perform I/O or have any impure effects. (Okay, fine, I mean, you can perform horrid hacks to do those things, but remember our rules: No deep hackery here.)

Now, let's consider the type of the Twisted-style computation.

deferredComputation :: Deferred a -> Deferred b

That is, we're taking a Deferred containing data of type a, and returning a Deferred with data of type b.

Now, here's the fun part. We want to generate deferredComputation from computation, to achieve code reuse. How? Well, let's use fmap, since Deferred is a Functor!

computation :: a -> b

fmap :: Functor f => (a -> b) -> f a -> f b

deferredComputation :: Deferred a -> Deferred b
deferredComputation = fmap computation

Hey hey! Pretty cool, right? This might seem super-obvious to people with lots of Haskell experience, but I think it's still worth repeating since not everybody has done this sort of thing before.

And now we return to the land of Python. Python-land. It's time to construct this thing in Python, as well. So, how do we lift a function up into a Deferred in Python?

def computation(a):
    return b

def deferredComputation(deferred):
    deferred.addCallback(computation)
    return deferred

Think about this for a second. Remember, Deferred objects carry state around with them, so we need this "impure" sort of approach, which is really not actually impure but just object-at-a-time. If you're unsure of exactly what this snippet's doing, go through it one bit at a time.

  1. Take a Deferred which will fire with a value of type a.
  2. Append a callback which transforms a into b.
  3. Return a Deferred which will fire with a value of type b.

Now, let's make this concrete with an example. Let's say that we've got a system that has two implementations of a client, one which is synchronous, and one which is asynchronous. We've isolated and split out these clients such that they have exactly the same setup functions, and they return exactly the same data, with one single difference: One client is blocking and returns the data, and the other client is non-blocking and returns a Deferred which will fire with the data. This is exactly the difference that maybeDeferred can paper over. We've got all of the code set up just the way we want it, according to those conditions I listed earlier.

But! These clients only make up a couple dozen lines of code. There are still thousands of lines of code that only work with the synchronous client. How do we make them work with Twisted without losing our synchronous abilities?

Let's create some helper which will apply the computation to the data. This helper will come with the client and will be tailored to the client's output. For unlifted non-Twisted data, this is simply the classic builtin apply, known to Haskellers as ($):

($) :: (a -> b) -> a -> b
f $ a = f a
def apply(f, a):
    return f(a)

Note that my apply is not the Python apply builtin function, which does a slightly different thing if its argument is iterable.

And for the Deferred-handling case, let's create a slightly more interesting applier which will continue to move data through the Deferred. We already wrote this above, actually, and in Haskell, it would be fmap:

fmap :: Functor f => (a -> b) -> f a -> f b
def deferredApply(f, deferred):
    deferred.addCallback(f)
    return deferred

And now we're ready to put everything together! Here's a small skeleton:

class SyncClient(object):
    @staticmethod
    def applier(f, value):
        return f(value)

    def request(self, s):
        return sync_library_call(s)

class AsyncClient(object):
    @staticmethod
    def applier(f, deferred):
        deferred.addCallback(f)
        return deferred

    def request(self, s):
        return async_library_call(s)

def computation(data):
    transform(data)
    poke(data)
    return data

def request_and_compute(client, resource):
    data = client.request(resource)
    return client.applier(computation, data)

Look at request_and_compute. It has no idea whether it's handling synchronous or asynchronous data, and it doesn't really care; it asks the client to actually apply the computation to the data. And the computation itself is totally unaware of things going on around it. It doesn't even have to be pure; it could do all kinds of side effects with that data. The only requirement for the computation is that it must remember to return the data so that subsequent computations can access it.

This is the approach I'm taking in a new library I'm hacking together for Ganeti, called Gentleman. I think it'll work out well.

~ C.

Last modified on 2012-09-21 15:09:00

Valid CSS!