NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Decorator JITs: Python as a DSL (eli.thegreenplace.net)
cchianel 20 hours ago [-]
I had the misfortune of translating CPython bytecode to Java bytecode, and I do not wish that experience on anyone:

- CPython's bytecode is extremely unstable. Not only do new opcodes are added/removed each release, the meaning of existing opcodes can change. For instance the meaning for the argument to JUMP_IF_FALSE_OR_POP changes depending on the CPython version; in CPython 3.10 and below, it is an absolute address, in CPython 3.11 and above, it a relative address.

- The documentation for the bytecode in dis tends to be outdated or outright wrong. I often had to analyze the generated bytecode to figure out what each opcode means (and then submit a corresponding PR to update said documentation). Moreover, it assumes you know how the inner details of CPython work, from the descriptor protocol to how binary operations are implemented (each of which are about 30 lines functions when written in Python).

- CPython's bytecode is extremely atypical. For instance, a for-loop keeps its iterator on the stack instead of storing it in a synthetic variable. As a result, when an exception occurs inside a for-loop, instead of the stack containing only the exception, it will also contain the for-loop iterator.

As for why I did this, I have Java calling CPython in a hot loop. Although direct FFI is fast, it causes a memory leak due to Java's and Python's Garbage collectors needing to track each other's objects. When using JPype or GraalPy, the overhead of calling Python in a Java hot-loop is massive; I got a 100x speedup from translating the CPython bytecode to Java bytecode with identical behaviour (details can be found in my blog post: https://timefold.ai/blog/java-vs-python-speed).

I strongly recommend using the AST instead (although there are no backward comparability guarantees with AST, it is far less likely to break between versions).

PaulHoule 1 days ago [-]
I read On Lisp by Graham recently and first thought "this is the best programming book I read in a while", and then had the urge to make copy editing kind of changes "he didn't define nconc" and then thought "if he was using Clojure he wouldn't be fighting with nconc", and by the end thought "most of the magic is in functions, mostly he gets efficiency out of macros, the one case that really needs macros is the use of continuations" and "I'm disappointed he didn't write any macros that do a real tree transformation"

Then a few weeks later I came to the conclusion that Python is the new Lisp when it comes to metaprogramming. (and async in Python does the same thing that he coded up with continuations.) I think homoiconicity and the parenthesis are a red herring, the real problem is that we're still stuck with parser generators that aren't composable. You really ought to be able to add

   unless(X) { ... }
to Java by adding 1 production to the grammar, a new object for the AST tree, and a transformation for the compiler that rewrites to

   if(!X) { ... }
probably the actual code would be smaller than the POM file if the compiler was built as if extensibility mattered.

Almost all the examples in this book (which claims to be a tutorial for Common Lisp programming)

https://www.amazon.com/Paradigms-Artificial-Intelligence-Pro...

are straightforward to code up in Python. The main retort to this I hear from Common Lisp enthusiasts is that some CL implementations are faster, which is true. Still, most languages today have a big helping of "Lisp, the good parts". Maybe some day the Rustifarians will realize the wide-ranging impacts of garbage collection, not least that you can smack together an unlimited number of frameworks and libraries into one program and never have to think about making the memory allocation and deallocation match up.

hatmatrix 1 days ago [-]
Peter Norvig himself has come around to embracing Python as an alternative to Lisp:

https://norvig.com/python-lisp.html

https://news.ycombinator.com/item?id=1803815

and there is indeed a Python implementation for the PAIP programs.

https://github.com/dhconnelly/paip-python

GC conferring additional composability is an interesting take - I hadn't thought of that (though I don't spend much time in this domain).

PaulHoule 24 hours ago [-]
Just imagine what a godawful mess it would be to write really complex libraries that are supposed to be composable in C. (I'm going to argue this is why there is no 'npm' or 'maven' for C)

The point of programming C is it is very low level and you have complete control over memory allocation so you're losing much of the benefit of C if you have a one-size-fits-all answer.

The application might, in some cases, pass the library a buffer that it already allocated and tell the library to use it. In other cases the application might give the library malloc and free functions to use. It gets complicated if the application and library are sharing complicated data structures with a network of pointers.

In simple cases you can find an answer that makes sense, but in general the application doesn't know if a library is done with some memory and the library doesn't know if the application is done with it. But the garbage collector knows!

It is the same story in Rust, you can design some scheme that satisfies the borrow checker in some particular domain but the only thing that works in general is to make everything reference counted, but at least Rust gives you that options, although the "no circular references" problem is also one of those design-limiting features, as everything has to be a tree or a DAG, not a general purpose graph.

pjmlp 9 hours ago [-]
The reason there is no npm/maven for C, is because UNIX culture prefers Makefiles and packages on whatever format the actual UNIX implementation uses.

Depots on Aix, pkgsrc on Solaris, tgz/rpm/deb on Linux, ports on BSDs,...

In any case, I would argue that we have npm/maven for C, and C++ nowadays, via CMake/Conan/vcpkg.

zahlman 9 hours ago [-]
> I think homoiconicity and the parenthesis are a red herring, the real problem is that we're still stuck with parser generators that aren't composable.

You might be interested in "PEP 638 – Syntactic Macros" (https://peps.python.org/pep-0638/), thought it hasn't gotten very much attention.

I have similar (or at least related) thoughts for my own language design (Fawlty), too. My basic idea is that beyond operators, expressions are parsed primarily by pattern-matching, and there's a reserved namespace for keywords/operators. For example, things like Python's f-strings would be implemented as a compile-time f operator, which does an AST transformation. Some user-defined operators might conditionally either create some complex AST representation or just delegate to a runtime function call.

> Rustifarians

Heh, haven't heard that one.

eru 23 hours ago [-]
I don't even think you can define homoiconicity both properly in a way that captures what you want to capture.

Even the Wikipedia introduction at https://en.wikipedia.org/wiki/Homoiconicity agrees that the concept is 'informal' at best and meaningless at worst.

6gvONxR4sf7o 1 days ago [-]
I've had a lot of fun with tracing decorators in python, but the limitation of data dependent control flow (e.g. an if statement, a for loop) always ends up being more painful that I'd hope. It's a shame since it's such a great pattern otherwise.

Can anyone think of a way to get a nice smooth gradation of tracing based transformations based on effort required or something. I'd love to say, 'okay, in this case i'm willing to put in a bit more effort' and somehow get data dependent if statements working, but not support data dependent loops. All I know of now is either tracing with zero data dependent control flow, or going all the way to writing a python compiler with whatever set of semantics you want to support and full failure on what you don't.

On a different note, some easy decorator DSL based pdb integration would be an incredible enabler for these kinds of things. My coworkers are always trying to write little 'engine' DSLs for one thing or another, and it sucks that whenever you implement your own execution engine, you completely lose all language tooling. As I understand it, in compiler tooling, you always have some burden of shepherding around maps of what part of the source a given thing corresponds to. Ditto for python decorator DSLs, except nobody bothers, meaning you get the equivalent of a 1960's developer experience in that DSL.

13 hours ago [-]
ByteMe95 22 hours ago [-]
CSP (https://github.com/Point72/csp) has a healthy amount of AST parsing for their DSL. Looks like they have debug breakpoints working by augmenting the line numbers in the AST generation
sega_sai 1 days ago [-]
I hope this is the future for Python. Write in pure Python, but if needed, the code can be JIT (or not JIT) compiled into something faster (provided your code does not rely too much on low-level python stuff, such as __ functions).
sevensor 1 days ago [-]
The misleading thing about this approach is that the decorated function is no longer Python at all. It’s another language with Python syntax. Which is a neat way to get a parser for free, but it’s going to set up expectations about semantics that are bound to be incorrect.
richard_shelton 1 days ago [-]
It's funny how close it is to the title of my talk "Python already has a frontend for your compiler": https://github.com/true-grue/python-dsls
dragonwriter 1 days ago [-]
> The misleading thing about this approach is that the decorated function is no longer Python at all. It’s another language with Python syntax.

Some of the current implementations are either strict Python subsets or very nearly so, but, yes, DSLs are distinct languages, that's what the L stands for.

almostgotcaught 1 days ago [-]
worse is you can't set a breakpoint inside the "jitted" function (maybe can't print either...)
cl3misch 1 days ago [-]
almostgotcaught 1 days ago [-]
> In JAX you can

I'm alway shocked when people in this line of work either take things at face-value or just lie by omission/imprecision. this is not python breakpoints - it's a wholeass other system they had to re-roll that emulates python breakpoints:

> Unlike pdb, you will not be able to step through the execution, but you are allowed to resume it.

you think it works with pycharm/vscode/any other tooling? spoiler alert: of course not.

so no in JAX you can not.

ByteMe95 22 hours ago [-]
You can in CSP nodes https://github.com/Point72/csp
Scene_Cast2 1 days ago [-]
If this is the way forward, I'd love for a better developer experience.

I'm currently wrangling with pytorch.compile (flex attention doesn't like bias terms - see issues 145869 and 144511 if curious). As much as I love speed and optimization, JIT (at least the pytorch flavor) currently has weird return types that break VS Code's intellisense, weird stack traces, limitations around printing, and random issues like the bias term, and limitations such as not supporting sparsity.

Speaking of pytorch JIT workflows - what's a nice way of having a flag to turn off compilation?

singhrac 1 days ago [-]
You can decorate a function with @torch.compiler.disable() to disable a specific function (and anything further down the call stack).
dleeftink 1 days ago [-]
> random issues like the bias term

I'd like to know more about this!

dec0dedab0de 1 days ago [-]
I would rather a JIT just built into the reference implementation. a JIT would help way more programs than removing the GIL but everyone thinks the GIL affects them for some reason.
masklinn 1 days ago [-]
They serve different use cases. The function JIT pattern being a manual opt in it can be much more aggressive by only supporting restricted language patterns rather than... the entire thing. They can also use bespoke annotations for better codegen e.g. you can tell numba to only codegen for i32 -> i32 -> i32, rather than lazily codegen for any a -> b -> c.
t-vi 14 hours ago [-]
If you like JIT wrappers and Python interpreters:

In Thunder[1], a PyTorch to Python JIT compiler for optimizing DL models, we are maintaining a bytecode interpreter covering 3.10-3.12 (and 3.13 soon) for our jit. That allows to run Python code while re-directing arbitrary function calls and operations but is quite a bit slower than CPython.

While the bytecode changes (and sometimes it is a back-and-forth for example in the call handling), it seems totally good once you embrace that there will be differences between Python versions.

What has been a large change is the new zero cost (in the happy path) exception handling, but I can totally why Python did that change to that from setting up try-block frames.

I will say that I was happy not to support Python <= 3.9 as changes were a lot more involved there (the bytecode format itself etc.).

Of course, working on this has also means knowing otherwise useless Python trivia afterwards. One of my favorites is how this works:

  l = [1, 2, 3]
  l[-1] += l.pop()
  print(l)
1. https://github.com/Lightning-AI/lightning-thunder/
est 1 days ago [-]
Aha, anyone remember psyco from the python 2.x era?

https://psyco.sourceforge.net/psycoguide/node8.html

p.s. The psyco guys then went another direction called pypy.

p.p.s. There's also a pypy based decorator but it limits the parameters to basic types. Sadly I forgot the github.

rented_mule 1 days ago [-]
Yes! I used psyco in production for a while, and the transition to psyco resulted in some interesting learning...

I had written a C extension to speed up an is-point-in-polygon function that was called multiple times during every mouse move in a Python-based graphical application (the pure Python version of the function resulted in too much lag on early 2000s laptops). When psyco came out, I tried moving the function back to Python to see how close its speed came to the C extension. I was shocked to see that psyco was significantly faster.

How could it be faster? In the C extension, I specified everything as doubles, because I called it with doubles in some places. It turns out the vast majority of the calls were working with ints. The C extension, as written, had to cast those ints to doubles and then do everything in flouting point, even though none of the calculations would have had fractional parts. Pysco did specialization - it produced a version of the function for every type signature it was called with. So it had an all-int version and an all-double version. Psyco's all-int version was much faster than the all-double version I'd written in C, and it was what was being called 95% of the time.

If I'd spent enough time profiling, I could have made two C functions and split my calls between them. But psyco discovered this for me. As an experiment, I tried making two versions of the C functions. Unsurprisingly, that was faster than psyco. I shipped the psyco version as it was more than fast enough, and much simpler to maintain.

My conclusion... JITs have more information to use for optimization than compilers do (e.g., runtime data types, runtime execution environment, etc.), so they have the potential to produce faster code than compilers in some cases if they exploit that added information through techniques like specialization.

svilen_dobrev 1 days ago [-]
it was very good. But there was a win only if one can avoid the overhead of function-calls, which is slowest thing in python - magnitude+ more than anything else (well, apart of exception throwing which is even slower, but.. rare). In my case, the speedup in calculations was lost in slowdown because of funccals.. so i ended up grouping and jamming most calculations in one-big-func(TM).. and then that was psyco-assembly-zed.

btw funccalls are still slowest thing. somedict.get(x) is almost 2x slower than (x in somedict and somedict[x]). In my last-year attempt to optimizing transit-protocol lib [0], bundling / copying few one-line calls in one 5-line func was the biggest win - and of course, not-doing some things at all.

[0] https://github.com/svilendobrev/transit-python3/blob/master/...

novosel 1 days ago [-]
simlevesque 1 days ago [-]
Yes
1 days ago [-]
hardmath123 1 days ago [-]
Here's another blog post on this theme! https://github.com/kach/art-deco/blob/main/art-deco.ipynb
svilen_dobrev 1 days ago [-]
i needed to make the "tracing" part - which i called "explain" - without jits, in 2007-8.. using combination of operator-overloading, variables-"declaring", and bytecode hacks [0].

Applied over set of (constrained) functions, and the result was well-formed trace of which var got what value because of what expression over what values.

So can these ~hacks be avoided now - or not really?

[0] https://github.com/svilendobrev/svd_util/blob/master/tracer....

agumonkey 1 days ago [-]
less complex libraries do python ast analysis wrapped in decorators to ensure purity of code for instance

it's a fun foot-in-the-door trick to start going into compilation

nurettin 18 hours ago [-]
The problem with these JIT/kernel implementations is that there is an extra step of copying data over to the function. Some implementations like numba work around that by exposing raw pointers to python for numeric arrays, but I don't know of any framework which can do that with python objects.

What we need is a framework which can mirror python code in jit code without effort.

    @jit.share
    class ComplexClass:
        def __init__(self, complex_parameters: SharedParameters):
            self.complex_parameters = complex_parameters
            self.x = 42
            self.y = 42
        
        def function_using_python_specific_libraries_and_stuff(self):
            ...


    @jit
    def make_it_go_fast(obj: ComplexClass):
        ...
        # modify obj, it is mirrored in jit code, 
        # but shares its memory with the python object 
        # so any change also effects the passed python object
        # initially, this can be done with observers, 
        # but then we will need some sort of memory aligned 
        # direct read/write regions for real speed

The complexity arises when function_using_python_specific_libraries_and_stuff uses exported libraries. The JIT code has to detect their calls as wrappers for shared objects and pass them through for seamless integration and only compile the python specific AST.
bjourne 1 days ago [-]
Great, but afaict, it's not a jit. It is using llvm to aot-compile Python code. Decorators are called when their respective functions are compiled, not when they are called.
dec0dedab0de 1 days ago [-]
A decorator is run at compile time and it’s output replaces the decorated function. The new replaced function could then have a jit at runtime. I don’t know if that’s what’s happening, but using a decorator doesn’t mean it cant also be a jit.
masklinn 1 days ago [-]
FWIW numba at least supports both cases depending how the decorator is used:

- if you just `@jit`, it will create a megamorphic function with specialisations generated at runtime

- if you pass a signature to `@jit`, it will compile a monomorphic function during loading

eliben 1 days ago [-]
JIT is something different people sometimes define in different ways.

In this sample, when the function itself is called (not when it's decorated), analysis runs followed by LLVM codegen and execution. The examples in the blog post are minimal, but can be trivially extended to cache the JIT step when needed, specialize on runtime argument types or values, etc.

If this isn't JIT, I'm curious to hear what you consider to be JIT?

willseth 1 days ago [-]
Python decorators simply wrap the function with decorator defined logic, so while yes that is all evaluated when the Python program is first run, whether or not llvm etc are run then vs when the function is first called is completely up to the implementation, ie anyone implementing a decorator based compiler could choose to run the compilation step at Python compile time or runtime or make it configurable.
1 days ago [-]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 00:32:47 GMT+0000 (Coordinated Universal Time) with Vercel.