I gotta admit that my smell test tells me that this is a step backwards; at least naively (I haven't looked through the code thoroughly yet), this just kind of feels like we're going back to pre-protected-memory operating systems like AmigaOS; there are reasons that we have the process boundary and the overhead associated with it.
If there are zero copies being shared across threadprocs, what's to stop Threadproc B from screwing up Threadproc A's memory? And if the answer to this is "nothing", then what does this buy over a vanilla pthread?
I'm not trying to come off as negative, I'm actually curious about the answer to this.
jer-irl 32 minutes ago [-]
Not negative at all, thanks for commenting. You're right that the answer is "nothing," and that this is a major trade-off inherent in the model. From a safety perspective, you'd need to be extremely confident that all threadprocs are well-behaved, though a memory-safe language would help some. The benefit is that you get process-like composition as separate binaries launched at separate times, with thread-like single-address space IPC.
After building this, I don't think this is necessarily a desirable trade-off, and decades of OS development certainly suggest process-isolated memory is desirable. I see this more as an experiment to see how bending those boundaries works in modern environments, rather than a practical way forward.
tombert 19 minutes ago [-]
It's certainly an interesting idea and I'm not wholly opposed to it, though I certainly wouldn't use it as a default process scheduler for an OS (not that you were suggesting that). I would be very concerned about security stuff. If there's no process boundary saying that threadproc A can't grab threadproc B's memory, there could pretty easily be unintended leakages of potentially sensitive data.
Still, I actually do think there could be an advantage to this if you know you can trust the executables, and if the executables don't share any memory; if you know that you're not sharing memory, and you're only grabbing memory with malloc and the like, then there is an argument to be made that there's no need for the extra process overhead.
furyofantares 34 minutes ago [-]
I think it is possible for B to screw up A when you share a pointer but otherwise unlikely. B doing stuff with B's memory is unlikely to screw up A.
tombert 26 minutes ago [-]
Sure but that's true of threads as well. The advantage of having these threadprocs is that there can be zero-copy sharing, which isn't necessarily bad but if they aren't copied then B could screw up A's stuff.
If you're ok with threads keeping their own memory and not sharing then pthreads already do that competently without any additional library. The problem with threads is that there's a shared address space and so thread B can screw up thread A's memory and juggling concurrency is hard and you need mutexes. Processes give isolation but at the cost of some overhead and IPC generally requiring copying.
I'm just not sure what this actually provides over vanilla pthreads. If I'm in charge of ensuring that the threadprocs don't screw with each other then I'm not sure this buys me anything.
lstodd 27 minutes ago [-]
Judging by the description, it's exactly like AmigaOS, Windows 3.x or early Apple. Or MS-DOS things like DESQview.
I fail to see the point - if you control the code and need performance so much that an occassional copy bites, you can as well just link it all into a single address space without those hoopjumps. It won't function as separate processes if it's modified to rely on passing pointers around anyway.
And if you don't, good luck chasing memory corruption issues.
Besides, what's wrong with shared memory?
philipwhiuk 4 minutes ago [-]
This is basically 'De-ASLR' is it not?
L8D 35 minutes ago [-]
It is lovely to see experimentation like this going on. I think this has a lot of potential with something like Haskell's green threading model, basically taking it up a notch and doing threading across pthreads instead of being restricted to the VM threads. Surely, if this can be well fleshed-out, it could be implemented at the compiler-level or library-level so existing multi-threaded Haskell software can switch over to something like this to squeeze out even more performance. I'm not an expert here, though, so ¯\_(ツ)_/¯ take my words with a grain of salt.
fwsgonzo 32 minutes ago [-]
I actually just published a paper about something like this, which I implemented in both libriscv and TinyKVM called "Inter-Process Remote Execution (IPRE):
Low-latency IPC/RPC using merged address spaces".
Here is the abstract:
This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications
(dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems.
We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization),
IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy
multi-tenant systems using traditional IPC.
Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo.
Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens.
(This paper is not out yet, but it should be very soon)
The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.
For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection. With libriscv it's also possible for the host to take over the guests global heap allocator, which makes it possible to free something that was remotely allocated. You can divide the address space into the number of possible callers, and one or more remotes, then if you give the remote a std::string larger than SSO, the address will reveal the source and the source tracks its own allocations, so we know if something didn't go right.
Note that this is only an interest for me, as even though (for example) libriscv is used in large codebases, the remote RPC feature is not used at all, and hasn't been attemped. It's a Cool Idea that kinda works out, but not ready for something high stakes.
jeffbee 28 minutes ago [-]
Looks like you forgot the URL. Interested.
fwsgonzo 19 minutes ago [-]
Best I can do is reply to an e-mail if someone asks for the paper, since it's not out yet. The e-mail ends with hotmail.
whalesalad 38 minutes ago [-]
I feel like this could unlock some huge performance gains with Python. If you want to truly "defeat the GIL" you must use processes instead of threads. This could be a middle ground.
hun3 13 minutes ago [-]
This is exactly what subinterpreters are for! Basically isolated copies of Python in the same process.
How would this really help python though? This doesn't solve the difficult problem, which is that python objects don't support parallel access by multiple threads/processes, no? Concurrent threads, yes, but only one thread can be operating on a python object at a time (I'm simplifying here for brevity).
There are already means of passing around bulk data with zero copy characteristics in python, but there's a lot of bureaucracy around it. A true solution must work with the GIL (or remove it altogether), no?
jer-irl 23 minutes ago [-]
I'm not familiar with CPython GC internals, but I there there are mechanisms for Python objects to be safely handed to C,C++ libraries and used there in parallel? Perhaps one could implement a handoff mechanism that uses those same mechanisms? Interesting idea!
Rendered at 17:20:19 GMT+0000 (Coordinated Universal Time) with Vercel.
I gotta admit that my smell test tells me that this is a step backwards; at least naively (I haven't looked through the code thoroughly yet), this just kind of feels like we're going back to pre-protected-memory operating systems like AmigaOS; there are reasons that we have the process boundary and the overhead associated with it.
If there are zero copies being shared across threadprocs, what's to stop Threadproc B from screwing up Threadproc A's memory? And if the answer to this is "nothing", then what does this buy over a vanilla pthread?
I'm not trying to come off as negative, I'm actually curious about the answer to this.
After building this, I don't think this is necessarily a desirable trade-off, and decades of OS development certainly suggest process-isolated memory is desirable. I see this more as an experiment to see how bending those boundaries works in modern environments, rather than a practical way forward.
Still, I actually do think there could be an advantage to this if you know you can trust the executables, and if the executables don't share any memory; if you know that you're not sharing memory, and you're only grabbing memory with malloc and the like, then there is an argument to be made that there's no need for the extra process overhead.
If you're ok with threads keeping their own memory and not sharing then pthreads already do that competently without any additional library. The problem with threads is that there's a shared address space and so thread B can screw up thread A's memory and juggling concurrency is hard and you need mutexes. Processes give isolation but at the cost of some overhead and IPC generally requiring copying.
I'm just not sure what this actually provides over vanilla pthreads. If I'm in charge of ensuring that the threadprocs don't screw with each other then I'm not sure this buys me anything.
I fail to see the point - if you control the code and need performance so much that an occassional copy bites, you can as well just link it all into a single address space without those hoopjumps. It won't function as separate processes if it's modified to rely on passing pointers around anyway.
And if you don't, good luck chasing memory corruption issues.
Besides, what's wrong with shared memory?
Here is the abstract: This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications (dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems. We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization), IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy multi-tenant systems using traditional IPC.
Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo. Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens. (This paper is not out yet, but it should be very soon)
The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.
For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection. With libriscv it's also possible for the host to take over the guests global heap allocator, which makes it possible to free something that was remotely allocated. You can divide the address space into the number of possible callers, and one or more remotes, then if you give the remote a std::string larger than SSO, the address will reveal the source and the source tracks its own allocations, so we know if something didn't go right. Note that this is only an interest for me, as even though (for example) libriscv is used in large codebases, the remote RPC feature is not used at all, and hasn't been attemped. It's a Cool Idea that kinda works out, but not ready for something high stakes.
https://docs.python.org/3/library/concurrent.interpreters.ht...
If you want a higher-level interface, there is InterpreterPoolExecutor:
https://docs.python.org/3/library/concurrent.futures.html#co...
There are already means of passing around bulk data with zero copy characteristics in python, but there's a lot of bureaucracy around it. A true solution must work with the GIL (or remove it altogether), no?