NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
splice() and the ghost of set_fs() (lwn.net)
mark_undoio 700 days ago [-]
Our software[1] got broken by a kernel change a few years ago. The experience was quite interesting.

We were making use of `/proc/PID/pagemap`, which is a kernel-generated file that previously would show the physical addresses of all the pages in a process's address space. Unfortunately, with the Rowhammer exploit, exposing this information - even for one's own processes - to unprivileged users went from being harmless to a security risk.

The first we saw of the change was when newer kernels started reporting zeros for all physical addresses, unless we ran as root. We raised this the LKML, explaining that we'd been relying on this feature to implement a somewhat esoteric optimisation.

Linus replied very helpfully - the security fix trumped userspace compatibility but he could see a secure way of getting us the information we really needed, given the technique we'd described. He invited us to submit a kernel patch and gave a few hints about potential gotchas.

I did work one up but the kernel community actually jumped on it as an opportunity to do more cleanup, so I ended up just signing off on the patch they produced. It was all a remarkably smooth and efficient process.

[1] Time travel debugging - http://undo.io

khuey 700 days ago [-]
Heh, as someone who maintains another exotic debugger[1] my experience has been that the kernel development process is kind of a pain. We've had good experiences with people fixing regressions once they're discovered but getting new features into the kernel has been difficult. I think it took 10 revisions for me to get cpuid faulting into the kernel, including multiple review cycles where I was first told "change X to Y" and then in a subsequent cycle told "change Y to X".

[1] https://rr-project.org/

tanelpoder 700 days ago [-]
Yeah, for similar reasons, the /proc/PID/wchan now shows just "0" (for other users' processes) on newer kernels, unless you run as root. Same with /proc/PID/stack, but it's implemented in a different way, I can open() that file successfully, but the read() syscall on the opened file descriptor returns EACCESS error...

  $ ls -l /proc/$$/stack
  -r-------- 1 tanel tanel 0 May 26 21:52 /proc/967141/stack
  $ 
  $ cat /proc/$$/stack
  cat: /proc/967141/stack: Permission denied
  $ 
  $ sudo cat /proc/$$/stack
  [<0>] do_wait+0x1c3/0x230
  [<0>] kernel_wait4+0xaf/0x150
  [<0>] __do_sys_wait4+0x85/0x90
  [<0>] __x64_sys_wait4+0x1e/0x20
  [<0>] do_syscall_64+0x49/0xc0
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Edit: Adding one more comment - my impression has been that the "no userland-visible changes" promise applies to system calls - how procfs presents data as human-readable text in the /proc files has changed every now and then before (I recall the sar command showing wrong numbers after a kernel update, for example).
mark_undoio 687 days ago [-]
We've found userspace ABI does change in some surprising ways occasionally but mostly it doesn't matter to anybody.

e.g. we've seen the format of signal stack frames change in the past but nobody relies on that layout so it's OK.

/proc is another one along those lines - technically somebody could probably complain if it changes but if nobody shouts then it will just drift over time.

Havoc 700 days ago [-]
> the security fix trumped userspace compatibility

TIL

All stories about kernels describe it as an absolute that userspace is never broken, but that actually makes sense

Quekid5 700 days ago [-]
Yeah, there are rare exceptions to The Rule.
zorgmonkey 699 days ago [-]
In case anyone else is curious here is the LKML thread he mentioned https://lkml.org/lkml/2015/4/24/379 and this is the followup thread with the patches https://lkml.org/lkml/2015/6/9/804
ithkuil 699 days ago [-]
Is there a way to try out liverecorder without having to speak with a salesperson?
mark_undoio 687 days ago [-]
Apologies for the slow response - but, yes (mostly).

If you go here: https://undo.io/udb-form/

Then you do have to fill out a contact details form but you will be able to download a trial of UDB, our interactive debugger. That gets you the Time Travel functionality.

(if you're a VSCode user then you can download the extension https://marketplace.visualstudio.com/items?itemName=Undo.udb and start from there instead)

If you want the full LiveRecorder experience - additional tool, library, etc then you do have to request a demo. https://undo.io/about-us/contact/request-demo/ - Mention that you had exchanged messages with me (Mark Williamson - Architect @ Undo) and I can help from my side if you have any technical issues.

(edit: s/issues/technical issues/)

teraflop 700 days ago [-]
Even if you don't want to directly support splice() for every possible filesystem or file descriptor, I don't understand why it couldn't be "emulated" in those cases, by having the kernel just pretend it was given a series of read()/write() calls. That might not be as efficient, but surely it would be better than just breaking completely.
derefr 700 days ago [-]
The point of splice(2) is to be a fast-path; it's not in POSIX, so software that wants to guarantee portability has to write the slow path too, and only use splice(2) if it's present.

Such a blind/naive compatibility shim, would likely be much slower than whatever hand-written slow path the developer has in place. If the whole point of a call is to be a fast/efficient version of something else, then the call is breaking its semantics if it isn't more fast/efficient than the alternative.

Because of this, it's better to just make autoconf et al detect splice(2) as absent for the given use-case (and fall back to the hand-rolled portable slow path), rather than relying on a naive kernel shim.

bonzini 700 days ago [-]
You cannot in general use autoconf to detect runtime behavior.
derefr 700 days ago [-]
Many of autoconf's more fraught checks work by attempting to compile and run little C programs to see what happens. These checks enable not just static analysis (seeing what the compiler does), but also "edge-case analysis" ala a fuzzer — e.g. seeing if the runtime environment of the compilation environment is one where passing certain printf(3) format-specifiers makes it choke, etc.

So if splice(2) no longer works when the source is e.g. /dev/random, then autoconf could attempt to compile+run a program that splice(2)s from /dev/random to a known-working sink, and see whether that program SIGSEGVs or not when run; and use that to decide whether to allow an --enable-splice configure flag to be passed, vs. bailing out if such a flag is passed.

Of course, there's the implicit assumption that if you're passing such a flag, the build environment is going to be the deploy environment; or at least, the build environment's feature-set will be a subset of the deploy environment's, such that the build environment's runtime features can be used as a conservative underestimate of the deploy environment's runtime features.

This "build is a subset of deploy" is usually a sensible assumption. Build environments are controllable, while deploy environments are arbitrary; so anyone who wants to make a build that works on many different deploy environments, can just set up their build environment to have the "lowest common denominator" of the features of the systems they want to target.

(Compare and contrast: microarchitectural optimizations. Same story.)

bonzini 700 days ago [-]
> then autoconf could attempt to compile+run a program that splice(2)s from /dev/random to a known-working sink

And then the next day you, or your user, update your kernel, the result is out of date and your program crashes. That's just not how autoconf (or cmake or meson for that matter) are used.

> running autoconf checks inside a target-machine emulator

Run tests are quite rare. Older autoconf used to use printf to detect the size or alignment of a type, but for 15-20 years it has instead been doing binary search (basically "guess the number") so that only compilation tests are needed instead.

(I am a former autoconf developer and GCC build system maintainer).

rootw0rm 700 days ago [-]
I don't think they're suggesting runtime use, but rather building software after the splice(2) change. That's how I read it at least.
vlovich123 700 days ago [-]
Aside from the “where you build isn’t where you run problem”, there’s the “this syscall doesn’t work when running on filesystem X or accessing /dev/urandom”.

Not only is the autoconf solution not fixing the problem, it’s placing a massive undue burden on developers. Linus has been inconsistent here. Telemetry would have been helpful here in aiding this work (ie support it with the slow path but report the event so that distro maintainers could provide feedback on broken paths). Once you think you’ve eliminated the long tail of issues, then remove and see if anything remains broken that telemetry didn’t catch.

bonzini 700 days ago [-]
It would break if compiled on old kernel and run on new kernel. When CI is containerized the kernel might not even have anything to do with the distro that you're building on.
baisq 700 days ago [-]
Doesn't that happen often anyway?
bonzini 700 days ago [-]
Nope. You can run programs to detect some behavior, for example printing sizeof(unsigned long) or checking how some rounding is performed. Those tests of course may affect the program behavior at runtime, but what they test is still the compilation environment.
tinus_hn 699 days ago [-]
That however is not how the linux kernel guarantee of the stability of userspace APIs works.
amluto 700 days ago [-]
It used to. But this emulation worked by doing set_fs() so that the internal (outdated) read/write implementation could follow a pointer into kernel memory, and that facility is gone now.
teraflop 700 days ago [-]
Ah, I think I get it. So the problem isn't just that these drivers don't specifically support splice() -- it's that they also only support reading/writing with userspace buffers, and set_fs() was just a hack around that limitation that is no longer supported. That's kind of unfortunate.
amluto 700 days ago [-]
Correct. The modern interface supports reading and writing from a rather generic concept of a buffer, and drivers that support that are usable with splice().
700 days ago [-]
londons_explore 700 days ago [-]
Perhaps because some of the semantics of a spliced pipe aren't exactly the same as read() followed by write()?
sc68cal 700 days ago [-]
I think that leaving splice() as it was, is worse than breaking the "NEVER BREAK USERSPACE" rule, in this specific instance. It should not become a regular occurence, but I don't see a better way. They tried to stick to it, things happened, but I am optimistic that they will fix it and try better the next time.
rwj 700 days ago [-]
If rule #1 is don't break userspace, we might need a rule #0, which would be don't support insecure behaviour. In the end, people might need to balance competing concerns...
manchmalscott 700 days ago [-]
The only time I’ve had a kernel update break things was a weird Proxmox bug where if you booted with the (at the time) latest kernel, it would fail to start the first VM, and nothing you did from the UI or the command line could touch it without timing out. Rebooting to the previous kernel release and it just started working again with no other changes.

That definitely broke my assumption that kernel updates were well vetted for regressions.

bityard 700 days ago [-]
"Don't break userspace" is more like a _goal_ than a law.

The kernel is an extremely complex piece of code, the developers can't be asked to test every kernel release against every piece of userspace software on every hardware configuration. That's one of the reasons that new code and significant changes require a bunch of mailing list discussion, reviews, and signing-off.

Also, Proxmox ships with its own supported kernel, it shouldn't be a big surprise that you ran into issues while straying off the beaten path.

londons_explore 700 days ago [-]
> The kernel is an extremely complex piece of code,

With no regression tests, nearly no unit tests, and lots of bits of supported hardware that nobody on the Dev team uses day to day...

It's a miracle that quality is as high as it is.

bonzini 700 days ago [-]
There are plenty of tests, just not in Linus's tree. KVM has not one but three suites of unit and integration tests, for example, but only 60ish tests are in the kernel's tools/testing/selftests directory.
bombcar 700 days ago [-]
I suspect kernel updates aren't vetted that well in general; the "don't break user space" is a ruling from Linus about what not to do - usually with regards to the user space API (don't remove or change things that already exist).

And this article is an example of where the decision was made to break user space.

josefx 700 days ago [-]
Is it even a hard break? It seems to be more of a short lived regression while patches for filesystem drivers are still coming in to restore support and I don't think Linux ever guaranteed a stable driver API.
caslon 700 days ago [-]
> The only time I’ve had a kernel update break things

> That definitely broke my assumption that kernel updates were well vetted for regressions.

Shouldn't that be evidence for the contrary? One break in a decade or two of use isn't so bad. It's actually pretty good.

yjftsjthsd-h 700 days ago [-]
> That definitely broke my assumption that kernel updates were well vetted for regressions.

I would argue that it's also a failure by the distro; unless there was something special about your exact setup that would have made the bug not show up in testing, I would argue that proxmox should have been testing updates to catch that kind of problem before users noticed.

fomine3 700 days ago [-]
Proxmox VE provides Enterprise and No-Subscription repository. Possibly GP uses latter repo that's for testing (and free!).

Maybe this? https://forum.proxmox.com/threads/latest-update-5-13-19-4-pv...

yjftsjthsd-h 700 days ago [-]
Ah, so free users hitting bugs is the test suite:)
simlevesque 700 days ago [-]
> kernel updates were well vetted for regressions.

That seems like the kind of things that is easier said than done.

867-5309 700 days ago [-]
I've just had that with the most recent Proxmox update and also had no option but to regress
Fnoord 700 days ago [-]
I had the issue as well but I managed to fix it by I changing my boot parameters (exact change differs per bootloader and hardware). Now I happily boot with .15 series (up from .13).

I expect some more users of Proxmox given Broadcom taking over VMware (e.g. I'd like to merge away from ESXi to Proxmox as I don't trust Broadcom). Hopefully it does the product and community well.

i13e 700 days ago [-]
Not sure if here is the place for this but here goes. Personally I've been dealing with a regression of sorts on my laptop: it no longer suspends without being quickly woken up again on the main kernel release (works fine on LTS). Does anyone know if and where I could file a bug report?
euank 700 days ago [-]
> Does anyone know if and where I could file a bug report?

The linux kernel does take bug reports: https://docs.kernel.org/admin-guide/reporting-issues.html

However, that bug probably isn't specific enough as you've described it, unless you can find the commit causing it (such as via a git bisect https://docs.kernel.org/admin-guide/bug-bisect.html), or come up with a clearer repro.

Alternatively, if you're seeing the issue on a distro-maintained kernel (such as on fedora/ubuntu/debian with their kernel package), reporting the issue to the distro maintainers may be more appropriate.

700 days ago [-]
asddubs 699 days ago [-]
i've had an ubuntu employee fix a kernel bug that rendered my machine unbootable on newer kernels. it was a newer thinkpad so there were quite a few affected users, it'd probably be lesser priority with more obscure hardware. Still, just to support your statement with some concrete experience, reporting to your distro can definitely help
corbet 700 days ago [-]
I've seen similar things. Messing with /proc/acpi/wakeup (https://unix.stackexchange.com/questions/698185/laptop-wakes...) made the problem go away for me. I've been meaning to try to bisect it down and find the real problem, but haven't had the time to do that yet...
Beltalowda 700 days ago [-]
I have some issues as well; I heard that [1] was the cause for some breakage, which should be fixed in 5.18, but I didn't verify as I haven't had much reason to use suspend lately.

[1]: https://github.com/torvalds/linux/commit/dfbba2518aac4204203...

csdvrx 700 days ago [-]
> my laptop: it no longer suspends without being quickly woken up again on the main kernel release

Long story short: there are a lot of things in your laptop generating interrupts.

Some of them you want to ignore, because it would cause the behavior you describe (preventing sleep)

Some of them you really want to listen to closely, because if it's an interrupt generated by say brushing on the power button, not listening to it means not waking up from sleep (traditional example: GPE96 on dells, cf for example https://bugzilla.kernel.org/show_bug.cgi?id=102281) until a longer press on the button generates an ACPI event or a powerup.

You can configure or finetune that with /proc/acpi/wakeup which hopefully will give you more context as to what other people have explained here.

mathstuf 700 days ago [-]
I noticed it recently too, but disconnecting my Bluetooth headset first allowed it to suspend properly (it seems like the wifi chip kept it awake before making like a yo-yo back and forth).
zonotope 700 days ago [-]
Me 3 and it's maddening. I've posted this [1] on the Arch Linux forums but haven't found a fix. I've seen it mentioned around the web that it's a known bug, but I have yet to find a bug report

1: https://bbs.archlinux.org/viewtopic.php?id=276064

csdvrx 700 days ago [-]
It's an open question whether Bluetooth should prevent sleep, or let sleep go through.

On a well done "modern suspend/suspend to idle", I use disconnected modern sleep + Windows Media Player on Windows 11 to listen to songs on my bluetooth headphones for a cost of about 1% of the battery per hour (as measured and plotted with powercfg) which can come down to about half of it, 0.5%/h when not using Bluetooth.

I wouldn't want Bluetooth to prevent sleep (a 1% drop per hour is better than not sleeping!)

I also wouldn't want sleep to prevent me from using my Bluetooth headset (the difference between 0.5% and 1% is significant, but it doesn't matter much in practice if my computer can be usable in the morning)

This is one of the rare examples of "modern suspend" delivering on its promises, and blowing the good old ACPI S3 away: I never saw a drop of battery <5% on S3 suspend-to-ram unless it also involved S4 in a hybrid sleep of "ACPI S3 suspend-to-ram then S4 suspend-to-disk after a while or when I run out of power whichever comes first"!

mathstuf 699 days ago [-]
Be that as it may, acting like a yo-yo is certainly the wrong direction on a "desired behavior" scale. Especially when the headset itself is announcing the up and down events in synthesized speech interrupting what else is going on.
__turbobrew__ 700 days ago [-]
I have noticed this as well
mc4ndr3 700 days ago [-]
with_resource(mutex file handle etc etc *o, func callback) is a safer pattern, for this reason. Thank you, Python

(with_resource is a generic example. Search for such capabilities in your library.)

Computers exist to automate. Let the computer remember to close resources upon scope end.

Go's defer is also helpful in this regard.

viraptor 700 days ago [-]
That was my first thought - would they leave the set_fs solution in place if C had a native solution for resource blocks / RAII+drop.
ars 700 days ago [-]
It's not that hard to do in C. Make a macro that takes a function, it runs:

  set_fs(KERN)
  *function_pointer()
  set_fs(USER)
There did it before, so they knew about that option, and if they didn't do it, they must have had a reason.
simias 700 days ago [-]
It's true, unfortunately the lack of lambdas in C makes this pattern cumbersome. You could achieve it with a macro instead, I suppose.
Matthias247 700 days ago [-]
There is a GCC extension (__attribute__ cleanup) [1] gcc which they probably could use if they wanted, given that the kernel relies primarily on gcc (and clang supports gcc extensions). Probably not allowed through the kernel coding style?

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attribute...

shaded-enmity 700 days ago [-]
A simple policy that both set_fs() calls need to happen within the same function body with corresponding CI test based on AST/DWARF inspection would have also prevented it. Do you really want to rely on stack unwinding/destructors for security sensitive code when stack is usually the first thing that gets controlled by the attacker? Exception handling (SEH) on Windows is an exploitation vector of it's own.
viraptor 700 days ago [-]
I'm talking about the general idea not specific implementation. Having something happen at function/block exit doesn't mean a runtime configurable behaviour. If you don't have exceptions, it's pretty easy to statically compile that behaviour and guarantee it rather than rely on checks.
shaded-enmity 700 days ago [-]
You still need to implement a CI rule so that I don't just call `set_fs()` without using `preferred_syntax(set_fs)`, don't you?
viraptor 700 days ago [-]
We're talking a theoretical language here, so... maybe? It could be also that set_fs is usable only in the preferred_syntax mode.
SolarNet 700 days ago [-]
Except that things are weird in a kernel, one cannot guarantee that language behavior is enforceable. Even with a RAII/drop style solution there is a possibility that kernel weirdness will prevent it from running correctly.

The issue with forgetting function calls like this are the edge cases, and the kernel has a lot more of those.

jcranmer 700 days ago [-]
The problem isn't that it relies on developer foresight to make the second call to set_fs. The problem (if you look through other LWN articles on the subject) is that the set_fs calls can be prone to not getting restored if the kernel code is interrupted.
lamontcg 700 days ago [-]
Don't like the lecturing tone of this article.

This seems entirely successful to me and retired crappy and insecure behavior.

The idea that you can never break anything means that some historical mistakes can never get fixed.

The someone complains about how the language / operating system is a bolted on mess of nonsense and people go chasing after the next hot thing which has the benefit of just being newer and being able to get away with breaking changes because people on the bleeding edge put up with it.

The linux kernel is probably striking the right balance here, and I doubt that this one breakage is a sign of the decay and downfall of the Empire.

raggi 700 days ago [-]
This could have interesting implications for Go programs. Go optimizes io.Copy down to splice if dst and src both support the necessary bits. Unlike C code where you could easily grep this out, this happens transparently, and could even by influenced by user input in various cases. The tail could be quite long for bugs that show up from dropping support. I'm sure some dynlangs may have similar optimizations.

edit: checked the current implementation in go, and it's only doing that for unix and tcp sockets, so it'll be fine.

freebuju 700 days ago [-]
Not too long ago I was constantly facing this kernel bug [0] that would kill audio whenever system memory would be under moderately high load. It's fixed now but it's kind of insane that the kernel would mess around with components as critical as sound for desktop users. [0]https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/22...
kccqzy 700 days ago [-]
The "don't break user space" guideline should just be called "break user space, and if someone complains, fix it afterwards."
700 days ago [-]
encryptluks2 700 days ago [-]
Systemd had a regression recently that has broken booting a secondary encrypted LUKS device.
teddyh 693 days ago [-]
What does “booting a secondary […] device” mean? Either you boot from a device, and that makes it the primary device, or you boot off another device, in which case this is now the primary device.

Or did you mean unlocking an encrypted device after unlocking and booting on the primary device? What does systemd have to do with that? Unless you want to type in the password manually every boot, you store the password (for the secondary device) on the primary device, add the password file to /etc/crypttab and be done with it, right?

loeg 700 days ago [-]
> But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline. It does not take too many of those to create breakage to the project's reputation that is hard to splice back together.
theteapot 700 days ago [-]
Shouldn't they have left it for 6.0?
mhitza 700 days ago [-]
Kernel doesn't adhere to a SemVer policy
theteapot 700 days ago [-]
Maybe they should. Maybe, they, should.
SAI_Peregrinus 700 days ago [-]
Naw. Kernel version number increases. Can't be more than 255 in any field, except the EXTRAVERSION which is a string. Kernel version numbering is exposed in the userspace ABI, so changing it to have meaning would be a breaking change for no substantial benefit. This[1] LWN article is a good overview of the history. Effectively the major gets incremented when the minor gets to 20 and Linus runs out of fingers & toes to keep track.

[1] https://lwn.net/Articles/871989/

mrlonglong 700 days ago [-]
How many digits does a penguin have?
ufo 700 days ago [-]
I believe it's 14. (Birds feet have 4 toes and their wings have 3 fingers)
theteapot 700 days ago [-]
> Can't be more than 255 in any field, except the EXTRAVERSION which is a string. Kernel version numbering is exposed in the userspace ABI, so changing it to have meaning would be a breaking change.

Sounds like they should leave this change for 6.0 then.

SAI_Peregrinus 699 days ago [-]
Why? The kernel version number has no meaning, other than to provide an ordering of releases.
theteapot 696 days ago [-]
In 6.0 it will.
charcircuit 700 days ago [-]
>Can't be more than 255 in any field

They can just change the max. It's not that big of a deal.

SAI_Peregrinus 699 days ago [-]
No, it's one byte per field (except EXTRAVERSION, which is a string). That's part of the userspace ABI, so it can't change without extremely good reason (critical security issue that can't be mitigated in another way).
ranger207 700 days ago [-]
As an aside, if you enjoy this article please consider subscribing to LWN at [0]. Normally articles are only available for free a week after they're posted; one of the perks of subscribing is being able to send a link like the one posted here so non-subscribers can view it before the week has passed. LWN's main source of revenue as I understand it is subscriptions, and the work they do is definitely worth it

[0] https://lwn.net/subscribe/Info

rtpg 700 days ago [-]
yep, LWN is a great resource, highly recommend for any developer (even/especially those who aren't linux people, they have lots of articles regarding other programming language development and the like).

I used to complain that they didn't have recurring subscriptions, but I believe they have fixed that! Just sign up yearly, at the worst you are a patron to good journalism that is not really happening elsewhere.

Just reading an article or two a week will give you loads of insight into what is going on in the tools you have everyday

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 19:45:40 GMT+0000 (Coordinated Universal Time) with Vercel.