As noted elsewhere on this site, I use QEmacs as my text editor. I have an extensive patch tweaking QEmacs to my liking.
But it had this one bug. If I loaded up a file greater than a megabyte in size, it would catch SIGBUS when I tried to save it to disk again and the file would be truncated to zero length.
Well, I finally got sufficiently annoyed at this to work out firstly how to reproduce it, and then stick a load of wires into it with gdb to figure out what was going on.
As with a lot of editors, when QEmacs gets some file which is bigger than some threshold value (say,
#define MIN_MMAP_SIZE (1024*1024) for example), instead of loading the file into core, it will call
mmap() on the file. Don't ask me about the specifics of the what and the why; I've been at this for hours and am heading for sleep deprivation delirium. Finding this knob (via the tried and tested method of Reading The Source Code) allowed me to construct a test case, as increasing the threshold at which files are
mmap()ed to e.g. 512MB reliably caused a lack of SIGBUS crashes when working with my 44MB CSV file that had spurned this investigation.
So I knew I was dealing with files mapped into core instead of those whose contents was copied into core.
Next, I recompiled QEmacs with debugging symbols and wired it up to that really nice gdb trace script I've mentioned in a previous post and then waited ten minutes for the editor to start, as the rendering function does a lot of stuff behind the scenes which takes ages when it's running under
ptrace(). I aborted this line of attack, and switched back to running this under plain gdb and letting gdb drop me to a prompt when the unhandled SIGBUS fired.
As it turns out, the SIGBUS was generated down the file saving code path. It was falling over inside a
memcpy() call. A denizen on a freenode channel I lurk in suggested trying AddressSanitizer (aka '
-fsanitize=address -lasan'); this looked rather cool, however I don't have access to a 64-bit Intel machine with enough RAM to run that monster, and compiling a 32-bit version and trying that only told me that there was a pointer that somebody didn't like somewhere. Brilliant.
I recompiled a few more times, adding
-fno-builtin (because gcc likes to do things its own way and override your libc occasionally) and statically against musl so I could get a sane looking stack trace; this was, ultimately, futile.
printf() debugging. Dumping the arguments of functions in the problematic call path did not help (though I ended up with some very nicely formatted hexadecimal 64-bit addresses -- "%.16llx" works wonders on pointers).
My last resort was to throw "linux mmap memcpy" at a search engine; I came up with a StackOverflow answer about calling
ftruncate() on shared objects post-
mmap(). I then looked at the code closely, and turns out that ten lines above the call to the function which contains the grumpy
memcpy() call is a call to
O_TRUNC as the arguments. Some further search engine usage and Reading Of The Source Code reveals that the underlying file backing the
mmap() is truncated; turns out, that if you then try to copy out of the region mapped into core, the kernel gets a bit irritated (and for a damn good reason too).
I fixed the whole bloody thing with a single
ftruncate() call which sets the output file size to the calculated length of the region to write.