QEmacs shenanigans


As noted elsewhere on this site, I use QEmacs as my text editor. I have an extensive patch tweaking QEmacs to my liking.

But it had this one bug. If I loaded up a file greater than a megabyte in size, it would catch SIGBUS when I tried to save it to disk again and the file would be truncated to zero length.

Well, I finally got sufficiently annoyed at this to work out firstly how to reproduce it, and then stick a load of wires into it with gdb to figure out what was going on.

mmap()

As with a lot of editors, when QEmacs gets some file which is bigger than some threshold value (say, #define MIN_MMAP_SIZE (1024*1024) for example), instead of loading the file into core, it will call mmap() on the file. Don't ask me about the specifics of the what and the why; I've been at this for hours and am heading for sleep deprivation delirium. Finding this knob (via the tried and tested method of Reading The Source Code) allowed me to construct a test case, as increasing the threshold at which files are mmap()ed to e.g. 512MB reliably caused a lack of SIGBUS crashes when working with my 44MB CSV file that had spurned this investigation.

So I knew I was dealing with files mapped into core instead of those whose contents was copied into core.

memcpy()

Next, I recompiled QEmacs with debugging symbols and wired it up to that really nice gdb trace script I've mentioned in a previous post and then waited ten minutes for the editor to start, as the rendering function does a lot of stuff behind the scenes which takes ages when it's running under ptrace(). I aborted this line of attack, and switched back to running this under plain gdb and letting gdb drop me to a prompt when the unhandled SIGBUS fired.

As it turns out, the SIGBUS was generated down the file saving code path. It was falling over inside a memcpy() call. A denizen on a freenode channel I lurk in suggested trying AddressSanitizer (aka '-fsanitize=address -lasan'); this looked rather cool, however I don't have access to a 64-bit Intel machine with enough RAM to run that monster, and compiling a 32-bit version and trying that only told me that there was a pointer that somebody didn't like somewhere. Brilliant.

I recompiled a few more times, adding -fno-builtin (because gcc likes to do things its own way and override your libc occasionally) and statically against musl so I could get a sane looking stack trace; this was, ultimately, futile.

Next trick: printf() debugging. Dumping the arguments of functions in the problematic call path did not help (though I ended up with some very nicely formatted hexadecimal 64-bit addresses -- "%.16llx" works wonders on pointers).

Ask question, get answer

My last resort was to throw "linux mmap memcpy" at a search engine; I came up with a StackOverflow answer about calling ftruncate() on shared objects post-mmap(). I then looked at the code closely, and turns out that ten lines above the call to the function which contains the grumpy memcpy() call is a call to open() with O_TRUNC as the arguments. Some further search engine usage and Reading Of The Source Code reveals that the underlying file backing the mmap() is truncated; turns out, that if you then try to copy out of the region mapped into core, the kernel gets a bit irritated (and for a damn good reason too).

I fixed the whole bloody thing with a single ftruncate() call which sets the output file size to the calculated length of the region to write.

Really?



home