Sometimes well intended optimizations can get in the way, and this is one of them.

In general, when a file is read from or written to, the operating system keeps the data around in case it's needed again. This is in general a good idea, except when it's not. Sometimes you know that after you won't need the cached data again any time soon, and it's better for the operating system to forget about it instead of letting the cache balloon and push out more useful, albeit less recently used data. For example, when performing a backup a large amount of data is read from the disk and then written out to (typically) another media. It's unlikely that much of the just backed up data is going to be needed again, at least until the next backup window. Unfortunately on Linux the interface to control this is fairly limited. One can write a LD_PRELOAD library to tell Linux to uncache files that are just closed:

#if 0
gcc $0 -shared -o $ -ldl -fPIC && LD_PRELOAD=$ exec [email protected]
#define _GNU_SOURCE
#include <dlfcn.h>
#include <fcntl.h>
#include <stdlib.h>

int close(int fd)
    static int (*close_func)(int) = NULL;
    posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
    if (close_func == NULL)
        close_func = dlsym(RTLD_NEXT, "close");

    return close_func(fd);

posix_fadvise(3) says the following about POSIX_FADV_DONTNEED:

POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.

Additionally, a length of 0 indicates that the advice applies until the end of file. This is insufficient however. If the program exits without closing its files then they will never be uncached. And when streaming large files they also won't be uncached until after the cache is already polluted. Those issues can be fixed by wrapping read, write, pread, pwrite and related functions, or by directly modifying the program when it's possible to do so. Still, the interface is unsatisfactory. For example, if a file was already cached before it's backed up, you don't want the backup process to uncache it. There's also no good way to use posix_fadvise to uncache write data, given what the manpage says:

Pages that have not yet been written out will be unaffected, so if the application wishes to guarantee that pages will be released, it should call fsync(2) or fdatasync(2) first.

Calling fsync(2) or fdatasync(2) while doing large streaming writes is simply unacceptable. Instead, it would be much better if the advice can be given at file open time and it would last for the duration of the file descriptor. The kernel can then take the hint and not cache accesses made through that file descriptor. O_DIRECT can achieve this, but sometimes you don't want the extra semantics and requirements that it brings.

by khc on Tue Nov 8 00:41:41 2011 Permlink
Tags: computer