Subsections


4. Jockey internals

Image interposer-structure

The above picture shows a high-level structure of Jockey. Internally, Jockey consists of three parts: interposerinterposer, Jockey proper, and fakethreadfakethread. The arrows show the flow of control when a system call is issued by the target application.

Interposer (the yellow boxes in the picture) is a low-level library that intercepts system calls and other CPU instructions with nondeterministic effects (rdtsc on x86). Interposer can be used for purposes other than execution record and replay. Section  describes its uses in more detail.

For simplicity and convenience, Jockey proper and interposer are combined into a single shared object file, libjockey.so. Simply loading libjockey.so on startup causes Jockey to take control of the process. As explained in Section 3, this can be done by starting the program via jockey frontend or manually linking libjockey.so to the program.

The fakethread library is used in conjunction with libjockey.so to deterministically record and replay pthread-based applications. It is actually named libpthread.so since it is a drop-in replacement for the standard pthreads. Section 3.7 describes it in more detail.

Linux's dynamic linker (ld.so) invokes Jockey and Interposer's initialization routines immediately after libjockey.so is loaded, before the target program starts execution.

Note: Jockey puts its initialization code in the .init_array section. The code placed in this section is called automatically when the object file is loaded, with arguments argc, argv, and environ.

Jockey's initialization routine performs the following tasks.

  1. The interposer scans the text sections in the process and discovers the occurrences of the system call instruction (int $0x80). Jockey rewrites each instruction intercepts its invocation. Jockey currently intercepts 80 types of Linux system calls, including time, recvfrom, and select. Jockey logs the values generated by these calls during recording, and reads the value from the log during replay.

  2. The interposer does the same for CPU instructions with nondeterministic effects. It currently patches only rdtsc, the x86 instruction for reading the CPU's timestamp counter. It is used, for example, as a pseudo random-number generator in libc.

  3. Jockey examines the program and excludedprogram options ( Section 3). If the program is not to be traced, it transfers the control to the target at this point. Jockey's system call and rdtsc handlers just forward the instructions to the original functions hereafter.

  4. Otherwise, Jockey checkpoints the process state just before returning control to the target program. In the replay mode, Jockey simply loads the checkpoint. Checkpointing is needed to ensure that the target sees the same set of environment variables and command line parameters during both record and replay. We discuss checkpointing in more detail in Section 3.2.

  5. Jockey transfers control to the target program. From this moment, Jockey becomes active only when the target executes a system call or a nondeterministic CPU instruction such as rdtsc.

The next section describes the first two steps in more detail. Section 4.2 discusses Jockey's efforts to segregate itself from the target program to avoid unnecessary interference. Section 3.2 describes Jockey's checkpointing function (step (3)), along with the challenges we had to overcome.


4.1 Instruction patching

Figure 4.1: Example of intercepting gettimeofday
\includegraphics{gettimeofday}

Jockey intercepts Linux system calls by patching the occurrences of system call instructions (int $0x80).

As an example, the above picture shows how the time system call is recorded and replayed. (a) shows an abridged version of time in libc.so. When Jockey starts, it reads the process's virtual memory structure by reading /proc/fd/maps, reading the ELF header of each loaded object file, and finds where the CPU instructions could be found. It then finds the occurrences of Linux system call instructions. If it manages to find the pattern mov $XX, %eax; int $0x80, then it overwrites writes a jmp instruction over these two, as shown in (b). It also fills nop up to the next instruction boundary.

Note: In some cases, int $0x80 appears without preceding mov $XX. In that case, we overwrite int $0x80 and the following instructions. We cross fingers so that there will be no instructions that branches into the overwritten part.

In part (e), Jockey also generates code that can be used to invoke the original implementation, using the hardwired knowledge of system call numbers and parameters.

Figure 4.1 (c) shows the pseudocode of the entry point for the new implementation of time. As we discuss further in Section 4.2, this code is dynamically generated so that Jockey can intercept system calls on a separate stack and avoid corrupting target memory.

Finally, part (d) shows newtime, Jockey's implementation of time. While recording, newtime calls the original implementation (e) and logs the returned value. While replaying, it simply supplies the value from the log without actually executing the system call.

One might wonder why libjockey.so does not just provide a new implementation of a system call with the same name--for example, defining a function named time to intercept the invocation of the time system call. In fact, LD_PRELOAD is often used for that purpose. There are a few reasons this does not work.

Task (2) is done in the same way. It finds the occurrences of context-dependent CPU instructions and overwrites the jmp instruction on them to intercept their invocations.

Jockey needs to parse CPU instructions during steps (1) and (2), not a trivial task given x86's complex instruction encoding. It uses a pidgin table-based parser for common instructions and consults libdisasm [#!libdisasm!#], an open-source x86 disassembler library, for uncommon cases. A few tables that map opcodes and operands to their instruction length let us quickly parse more than 80% of all instruction occurrences. Even using this technique, however, parsing all CPU instructions in a typical Linux program takes about 350 milliseconds on a 1.5GHz Pentium-M processor, which may be too slow for some users. To reduce the startup latency further, Jockey also employs cached-mode instruction patching. Here, after finishing the slow mode, Jockey writes the locations of nondeterministic instructions found for each shared object in file ~/.jockey-sig. When the program starts the next time, it just reads ~/.jockey-sig without scanning the process's virtual memory, unless the timestamp of the object file has changed.

Jockey's instruction-patching approach is simpler and faster than full-program binary translation, employed by ATOM [#!srivastava1994!#] or Valgrind [#!valgrind!#]. It needs to patch only a few bytes at the beginning of system calls and nondeterministic CPU instructions; the rest of the target program executes natively. Indeed, as we show in Section , Jockey's performance overhead is negligible for CPU-intensive programs.


4.2 Segregating resource usage

Jockey and the target application run as part of the same process and share all resources. Jockey must segregate the use of resources to prevent Jockey from unnecessarily changing the target's behavior, and to minimize the chance of a misbehaving target program breaking Jockey. This section discusses Jockey's treatment of three types of shared resources: heap, stack, and file descriptors.


4.2.1 Heap

Jockey cannot use standard libc functions, such as malloc or sbrk, to manage its internal data. Doing so increases the likelihood of a misbehaving target program breaking Jockey. Moreover, it changes the memory layout of the target process between record and replay. It would thus become impossible to replay invalid memory accesses correctly, e.g., accessing free'ed memory, which is one of the common programming errors.

Instead, Jockey stores all its internal data in a mmapped region at a fixed virtual address that is unlikely to be accessed accidentally by the target program. This address is by default set to 0x63000000, but it could be changed via an environment variable if the target needs to access this address legitimately. We use an internal malloc-like library to carve the memory out to individual data structures and build a custom C++ STL memory allocator on top of it. Thus, the Jockey code has full access to STL features, including maps and dynamic vectors. This design has considerably simplified the development of Jockey.

One restriction is that Jockey cannot make internal calls to libc functions that use malloc. Examples include high-level I/O functions (fopen, std::fstream) and DNS resolvers (gethostbyname).


4.2.2 Stack

Jockey also segregates the use of stack. This is necessary to replay a program that improperly accesses data beyond the stack pointer (e.g., accessing an on-stack array with a negative index). Figure 4.1 (d) illustrates how this is done. In the first few instructions after it intercepts the call to gettimeofday, Jockey saves the stack pointer to an internal variable, switches the stack to an internal buffer, copies the parameters to gettimeofday (a 4-byte pointer) from the old to the new stack, and calls newgettimeofday. Once the new implementation returns, Jockey restores the stack pointer. This allows for deterministic replay of even a buggy program because Jockey never uses the target's stack.

This stack-switching process must be done without touching any CPU register other than the stack pointer. For this purpose, all the data structures involved here are allocated statically. This makes Jockey non-reentrant, but it is not an issue because Jockey does not support multi-threading.


4.2.3 File descriptors

Jockey must perform its own file accesses occasionally, for example, when opening a log file or dumping a checkpoint. Because Jockey and the target process share the same file-descriptor table, Jockey must ensure that its file operations do not alter the descriptor allocation scheme seen by the target. To achieve this goal, Jockey moves file descriptors it internally opens to a fixed range not likely to be used by the target (430$\sim$439).

An alternative approach would be to create an indirection table that translates file descriptors between the target program and the operating system kernel. We chose the former approach for two reasons. First, the former approach simplifies implementation, especially for system calls such as select. Second, the absence of descriptor indirection allows the user to inspect the process's state more transparently, for example, by using the system call tracer (strace).

Gdb (debugger) poses another problem. When starting the target process, gdb opens a few extra file descriptors in addition to the usual stdin, stdout, and stderr. Thus, if an execution is recorded under a normal shell and then replayed under gdb, the files opened by the target processes will be assigned different descriptors, which make replaying divergent.4.1 We solve this problem by having Jockey open dummy files for descriptors 0 to 9 before starting the target program (it leaves descriptors inherited from the parent process untouched). Assuming that gdb opens at most 10 descriptors when it starts the target, we can ensure that the target has the same set of files opened upon record and replay.


4.3 Checkpointing

Jockey allows process state to be checkpointed automatically. Checkpointing serves two purposes. First, it allows the developer to time-travel through the history of execution quickly. Second, it bounds log-space consumption, because log records older than the oldest checkpoint can be deleted from disk.

Following the technique pioneered by libckpt [#!plank1995!#] and Flashback [#!srinivasan2004!#], Jockey first forks the target process. It then dumps the state of the child, while letting the parent continue running. Jockey reads the file /proc/N/maps (N is the process ID) to obtain the virtual memory mappings of the process and dumps only those sections that are mapped privately and read-write. To restore a checkpoint, for each section recorded in the checkpoint file, Jockey unmaps the memory region if it is already occupied, and either restores the contents from the checkpoint file or remaps the file.

We discuss two particular problems we faced, both related to dynamic linking.


4.3.1 Preventing brain damage to the dynamic linker

One of the challenges of checkpoint restoration is that Jockey needs to overwrite the memory that is potentially used by the restoration code itself. The process would crash if restoration is done naively. Here, two types of memory regions need to be taken care of: Jockey's internal heap ( Section ) and the heap used by the dynamic linker. For example, Jockey must execute the read system call to load checkpoint contents. If the call to read happens to be the first ever made by the target application or Jockey, then the dynamic linker is invoked to resolve the symbol ``read'', which involves modifying the linker's heap.

Jockey handles its internal heap by excluding it from checkpointing, but the dynamic linker poses a particular challenge--we cannot know a priori where the memory used by the dynamic linker is (the linker performs an anonymous mmap of its heap memory; all anonymous-mmapped sections look the same to Jockey). We resolve this issue by eagerly linking all libc functions that are called during snapshot restoration, by making dummy calls to functions such as open and read before it restores any checkpoint.


4.3.2 Exec shield

Exec-shield is a facility found in some Linux kernels (e.g., Red Hat, Fedora Core) to thwart buffer-overflow attacks [#!execshield!#]. One of its features is randomization of the loading addresses of shared-object files. This feature breaks Jockey because Jockey needs to keep data structures that are specific to the process's memory layout. We currently require that this feature be disabled by doing the below on machine boot.

echo 0 >/proc/sys/kernel/exec-shield


4.4 Handling signals

\includegraphics{signal}

Signals, especially those that happen asynchronously (for example, SIGALRM, SIGINT) present a special challenge, because they need to be delivered at exactly the same point in the execution during record and replay. We handle them in a way similar to [#!stodolsky1993!#]. Each signal delivery is first intercepted by Jockey. Jockey's signal handler simply records the parameters to the signal (signal number and the CPU register values) and finishes. At the end of the Jockey's handler for a system call or rdtsc CPU instruction, Jockey checks if a signal was intercepted in the past. If so, it logs the signal (so that it can be replayed) and calls the target-defined signal handler. This way, we convert asynchronous signals to synchronous upcalls that only happen immediately after a system call.

This technique may distort program behavior when the target program runs without issuing a system call (or executing nondeterministic CPU instructions) for a long period and receives signals in the meantime. However, our primary targets, I/O-oriented programs, usually do not suffer from this problem.


4.5 Reducing logging overhead for I/O system calls

Jockey employs two different types of logging techniques, depending on the types of system calls, to reduce the log-space overhead.

System calls such as read and write can operate on both types of files. We intercept calls to functions that create file descriptors--e.g., open, socket, and accept--remember the type of each descriptor, and dispatch based on the descriptor type. File descriptors inherited from the parent process (e.g., stdin) are always redo-logged.

Various studies have shown that majority of I/Os to regular files are reads, and that most of the write traffic is actually appends [#!ousterhout1985!#,#!vogels1999!#]. For these common cases, our design allows Jockey to only log the type and the offset of the requests, not the actual contents. Thus, it drastically reduces the logging overhead for file I/O system calls.

The downside of the undo-based logging is that the user cannot modify the files accessed by the target program between record and replay. So far, we have not found this to be a significant burden.


4.6 Handling memory-mapped I/Os

Updates to memory mapped files are handled using user-space memory-protection mechanisms. Jockey intercepts calls to the mmap system call. For each file requested to be mapped read-write in a shared mode (i.e., MAP_SHARED),4.2 Jockey makes the mapped region read-only, and takes a page fault (SIGSEGV signal) after the first write access to each page in the region. In the SIGSEGV handler, Jockey logs the current page contents ( Section 4.5), makes the page writable, and returns the control to the target. A similar approach is adopted by Flashback [#!srinivasan2004!#], albeit using a kernel extension.

These memory pages are made read-only again just before checkpointing, so that Jockey can restore the contents of the file at the moment of each checkpoint.



Footnotes

...4.1
In fact, this problem is not just specific to gdb. It happens whenever the target process inherits more than the standard number of file descriptors from the parent.
...MAP_SHARED),4.2
Accesses to a private mapping (MAP_PRIVATE) need not be intercepted, because private mapping is essentially a heap memory with particular initial contents.