The above picture shows a high-level structure of Jockey. Internally, Jockey consists of three parts: interposerinterposer, Jockey proper, and fakethreadfakethread. The arrows show the flow of control when a system call is issued by the target application.
(the yellow boxes in the picture)
is a low-level
library that intercepts system calls and other CPU instructions with
nondeterministic effects (
rdtsc on x86).
Interposer can be used
for purposes other than execution record and replay.
For simplicity and convenience, Jockey proper and interposer are combined into a single shared object file, libjockey.so. Simply loading libjockey.so on startup causes Jockey to take control of the process. As explained in Section 3, this can be done by starting the program via jockey frontend or manually linking libjockey.so to the program.
The fakethread library is used in conjunction with libjockey.so to deterministically record and replay pthread-based applications. It is actually named libpthread.so since it is a drop-in replacement for the standard pthreads. Section 3.7 describes it in more detail.
Linux's dynamic linker (ld.so) invokes Jockey and Interposer's initialization routines immediately after libjockey.so is loaded, before the target program starts execution.
.init_arraysection. The code placed in this section is called automatically when the object file is loaded, with arguments
Jockey's initialization routine performs the following tasks.
int $0x80). Jockey rewrites each instruction intercepts its invocation. Jockey currently intercepts 80 types of Linux system calls, including time, recvfrom, and select. Jockey logs the values generated by these calls during recording, and reads the value from the log during replay.
rdtsc, the x86 instruction for reading the CPU's timestamp counter. It is used, for example, as a pseudo random-number generator in libc.
excludedprogramoptions ( Section 3). If the program is not to be traced, it transfers the control to the target at this point. Jockey's system call and
rdtschandlers just forward the instructions to the original functions hereafter.
The next section describes the first two steps in more detail. Section 4.2 discusses Jockey's efforts to segregate itself from the target program to avoid unnecessary interference. Section 3.2 describes Jockey's checkpointing function (step (3)), along with the challenges we had to overcome.
Jockey intercepts Linux system calls by patching the occurrences of
system call instructions (
As an example, the above picture shows how the time system
call is recorded and replayed. (a) shows an abridged version of
time in libc.so. When Jockey starts, it reads the
process's virtual memory structure by reading /proc/fd/maps,
reading the ELF header of each loaded object file, and finds where the
CPU instructions could be found. It then finds the occurrences of
Linux system call instructions. If it manages to find the pattern
mov $XX, %eax; int $0x80, then it
overwrites writes a
jmp instruction over these two,
as shown in (b). It also fills
nop up to the next instruction
int $0x80appears without preceding
mov $XX. In that case, we overwrite
int $0x80and the following instructions. We cross fingers so that there will be no instructions that branches into the overwritten part.
In part (e), Jockey also generates code that can be used to invoke the original implementation, using the hardwired knowledge of system call numbers and parameters.
Figure 4.1 (c) shows the pseudocode of the entry point for the new implementation of time. As we discuss further in Section 4.2, this code is dynamically generated so that Jockey can intercept system calls on a separate stack and avoid corrupting target memory.
Finally, part (d) shows newtime, Jockey's implementation of time. While recording, newtime calls the original implementation (e) and logs the returned value. While replaying, it simply supplies the value from the log without actually executing the system call.
One might wonder why libjockey.so does not just provide a new
implementation of a system call with the same name--for example,
defining a function named
time to intercept the invocation of
time system call. In fact, LD_PRELOAD is often
used for that purpose. There are a few reasons this does not work.
libcor the dynamic linker--for example, a call to read made by getc. These internal calls are always resolved to invoke the definition inside
libc, and they cannot be overridden merely by redefining using LD_PRELOAD.
libcmay invoke a system call
foonot necessarily by invoking the function with the same name. For example, some versions of libc inlines the system call instruction in other functions. In other cases, there are multiple internal version of the system call (for example, the original call and non-cancellable call) with unpredictable (and often non-public) names.
Task (2) is done in the same way. It finds the occurrences of
context-dependent CPU instructions and overwrites the
instruction on them to intercept their invocations.
Jockey needs to parse CPU instructions during steps (1) and (2), not a trivial task given x86's complex instruction encoding. It uses a pidgin table-based parser for common instructions and consults libdisasm [#!libdisasm!#], an open-source x86 disassembler library, for uncommon cases. A few tables that map opcodes and operands to their instruction length let us quickly parse more than 80% of all instruction occurrences. Even using this technique, however, parsing all CPU instructions in a typical Linux program takes about 350 milliseconds on a 1.5GHz Pentium-M processor, which may be too slow for some users. To reduce the startup latency further, Jockey also employs cached-mode instruction patching. Here, after finishing the slow mode, Jockey writes the locations of nondeterministic instructions found for each shared object in file ~/.jockey-sig. When the program starts the next time, it just reads ~/.jockey-sig without scanning the process's virtual memory, unless the timestamp of the object file has changed.
Jockey's instruction-patching approach is simpler and faster than
full-program binary translation, employed by
ATOM [#!srivastava1994!#] or Valgrind [#!valgrind!#]. It
needs to patch only a few bytes
at the beginning of system calls and nondeterministic CPU instructions;
the rest of the target program executes natively.
Indeed, as we show
Jockey and the target application run as part of the same process and share all resources. Jockey must segregate the use of resources to prevent Jockey from unnecessarily changing the target's behavior, and to minimize the chance of a misbehaving target program breaking Jockey. This section discusses Jockey's treatment of three types of shared resources: heap, stack, and file descriptors.
Jockey cannot use standard libc functions, such as malloc or sbrk, to manage its internal data. Doing so increases the likelihood of a misbehaving target program breaking Jockey. Moreover, it changes the memory layout of the target process between record and replay. It would thus become impossible to replay invalid memory accesses correctly, e.g., accessing free'ed memory, which is one of the common programming errors.
Instead, Jockey stores all its internal data in a mmapped region at a fixed virtual address that is unlikely to be accessed accidentally by the target program. This address is by default set to 0x63000000, but it could be changed via an environment variable if the target needs to access this address legitimately. We use an internal malloc-like library to carve the memory out to individual data structures and build a custom C++ STL memory allocator on top of it. Thus, the Jockey code has full access to STL features, including maps and dynamic vectors. This design has considerably simplified the development of Jockey.
One restriction is that Jockey cannot make internal calls to libc
functions that use malloc. Examples include
high-level I/O functions (fopen,
DNS resolvers (gethostbyname).
Jockey also segregates the use of stack. This is necessary to replay a program that improperly accesses data beyond the stack pointer (e.g., accessing an on-stack array with a negative index). Figure 4.1 (d) illustrates how this is done. In the first few instructions after it intercepts the call to gettimeofday, Jockey saves the stack pointer to an internal variable, switches the stack to an internal buffer, copies the parameters to gettimeofday (a 4-byte pointer) from the old to the new stack, and calls newgettimeofday. Once the new implementation returns, Jockey restores the stack pointer. This allows for deterministic replay of even a buggy program because Jockey never uses the target's stack.
This stack-switching process must be done without touching any CPU register other than the stack pointer. For this purpose, all the data structures involved here are allocated statically. This makes Jockey non-reentrant, but it is not an issue because Jockey does not support multi-threading.
Jockey must perform its own file accesses occasionally, for example, when opening a log file or dumping a checkpoint. Because Jockey and the target process share the same file-descriptor table, Jockey must ensure that its file operations do not alter the descriptor allocation scheme seen by the target. To achieve this goal, Jockey moves file descriptors it internally opens to a fixed range not likely to be used by the target (430439).
An alternative approach would be to create an indirection table that
translates file descriptors between the target program and the
operating system kernel. We chose the former approach for two reasons.
First, the former approach simplifies implementation, especially for
system calls such as select. Second, the absence of descriptor
allows the user to inspect the process's state more transparently, for
example, by using the system call tracer (
Gdb (debugger) poses another problem. When starting the target process, gdb opens a few extra file descriptors in addition to the usual stdin, stdout, and stderr. Thus, if an execution is recorded under a normal shell and then replayed under gdb, the files opened by the target processes will be assigned different descriptors, which make replaying divergent.4.1 We solve this problem by having Jockey open dummy files for descriptors 0 to 9 before starting the target program (it leaves descriptors inherited from the parent process untouched). Assuming that gdb opens at most 10 descriptors when it starts the target, we can ensure that the target has the same set of files opened upon record and replay.
Jockey allows process state to be checkpointed automatically. Checkpointing serves two purposes. First, it allows the developer to time-travel through the history of execution quickly. Second, it bounds log-space consumption, because log records older than the oldest checkpoint can be deleted from disk.
Following the technique pioneered by libckpt [#!plank1995!#] and Flashback [#!srinivasan2004!#], Jockey first forks the target process. It then dumps the state of the child, while letting the parent continue running. Jockey reads the file /proc/N/maps (N is the process ID) to obtain the virtual memory mappings of the process and dumps only those sections that are mapped privately and read-write. To restore a checkpoint, for each section recorded in the checkpoint file, Jockey unmaps the memory region if it is already occupied, and either restores the contents from the checkpoint file or remaps the file.
We discuss two particular problems we faced, both related to dynamic linking.
One of the challenges of checkpoint restoration is that Jockey needs to
overwrite the memory that is potentially used by the restoration code
itself. The process would crash if restoration is done naively. Here, two
types of memory regions need to be taken care of: Jockey's internal
heap ( Section
Jockey handles its internal heap by excluding it from checkpointing, but the dynamic linker poses a particular challenge--we cannot know a priori where the memory used by the dynamic linker is (the linker performs an anonymous mmap of its heap memory; all anonymous-mmapped sections look the same to Jockey). We resolve this issue by eagerly linking all libc functions that are called during snapshot restoration, by making dummy calls to functions such as open and read before it restores any checkpoint.
Exec-shield is a facility found in some Linux kernels (e.g., Red Hat, Fedora Core) to thwart buffer-overflow attacks [#!execshield!#]. One of its features is randomization of the loading addresses of shared-object files. This feature breaks Jockey because Jockey needs to keep data structures that are specific to the process's memory layout. We currently require that this feature be disabled by doing the below on machine boot.
echo 0 >/proc/sys/kernel/exec-shield
Signals, especially those that happen asynchronously (for example,
SIGALRM, SIGINT) present a special challenge, because
they need to be delivered at exactly the same point in the execution during
record and replay.
We handle them in a way similar to [#!stodolsky1993!#]. Each
signal delivery is first intercepted by Jockey. Jockey's signal
handler simply records the parameters to the signal (signal number and
the CPU register values) and finishes. At the end of the Jockey's handler
for a system call or
rdtsc CPU instruction, Jockey checks if
a signal was intercepted in the past. If so, it logs the signal (so that
it can be replayed) and calls the target-defined signal handler. This
way, we convert asynchronous signals to synchronous upcalls that
only happen immediately after a system call.
This technique may distort program behavior when the target program runs without issuing a system call (or executing nondeterministic CPU instructions) for a long period and receives signals in the meantime. However, our primary targets, I/O-oriented programs, usually do not suffer from this problem.
Jockey employs two different types of logging techniques, depending on the types of system calls, to reduce the log-space overhead.
rdtsc--Jockey performs ``redo'' logging. Jockey logs the value produced by the event during recording, as illustrated in Figure
System calls such as read and write can operate on both types of files. We intercept calls to functions that create file descriptors--e.g., open, socket, and accept--remember the type of each descriptor, and dispatch based on the descriptor type. File descriptors inherited from the parent process (e.g., stdin) are always redo-logged.
Various studies have shown that majority of I/Os to regular files are reads, and that most of the write traffic is actually appends [#!ousterhout1985!#,#!vogels1999!#]. For these common cases, our design allows Jockey to only log the type and the offset of the requests, not the actual contents. Thus, it drastically reduces the logging overhead for file I/O system calls.
The downside of the undo-based logging is that the user cannot modify the files accessed by the target program between record and replay. So far, we have not found this to be a significant burden.
Updates to memory mapped files are handled using user-space memory-protection
mechanisms. Jockey intercepts calls to the mmap system call.
For each file requested to be mapped read-write in a shared mode (i.e.,
MAP_SHARED),4.2 Jockey makes
the mapped region read-only, and takes a page fault (
signal) after the first write access to each page in the region. In
SIGSEGV handler, Jockey logs the current page contents
( Section 4.5), makes the page
writable, and returns the control to the target. A similar
approach is adopted by Flashback [#!srinivasan2004!#], albeit using a
These memory pages are made read-only again just before checkpointing, so that Jockey can restore the contents of the file at the moment of each checkpoint.
MAP_PRIVATE) need not be intercepted, because private mapping is essentially a heap memory with particular initial contents.