Tuesday, October 29, 2013

How does clock_gettime work

Clock_gettime is a function that, as its name suggests, gives the time. clock_gettime has a VDSO  implementation on x86 architectures. VDSO is a shared memory segment between the kernel and each user application. It allows the kernel to export functions to userland so that userspace processes can use them without the overhead of a system call.
clock_gettime() requires two arguments, first one being the wanted clock id, and the second one being a pointer to a struct timespec variable in which the values will be stored. Struct timespec is simply a structure that contains two fields, tv_sec for seconds, and tv_nsec for nanoseconds:

struct timespec {
    __kernel_time_t tv_sec;     /* seconds */
    long    tv_nsec;    /* nanoseconds */
};
Note: The main focus of this blog post will be around clock ids CLOCK_MONOTONIC and CLOCK_REALTIME  as these are the clocks that the LTTng tracer uses for userspace tracing to put a timestamp on recorded events.
clock_gettime()
 is relative to a certain time reference, ie. some specific event in the past. The main difference on Linux between CLOCK_MONOTONIC and CLOCK_REALTIME is this reference. CLOCK_REALTIME gives the "real time" as in the wall clock time, or the time on your watch. Its time reference is the epoch which is defined to be the first of January 1970. If I call:
clock_gettime(CLOCK_REALTIME, &ts);
at the time I am writing this post, the returned values are the following:
ts.tv_sec = 1383065479, ts.tv_nsec = 750367192.
If we take the number of seconds and convert it to years (dividing it by 3600, then 24, then 365.25), we get 43.82. This means that 43.82 years have elapsed since the epoch up until the moment I called clock_gettime(CLOCK_REALTIME, &ts).  This also means that if I manually change the clock (or the date) of my system, this change will have repercussions on the value returned by clock_gettime(CLOCK_REALTIME, &ts).Note that this is also true for time changes made by NTP. Thus, the time given by the CLOCK_REALTIME clock is not ~monotonic~, as it is not necessarily monotonically increasing in time, and can go backwards and forwards.

This helps us introduce the other clock id, CLOCK_MONOTONIC. This clock is, as you could have guessed, updated in a strictly monotonic fashion. In other words, consecutive reads of this clock unconditionally give ascending values; this clock can not go back in time, even if the clock of my system is changed. The time reference to which it relatively gives the time to is the boot time of the system. Note that this is specific to Linux, and not to all POSIX systems. The time returned by clock_gettime(CLOCK_MONOTONIC, &ts) is the elapsed time since the system boot. If I call:
clock_gettime(CLOCK_MONOTONIC, &ts);
I get the following values:
ts.tv_sec = 103941, ts.tv_nsec = 959414826
Meaning that my (Linux) system has booted 103941/3600 = 28.8 hours ago. We can clearly see why this time reference guarantees monotonicity. The elapsed time since boot is independent from the wall clock time. If I change the clock of my system, the value given by the CLOCK_MONOTONIC clock is still relative to the boot time, which still hasn't changed.

As you can see, CLOCK_MONOTONIC is better for ordering events during the lifetime of a session, whereas CLOCK_REALTIME is better when an absolute time is needed. LTTng  uses the monotonic clock to assign a timestamp to the recorded events in a trace. However, since it is more useful to have an actual wall clock time, LTTng stores the difference between CLOCK_REALTIME and CLOCK_MONOTONIC at the beginning of the tracing in a metadata file. When LTTng is done tracing, a conversion from boot time to absolute time can be made by adding that value to all recorded timestamps.

Now let's take a look at the source code of the VDSO implementation of clock_gettime(), in file
arch/x86/vdso/vclock_gettime.c from the kernel source tree:
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
    int ret = VCLOCK_NONE;

    switch (clock) {
    case CLOCK_REALTIME:
        ret = do_realtime(ts);
        break;
    case CLOCK_MONOTONIC:
        ret = do_monotonic(ts);
        break;
    case CLOCK_REALTIME_COARSE:
        return do_realtime_coarse(ts);
    case CLOCK_MONOTONIC_COARSE:
        return do_monotonic_coarse(ts);
    }

    if (ret == VCLOCK_NONE)
        return vdso_fallback_gettime(clock, ts);
    return 0;
}
This code snippet simply calls the time function corresponding to the requested clock id. Assuming we asked for CLOCK_MONOTONIC, let's take a look at the do_monotonic() function, from the same file:
notrace static int do_monotonic(struct timespec *ts)
{
    unsigned long seq;
    u64 ns;
    int mode;

    ts->tv_nsec = 0;
    do {
        seq = read_seqcount_begin(&gtod->seq);
        mode = gtod->clock.vclock_mode;
        ts->tv_sec = gtod->monotonic_time_sec;
        ns = gtod->monotonic_time_snsec;
        ns += vgetsns(&mode);
        ns >>= gtod->clock.shift;
    } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
    timespec_add_ns(ts, ns);
  
    return mode;
}

As you can see, all this function does is to "fill" the ts structure that was given as a parameter with the current values of tv_sec and tv_nsec. The do-while loop is simply a synchronization scheme and can be ignored for now.
ts->tv_sec is set to gtod->monotonic_time_sec while ts->tv_nsec is set to gtod->monotonic_time_snsec  plus the returned value of vgetsns(), for finer granularity. gtod is simply a structure that acts as a replacement for the actual values kept in the kernel, that userspace processes can't access. Therefore, the values in gtod have to get updated regularly. This update happens in update_vsyscall(struct timekeeper *tk), from file arch/x86/kernel/vsyscall_64.c:
void update_vsyscall(struct timekeeper *tk)
{
    struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;

    write_seqcount_begin(&vdata->seq);

    /* copy vsyscall data */
    [...]
  
    vdata->monotonic_time_sec = tk->xtime_sec      // (1)
          + tk->wall_to_monotonic.tv_sec;
    vdata->monotonic_time_snsec = tk->xtime_nsec   // (2)
          + (tk->wall_to_monotonic.tv_nsec
            << tk->shift);
    while (vdata->monotonic_time_snsec >=
          (((u64)NSEC_PER_SEC) << tk->shift)) {
        vdata->monotonic_time_snsec -=
          ((u64)NSEC_PER_SEC) << tk->shift;
        vdata->monotonic_time_sec++;
    }

    [...]

    write_seqcount_end(&vdata->seq);
}

In (1), monotonic_time_sec is set, and in 2, monotonic_time_snsec is set. These are the values that are "exported" to userland, via the vsyscall_gtod_data structure. By digging a little more in the kernel source, we can have an idea at how and when is this structure is updated.

Depending on the frequency of "ticks" - see CONFIG_HZ
Hardware timer interrupt (generated by the Programmable Interrupt Timer - PIT)
-> tick_periodic();
  -> do_timer(1);
    -> update_wall_time();
      -> timekeeping_update(tk, false);
        -> update_vsyscall(tk);

Or, (on tickless kernels - see CONFIG_NO_HZ):
smp_apic_timer_interrupt()
  -> irq_enter()
    -> tick_check_idle()
      -> tick_check_nohz()
        -> tick_nohz_update_jiffies()
          -> tick_do_update_jiffies64()
            -> do_timer(ticks) // ex: ticks = 1344
              -> update_wall_time();
                -> timekeeping_update(tk, false);
                  -> update_vsyscall(tk);

So, to sum things up: clock_gettime() gives some values that are updated regurarly, plus an interpolation to give better precision for the nanoseconds value. How regurarly are these values updated? Simply upon timer interrupts.

2 comments:

Registering a probe to a kernel module using Systemtap

I was trying to register a probe on a function of a kernel module of mine using Systemtap. The .stp file was fairly simple: $> cat mymod...