Tuesday, October 29, 2013

How does clock_gettime work

Clock_gettime is a function that, as its name suggests, gives the time. clock_gettime has a VDSO  implementation on x86 architectures. VDSO is a shared memory segment between the kernel and each user application. It allows the kernel to export functions to userland so that userspace processes can use them without the overhead of a system call.
clock_gettime() requires two arguments, first one being the wanted clock id, and the second one being a pointer to a struct timespec variable in which the values will be stored. Struct timespec is simply a structure that contains two fields, tv_sec for seconds, and tv_nsec for nanoseconds:

struct timespec {
    __kernel_time_t tv_sec;     /* seconds */
    long    tv_nsec;    /* nanoseconds */
};
Note: The main focus of this blog post will be around clock ids CLOCK_MONOTONIC and CLOCK_REALTIME  as these are the clocks that the LTTng tracer uses for userspace tracing to put a timestamp on recorded events.
clock_gettime()
 is relative to a certain time reference, ie. some specific event in the past. The main difference on Linux between CLOCK_MONOTONIC and CLOCK_REALTIME is this reference. CLOCK_REALTIME gives the "real time" as in the wall clock time, or the time on your watch. Its time reference is the epoch which is defined to be the first of January 1970. If I call:
clock_gettime(CLOCK_REALTIME, &ts);
at the time I am writing this post, the returned values are the following:
ts.tv_sec = 1383065479, ts.tv_nsec = 750367192.
If we take the number of seconds and convert it to years (dividing it by 3600, then 24, then 365.25), we get 43.82. This means that 43.82 years have elapsed since the epoch up until the moment I called clock_gettime(CLOCK_REALTIME, &ts).  This also means that if I manually change the clock (or the date) of my system, this change will have repercussions on the value returned by clock_gettime(CLOCK_REALTIME, &ts).Note that this is also true for time changes made by NTP. Thus, the time given by the CLOCK_REALTIME clock is not ~monotonic~, as it is not necessarily monotonically increasing in time, and can go backwards and forwards.

This helps us introduce the other clock id, CLOCK_MONOTONIC. This clock is, as you could have guessed, updated in a strictly monotonic fashion. In other words, consecutive reads of this clock unconditionally give ascending values; this clock can not go back in time, even if the clock of my system is changed. The time reference to which it relatively gives the time to is the boot time of the system. Note that this is specific to Linux, and not to all POSIX systems. The time returned by clock_gettime(CLOCK_MONOTONIC, &ts) is the elapsed time since the system boot. If I call:
clock_gettime(CLOCK_MONOTONIC, &ts);
I get the following values:
ts.tv_sec = 103941, ts.tv_nsec = 959414826
Meaning that my (Linux) system has booted 103941/3600 = 28.8 hours ago. We can clearly see why this time reference guarantees monotonicity. The elapsed time since boot is independent from the wall clock time. If I change the clock of my system, the value given by the CLOCK_MONOTONIC clock is still relative to the boot time, which still hasn't changed.

As you can see, CLOCK_MONOTONIC is better for ordering events during the lifetime of a session, whereas CLOCK_REALTIME is better when an absolute time is needed. LTTng  uses the monotonic clock to assign a timestamp to the recorded events in a trace. However, since it is more useful to have an actual wall clock time, LTTng stores the difference between CLOCK_REALTIME and CLOCK_MONOTONIC at the beginning of the tracing in a metadata file. When LTTng is done tracing, a conversion from boot time to absolute time can be made by adding that value to all recorded timestamps.

Now let's take a look at the source code of the VDSO implementation of clock_gettime(), in file
arch/x86/vdso/vclock_gettime.c from the kernel source tree:
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
    int ret = VCLOCK_NONE;

    switch (clock) {
    case CLOCK_REALTIME:
        ret = do_realtime(ts);
        break;
    case CLOCK_MONOTONIC:
        ret = do_monotonic(ts);
        break;
    case CLOCK_REALTIME_COARSE:
        return do_realtime_coarse(ts);
    case CLOCK_MONOTONIC_COARSE:
        return do_monotonic_coarse(ts);
    }

    if (ret == VCLOCK_NONE)
        return vdso_fallback_gettime(clock, ts);
    return 0;
}
This code snippet simply calls the time function corresponding to the requested clock id. Assuming we asked for CLOCK_MONOTONIC, let's take a look at the do_monotonic() function, from the same file:
notrace static int do_monotonic(struct timespec *ts)
{
    unsigned long seq;
    u64 ns;
    int mode;

    ts->tv_nsec = 0;
    do {
        seq = read_seqcount_begin(&gtod->seq);
        mode = gtod->clock.vclock_mode;
        ts->tv_sec = gtod->monotonic_time_sec;
        ns = gtod->monotonic_time_snsec;
        ns += vgetsns(&mode);
        ns >>= gtod->clock.shift;
    } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
    timespec_add_ns(ts, ns);
  
    return mode;
}

As you can see, all this function does is to "fill" the ts structure that was given as a parameter with the current values of tv_sec and tv_nsec. The do-while loop is simply a synchronization scheme and can be ignored for now.
ts->tv_sec is set to gtod->monotonic_time_sec while ts->tv_nsec is set to gtod->monotonic_time_snsec  plus the returned value of vgetsns(), for finer granularity. gtod is simply a structure that acts as a replacement for the actual values kept in the kernel, that userspace processes can't access. Therefore, the values in gtod have to get updated regularly. This update happens in update_vsyscall(struct timekeeper *tk), from file arch/x86/kernel/vsyscall_64.c:
void update_vsyscall(struct timekeeper *tk)
{
    struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;

    write_seqcount_begin(&vdata->seq);

    /* copy vsyscall data */
    [...]
  
    vdata->monotonic_time_sec = tk->xtime_sec      // (1)
          + tk->wall_to_monotonic.tv_sec;
    vdata->monotonic_time_snsec = tk->xtime_nsec   // (2)
          + (tk->wall_to_monotonic.tv_nsec
            << tk->shift);
    while (vdata->monotonic_time_snsec >=
          (((u64)NSEC_PER_SEC) << tk->shift)) {
        vdata->monotonic_time_snsec -=
          ((u64)NSEC_PER_SEC) << tk->shift;
        vdata->monotonic_time_sec++;
    }

    [...]

    write_seqcount_end(&vdata->seq);
}

In (1), monotonic_time_sec is set, and in 2, monotonic_time_snsec is set. These are the values that are "exported" to userland, via the vsyscall_gtod_data structure. By digging a little more in the kernel source, we can have an idea at how and when is this structure is updated.

Depending on the frequency of "ticks" - see CONFIG_HZ
Hardware timer interrupt (generated by the Programmable Interrupt Timer - PIT)
-> tick_periodic();
  -> do_timer(1);
    -> update_wall_time();
      -> timekeeping_update(tk, false);
        -> update_vsyscall(tk);

Or, (on tickless kernels - see CONFIG_NO_HZ):
smp_apic_timer_interrupt()
  -> irq_enter()
    -> tick_check_idle()
      -> tick_check_nohz()
        -> tick_nohz_update_jiffies()
          -> tick_do_update_jiffies64()
            -> do_timer(ticks) // ex: ticks = 1344
              -> update_wall_time();
                -> timekeeping_update(tk, false);
                  -> update_vsyscall(tk);

So, to sum things up: clock_gettime() gives some values that are updated regurarly, plus an interpolation to give better precision for the nanoseconds value. How regurarly are these values updated? Simply upon timer interrupts.

Wednesday, October 2, 2013

Debug the Linux kernel using Qemu

As I am trying to understand in detail how virtualization works, I wanted to debug the Linux kernel in a virtual machine using Qemu to have an idea about the real control flow during the execution of Linux in Qemu/KVM. So I looked on the internet for how to get this done, but most of the posts I found didn't work for me. The main reason is that Qemu was used to directly boot a kernel image from the host, which for some reason I wasn't able to get done. In this post, I will use another way to debug the Linux kernel using Qemu.

The workaround is to give to gdb (on the host) the same kernel used by the virtual machine you want to debug. If you have compiled your own kernel on the host, this can easily be done by downloading the sources of the same kernel version in a VM, and use the same configuration file you used on your host system. Note that you don't need to be running the same version on your host, you only need to provide the same vmlinux file to gdb as the one used in your VM. You could also do the other way around, which is to get the configuration file used to compile the kernel in the VM and use it to compile the same kernel version on the host.

These are the detailed steps from scratch, using Linux kernel version 3.11.3:
On the host, download the tarball from kernel.org and extract it (you can also use git or any other way, the only important thing is to have the same version both on the host an on the guest):
host $> wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.11.3.tar.xz
host $> tar xvf linux-3.11.3.tar.xz
host $> cd linux-3.11.3
host $> make menuconfig
host $> make -jNUMBER_OF_THREADS
Now, it is very important that you add debugging information to your kernel. Make sure you have done this by inspecting the .config file created, and by making sure that CONFIG_DEBUG_INFO is set to "y":
CONFIG_DEBUG_INFO=y 

At this point, you should have a file called vmlinux in the current directory. Now you have two options, you can copy the entire directory linux-3.11.3 into your VM, cd into it and run:
guest $> sudo make modules_install
guest $> sudo make install

Or download the same tarball in your VM, copy only the file linux-3.11.3/.config from your host into linux-3.11.3 in your VM, and run:
guest $> make
guest $> sudo make modules_install
guest $> sudo make install

Both of these methods will get you to the same result: installing the same vmlinux file you have on your host in your VM. You can then reboot your VM, and try to boot using the kernel you just compiled. If you can't achieve this, you have a problem and you should review the steps from the beginning.

Assuming this worked, turn off the VM. Then use Qemu to turn it on again, while adding the options -s -S to qemu. If you're using virt-manager (libvirt), you should do this by modifying the xml file used by libvirt to configure your VM. Otherwise, you can simply use the following command:
host $> qemu-system-x86_64 -smp 4 -m 4096 /var/lib/libvirt/images/name-of-vm.img -s -S

Note: -smp 4 means create 4 virtual CPUs and -m 4096 means allocate 4GB of memory to the VM.

At this point, Qemu will have started and opened a gdbserver instance on port 1234 (-s option) but your VM will be stopped (because of the -S option). Now, from your host, launch gdb by giving it the vmlinux file we compiled earlier:
host $> gdb vmlinux

You will see a message on the output, ending with:
Reading symbols from some_path/linux-3.11.3/vmlinux...done.
If you see instead another message like:
Reading symbols from some_path/linux-3.9.2/vmlinux...(no debugging symbols found)...done.
Go back and make sure you have enabled the debugging information before compiling the kernel.

Then connect to the gdbserver launched by Qemu:
(gdb) target remote :1234
then:
(gdb) c

Now your VM will continue execution and start the boot process. Make sure you boot using the kernel we have just compiled.
From this point, you can use gdb as you would with any regular process. In the gdb shell, you can send the Ctrl-C signal to pause the VM and set a breakpoint on a specific function:
(gdb) b do_timer

Then continue execution:
(gdb) c

Gdb will then break the next time function do_timer is called.

Tip: after sending the Ctrl-C signal to pause your VM, you can use Ctrl-X, Ctrl-A to show the source code of your kernel and the location at which the VM is paused.


Friday, July 12, 2013

How to change QEMU processes names

I am currently working on trace analysis and correlation between a host and multiple guests. I am using qemu/kvm as a hypervisor and one of the problems that I encountered was how to know which qemu (or qemu-kvm depending on the version you are using) belongs to which virtual machine. Then I discovered a very interesting feature in qemu allowing you to set the name of the virtual machine's process running on the host (knowing that one VM is a process, and each virtual CPU is a thread belonging to this process).

First, you have to enable the feature from /etc/libvirt/qemu.conf (or whatever configuration file you're using for qemu). Search for the line
#set_process_name = 1
and uncomment it (or make sure it is set to 1):
set_process_name = 1
Then you should restart the libvirt daemon (or reboot your system).

You can then simply give the additional following option:
$> qemu [options] -name vm_name,process="process_name"
to QEMU to set the name of the process to "process_name".

If you omit to provide the process="process_name" option to qemu, the process's name will be set to qemu:vm_name. You have to know that the name of a process can only be 15 characters long, so with this method you've already lost 5 characters. The rest of the name will be truncated.

Make sure this worked by running
$> top
or
$> pgrep -l process_name



Sunday, June 30, 2013

Detecting virtualization: How to know if you're running inside a virtual machine

As you may know, the Linux kernel has some paravirtualization routines and setups to collaborate with the hypervisor in case it is running inside a virtual machine (VM). The paravirtualization components can be modules to be loaded (for instance, the virtio family) or the the initialization of an interface for paravirtualization routines (more on that in a future blog post). But how does the Linux kernel knows when it is running inside a VM to prepare the ground for paravirtualization?

This blog post is mostly dedicated to virtualization using KVM (Kernel-based virtual machine) as a hypervisor on x86 platforms, though the detection of virtualization is basically the same for any other hypervisor.

Everything starts in the setup_arch() function, in file arch/x86/kernel/setup.c from the kernel source tree. Here is a snippet of this function:
void __init setup_arch(char **cmdline_p)
{
  ...
  init_hypervisor_platform(); // (1)
  ...
  kvm_guest_init(); // (2)
  ...
}
1: Try to detect if the kernel is running inside a VM, and which hypervisor is used.
2: Try to initialize the kernel as a kvm guest. This function won't do anything if the system is not running inside a kvm guest.

Now, let's take a closer look at the function init_hypervisor_platform(), in file arch/x86/kernel/cpu/hypervisor.c :
void __init init_hypervisor_platform(void)
{
  detect_hypervisor_vendor(); // (3)

  if (!x86_hyper)
    return;
 
  init_hypervisor(&boot_cpu_data);

  if (x86_hyper->init_platform)
    x86_hyper->init_platform();
}
3: detect_hypervisor_vendor() is the function where all the magic happens, let's take a closer look at it (from the same file in the kernel source tree):
static inline void __init
detect_hypervisor_vendor(void)
{
  const struct hypervisor_x86 *h, * const *p;

  for (p = hypervisors; p < hypervisors + ARRAY_SIZE(hypervisors); p++) {
    h = *p;
    if (h->detect()) { // (4)
      x86_hyper = h;
      printk(KERN_INFO "Hypervisor detected: %s\n", h->name);
      break;
    }
  }
}

Here, "hypervisors" is an array of all the hypervisors that the Linux kernel knows about and that can be detected. This array is initialized in the same file and contains hypervisor_x86 structures that have already been created for the following hypervisors: Xen, VMWare, HyperV and KVM. The hypervisor_x86 structure is just a set of function pointers, each of which points to a function within each hypervisor's infrastructure. In (4), the detect() function of each hypervisor is called, until one of them succeeds. For KVM, the detect function pointer points to kvm_detect(). Let's take a look at it (from file arch/x86/kernel/kvm.c) :
static bool __init kvm_detect(void)
{
  if (!kvm_para_available())
    return false;
  return true;
}
And into kvm_para_available(). Now this is where the actual detection is done :
static inline bool kvm_para_available(void)
{ 
  unsigned int eax, ebx, ecx, edx;
  char signature[13];

  if (boot_cpu_data.cpuid_level < 0)
    return false; /* So we don't blow up on old processors */

  if (cpu_has_hypervisor) {
    cpuid(KVM_CPUID_SIGNATURE, &eax, &ebx, &ecx, &edx); // (5)
    memcpy(signature + 0, &ebx, 4);
    memcpy(signature + 4, &ecx, 4);
    memcpy(signature + 8, &edx, 4);
    signature[12] = 0;

    if (strcmp(signature, "KVMKVMKVM") == 0)
      return true;
  }

  return false;
}

What this function mainly does is asking for the signature of the CPU, using the cpuid() function, which gives the results in registers eax, ecx and edx. These registers which form the signature are then compared to the string "KVMKVMKVM", which would indicate that KVM is used as a hypervisor in case the comparison matches.
6: From Intel's documentation: Intel 64 and IA-32 Architectures Software Developer Manuals, Volume 3, section 25.1.2 - Instructions That Cause VM Exits Unconditionally: the cpuid instruction is listed as causing a VM exit whenever it is executed in VMX non-root mode. The hypervisor can then trap this instruction and deal with it appropriately.
Note here that it is not the function cpuid() called in (5) that traps, but the assembly instruction that it ends up calling.

Now, let's take a look at an interesting function in the KVM module, do_cpuid_ent(), from file arch/x86/kvm/cpuid.c :
static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
       u32 index, int *nent, int maxnent)
{
  ...
  case KVM_CPUID_SIGNATURE: {
    static const char signature[12] = "KVMKVMKVM\0\0";
    const u32 *sigptr = (const u32 *)signature;
    entry->eax = KVM_CPUID_FEATURES;
    entry->ebx = sigptr[0];
    entry->ecx = sigptr[1];
    entry->edx = sigptr[2];
    break;
  }
  ...
}
Note that this code is executed in KVM module, on the host system, unlike to the other code snippets shown earlier, which were executed by the guest kernel directly inside the VM.
The cpuid instruction from the guest is trapped by the hypervisor, giving the execution to the KVM module in the host to handle it. The KVM module then sets "KVMKVMKVM" as a signature for the CPU and gives the execution back to the VM. Finally, the CPU signature that the guest sees matches what the kernel is looking for to conclude that KVM is the hypervisor used, thus running inside a VM.
The same approach is used for other hypervisors, depending on the signature of the CPU. "Microsoft Hv" indicates Microsoft's HyperV, and "VMWareVMWare" indicates VMWare's hypervisor.


Registering a probe to a kernel module using Systemtap

I was trying to register a probe on a function of a kernel module of mine using Systemtap. The .stp file was fairly simple: $> cat mymod...