Abstract
System level virtualization provides several advantages: (i) customization is eased since virtual machines may be based on different systems; (ii) virtual machines are isolated from hardware, subsequently applications are isolated via the virtual machines; (iii) basic fault tolerance mechanisms -- pro-active fault tolerance through virtual machine migration and virtual machine snapshot/restore; and (iv) basic load balancing mechanisms -- the capability to move and stop virtual machines running in the system. However, the current Xen implementation does not natively provide mechanisms for virtual machine checkpoint/restart.
This document presents the design of a reactive fault tolerant system, based on a checkpoint/restart mechanism for Xen virtual machines. We present the infrastructure for the management of virtual machines' checkpoint data as well as challenges for the implementation of a virtual machine checkpoint/restart mechanism based on Xen.