Xen Project Hypervisor: Virtualization and Power Management are Coalescing into an Energy-Aware Hypervisor
Power management in the Xen Project Hypervisor historically targets server applications to improve power consumption and heat management in data centers reducing electricity and cooling costs. In the embedded space, the Xen Project Hypervisor faces very different applications, architectures and power-related requirements, which focus on battery life, heat, and size.
Although the same fundamental principles of power management apply, the power management infrastructure in the Xen Project Hypervisor requires new interfaces, methods, and policies tailored to embedded architectures and applications. This post recaps Xen Project power management, how the requirements change in the embedded space, and how this change may unite the hypervisor and power manager functions.
Evolution of Xen Project Power Management on x86
Time-sharing of computer resources by different virtual machines (VMs) was the precursor to scheduling and virtualization. Sharing of time using workload estimates was both a good and simple enough proxy for energy sharing. As in all main OSes, energy and power management in the Xen Project Hypervisor came as an afterthought.
Intel and AMD developed the first forms of power management for the Xen Project with the x86_64 architecture. Initially, the Xen Project used the `hlt' instruction for CPU idling and didn't have any support for deeper sleep states. Then, support for suspend-to-RAM, also known as ACPI S3, was introduced. It was entirely driven by Dom0 and meant to support manual machine suspensions by the user, for instance when the lid is closed on a laptop. It was not intended to reduce power utilization under normal circumstances. As a result, power saving was minimal and limited to the effects of `hlt’.
Finally, Intel introduced support for cpu-freq in the Xen Project in 2007. This was the first non-trivial form of power management for the Xen Project. Cpu-freq decreases the CPU frequency at runtime to reduce power consumption when the CPU is only lightly utilized. Again, cpu-freq was entirely driven by Dom0: the hypervisor allowed Dom0 to control the frequency of the underlying physical CPUs.
Not only was this a backward approach from the Xen architecture point of view, but this approach was severely limiting. Dom0 didn't have a full view of the system to make the right decisions. In addition, it required one virtual CPU in Dom0 for each physical CPU and to pin each Dom0 virtual CPU to a different physical CPU. It was not a viable option in the long run.
To address this issue, cpu-freq was re-architected, moving the cpu-freq driver to the hypervisor. Thus, Xen Project became able to change CPU frequency and make power saving decisions by itself, solving these issues.
Intel and AMD introduced support for deep sleep states around the same time of the cpu-freq redesign. The Xen Project Hypervisor added the ability to idle physical CPUs beyond the simple `hlt' instruction. Deeper sleep states, also known as ACPI C-states, have better power savings properties, but come with higher latency cost. The deeper the sleep state, the more power is saved, the longer it takes to resume normal operation. The decision to enter a sleep state is based on two variables: time and energy. However, scheduling and idling remain separate activities by large margins. As an example, the scheduler has very limited influence on the choice of the particular sleep state.
Xen Project Power Management on Arm
The first Xen release with Arm support was Xen 4.3 in 2013, but the Xen power management has not been actively addressed until very recently. One of the reasons may be the dominance of proprietary and in-house hypervisors for Arm in the embedded space and the overwhelming prevalence of x86 for servers. Due to the Xen Project’s maturity, its open source model and wide deployment, it is frequently used today in a variety of Arm-based applications. The power management support for the Xen Project hypervisor on Arm is becoming essential, in particular in the embedded world.
In our next blog post, we will cover architectural choices for Xen on Arm in the embedded world and use cases on how to make this work.
Xen Power Management for Embedded Applications
Embedded applications require the same OS isolation and security capabilities that motivated the development of server virtualization, but come with a wider variety of multicore architectures, guest OSes, and virtual to physical hardware mappings. Moreover, most embedded designs are highly sensitive to deteriorations in performance, memory size, power efficiency and wakeup latency that often come with hypervisors. As the embedded devices are increasingly cooler, quieter, smaller and battery powered, efficient power management emerges as a vital hurdle for the successful adoption of hypervisors in the embedded community.
Standard non-virtualized embedded devices manage power at two levels: the platform and the OS level. At the platform level, the platform manager is typically executing on dedicated on-chip or on-board processors and microcontrollers. It is monitoring and controlling the energy consumption of the CPUs, the peripherals, the CPU clusters and all board level components by changing the frequencies, voltages, and functional states of the hardware. However, it has no intrinsic knowledge about the running applications, which is necessary for making the right decisions to save power.
This knowledge is provided by the OS, or, in some cases, directly by the application software itself. The Power State Coordination Interface (PSCI) and the Extensible Energy Management Interface (EEMI) are used to coordinate the power events between the platform manager, the OSes, and the processing clusters. Whereas PSCI coordinates the power events among the CPUs of a single processor cluster, EEMI is responsible for the peripherals and the power interaction between multiple clusters.
Contrary to the ACPI based power management for x86 architectures typical for desktops and servers, PSCI and EEMI allow for much more direct control and enable precise power management of virtual clusters. In embedded systems, every micro Joule counts, so the precision in terms of timing and scope of power management actions is essential.
When a virtualization layer is inserted between the OSes and the platform manager, it effectively enables additional virtual clusters, which come with virtual CPUs, virtual peripherals, and even physical peripherals with device passthrough. The EEMI power coordination of the virtual clusters can execute in the platform manager, hypervisor or both. If the platform manager is selected, the power management can be made very precise, but at the expense of firmware memory bloating, as it needs to manage not only the fixed physical clusters but also the dynamically created virtual clusters.
Additionally, the platform manager requires stronger processing capabilities to optimally manage power, especially if it takes the cluster and system loads into consideration. As platform managers typically reside in low power domains, both memory space, and processing power are in short supply.
The hypervisor usually executes on powerful CPU clusters, so has enough memory and processing power at its disposal. It is also well informed about the partitioning and load of the virtual clusters, making it the ideal place to manage power. However, for proper power management, the hypervisor also requires an accurate energy model of the underlying physical clusters. Similar to the energy-aware scheduler in Linux, the hypervisor must coalesce the sharing of time and energy to manage power properly. In this case, the OS-based power management is effectively transformed into the hypervisor-based power management.
The Hypervisor and Power Manager Coalesce
Most embedded designs consist of multiple physical clusters or subsystems that are frequently put into inactive low-power states to save energy, such as sleep, suspend, hibernate or power-off suspend. Typical examples are the application, real-time video, or accelerator clusters that own multiple CPUs and share the system memory, peripherals, board level components, and the energy source. If all the clusters enter low-power states, their respective hypervisors are inactive, and the always-on platform manager has to take over the sole responsibility for system power management. Once the clusters become active again, the power management is passed back to the respective hypervisors. In order to secure optimum power management, the hypervisors and the power manager have to act as one, ultimately coalescing into a distributed system software covering both performance and power management.
A good example of a design in action indicative of such evolution is the power management support for the Xilinx Zynq UltraScale+ MPSoC. The Xen hypervisor running in the Application Processing Unit (APU) and the power manager in the Power Management Unit (PMU) have already evolved into a tight bundle around EEMI based power management and shall further evolve with the upcoming EEMI clock support.
The next blog in this series will cover the suspend-to-RAM feature for the Xen Project Hypervisor targeting the Xilinx Zynq UltraScale+ MPSoC, which lays the foundation for full-scale power management on Arm architectures.
Vojin Zivojnovic, CEO and Co-Founder at AGGIOS
Stefano Stabellini, Principal Engineer at Xilinx and Xen Project Maintainer