Virtualization Blog

Discussions and observations on virtualization.

Overview of the Performance Improvements between XenServer 6.2 and Creedence Alpha 2

The XenServer Creedence Alpha 2 has been released, and one of the main focuses in Alpha 2 was the inclusion of many performance improvements that build on the architectural improvements seen in Alpha 1. This post will give you an overview of these performance improvements in Creedence, and will start a series of in-depth blog posts with more details about the most important ones.

Creedence Alpha 1 introduced several architectural improvements that aim to improve performance and fix a series of scalability limits found in XenServer 6.2:

  • A new 64-bit Dom0 Linux kernel. The 64-bit kernel will remove the cumbersome low/high-memory division present in the previous 32-bit Dom0 kernel, which limited the maximum amount of memory that Dom0 could use and which added memory access penalties in a Dom0 with more than 752MB RAM. This means that the Dom0 memory can now be arbitrarily scaled up to cope with memory demands of the latest vGPU, disk and network drivers, support for more VMs and internal caches to speed up disk access (see, for instance, the Read-caching section below).

  • Dom0 Linux kernel 3.10 with native support for the Xen Project hypervisor. Creedence Alpha 1 adopted a very recent long-term Linux kernel. This modern Linux kernel contains many concurrency, multiprocessing and architectural improvements over the old xen-Linux 2.6.32 kernel used previously in XenServer 6.2. It contains pvops features to run natively on the Xen Project hypervisor, and streamlined virtualization features used to increase datapath performance, such as a grant memory device that allows Dom0 user space processes to access memory from a guest (as long as the guest agrees in advance). Additionally, the latest drivers from hardware manufacturers containing performance improvements can be adopted more easily.

  • Xen Project hypervisor 4.4. This is the latest Xen Project hypervisor version available, and it improves on the previous version 4.1 on many accounts. It vastly increases the number of virtual event channels available for Dom0 -- from 1023 to 131071 -- which can translate into a correspondingly larger number of VMs per host and larger numbers of virtual devices that can be attached to them. XenServer 6.2 was using a special interim change that provided 4096 channels, which was enough for around 500 VMs per host with a few virtual devices in each VM. With the extra event channels in version 4.4, Creedence Alpha 1 can have each of these VMs endowed with a richer set of virtual devices. The Xen Project hypervisor 4.4 also handles grant-copy locking requests more efficiently, improving aggregate network and disk throughput; it facilitates future increases to the supported amount of host memory and CPUs; and it adds many other helpful scalability improvements.

  • Tapdisk3. The latest Dom0 disk backend design has been enabled by default for all the guest VBDs. While the previous tapdisk2 in XenServer 6.2 would establish a datapath to the guest in a circuitous way via a Dom0 kernel component, tapdisk3 in Creedence Alpha 1 establishes a datapath connected directly to the guest (via the grant memory device in the new kernel), minimizing latency and using less CPU. This results in big improvements in concurrent disk access and a much larger total aggregate disk throughput for the VBDs. We have measured aggregate disk throughput improvements of up to 100% on modern disks and machines accessing large blocksizes with large number of threads and observed local SSD arrays being maxed out when enough VMs and VBDs were used.

  • GRO enabled by default. The Generic Receive Offload is now enabled by default for all PIFs available to Dom0. This means that for GRO-capable NICs, incoming network packets will be transparently merged by the NIC and Dom0 will be interrupted less often to process the incoming data, saving CPU cycles and scaling much better with 10Gbps and 40Gbps networks. We have observed incoming single-stream network throughput improvements of 300% on modern machines.

  • Netback thread per VIF. Previously, XenServer 6.2 would have one netback thread for each existing Dom0 VCPU and a VIF would be permanently associated with one Dom0 VCPU. In the worst case, it was possible to end up with many VIFs forcibly sharing the same Dom0 VCPU thread, while other Dom0 VCPU threads were idle but unable to help. Creedence Alpha 2 improves this design and gives each VIF its own Dom0 netback thread that can run on any Dom0 VCPU. Therefore, the VIF load will now be spread evenly across all Dom0 VCPUs in all cases.

Creedence Alpha 2 then introduced a series of extra performance enhancements on top of the architecture improvements of Creedence Alpha 1:

  • Read-caching. In some situations, several VMs are all cloned from the same base disk so share much of their data while the few different blocks they write are stored in differencing-disks unique to each VM. In this case, it would be useful to be able to cache the contents of the base disk in memory, so that all the VMs can benefit from very fast access to the contents of the base disk, reducing the amount of I/O going to and from physical storage. Creedence Alpha 2 introduces this read caching feature enabled by default, which we expect to yield substantial performance improvements in the time it takes to boot VMs and other desktop and server applications where the VMs are mostly sharing a single base disk.

  • Grant-mapping on the network datapath. The pvops-Linux 3.10 kernel used in Alpha 1 had a VIF datapath that would need to copy the guest's network data into Dom0 before transmitting it to another guest or host. This memory copy operation was expensive and it would saturate the Dom0 VCPUs and limit the network throughput. A new design was introduced in Creedence Alpha 2, which maps the guest's network data into Dom0's memory space instead of copying it. This saves substantial Dom0 VCPU resources that can be used to increase the single-stream and aggregate network throughput even more. With this change, we have measured network throughput improvements of 250% for single-stream and 200% for aggregate stream over XenServer 6.2 on modern machines. 

  • OVS 2.1. An openvswitch network flow is a match between a network packet header and an action such as forward or drop. In OVS 1.4, present in XenServer 6.2, the flow had to have an exact match for the header. A typical server VM could have hundreds or more connections to clients, and OVS would need to have a flow for each of these connections. If the host had too many such VMs, the OVS flow table in the Dom0 kernel would become full and would cause many round-trips to the OVS userspace process, degrading significantly the network throughput to and from the guests. Creedence Alpha 2 has the latest OVS 2.1, which supports megaflows. Megaflows are simply a wildcarded language for the flow table allowing OVS to express a flow as group of matches, therefore reducing the number of required entries in the flow table for the most common situations and improving the scalability of Dom0 to handle many server VMs connected to a large number of clients.

Our goal is to make Creedence the most scalable and fastest XenServer release yet. You can help us in this goal by testing the performance features above and verifying if they boost the performance you can observe in your existing hardware.

Debug versus non-debug mode in Creedence Alpha

The Creedence Alpha releases use by default a version of the Xen Project hypervisor with debugging mode enabled to facilitate functional testing. When testing the performance of these releases, you should first switch to using the corresponding non-debugging version of the hypervisor, so that you can unleash its full potential suitable for performance testing. So, before you start any performance measurements, please run in Dom0:

cd /boot
ln -sf xen-*-xs?????.gz xen.gz   #points to the non-debug version of the Xen Project hypervisor in /boot

Double-check that the resulting xen.gz symlink is pointing to a valid file and then reboot the host.

You can check if the hypervisor debug mode is currently on or off by executing in Dom0:

xl dmesg | fgrep "Xen version"

and checking if the resulting line has debug=y or debug=n. It should be debug=n for performance tests.

You can reinstate the hypervisor debugging mode by executing in Dom0:

cd /boot
ln -sf xen-*-xs?????-d.gz xen.gz   #points to the debug (-d) version of the Xen Project hypervisor in /boot

and then rebooting the host.

Please report any improvements and regressions you observe on your hardware to the This email address is being protected from spambots. You need JavaScript enabled to view it. list. And keep an eye out for the next installments of this series!

Recent Comments
Tobias Kreidl
Turning off debugging will be very interesting to compare with the standard debug setting. I already have a set of benchmarks I ra... Read More
Wednesday, 18 June 2014 19:55
Tobias Kreidl
@James: Sure... Benchmarks with bonnie++ (V1.93c) on XenServer Creedence Alpha.2 with dom0 increased to 2 GB memory, otherwise th... Read More
Friday, 20 June 2014 04:30
Tobias Kreidl
Sure, here they are. The random operations seem to be worse for Alpha.2 in this case, but see an important note about this furthe... Read More
Monday, 23 June 2014 17:59
Continue reading
19318 Hits
11 Comments

How did we increase VM density in XenServer 6.2? (part 2)

In a previous article, I described how dom0 event channels can cause a hard limitation on VM density scalability.

Event channels were just one hard limit the XenServer engineering team needed to overcome to allow XenServer 6.2 to support up to 500 Windows VMs or 650 Linux VMs on a single host.

In my talk at the 2013 Xen Developer Summit towards the end of October, I spoke about a further six hard limits and some soft limits that we overcame along the way to achieving this goal. This blog article summarises that journey.

Firstly, I'll explain what I mean by hard and soft VM density limits. A hard limit is where you can run a certain number of VMs without any trouble, but you are unable to run one more. Hard limits arise when there is some finite, unsharable resource that each VM consumes a bit of. On the other hand, a soft limit is where performance degrades with every additional VM you have running; there will be a point at which it's impractical to run more than a certain number of VMs because they will be unusable in some sense. Soft limits arise when there is a shared resource that all VMs must compete for, such as CPU time.

Here is a run-down of all seven hard limits, how we mitigated them in XenServer 6.2, and how we might be able to push them even further back in future:

  1. dom0 event channels

    • Cause of limitation: XenServer uses a 32-bit dom0. This means a maximum of 1,024 dom0 event channels.
    • Mitigation for XenServer 6.2: We made a special case for dom0 to allow it up to 4,096 dom0 event channels.
    • Mitigation for future: Adopt David Vrabel's proposed change to the Xen ABI to provide unlimited event channels.
  2. blktap2 device minor numbers

    • Cause of limitation: blktap2 only supports up to 1,024 minor numbers, caused by #define MAX_BLKTAP_DEVICE in blktap.h.
    • Mitigation for XenServer 6.2: We doubled that constant to allow up to 2,048 devices.
    • Mitigation for future: Move away from blktap2 altogether?
  3. aio requests in dom0

    • Cause of limitation: Each blktap2 instance creates an asynchronous I/O context for receiving 402 events; the default system-wide number of aio requests (fs.aio-max-nr) was 444,416 in XenServer 6.1.
    • Mitigation for XenServer 6.2: We set fs.aio-max-nr to 1,048,576.
    • Mitigation for future: Increase this parameter yet further. It's not clear whether there's a ceiling, but it looks like this would be okay.
  4. dom0 grant references

    • Cause of limitation: Windows VMs used receive-side copy (RSC) by default in XenServer 6.1. In netbk_p1_setup, netback allocates 22 grant-table entries per virtual interface for RSC. But dom0 only had a total of 8,192 grant-table entries in XenServer 6.1.
    • Mitigation for XenServer 6.2: We could have increased the size of the grant-table, but for other reasons RSC is no longer the default for Windows VMs in XenServer 6.2, so this limitation no longer applies.
    • Mitigation for future: Continue to leave RSC disabled by default.
  5. Connections to xenstored

    • Cause of limitation: xenstored uses select(2), which can only listen on up to 1,024 file descriptors; qemu opens 3 file descriptors to xenstored.
    • Mitigation for XenServer 6.2: We made two qemu watches share a connection.
    • Mitigation for future: We could modify xenstored to accept more connections, but in the future we expect to be using upstream qemu, which doesn't connect to xenstored, so it's unlikely that xenstored will run out of connections.
  6. Connections to consoled

    • Cause of limitation: Similarly, consoled uses select(2), and each PV domain opens 3 file descriptors to consoled.
    • Mitigation for XenServer 6.2: We use poll(2) rather than select(2). This has no such limitation.
    • Mitigation for future: Continue to use poll(2).
  7. dom0 low memory

    • Cause of limitation: Each running VM eats about 1 MB of dom0 low memory.
    • Mitigation for future: Using a 64-bit dom0 would remove this limit.

Summary of limits

Okay, so what does this all mean in terms of how many VMs you can run on a host? Well, since some of the limits concern your VM configuration, it depends on the type of VM you have in mind.

Let's take the example of Windows VMs with PV drivers, each with 1 vCPU, 3 disks and 1 network interface. Here are the number of those VMs you'd have to run on a host in order to hit each limitation:

Limitation XS 6.1 XS 6.2 Future
dom0 event channels 150 570 no limit
blktap minor numbers 341 682 no limit
aio requests 368 869 no limit
dom0 grant references 372 no limit no limit
xenstored connections 333 500 no limit
consoled connections no limit no limit no limit
dom0 low memory 650 650 no limit

The first limit you'd arrive at in each release is highlighted. So the overall limit is event channels in XenServer 6.1, limiting us to 150 of these VMs. In XenServer 6.2, it's the number of xenstore connections that limits us to 500 VMs per host. In the future, none of these limits will hit us, but there will surely be an eighth limit when running many more than 500 VMs on a host.

What about Linux guests? Here's where we stand for paravirtualised Linux VMs each with 1 vCPU, 1 disk and 1 network interface:

Limitation XS 6.1 XS 6.2 Future
dom0 event channels 225 1000 no limit
blktap minor numbers 1024 2048 no limit
aio requests 368 869 no limit
dom0 grant references no limit no limit no limit
xenstored connections no limit no limit no limit
consoled connections 341 no limit no limit
dom0 low memory 650 650 no limit

This explains why the supported limit for Linux guests can be as high as 650 in XenServer 6.2. Again, in the future, we'll likely be limited by something else above 650 VMs.

What about the soft limits?

After having pushed the hard limits such a long way out, we then needed to turn our attention towards ensuring that there weren't any soft limits that would make it infeasible to run a large number of VMs in practice.

Felipe Franciosi has already described how qemu's utilisation of dom0 CPUs can be reduced by avoiding the emulation of unneeded virtual devices. The other major change in XenServer 6.2 to reduce dom0 load was to reduce the amount of xenstore traffic. This was achieved by replacing code that polled xenstore with code that registers watches on xenstore and by removing some spurious xenstore accesses from the Windows guest agent.

These things combine to keep dom0 CPU load down to a very low level. This means that VMs can remain healthy and responsive, even when running a very large number of VMs.

Recent comment in this post
Tobias Kreidl
We see xenstored eat anywhere from 30 to 70% of a CPU with something like 80 VMs running under XenServer 6.1. When major updates t... Read More
Wednesday, 13 November 2013 17:10
Continue reading
24952 Hits
1 Comment

How did we increase VM density in XenServer 6.2?

One of the most noteworthy improvements in XenServer 6.2 is the support for a significantly increased number of VMs running on a host: now up to 500 Windows VMs or 650 Linux VMs.

We needed to remove several obstacles in order to achieve this huge step up. Perhaps the most important of the technical changes that led to this was to increase in the number of event channels available to dom0 (the control domain) from 1024 to 4096. This blog post is an attempt to shed some light on what these event channels are, and why they play a key role in VM density limits.

What is an event channel?

It's a channel for communications between a pair of VMs. An event channel is typically used by one VM to notify another VM about something. For example, a VM's paravirtualised disk driver would use an event channel to notify dom0 of the presence of newly written data in a region of memory shared with dom0.

Here are the various things that a VM requires an event channel for:

  • one per virtual disk;
  • one per virtual network interface;
  • one for communications with xenstore;
  • for HVM guests, one per virtual CPU (rising to two in XenServer 6.2); and
  • for PV guests; one to communicate with the console daemon.


Therefore VMs will typically require at least four dom0 event channels depending on the configuration of the VM. Requiring more than ten is not an uncommon configuration.

Why can event channels cause scalability problems when trying to run lots of VMs?

The total number of event channels any domain can use is part of a shared structure in the interface between a paravirtualised VM and the hypervisor; it is fixed at 1024 for 32-bit domains such as XenServer's dom0. Moreover, there are normally around 50--100 event channels used for other purposes, such as physical interrupts. This is normally related to the number of physical devices you have in your host. This overhead means that in practice there might be not too many more than 900--950 event channels available for VM use. So the number of available event channels becomes a limited resource that can cause you to experience a hard limit on the number of VMs you can run on a host.

To take an example: Before XenServer 6.2, if each of your VMs requires 6 dom0 event channels (e.g. an HVM guest with 3 virtual disks, 1 virtual network interface and 1 virtual CPU) then you'll probably find yourself running out of dom0 event channels if you go much over 150 VMs.

In XenServer 6.2, we have made a special case for our dom0 to allow it to behave differently to other 32-bit domains to allow it to use up to four times the normal number of event channels. Hence there are now a total of 4096 event channels available.

So, on XenServer 6.2 in the same scenario as the example above, even though each VM of this type would now use 7 dom0 event channels, the increased total number of dom0 event channels means you'd have to run over 570 of them before running out.

What happens when I run out of event channels?

On VM startup, the XenServer toolstack will try to plumb all the event channels through from dom0 to the nascent VM. If there are no spare slots, the connection will fail. The exact failure mode depends on which subsystem the event channel was intended for use in, but you may see error messages like these when the toolstack tries to connect up the next event channel after having run out:

error 28 mapping ring-refs and evtchn
message: xenopsd internal error: Device.Ioemu_failed("qemu-dm exited unexpectedly")

In other words, it's not pretty. The VM either won't boot or will run with reduced functionality.

That sounds scary. How can I tell whether there's sufficient spare event channels to start another VM?

XenServer has a utility called "lsevtchn" that allows you to inspect the event channel plumbing.

In dom0, run the following command to see what event channels are connected to a particular domain.

/usr/lib/xen/bin/lsevtchn 

For example, here is the output from a PV domain with domid 36:

[root@xs62 ~]# /usr/lib/xen/bin/lsevtchn 36
   1: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 51
   2: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 52
   3: VCPU 0: Virtual IRQ 0
   4: VCPU 0: IPI
   5: VCPU 0: IPI
   6: VCPU 0: Virtual IRQ 1
   7: VCPU 0: IPI
   8: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 55
   9: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 53
  10: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 54
  11: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 56

You can see that six of this VM's event channels are connected to dom0.

But the domain we are most interested in is dom0. The total number of event channels connected to dom0 can be determined by running

/usr/lib/xen/bin/lsevtchn 0 | wc -l

Before XenServer 6.2, if that number is close to 1024 then your host is on the verge of not being able to run an additional VM. On XenServer 6.2, the number to watch out for is 4096. However, before you'd be able to get enough VMs up and running to approach that limit, there are various other things you might run into depending on configuration and workload. Watch out for further blog posts describing how we have cleared more of these hurdles in XenServer 6.2.

Continue reading
66540 Hits
0 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.