Virtualization Blog

Discussions and observations on virtualization.

Pushing XenServer limits with Creedence beta.2

Well folks, its that time once again; we've another XenServer build ripe for your inspection, and this one is a critical one for a number of reasons. Today we've released XenServer Creedence beta.2, and this is binary compatible with a Citrix Tech Preview refresh. The build number is 87850 and it presents itself to the outside world as 6.4.95. Over the past few announcements I've hinted at pushing the boundaries of XenServer and wanting the community at large to "have at it", but I've not put out too many details on the overall performance we're seeing internally. The most important attribute of this build is that internally, its going to form part of a series of long term stability tests. Yes folks, we're that confident in what we're seeing and I wanted to thank everyone who has participated in our pre-release activities by sharing a few performance tidbits:

  • Creedence can start and run 1000 PV VMs with only 8GB dom0 memory. That's up from the 650 we have in XenServer 6.2.
  • Booting 125 Windows 7 VMs on a single host takes only 350 seconds in a bootstorm scenario. That's down from 850 seconds in XenServer 6.2
  • Aggregate disk throughput has been measured to improve by as much as 100% when compared to XenServer 6.2
  • Aggregate intrahost network throughput has been measured to improve by as much as 200% when compared to XenServer 6.2
  • The number of virtual disks per host has been raised by a factor of four when compared to XenServer 6.2

When compared to beta.1, the team has been looking at a number of performance and scalability system aspects, with a primary focus on dom0 idle state behavior at scale. This is a very important aspect of system operation as overall system responsiveness is directly tied to the overhead of managing a large number of VMs. We did see two distinct areas for investigation, and are inviting the community to look into these and provide us with others. Those two areas are:

  • When using 40Gb NICs outbound (transmit) performance is below expectations. We have some internal fixes, but are encouraging anyone with such NICs to test and report their findings
  • When large numbers of hosts are pooled we're seeing VM start times appear to slow unexpectedly under large pool VM densities.

 

As always we're actively encouraging you to test the beta and provide your feedback (both positive and negative) in an incident report. You can download beta.2 from here: http://xenserver.org/component/content/article/11-product/142-download-pre-release.html, and enter your feedback at https://bugs.xenserver.org.     

Recent Comments
Tim Mackey
@bpbp If you're comparing the ovs in Creedence to previous Creedence builds, then they are identical. If you're comparing it to ... Read More
Sunday, 24 August 2014 23:37
Tim Mackey
@vati, If I understand correctly, you're looking for when the DVSC will appear? The DVSC is being handled via the Citrix Tech Pr... Read More
Monday, 25 August 2014 14:32
Tim Mackey
@bpbp, We're hoping that the upgrade to ovs 2.1.2 will address most of the issues seen with the older ovs. If you're in a positi... Read More
Monday, 25 August 2014 14:33
Continue reading
16645 Hits
19 Comments

In-memory read caching for XenServer

Overview

In this blog post, I introduce a new feature of XenServer Creedence alpha.4, in-memory read caching, the technical details, the benefits it can provide, and how best to use it.

Technical Details

A common way of using XenServer is to have an OS image, which I will call the golden image, and many clones of this image, which I will call leaf images. XenServer implements cheap clones by linking images together in the form of a tree. When the VM accesses a sector in the disk, if a sector has been written into the leaf image, this data is retrieved from that image. Otherwise, the tree is traversed and data is retrieved from a parent image (in this case, the golden image). All writes go into the leaf image. Astute readers will notice that no writes ever hit the golden image. This has an important implication and allows read caching to be implemented.

tree.png

tapdisk is the storage component in dom0 which handles requests from VMs (see here for many more details). For safety reasons, tapdisk opens the underlying VHD files with the O_DIRECT flag. The O_DIRECT flag ensures that dom0's page cache is never used; i.e. all reads come directly from disk and all writes wait until the data has hit the disk (at least as far as the operating system can tell, the data may still be in a hardware buffer). This allows XenServer to be robust in the face of power failures or crashes. Picture a situation where a user saves a photo and the VM flushes the data to its virtual disk which tapdisk handles and writes to the physical disk. If this write goes into the page cache as a dirty page and then a power failure occurs, the contract between tapdisk and the VM is broken since data has been lost. Using the O_DIRECT flag allows this situation to be avoided and means that once tapdisk has handled a write for a VM, the data is actually on disk.

Because no data is ever written to the golden image, we don't need to maintain the safety property mentioned previously. For this reason, tapdisk can elide the O_DIRECT flag when opening a read-only image. This allows the operating system's page cache to be used which can improve performance in a number of ways:

  • The number of physical disk I/O operations is reduced (as a direct consequence of using a cache).
  • Latency is improved since the data path is shorter if data does not need to be read from disk.
  • Throughput is improved since the disk bottleneck is removed.

One of our goals for this feature was that it should have no drawbacks when enabled. An effect which we noticed initially was that data appeared to be read twice from disk which increases the number of I/O operations in the case where data is only read once from the VM. After a little debugging, we found that disabling O_DIRECT causes the kernel to automatically turn on readahead. Because data access pattern of a VM's disk tends to be quite random, this had a detrimental effect on the overall number of read operations. To fix this, we made use of a POSIX feature, posix_fadvise, which allows an application to inform the kernel how it plans to use a file. In this case, tapdisk tells the kernel that access will be random using the POSIX_FADV_RANDOM flag. The kernel responds to this by disabling readahead, and the number of read operations drops to the expected value (the same as when O_DIRECT is enabled).

Administration

Because of difficulties maintaining cache consistency across multiple hosts in a pool for storage operations, read caching can only be used with file-based SRs; i.e. EXT and NFS SRs. For these SRs, it is enabled by default. There shouldn't be any performance problems associated with this; however, if necessary, it is possible to disable read caching for an SR:

xe sr-param-set uuid=<UUID> other-config:o_direct=true

You may wonder how read caching differs from IntelliCache. The major difference is that IntelliCache works by caching reads from the network onto a local disk while in-memory read caching caches reads from either into memory. The advantage of in-memory read caching is that memory is still an order of magnitude faster than an SSD so performance in bootstorms and other heavy I/O situations should be improved. It is possible for them both to be enabled simultaneously; in this case reads from the network are cached by IntelliCache to a local disk and reads from that local disk are cached in memory with read caching. It is still advantageous to have IntelliCache turned on in this situation because the amount of available memory in dom0 may not be enough to cache the entire working set and reading the remainder from local storage is quicker than reading over the network. IntelliCache further reduces the load on shared storage when using VMs with disks that are not persistent across reboots by only writing to the local disk, not the shared storage.

Talking of available memory, XenServer admins should note that to make best use of read caching, the amount of dom0 memory may need to be increased. Ideally the amount of dom0 memory would be increased to the size of the golden image so that once cached, no more reads hit the disk. In case this is not possible, an approach to take would be to temporarily increase the amount of dom0 memory to the size of the golden image, boot up a VM and open the various applications typically used, determine how much dom0 memory is still free, and then reduce dom0's memory by this amount.

Performance Evaluation

Enough talk, let's see some graphs!

reads.png

In this first graph, we look at the number of bytes read over the network when booting a number of VMs on an NFS SR in parallel. Notice how without read caching, the number of bytes read scales proportionately with the number of VMs booted which checks out since each VM's reads go directly to the disk. When O_DIRECT is removed, the number of bytes read remains constant regardless of the number of VMs booted in parallel. Clearly the in-memory caching is working!

time.png

How does this translate to improvements in boot time? The short answer: see the graph! The longer answer is that it depends on many factors. In the graph, we can see that there is little difference in boot time when booting less than 4 VMs in parallel because the NFS server is able to handle that much traffic concurrently. As the number of VMs increases, the NFS server becomes saturated and the difference in boot time becomes dramatic. It is clear that for this setup, booting many VMs is I/O-limited so read caching makes a big difference. Finally, you may wonder why the boot time per VM increases slowly as the number of VMs increases when read caching is enabled. Since the disk is no longer a bottleneck, it appears that some other bottleneck has been revealed, probably CPU contention. In other words, we have transformed an I/O-limited bootstorm into a CPU-limited one! This improvement in boot times would be particularly useful for VDI deployments where booting many instances of the same VM is a frequent occurrence.

Conclusions

In this blog post, we've seen that in-memory read caching can improve performance in read I/O-limited situations substantially without requiring new hardware, compromising reliability, or requiring much in the way of administration.

As future work to improve in-memory read caching further, we'd like to remove the limitation that it can only use dom0's memory. Instead, we'd like to be able to use the host's entire free memory. This is far more flexible than the current implementation and would remove any need to tweak dom0's memory.

Credits

Thanks to Felipe Franciosi, Damir Derd, Thanos Makatos and Jonathan Davies for feedback and reviews.

Recent Comments
Tim Mackey
I'm not familiar with ZFS, but XenServer has had an shared storage cache called IntelliCache. It's designed for use in highly tem... Read More
Monday, 28 July 2014 02:19
Tobias Kreidl
Ross, Nice article! This cache is definitely going to help but as you pointed out, at some point, the size of the golden image wil... Read More
Tuesday, 29 July 2014 05:12
Tobias Kreidl
Apparently I hit a sore spot with you, "whatever"... I never said Nexenta was the best or most innovative solution out there, but... Read More
Thursday, 31 July 2014 04:28
Continue reading
35608 Hits
7 Comments

Overview of the Performance Improvements between XenServer 6.2 and Creedence Alpha 2

The XenServer Creedence Alpha 2 has been released, and one of the main focuses in Alpha 2 was the inclusion of many performance improvements that build on the architectural improvements seen in Alpha 1. This post will give you an overview of these performance improvements in Creedence, and will start a series of in-depth blog posts with more details about the most important ones.

Creedence Alpha 1 introduced several architectural improvements that aim to improve performance and fix a series of scalability limits found in XenServer 6.2:

  • A new 64-bit Dom0 Linux kernel. The 64-bit kernel will remove the cumbersome low/high-memory division present in the previous 32-bit Dom0 kernel, which limited the maximum amount of memory that Dom0 could use and which added memory access penalties in a Dom0 with more than 752MB RAM. This means that the Dom0 memory can now be arbitrarily scaled up to cope with memory demands of the latest vGPU, disk and network drivers, support for more VMs and internal caches to speed up disk access (see, for instance, the Read-caching section below).

  • Dom0 Linux kernel 3.10 with native support for the Xen Project hypervisor. Creedence Alpha 1 adopted a very recent long-term Linux kernel. This modern Linux kernel contains many concurrency, multiprocessing and architectural improvements over the old xen-Linux 2.6.32 kernel used previously in XenServer 6.2. It contains pvops features to run natively on the Xen Project hypervisor, and streamlined virtualization features used to increase datapath performance, such as a grant memory device that allows Dom0 user space processes to access memory from a guest (as long as the guest agrees in advance). Additionally, the latest drivers from hardware manufacturers containing performance improvements can be adopted more easily.

  • Xen Project hypervisor 4.4. This is the latest Xen Project hypervisor version available, and it improves on the previous version 4.1 on many accounts. It vastly increases the number of virtual event channels available for Dom0 -- from 1023 to 131071 -- which can translate into a correspondingly larger number of VMs per host and larger numbers of virtual devices that can be attached to them. XenServer 6.2 was using a special interim change that provided 4096 channels, which was enough for around 500 VMs per host with a few virtual devices in each VM. With the extra event channels in version 4.4, Creedence Alpha 1 can have each of these VMs endowed with a richer set of virtual devices. The Xen Project hypervisor 4.4 also handles grant-copy locking requests more efficiently, improving aggregate network and disk throughput; it facilitates future increases to the supported amount of host memory and CPUs; and it adds many other helpful scalability improvements.

  • Tapdisk3. The latest Dom0 disk backend design has been enabled by default for all the guest VBDs. While the previous tapdisk2 in XenServer 6.2 would establish a datapath to the guest in a circuitous way via a Dom0 kernel component, tapdisk3 in Creedence Alpha 1 establishes a datapath connected directly to the guest (via the grant memory device in the new kernel), minimizing latency and using less CPU. This results in big improvements in concurrent disk access and a much larger total aggregate disk throughput for the VBDs. We have measured aggregate disk throughput improvements of up to 100% on modern disks and machines accessing large blocksizes with large number of threads and observed local SSD arrays being maxed out when enough VMs and VBDs were used.

  • GRO enabled by default. The Generic Receive Offload is now enabled by default for all PIFs available to Dom0. This means that for GRO-capable NICs, incoming network packets will be transparently merged by the NIC and Dom0 will be interrupted less often to process the incoming data, saving CPU cycles and scaling much better with 10Gbps and 40Gbps networks. We have observed incoming single-stream network throughput improvements of 300% on modern machines.

  • Netback thread per VIF. Previously, XenServer 6.2 would have one netback thread for each existing Dom0 VCPU and a VIF would be permanently associated with one Dom0 VCPU. In the worst case, it was possible to end up with many VIFs forcibly sharing the same Dom0 VCPU thread, while other Dom0 VCPU threads were idle but unable to help. Creedence Alpha 2 improves this design and gives each VIF its own Dom0 netback thread that can run on any Dom0 VCPU. Therefore, the VIF load will now be spread evenly across all Dom0 VCPUs in all cases.

Creedence Alpha 2 then introduced a series of extra performance enhancements on top of the architecture improvements of Creedence Alpha 1:

  • Read-caching. In some situations, several VMs are all cloned from the same base disk so share much of their data while the few different blocks they write are stored in differencing-disks unique to each VM. In this case, it would be useful to be able to cache the contents of the base disk in memory, so that all the VMs can benefit from very fast access to the contents of the base disk, reducing the amount of I/O going to and from physical storage. Creedence Alpha 2 introduces this read caching feature enabled by default, which we expect to yield substantial performance improvements in the time it takes to boot VMs and other desktop and server applications where the VMs are mostly sharing a single base disk.

  • Grant-mapping on the network datapath. The pvops-Linux 3.10 kernel used in Alpha 1 had a VIF datapath that would need to copy the guest's network data into Dom0 before transmitting it to another guest or host. This memory copy operation was expensive and it would saturate the Dom0 VCPUs and limit the network throughput. A new design was introduced in Creedence Alpha 2, which maps the guest's network data into Dom0's memory space instead of copying it. This saves substantial Dom0 VCPU resources that can be used to increase the single-stream and aggregate network throughput even more. With this change, we have measured network throughput improvements of 250% for single-stream and 200% for aggregate stream over XenServer 6.2 on modern machines. 

  • OVS 2.1. An openvswitch network flow is a match between a network packet header and an action such as forward or drop. In OVS 1.4, present in XenServer 6.2, the flow had to have an exact match for the header. A typical server VM could have hundreds or more connections to clients, and OVS would need to have a flow for each of these connections. If the host had too many such VMs, the OVS flow table in the Dom0 kernel would become full and would cause many round-trips to the OVS userspace process, degrading significantly the network throughput to and from the guests. Creedence Alpha 2 has the latest OVS 2.1, which supports megaflows. Megaflows are simply a wildcarded language for the flow table allowing OVS to express a flow as group of matches, therefore reducing the number of required entries in the flow table for the most common situations and improving the scalability of Dom0 to handle many server VMs connected to a large number of clients.

Our goal is to make Creedence the most scalable and fastest XenServer release yet. You can help us in this goal by testing the performance features above and verifying if they boost the performance you can observe in your existing hardware.

Debug versus non-debug mode in Creedence Alpha

The Creedence Alpha releases use by default a version of the Xen Project hypervisor with debugging mode enabled to facilitate functional testing. When testing the performance of these releases, you should first switch to using the corresponding non-debugging version of the hypervisor, so that you can unleash its full potential suitable for performance testing. So, before you start any performance measurements, please run in Dom0:

cd /boot
ln -sf xen-*-xs?????.gz xen.gz   #points to the non-debug version of the Xen Project hypervisor in /boot

Double-check that the resulting xen.gz symlink is pointing to a valid file and then reboot the host.

You can check if the hypervisor debug mode is currently on or off by executing in Dom0:

xl dmesg | fgrep "Xen version"

and checking if the resulting line has debug=y or debug=n. It should be debug=n for performance tests.

You can reinstate the hypervisor debugging mode by executing in Dom0:

cd /boot
ln -sf xen-*-xs?????-d.gz xen.gz   #points to the debug (-d) version of the Xen Project hypervisor in /boot

and then rebooting the host.

Please report any improvements and regressions you observe on your hardware to the This email address is being protected from spambots. You need JavaScript enabled to view it. list. And keep an eye out for the next installments of this series!

Recent Comments
Tobias Kreidl
Turning off debugging will be very interesting to compare with the standard debug setting. I already have a set of benchmarks I ra... Read More
Wednesday, 18 June 2014 19:55
Tobias Kreidl
@James: Sure... Benchmarks with bonnie++ (V1.93c) on XenServer Creedence Alpha.2 with dom0 increased to 2 GB memory, otherwise th... Read More
Friday, 20 June 2014 04:30
Tobias Kreidl
Sure, here they are. The random operations seem to be worse for Alpha.2 in this case, but see an important note about this furthe... Read More
Monday, 23 June 2014 17:59
Continue reading
19318 Hits
11 Comments

How did we increase VM density in XenServer 6.2? (part 2)

In a previous article, I described how dom0 event channels can cause a hard limitation on VM density scalability.

Event channels were just one hard limit the XenServer engineering team needed to overcome to allow XenServer 6.2 to support up to 500 Windows VMs or 650 Linux VMs on a single host.

In my talk at the 2013 Xen Developer Summit towards the end of October, I spoke about a further six hard limits and some soft limits that we overcame along the way to achieving this goal. This blog article summarises that journey.

Firstly, I'll explain what I mean by hard and soft VM density limits. A hard limit is where you can run a certain number of VMs without any trouble, but you are unable to run one more. Hard limits arise when there is some finite, unsharable resource that each VM consumes a bit of. On the other hand, a soft limit is where performance degrades with every additional VM you have running; there will be a point at which it's impractical to run more than a certain number of VMs because they will be unusable in some sense. Soft limits arise when there is a shared resource that all VMs must compete for, such as CPU time.

Here is a run-down of all seven hard limits, how we mitigated them in XenServer 6.2, and how we might be able to push them even further back in future:

  1. dom0 event channels

    • Cause of limitation: XenServer uses a 32-bit dom0. This means a maximum of 1,024 dom0 event channels.
    • Mitigation for XenServer 6.2: We made a special case for dom0 to allow it up to 4,096 dom0 event channels.
    • Mitigation for future: Adopt David Vrabel's proposed change to the Xen ABI to provide unlimited event channels.
  2. blktap2 device minor numbers

    • Cause of limitation: blktap2 only supports up to 1,024 minor numbers, caused by #define MAX_BLKTAP_DEVICE in blktap.h.
    • Mitigation for XenServer 6.2: We doubled that constant to allow up to 2,048 devices.
    • Mitigation for future: Move away from blktap2 altogether?
  3. aio requests in dom0

    • Cause of limitation: Each blktap2 instance creates an asynchronous I/O context for receiving 402 events; the default system-wide number of aio requests (fs.aio-max-nr) was 444,416 in XenServer 6.1.
    • Mitigation for XenServer 6.2: We set fs.aio-max-nr to 1,048,576.
    • Mitigation for future: Increase this parameter yet further. It's not clear whether there's a ceiling, but it looks like this would be okay.
  4. dom0 grant references

    • Cause of limitation: Windows VMs used receive-side copy (RSC) by default in XenServer 6.1. In netbk_p1_setup, netback allocates 22 grant-table entries per virtual interface for RSC. But dom0 only had a total of 8,192 grant-table entries in XenServer 6.1.
    • Mitigation for XenServer 6.2: We could have increased the size of the grant-table, but for other reasons RSC is no longer the default for Windows VMs in XenServer 6.2, so this limitation no longer applies.
    • Mitigation for future: Continue to leave RSC disabled by default.
  5. Connections to xenstored

    • Cause of limitation: xenstored uses select(2), which can only listen on up to 1,024 file descriptors; qemu opens 3 file descriptors to xenstored.
    • Mitigation for XenServer 6.2: We made two qemu watches share a connection.
    • Mitigation for future: We could modify xenstored to accept more connections, but in the future we expect to be using upstream qemu, which doesn't connect to xenstored, so it's unlikely that xenstored will run out of connections.
  6. Connections to consoled

    • Cause of limitation: Similarly, consoled uses select(2), and each PV domain opens 3 file descriptors to consoled.
    • Mitigation for XenServer 6.2: We use poll(2) rather than select(2). This has no such limitation.
    • Mitigation for future: Continue to use poll(2).
  7. dom0 low memory

    • Cause of limitation: Each running VM eats about 1 MB of dom0 low memory.
    • Mitigation for future: Using a 64-bit dom0 would remove this limit.

Summary of limits

Okay, so what does this all mean in terms of how many VMs you can run on a host? Well, since some of the limits concern your VM configuration, it depends on the type of VM you have in mind.

Let's take the example of Windows VMs with PV drivers, each with 1 vCPU, 3 disks and 1 network interface. Here are the number of those VMs you'd have to run on a host in order to hit each limitation:

Limitation XS 6.1 XS 6.2 Future
dom0 event channels 150 570 no limit
blktap minor numbers 341 682 no limit
aio requests 368 869 no limit
dom0 grant references 372 no limit no limit
xenstored connections 333 500 no limit
consoled connections no limit no limit no limit
dom0 low memory 650 650 no limit

The first limit you'd arrive at in each release is highlighted. So the overall limit is event channels in XenServer 6.1, limiting us to 150 of these VMs. In XenServer 6.2, it's the number of xenstore connections that limits us to 500 VMs per host. In the future, none of these limits will hit us, but there will surely be an eighth limit when running many more than 500 VMs on a host.

What about Linux guests? Here's where we stand for paravirtualised Linux VMs each with 1 vCPU, 1 disk and 1 network interface:

Limitation XS 6.1 XS 6.2 Future
dom0 event channels 225 1000 no limit
blktap minor numbers 1024 2048 no limit
aio requests 368 869 no limit
dom0 grant references no limit no limit no limit
xenstored connections no limit no limit no limit
consoled connections 341 no limit no limit
dom0 low memory 650 650 no limit

This explains why the supported limit for Linux guests can be as high as 650 in XenServer 6.2. Again, in the future, we'll likely be limited by something else above 650 VMs.

What about the soft limits?

After having pushed the hard limits such a long way out, we then needed to turn our attention towards ensuring that there weren't any soft limits that would make it infeasible to run a large number of VMs in practice.

Felipe Franciosi has already described how qemu's utilisation of dom0 CPUs can be reduced by avoiding the emulation of unneeded virtual devices. The other major change in XenServer 6.2 to reduce dom0 load was to reduce the amount of xenstore traffic. This was achieved by replacing code that polled xenstore with code that registers watches on xenstore and by removing some spurious xenstore accesses from the Windows guest agent.

These things combine to keep dom0 CPU load down to a very low level. This means that VMs can remain healthy and responsive, even when running a very large number of VMs.

Recent comment in this post
Tobias Kreidl
We see xenstored eat anywhere from 30 to 70% of a CPU with something like 80 VMs running under XenServer 6.1. When major updates t... Read More
Wednesday, 13 November 2013 17:10
Continue reading
24952 Hits
1 Comment

How did we increase VM density in XenServer 6.2?

One of the most noteworthy improvements in XenServer 6.2 is the support for a significantly increased number of VMs running on a host: now up to 500 Windows VMs or 650 Linux VMs.

We needed to remove several obstacles in order to achieve this huge step up. Perhaps the most important of the technical changes that led to this was to increase in the number of event channels available to dom0 (the control domain) from 1024 to 4096. This blog post is an attempt to shed some light on what these event channels are, and why they play a key role in VM density limits.

What is an event channel?

It's a channel for communications between a pair of VMs. An event channel is typically used by one VM to notify another VM about something. For example, a VM's paravirtualised disk driver would use an event channel to notify dom0 of the presence of newly written data in a region of memory shared with dom0.

Here are the various things that a VM requires an event channel for:

  • one per virtual disk;
  • one per virtual network interface;
  • one for communications with xenstore;
  • for HVM guests, one per virtual CPU (rising to two in XenServer 6.2); and
  • for PV guests; one to communicate with the console daemon.


Therefore VMs will typically require at least four dom0 event channels depending on the configuration of the VM. Requiring more than ten is not an uncommon configuration.

Why can event channels cause scalability problems when trying to run lots of VMs?

The total number of event channels any domain can use is part of a shared structure in the interface between a paravirtualised VM and the hypervisor; it is fixed at 1024 for 32-bit domains such as XenServer's dom0. Moreover, there are normally around 50--100 event channels used for other purposes, such as physical interrupts. This is normally related to the number of physical devices you have in your host. This overhead means that in practice there might be not too many more than 900--950 event channels available for VM use. So the number of available event channels becomes a limited resource that can cause you to experience a hard limit on the number of VMs you can run on a host.

To take an example: Before XenServer 6.2, if each of your VMs requires 6 dom0 event channels (e.g. an HVM guest with 3 virtual disks, 1 virtual network interface and 1 virtual CPU) then you'll probably find yourself running out of dom0 event channels if you go much over 150 VMs.

In XenServer 6.2, we have made a special case for our dom0 to allow it to behave differently to other 32-bit domains to allow it to use up to four times the normal number of event channels. Hence there are now a total of 4096 event channels available.

So, on XenServer 6.2 in the same scenario as the example above, even though each VM of this type would now use 7 dom0 event channels, the increased total number of dom0 event channels means you'd have to run over 570 of them before running out.

What happens when I run out of event channels?

On VM startup, the XenServer toolstack will try to plumb all the event channels through from dom0 to the nascent VM. If there are no spare slots, the connection will fail. The exact failure mode depends on which subsystem the event channel was intended for use in, but you may see error messages like these when the toolstack tries to connect up the next event channel after having run out:

error 28 mapping ring-refs and evtchn
message: xenopsd internal error: Device.Ioemu_failed("qemu-dm exited unexpectedly")

In other words, it's not pretty. The VM either won't boot or will run with reduced functionality.

That sounds scary. How can I tell whether there's sufficient spare event channels to start another VM?

XenServer has a utility called "lsevtchn" that allows you to inspect the event channel plumbing.

In dom0, run the following command to see what event channels are connected to a particular domain.

/usr/lib/xen/bin/lsevtchn 

For example, here is the output from a PV domain with domid 36:

[root@xs62 ~]# /usr/lib/xen/bin/lsevtchn 36
   1: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 51
   2: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 52
   3: VCPU 0: Virtual IRQ 0
   4: VCPU 0: IPI
   5: VCPU 0: IPI
   6: VCPU 0: Virtual IRQ 1
   7: VCPU 0: IPI
   8: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 55
   9: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 53
  10: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 54
  11: VCPU 0: Interdomain (Connected) - Remote Domain 0, Port 56

You can see that six of this VM's event channels are connected to dom0.

But the domain we are most interested in is dom0. The total number of event channels connected to dom0 can be determined by running

/usr/lib/xen/bin/lsevtchn 0 | wc -l

Before XenServer 6.2, if that number is close to 1024 then your host is on the verge of not being able to run an additional VM. On XenServer 6.2, the number to watch out for is 4096. However, before you'd be able to get enough VMs up and running to approach that limit, there are various other things you might run into depending on configuration and workload. Watch out for further blog posts describing how we have cleared more of these hurdles in XenServer 6.2.

Continue reading
66540 Hits
0 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.