Virtualization Blog

Discussions and observations on virtualization.

Overview of the Performance Improvements between XenServer 6.2 and Creedence Alpha 2

The XenServer Creedence Alpha 2 has been released, and one of the main focuses in Alpha 2 was the inclusion of many performance improvements that build on the architectural improvements seen in Alpha 1. This post will give you an overview of these performance improvements in Creedence, and will start a series of in-depth blog posts with more details about the most important ones.

Creedence Alpha 1 introduced several architectural improvements that aim to improve performance and fix a series of scalability limits found in XenServer 6.2:

  • A new 64-bit Dom0 Linux kernel. The 64-bit kernel will remove the cumbersome low/high-memory division present in the previous 32-bit Dom0 kernel, which limited the maximum amount of memory that Dom0 could use and which added memory access penalties in a Dom0 with more than 752MB RAM. This means that the Dom0 memory can now be arbitrarily scaled up to cope with memory demands of the latest vGPU, disk and network drivers, support for more VMs and internal caches to speed up disk access (see, for instance, the Read-caching section below).

  • Dom0 Linux kernel 3.10 with native support for the Xen Project hypervisor. Creedence Alpha 1 adopted a very recent long-term Linux kernel. This modern Linux kernel contains many concurrency, multiprocessing and architectural improvements over the old xen-Linux 2.6.32 kernel used previously in XenServer 6.2. It contains pvops features to run natively on the Xen Project hypervisor, and streamlined virtualization features used to increase datapath performance, such as a grant memory device that allows Dom0 user space processes to access memory from a guest (as long as the guest agrees in advance). Additionally, the latest drivers from hardware manufacturers containing performance improvements can be adopted more easily.

  • Xen Project hypervisor 4.4. This is the latest Xen Project hypervisor version available, and it improves on the previous version 4.1 on many accounts. It vastly increases the number of virtual event channels available for Dom0 -- from 1023 to 131071 -- which can translate into a correspondingly larger number of VMs per host and larger numbers of virtual devices that can be attached to them. XenServer 6.2 was using a special interim change that provided 4096 channels, which was enough for around 500 VMs per host with a few virtual devices in each VM. With the extra event channels in version 4.4, Creedence Alpha 1 can have each of these VMs endowed with a richer set of virtual devices. The Xen Project hypervisor 4.4 also handles grant-copy locking requests more efficiently, improving aggregate network and disk throughput; it facilitates future increases to the supported amount of host memory and CPUs; and it adds many other helpful scalability improvements.

  • Tapdisk3. The latest Dom0 disk backend design has been enabled by default for all the guest VBDs. While the previous tapdisk2 in XenServer 6.2 would establish a datapath to the guest in a circuitous way via a Dom0 kernel component, tapdisk3 in Creedence Alpha 1 establishes a datapath connected directly to the guest (via the grant memory device in the new kernel), minimizing latency and using less CPU. This results in big improvements in concurrent disk access and a much larger total aggregate disk throughput for the VBDs. We have measured aggregate disk throughput improvements of up to 100% on modern disks and machines accessing large blocksizes with large number of threads and observed local SSD arrays being maxed out when enough VMs and VBDs were used.

  • GRO enabled by default. The Generic Receive Offload is now enabled by default for all PIFs available to Dom0. This means that for GRO-capable NICs, incoming network packets will be transparently merged by the NIC and Dom0 will be interrupted less often to process the incoming data, saving CPU cycles and scaling much better with 10Gbps and 40Gbps networks. We have observed incoming single-stream network throughput improvements of 300% on modern machines.

  • Netback thread per VIF. Previously, XenServer 6.2 would have one netback thread for each existing Dom0 VCPU and a VIF would be permanently associated with one Dom0 VCPU. In the worst case, it was possible to end up with many VIFs forcibly sharing the same Dom0 VCPU thread, while other Dom0 VCPU threads were idle but unable to help. Creedence Alpha 2 improves this design and gives each VIF its own Dom0 netback thread that can run on any Dom0 VCPU. Therefore, the VIF load will now be spread evenly across all Dom0 VCPUs in all cases.

Creedence Alpha 2 then introduced a series of extra performance enhancements on top of the architecture improvements of Creedence Alpha 1:

  • Read-caching. In some situations, several VMs are all cloned from the same base disk so share much of their data while the few different blocks they write are stored in differencing-disks unique to each VM. In this case, it would be useful to be able to cache the contents of the base disk in memory, so that all the VMs can benefit from very fast access to the contents of the base disk, reducing the amount of I/O going to and from physical storage. Creedence Alpha 2 introduces this read caching feature enabled by default, which we expect to yield substantial performance improvements in the time it takes to boot VMs and other desktop and server applications where the VMs are mostly sharing a single base disk.

  • Grant-mapping on the network datapath. The pvops-Linux 3.10 kernel used in Alpha 1 had a VIF datapath that would need to copy the guest's network data into Dom0 before transmitting it to another guest or host. This memory copy operation was expensive and it would saturate the Dom0 VCPUs and limit the network throughput. A new design was introduced in Creedence Alpha 2, which maps the guest's network data into Dom0's memory space instead of copying it. This saves substantial Dom0 VCPU resources that can be used to increase the single-stream and aggregate network throughput even more. With this change, we have measured network throughput improvements of 250% for single-stream and 200% for aggregate stream over XenServer 6.2 on modern machines. 

  • OVS 2.1. An openvswitch network flow is a match between a network packet header and an action such as forward or drop. In OVS 1.4, present in XenServer 6.2, the flow had to have an exact match for the header. A typical server VM could have hundreds or more connections to clients, and OVS would need to have a flow for each of these connections. If the host had too many such VMs, the OVS flow table in the Dom0 kernel would become full and would cause many round-trips to the OVS userspace process, degrading significantly the network throughput to and from the guests. Creedence Alpha 2 has the latest OVS 2.1, which supports megaflows. Megaflows are simply a wildcarded language for the flow table allowing OVS to express a flow as group of matches, therefore reducing the number of required entries in the flow table for the most common situations and improving the scalability of Dom0 to handle many server VMs connected to a large number of clients.

Our goal is to make Creedence the most scalable and fastest XenServer release yet. You can help us in this goal by testing the performance features above and verifying if they boost the performance you can observe in your existing hardware.

Debug versus non-debug mode in Creedence Alpha

The Creedence Alpha releases use by default a version of the Xen Project hypervisor with debugging mode enabled to facilitate functional testing. When testing the performance of these releases, you should first switch to using the corresponding non-debugging version of the hypervisor, so that you can unleash its full potential suitable for performance testing. So, before you start any performance measurements, please run in Dom0:

cd /boot
ln -sf xen-*-xs?????.gz xen.gz   #points to the non-debug version of the Xen Project hypervisor in /boot

Double-check that the resulting xen.gz symlink is pointing to a valid file and then reboot the host.

You can check if the hypervisor debug mode is currently on or off by executing in Dom0:

xl dmesg | fgrep "Xen version"

and checking if the resulting line has debug=y or debug=n. It should be debug=n for performance tests.

You can reinstate the hypervisor debugging mode by executing in Dom0:

cd /boot
ln -sf xen-*-xs?????-d.gz xen.gz   #points to the debug (-d) version of the Xen Project hypervisor in /boot

and then rebooting the host.

Please report any improvements and regressions you observe on your hardware to the This email address is being protected from spambots. You need JavaScript enabled to view it. list. And keep an eye out for the next installments of this series!

XenServer Storage Performance Improvements and Tap...
XenServer Creedence Alpha 2 Released

Related Posts

 

Comments 11

Tobias Kreidl on Wednesday, 18 June 2014 19:55

Turning off debugging will be very interesting to compare with the standard debug setting. I already have a set of benchmarks I ran under the standard (debug) release with bonnie++.
-=Tobias

0
Turning off debugging will be very interesting to compare with the standard debug setting. I already have a set of benchmarks I ran under the standard (debug) release with bonnie++. -=Tobias
Guest - james on Thursday, 19 June 2014 17:51

Can you post your test results ? just for community to see how XS Creedence alpha-2 performs ?

0
Can you post your test results ? just for community to see how XS Creedence alpha-2 performs ?
Tobias Kreidl on Friday, 20 June 2014 04:30

@James:
Sure... Benchmarks with bonnie++ (V1.93c) on XenServer Creedence Alpha.2 with dom0 increased to 2 GB memory, otherwise the stock version using the same NFS SR. Dell R200 platform. Linux VM running RHEL5 with 2 VCPUs, 2 GB RAM.

Values are representative of one of several runs, all which generally deliver similar results (within a few percent). I would consider anything around 5% or less to be essentially a tie.
-=Tobias


operation Debug Off Debug On I/O difference
KB/sec %CPU KB/sec %CPU Off vs. On
seq. output block 94427 17 81274 74 +14%
seq. output rewrite 49804 4 50074 6 -1%
seq. input block 117317 2 116485 3 +7%
random seek 8076 51 5891 61 +37%

seq. create create 38363 84 34090 80 +13%
seq. create read 168882 99 142384 99 +19%
seq. create delete 25733 50 19750 41 +30%

random create 35691 78 34023 79 +5%
random read 186617 100 154769 100 +21%
random delete 11109 23 10820 23 +3%

0
@James: Sure... Benchmarks with bonnie++ (V1.93c) on XenServer Creedence Alpha.2 with dom0 increased to 2 GB memory, otherwise the stock version using the same NFS SR. Dell R200 platform. Linux VM running RHEL5 with 2 VCPUs, 2 GB RAM. Values are representative of one of several runs, all which generally deliver similar results (within a few percent). I would consider anything around 5% or less to be essentially a tie. -=Tobias [quote] operation Debug Off Debug On I/O difference KB/sec %CPU KB/sec %CPU Off vs. On seq. output block 94427 17 81274 74 +14% seq. output rewrite 49804 4 50074 6 -1% seq. input block 117317 2 116485 3 +7% random seek 8076 51 5891 61 +37% seq. create create 38363 84 34090 80 +13% seq. create read 168882 99 142384 99 +19% seq. create delete 25733 50 19750 41 +30% random create 35691 78 34023 79 +5% random read 186617 100 154769 100 +21% random delete 11109 23 10820 23 +3%[/quote]
Guest - james on Friday, 20 June 2014 18:03

thanks-
any luck with iSCSI SR performance test in comparison with XS 6.x?

0
thanks- any luck with iSCSI SR performance test in comparison with XS 6.x?
Tobias Kreidl on Monday, 23 June 2014 17:59

Sure, here they are. The random operations seem to be worse for Alpha.2 in this case, but see an important note about this further down. The iSCSI connection is to a whole different storage array, so these numbers should not be inter-compared with the NFS benchmarks. Also, in all fairness, I have run additional tests on a different iSCSI array with more spindles and better overall throughput and some of these values have increased significantly, though I do not have the same setup to be able to run comparable XS 6.2 tests. The most dramatic was for random deletes, which improved almost a factor of ten and aside from read operations, were at least a factor of three greater for all other operations.

iSCSI Creedence Alpha.2 XenServer 6.2
operation Debug Off Stock I/O difference
KB/sec %CPU KB/sec %CPU Alpha.2 vs. 6.2
seq. output block 10877 0 7261 0 +50%
seq. output rewrite 9729 0 7014 0 +39%
seq. input block 116975 2 76469 0 +53%
random seek 1479 13 1607 11 -8%

seq. create create 11637 25 11855 27 -2%
seq. create read 167521 99 142089 99 +18%
seq. create delete 3126 6 3512 7 -11%

random create 12144 26 11802 27 +3%
random read 84838 100 154270 100 -45%
random delete 1170 2 1543 3 -24%

0
Sure, here they are. The random operations seem to be worse for Alpha.2 in this case, but see an important note about this further down. The iSCSI connection is to a whole different storage array, so these numbers should not be inter-compared with the NFS benchmarks. Also, in all fairness, I have run additional tests on a different iSCSI array with more spindles and better overall throughput and some of these values have increased significantly, though I do not have the same setup to be able to run comparable XS 6.2 tests. The most dramatic was for random deletes, which improved almost a factor of ten and aside from read operations, were at least a factor of three greater for all other operations. iSCSI Creedence Alpha.2 XenServer 6.2 operation Debug Off Stock I/O difference KB/sec %CPU KB/sec %CPU Alpha.2 vs. 6.2 seq. output block 10877 0 7261 0 +50% seq. output rewrite 9729 0 7014 0 +39% seq. input block 116975 2 76469 0 +53% random seek 1479 13 1607 11 -8% seq. create create 11637 25 11855 27 -2% seq. create read 167521 99 142089 99 +18% seq. create delete 3126 6 3512 7 -11% random create 12144 26 11802 27 +3% random read 84838 100 154270 100 -45% random delete 1170 2 1543 3 -24%
Guest - james on Monday, 23 June 2014 18:25

@Tobias,
thanks for posting the numbers! It wil be interesting to see a blog post from Citrix on their internal perf test numbers. (one from Felipe as follow-up to the tapdisk3 post earlier.......)
Ofcourse, there is lot of interest and excitement within the XS community about the performance gains in aplha-2, those will find your results helpful.
best,
-James

1
@Tobias, thanks for posting the numbers! It wil be interesting to see a blog post from Citrix on their internal perf test numbers. (one from Felipe as follow-up to the tapdisk3 post earlier.......) Ofcourse, there is lot of interest and excitement within the XS community about the performance gains in aplha-2, those will find your results helpful. best, -James
Tobias Kreidl on Monday, 23 June 2014 18:48

It will be great to see some internal reports. Felipe did some most interesting tests that resulted in identifying some bottlenecks with SSD drives, so I look very much forward to his results on Creedence.

Just to give an idea on the difference in the improved iSCSI results (that are, BTW, very close to those from NFS using the same storage array geometry), here they are. Sorry, it's a pretty lightweight server, so it runs out of CPU power early on. Some tests with more powerful processors have yielded over 200k rates for some metrics.

Creedence Alpha.2 iSCSI (Dell R200, debug=off, dom0=2GB x 2)
Operation KB/sec %CPU
seq. output block 65671 7
seq. output rewrite 39550 3
seq. input block 78333 0
random seek 6660 58

seq. create create 34537 75
seq. create read 168601 99
seq. create delete 19880 39

random create 35000 76
random read 187771 100
random delete 10344 21

0
It will be great to see some internal reports. Felipe did some most interesting tests that resulted in identifying some bottlenecks with SSD drives, so I look very much forward to his results on Creedence. Just to give an idea on the difference in the improved iSCSI results (that are, BTW, very close to those from NFS using the same storage array geometry), here they are. Sorry, it's a pretty lightweight server, so it runs out of CPU power early on. Some tests with more powerful processors have yielded over 200k rates for some metrics. Creedence Alpha.2 iSCSI (Dell R200, debug=off, dom0=2GB x 2) Operation KB/sec %CPU seq. output block 65671 7 seq. output rewrite 39550 3 seq. input block 78333 0 random seek 6660 58 seq. create create 34537 75 seq. create read 168601 99 seq. create delete 19880 39 random create 35000 76 random read 187771 100 random delete 10344 21
Guest - Leonardo on Friday, 27 June 2014 14:15

Because instead of a Windows only application (XenCenter) systems the xenserver is not provided with a web interface for administration in own Hypervisor? I think it would be simple and accessible to all operating systems.

0
Because instead of a Windows only application (XenCenter) systems the xenserver is not provided with a web interface for administration in own Hypervisor? I think it would be simple and accessible to all operating systems.
Guest - midnight man on Sunday, 06 July 2014 00:00

Xen Orchestra - Web GUI for XenServer and XAPI

0
[url=https://xen-orchestra.com]Xen Orchestra[/url] - Web GUI for XenServer and XAPI
Guest - james on Friday, 04 July 2014 17:33

try xen orchestra - web GUI for XenServer

0
try xen orchestra - web GUI for XenServer
Tobias Kreidl on Friday, 27 June 2014 16:02

I am looking forward to hopefully a report soon from Felipe Franciosi on I/O tests on Creedence vs. XS 6.2!

0
I am looking forward to hopefully a report soon from Felipe Franciosi on I/O tests on Creedence vs. XS 6.2!

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Technical support for XenServer is available from Citrix.