Virtualization Blog

Discussions and observations on virtualization.

XenServer 7.0 performance improvements part 4: Aggregate I/O throughput improvements

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the fourth in a series of articles that will describe the principal improvements. For the previous ones, see:

  1. http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html
  2. http://xenserver.org/blog/entry/dundee-networking-multi-queue.html
  3. http://xenserver.org/blog/entry/dundee-parallel-vbd-operations.html

In this article we return to the theme of I/O throughput. Specifically, we focus on improvements to the total throughput achieved by a number of VMs performing I/O concurrently. Measurements show that XenServer 7.0 enjoys aggregate network throughput over three times faster than XenServer 6.5, and also has an improvement to aggregate storage throughput.

What limits aggregate I/O throughput?

When a number of VMs are performing I/O concurrently, the total throughput that can be achieved is often limited by dom0 becoming fully busy, meaning it cannot do any additional work per unit time. The I/O backends (netback for network I/O and tapdisk3 for storage I/O) together consume 100% of available dom0 CPU time.

How can this limit be overcome?

Whenever there is a CPU bottleneck like this, there are two possible approaches to improving the performance:

  1. Reduce the amount of CPU time required to perform I/O.
  2. Increase the processing capacity of dom0, by giving it more vCPUs.

Surely approach 2 is easy and will give a quick win...? Intuitively, we might expect the total throughput to increase proportionally with the number of dom0 vCPUs.

Unfortunately it's not as straightforward as that. The following graph shows what happened to the aggregate network throughput on XenServer 6.5 if the number of dom0 vCPUs is artificially increased. (In this case, we are measuring the total network throughput of 40 VMs communicating amongst themselves on a single Dell R730 host.)

b2ap3_thumbnail_5179.png

Counter-intuitively, the aggregate throughput decreases as we add more processing power to dom0! (This explains why the default was at most 8 vCPUs in XenServer 6.5.)

So is there no hope for giving dom0 more processing power...?

The explanation for the degradation in performance is that certain operations run more slowly when there are more vCPUs present. In order to make dom0 work better with more vCPUs, we needed to understand what those operations are, and whether they can be made to scale better.

Three such areas of poor scalability were discovered deep in the innards of Xen by Malcolm Crossley and David Vrabel, and improvements were made for each:

  1. Maptrack lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=dff515dfeac4c1c13422a128c558ac21ddc6c8db
  2. Grant-table lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4650e9a96d78b87ccf7deb4f74733ccfcc64db5
  3. TLB flush on grant-unmap – improved by https://github.com/xenserver/xen-4.6.pg/blob/master/master/avoid-gnt-unmap-tlb-flush-if-not-accessed.patch

The result of improving these areas is dramatic – see the green line in the following graph:

b2ap3_thumbnail_4190.png

Now, throughput scales very well as the number of vCPUs increases. This means that, for the first time, it is now beneficial to allocate many vCPUs to dom0 – so that when there is demand, dom0 can deliver. Hence we have given XenServer 7.0 a higher default number of dom0 vCPUs.

How many vCPUs are now allocated to dom0 by default?

Most hosts will now get 16 vCPUs by default, but the exact number depends on the number of CPU cores on the host. The following graph summarises how the default number of dom0 vCPUs is calculated from the number of CPU cores on various current and historic XenServer releases:

b2ap3_thumbnail_dom0-vcpus.png

Summary of improvements

I will conclude with some aggregate I/O measurements comparing XenServer 6.5 and 7.0 under default settings (no dom0 configuration changes) on a Dell R730xd.

  1. Aggregate network throughput – twenty pairs of 32-bit Debian 6.0 VMs sending and receiving traffic generated with iperf 2.0.5.
    b2ap3_thumbnail_aggr-intrahost-r730_20160729-093608_1.png
  2. Aggregate storage IOPS – twenty 32-bit Windows 7 SP1 VMs each doing single-threaded, serial, sequential 4KB reads with fio to a virtual disk on an Intel P3700 NVMe drive.
    b2ap3_thumbnail_storage-iops-aggr-p3700-win7.png
Continue reading
7259 Hits
2 Comments

Improving dom0 Responsiveness Under Load

Recently, the XenServer Engineering Team has been working on improving the responsiveness of the control domain when it is under heavy load. Many VMs doing lots of I/O operations can prevent one from connecting to the host through ssh or make the XenCenter session disconnected with no apparent reason. All of this happened when the control plane was overloaded by the datapath plane, leaving very little CPU for such important processes as sshd or xapi. Let's have a look at how much time it takes to repeatedly execute a simple xe vm-list command on a host with 20 VM pairs doing heavy network communication:

b2ap3_thumbnail_01.png

Most of the commands took around 2-3 seconds to complete, but some of them took as long as 30 seconds. The 2-3 seconds is slower than it should be, and 20-30 seconds is way outside of a reasonable operating window. The slow reaction time of 3 seconds and the heavy spikes of 30 seconds visible on the graph above are two separate issues affecting the responsiveness of the control commands. Let's tackle them one by one.

To fix the 2-3 seconds slowdown, we took advantage of the Linux kernel feature called cgroups (control groups). Cgroups allows the aggregation of processes into separate hierarchies that manage their access to various resources (CPU, memory, network). In our case, we utilised the CPU resource isolation, placing all control path daemons in the cpu control group subsystem. Giving them ten times more cpu share than datapath processes guarantee they would get enough computing power to execute control operations in a timely fashion. It's worth pointing out, that it does not slow down the datapath in times when the control plane is idle. The datapath reduces its cpu usage only when control operations need to run.

b2ap3_thumbnail_02_20151222-162855_1.png

We can see that the majority of the commands took just a fraction of a second to execute, which solves the first of our problems.

What about the commands that took 20-30 seconds to print out the list of VMs? This was caused by the way in which xapi handles the creation of threads, limiting the rate based on current load and memory usage in dom0. When the load goes too high, there is not enough xapi threads to handle the requests, which results in periodic spikes in the executions of the xe commands. However, this feature was useful when the dom0 was 32 bit and when the increased number of threads might have caused some issues to the stability of the whole system. Since dom0 is 64bit, and with the control groups enabled, we decided it is perfectly safe to get rid of xapi’s thread limiting feature.

With these changes applied, the execution times of control path commands became as one would expect them to be:

b2ap3_thumbnail_03_20151222-162856_1.png

In spite of heavy I/O load, control path processes receive all the CPU they need to get the job done, so can do it without any delay, leaving the user with a nicely responsive host regardless of the load in the guests. This makes a tremendous difference to the user-experience when interacting with the host via XenCenter, the xe CLI or over SSH.

Another real world example in which we expected significant improvements is bootstorm. In this benchmark we start more than hundred VMs and measure how much time it takes for the guests to become fully operational (time measured from starting the 1st VM to the completion of the n-th VM). Usual strategy employed is to run 25 VMs at a time. Following is the comparison of the results before and after the changes:

b2ap3_thumbnail_4495.png

Before, booting guests overloaded the control path which slowed down the boot process of latter VMs. After our improvements, the time of booting consecutive guests grows linearly with the whole benchmark completing twice as fast compared to the build without changes.

Another view on the same data - showing the time to boot a single VM:

b2ap3_thumbnail_4496.png

CPU resource isolation and xapi improvements make VMs resilient to the load generated by the simultaneously booting guests. Each of them takes the same amount of time to become ready compared to the significant increase that happened for the host without changes. That is how you would expect for the control plane to operate.

What other benefits would that improvements bring for the XenServer users? They will have no more problems with synchronizing XenCenter with the host and issuing commands to xapi. We expect now that XenDesktop users should be able to start many VMs in the pool master leaving it still responding to control path commands. It would allow them to start more VMs on the master, reducing the necessary hardware and decreasing the total cost of ownership. Cloud administrators can have increased confidence in their ability to administer the host despite unknown and unpredictable workloads in their tenants’ VMs.

TECHNICAL DETAILS

For anyone interested in playing around with the new feature, here are a couple of details of the implementation and the organisation of files in the dom0.

All the changes are contained in a single rpm called control-slice.

The control-slice itself is a systemd unit that defines a new slice to which all control-path daemons are assigned. You can find its configuration in the following file:

# cat /etc/systemd/system/control.slice 
[Unit]
Description=Control path slice
[Slice] CPUShares=10240

By modifying the CPUShares parameter one can change the cpu shares that control-path processes will get. Since the base value is 1024, assigning the shares of, for example, 2048 would mean that control-path processes would get twice the processing power than datapath processes. The default value for the control-slice is 10240, which means control-path processes get up to ten time more cpu than datapath. To apply the changes one has to reload the configuration and restart the control.slice unit:

# systemctl daemon-reload
# systemctl restart control.slice

Each daemon that belongs to the control-slice has a simple configuration file that specifies the name of the slice that it belongs to, for example for xapi we have:

# cat /etc/systemd/system/xapi.service.d/slice.conf 
[Service]
Slice=control.slice

Last but not least, systemd provides admins with a powerful utility that allows monitoring cgroups resources utilisation. These can be examined by typing the following command:

# systemd-cgtop

Above improvements are planned for the forthcoming XenServer Dundee release, and can be experienced with the Dundee beta.2 preview. Let us know if you liked it and if it made a difference to you!

Recent Comments
Tobias Kreidl
Really nice results and a nod to the engineering teams to keep identifying bottlenecks and improving each subsequent version of Xe... Read More
Tuesday, 22 December 2015 19:10
Rafal Mielniczuk
Hi fbifido, thanks for comment. These changes have no effect on storage performance, they affect only execution speed of control p... Read More
Thursday, 24 December 2015 14:27
Continue reading
15667 Hits
5 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.